Publications List of A Researcher in Bibtex Format¶
Fetch publications of a specific person identified by ORCID, and export data as a bibtex format file. Visualise publication counts over the years by bar plot and generate a keyword word-cloud from titles.
Related Notebooks:
- ORCID Notebook
Query for researchers' data by passing an ORCID to the Augment API. Visualise co-author relationships in a graph. - DOI Notebook
Query publications data by passing a DOI to the API. - Affiliations Notebook
Query researchers and affiliations by passing an ORCID to the API. Extract the geolocation data and map affiliations data on a world map. Plot researcher-organisation relationships in a graph.
import sys
sys.path.append('../')
# Packages for plotting charts, graphs and wordcloud
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
# Packages for data manipulation
import pandas as pd
from datetime import datetime, date
# Packages to use API
import requests
import json
# packages to read API_KEY
import os
from os.path import join, dirname
from dotenv import load_dotenv
load_dotenv();
API Errors¶
When using the API, we load the API_KEY and the ORCID you want to search into variables and add them in the url string. Later the python request package will pass those values to the API and get the data you want. This section shows the 2 types of common errors you might get when using augment API. Either the ORCID passed is invalid or the API_KEY is not load successfully from you environment file.
ORCID ID Not Found¶
Here we assign an invalid value to the ORCID variable. When error occurs, the request.get( ) will be an object with the status code indicating error type and an error message.
# ORCID ID not found
API_KEY = os.environ.get("API_KEY")
ORCID = "0000-0003-XXXX-XXXX"
url = f'https://augmentapi.researchgraph.com/v1/orcid/{ORCID}?subscription-key={API_KEY}'
r = requests.get(url)
# print a short confirmation on completion
print('Augment API query complete ', r.status_code)
if r.status_code == 400:
print(r.json()[0]["error"])
Missing API_KEY¶
You will receive an authentication error if the API KEY is invalid.
# Missing API_KEY
API_KEY = ''
ORCID = "0000-0002-0715-6126"
url = f'https://augmentapi.researchgraph.com/v1/orcid/{ORCID}?subscription-key={API_KEY}'
r = requests.get(url)
# print a short confirmation on completion
print('Augment API query complete ', r.status_code)
if r.status_code == 401:
print(f'Authentication error.', r.json()['message'])
Data Extraction for Valid ORCID ID¶
For valid ORCID records retrieved, it is a nested dictionary structure with all data that is connected to the ORCID requested. First level has 3 keys as shown in the block below.
# ORCID ID does exist
API_KEY = os.environ.get("API_KEY")
ORCID = "0000-0002-0068-716X"
url = f'https://augmentapi.researchgraph.com/v1/orcid/{ORCID}?subscription-key={API_KEY}'
r = requests.get(url)
# print a short confirmation on completion
print('Augment API query complete ', r.status_code)
# Shows data
print('The data returned has below fields: ',r.json()[0].keys())
In 'nodes', data is stored in 5 labels from the ResearchGraph schema:
r.json()[0]["nodes"].keys()
Each data above is stored as a list of dictionaries. To extract the data we need, iterate through the list and check for the ORCID.
# ORCID ID does exist
API_KEY = os.environ.get("API_KEY")
ORCID = "0000-0002-0068-716X"
url = f'https://augmentapi.researchgraph.com/v1/orcid/{ORCID}?subscription-key={API_KEY}'
r = requests.get(url)
# print a short confirmation on completion
print('Augment API query complete ', r.status_code)
if r.status_code == 200 and r.json()[0]["nodes"]["researchers"]:
researchers = r.json()[0]["nodes"]["researchers"]
researcher = None
for i in range(len(researchers)):
if researchers[i]["orcid"] == ORCID:
researcher = researchers[i]
print()
print(f'ORCID: {researcher["orcid"]}')
print(f'First name: {researcher["first_name"]}')
print(f'Last name: {researcher["last_name"]}')
print()
print(f'The researcher {researcher["full_name"]} is connected to {r.json()[0]["stats"]}.')
List of publications as bibtex¶
In this section, we will use another API from Crossref to query the bibtex format for publications, and export them to a file.
# include all publications
pd.set_option("display.max_rows", None)
df = pd.DataFrame(r.json()[0]["nodes"]["publications"], columns=['doi', 'publication_year', 'title'])
df = df.dropna()
df = df.drop_duplicates(subset=['title'])
df = df.sort_values(by=['publication_year','doi'], ascending=False)
Similar to our API, we pass the DOI for publications. To specify the text format, we also pass a header in our query. For more information on Crossref API, please see documentation.
# Use crossref API to get bibtex foramt data for each publication
# This may take a few minutes
data = []
for index, row in df.iterrows():
url = f'http://dx.doi.org/{row["doi"]}'
headers = {'Accept': 'text/bibliography', 'style': 'bibtex'}
ra = requests.get(url, headers=headers)
print(f'Crosscite API query for {row["doi"]} complete', ra.status_code,)
data.append(ra.text)
bib = '\n'.join(data)
# Export data into a bib document
with open(researcher['last_name'].lower() + '_publications.bib', 'a') as fp:
fp.write(bib)
This document will be save in the same directory as this notebook. Now you can navigate to your folders and see the result.
Publications by year¶
There are lots of ways to use the publication information. This section shows the publication trend by counting the publication records each year. Below code also handling years without any publications and using the ResearchGraph color for bars.
plot_title = alt.TitleParams(f'{researcher["full_name"]} (ORCID {ORCID})', subtitle=['Publications by Year'])
alt.Chart(df, title=plot_title).mark_bar(color='#49B1F4').properties(width=500).encode(
x=alt.X("publication_year:O", axis=alt.Axis(title='Publication Year', labelAngle=0, labelSeparation=10)),
y=alt.Y("count:Q", impute=alt.ImputeParams(value=0, keyvals={"start": int(min(df['publication_year'].tolist())), "stop": datetime.now().year }), axis=alt.Axis(title=None))
).transform_aggregate(
count='count(publication_year)',
groupby=["publication_year"]
).configure_title(
fontSize=18
).configure_axis(
grid=False
).configure_view(
strokeWidth=0
)
Topics of publications¶
If we want to know the topic by keywords in all publications by this researcher, we can create a word-cloud for the titles too.
# High frequency meaningless words to be removed, e.g. the, a, of...
stopWords = set(STOPWORDS)
stopWords.add('_')
titleWords=[]
for index, row in df.iterrows():
tokens = [t.lower() for t in row['title'].split()]
titleWords += tokens
x, y = np.ogrid[:800, :800]
mask = (x - 400) ** 2 + (y - 400) ** 2 > 345 ** 2
mask = 255 * mask.astype(int)
wordcloud = WordCloud(width = 600, height = 600,
max_words = 100,
background_color ='white',
stopwords = stopWords,
min_font_size = 12,
mask = mask).generate(" ".join(titleWords))
fig, ax = plt.subplots(1, 1, figsize = (8, 8), facecolor = None)
ax.set_title(f'{researcher["full_name"]} (ORCID {researcher["orcid"]}) \n Word cloud of publication titles', fontsize=18, fontweight="semibold")
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()