This is a small tutorial on how to work with the Scopus API using R. Most bibliographic databases allow nowadays downloading bibliographic data through their API in an automated way if you are registered and have an institutional subscription. In many cases, they lack of a good documentation, this is not the case for Elsevier, who have quite a lot of info on how to work with their data.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next step is to store the API key in an .Renviron file. For this we first need to open the .Renviron file in RStudio:
file.edit("~/.Renviron")
Add a new line to the file with this info:
SCOPUS_API_KEY=your_api_key_here
Now let’s try it works:
# Retrieve API from .Renvironapi_key <-Sys.getenv("SCOPUS_API_KEY")# Test API call to check validity of the keyresponse <-GET("https://api.elsevier.com/content/search/scopus",add_headers("X-ELS-APIKey"= api_key),query =list(query ="AUTHOR-NAME(Robinson-Garcia)", count =1))# Check statusif (status_code(response) ==200) {print("API key is valid and working.")} else {print(paste("Error:", status_code(response), "Check your API key or access permissions."))}
[1] "API key is valid and working."
Step 3. Reading and preparing the list of Author IDs
In my case I already have a list of publications with their author IDs per row. I want to work only with the Author IDs and clean it so that I have one by row in a vector for querying later on the API.
library(readr) # For reading the CSV file# Step 1: Import the CSV file# Replace "your_file.csv" with your actual file pathdata <-read_csv("G:/Mi unidad/1. Work sync/Projects/z2025_01-SELECT/Contributions-inequalites/raw_data/contrib_data.csv")
Rows: 675080 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): doi, auid_list, affiliation_id_list, author_affiliation_mapping, s...
dbl (22): Eid, Year, n_authors, source_id, CitationCount, DDR, DDA, SV_topic...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 2: Extract and clean the 'auid_list' columnauthor_ids <- data %>%select(auid_list) %>%# Select the relevant columnseparate_rows(auid_list, sep =",") %>%# Split each row by commasmutate(auid_list =str_trim(auid_list)) %>%# Trim whitespacedistinct(auid_list) %>%# Remove duplicate IDspull(auid_list) # Extract as a vector# Optional: Check the length of the vectorlength(author_ids)
[1] 2082499
I end up with over 2M authors.
Step 4. Query the API for metadata
Let’s create a function to download the data we want:
# Function to query Scopus API for author metadataquery_author <-function(author_id, api_key, output_dir ="author_data") {# Ensure the output directory existsif (!dir.exists(output_dir)) dir.create(output_dir)# Construct the API URL url <-paste0("https://api.elsevier.com/content/author/author_id/", author_id)# Query the API response <-GET(url,add_headers("X-ELS-APIKey"= api_key),query =list(httpAccept ="application/json"))if (status_code(response) ==200) {# Parse the response content content_raw <-content(response, as ="text", encoding ="UTF-8") content <-fromJSON(content_raw)# Save to a JSON file output_file <-file.path(output_dir, paste0(author_id, ".json"))write_json(content, output_file, pretty =TRUE)return(TRUE) # Indicate success } else {# Log the errorprint(paste("Error: Status code", status_code(response), "for author ID:", author_id))return(FALSE) # Indicate failure }}