Extract percentiles values from Scopus using R

ghz 8months ago ⋅ 93 views

I'm trying to get the Highest percentile value from Scopus for all the journals.

The URL is: https://www.scopus.com/sources.uri

This values is updated every year, so the value for each year is even better.

There's no way to get this data using its API.

I tried rvestbut no success.

library(rvest)
library(dplyr)

url <- "https://www.scopus.com/sources.uri"

site <- read_html(url)

site %>% 
  html_node("table") %>% 
  html_table()
#> # A tibble: 2 × 3
#>   ``         CiteScore CiteScoreTracker 
#>   <chr>      <chr>     <chr>            
#> 1 Calculated Annually  12 times per year
#> 2 Updates    None      Monthly

Answers

It seems like the data you are trying to scrape from the Scopus website is loaded dynamically using JavaScript, which rvest cannot handle because it does not execute JavaScript.

To scrape dynamically loaded content, you can use a tool like RSelenium, which can control a web browser programmatically to interact with the page and access the dynamically generated content. Below is an example of how you can achieve this using RSelenium:

library(RSelenium)
library(dplyr)

# Start a Selenium server
driver <- rsDriver(browser = "chrome")
remDr <- driver[["client"]]

# Navigate to the URL
remDr$navigate("https://www.scopus.com/sources.uri")

# Wait for the page to load
Sys.sleep(5) # Adjust the sleep time as needed

# Extract the table
site <- remDr$getPageSource()[[1]] %>%
  read_html()

table_data <- site %>%
  html_nodes("table") %>%
  html_table()

# Stop the Selenium server
remDr$close()
driver$server$stop()

# Extracting the required table
scopus_table <- table_data[[1]]

# Print the table
print(scopus_table)

However, keep in mind that web scraping may violate the terms of service of the website, so it's essential to review the terms of use and the website's robots.txt file before scraping any data. Additionally, consider contacting the website owners to inquire about accessing the data programmatically. They may have an API or other means for accessing the data legally and ethically.