DEV Community

Cover image for Best Web Scraping Libraries for R
Antonello Zanini for Writech

Posted on • Edited on

Best Web Scraping Libraries for R

In recent years, web scraping has become an essential tool for data analysts and data scientists. This technique involves extracting data from the web through automated tools. R is one of the most popular languages for data analysis and provides several web scraping libraries.

In this article, you will take a look at the best web scraping R libraries and their pros and cons.

Top 5 Libraries for Web Scraping with R

Here is the list of the most useful open-source libraries to perform web scraping in R.

1. rvest

rvest is one of the most popular R packages for web scraping. It is built on top of the xml2 package and provides a set of functions for parsing from HTML/XML documents. In detail, it supports CSS and XPath selectors, making it easy to select HTML elements and extract data from them. Also, it comes with built-in functionality to extract data from tables.

Let's see rvest in action in the code example below:

library(rvest)

url <- "https://en.wikipedia.org/wiki/R_(programming_language)"
page <- read_html(url)

# extract data from 
# the first table on the page
table <- page %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table()

# extract text from the first p tag 
# on the page
paragraph <- page %>%
  html_nodes("p") %>%
  .[[1]] %>%
  html_text()
Enter fullscreen mode Exit fullscreen mode

👍 Pros:

  • Easy to use for beginners

  • Built-in support for scraping tables

  • Good documentation and community support

👎 Cons:

  • Does not support JavaScript-rendered sites

  • Can be slow when extracting large amounts of data

2. RSelenium

RSelenium is a set of bindings for the Selenium 2.0 WebDriver tool. It allows you to instruct a browser to perform operations on a web page as a human user would. In particular, RSelenium provides headless browser capabilities and can scrape sites that require SavaScript.

Here is what a simple RSelenium script looks like:

library(RSelenium)

# start controlling Firefox
remDr <- remoteDriver(browserName = "firefox")
remDr$open()

# navigate to the target site's login page
remDr$navigate("https://example.com/login")

# type in the login credentials
# and submit the form
remDr$findElement(using = "name", value = "username")$sendKeysToElement(list("myusername"))
remDr$findElement(using = "name", value = "password")$sendKeysToElement(list("mypassword"))
remDr$findElement(using = "name", value = "submit")$clickElement()

# scrape data from a table
data <- remDr$findElement(using = "css", value = "table")$getElementText()

# quit the Selenium driver and server
remDr$close()
Enter fullscreen mode Exit fullscreen mode

👍 Pros:

  • Can handle websites that rely on JavaScript for rendering or data retrieval

  • Supports several browsers, including Chrome, Firefox, Safari, and Edge

  • Can fool anti-bot technologies by simulating human user interaction

👎 Cons:

  • Requires a web browser and the right driver to work

  • Can be slow and resource-intensive

  • It does support Selenium 3.x and 4.x features

3. RCrawler

RCrawler provides a range of tools for web crawling and extracting structured data from the Web. It uses a combination of XPath or CSS selectors and regular expressions to retrieve data from web pages. RCrawler also supports JavaScript, allowing dynamic page scraping.

Here is an RCrawler snippet example:

library(RCrawler)

# target page
url <- "https://en.wikipedia.org/wiki/R_(programming_language)"

# specify the crawler configuration
crawler_config <- list(
  extractFunc = extract_text,
  extractPat = list(title = "//title", p = "//p"),
  evalFunc = RCrawler:::evaluate_js
)

# execute the actions defined in the
# configurations
results <- crawl(url, crawler_config)
Enter fullscreen mode Exit fullscreen mode

👍 Pros:

  • Supports JavaScript and can scrape dynamic web pages

  • Supports parallel scraping and crawling

👎 Cons:

  • Last update to the library was 5 years ago

  • Limited documentation and community support

4. xmlTreeParse

xmlTreeParse is a lightweight XML parser. It is built on top of the XML package and makes it easier to parse XML and HTML documents.

See xmlTreeParse in action in the sample code below:

library(xmlTreeParse)

url <- "https://en.wikipedia.org/wiki/R_(programming_language)"
doc <- htmlTreeParse(url, useInternalNodes = TRUE)

# extract data from the first table 
# on the page
table <- xpathApply(doc, "//table")[[1]] %>% xmlToList()

# extract the text contained in the 
# first paragraph from the page
paragraph <- xpathApply(doc, "//p")[[1]] %>% xmlValue()
Enter fullscreen mode Exit fullscreen mode

👍 Pros:

  • Lightweight and fast

  • Easy to use for simple parsing tasks

👎 Cons:

  • Does not support JavaScript

  • Limited documentation

  • Very limited community support

5. httr

httr is an HTTP client that makes it easy to execute HTTP requests in R. Although it is not a dedicated web scraping library, it is used by most R scrapers to call APIs or make HTTP requests.

Perform a GET request with httr as follows:

library(httr)

# perform an HTTP GET request to 
# an API endpoint
url <- "https://api.example.com/data"
response <- GET(url)

# get the API response as text
data <- content(response, "text")
Enter fullscreen mode Exit fullscreen mode

👍Pros:

  • Provides a simple way to work with HTTP requests

  • Can be useful for scraping data from APIs

👎 Cons:

  • Not a dedicated web scraping library

Conclusion

In this article, you saw the best R web scraping libraries: rvest, RCrawler, RSelenium, xmlTreeParse, and httr. Each library has its own strengths and weaknesses. Thus, the choice of which library to use will depend on your specific scraping goals. By learning how to use these libraries, you can easily get data from websites and use that information for data mining or machine learning.

Thanks for reading! I hope you found this article helpful.


The post "Best Web Scraping Libraries for R" appeared first on Writech.

Top comments (0)