DEV Community

Cover image for Web Scraping in Node.js!
Code_Jedi
Code_Jedi

Posted on • Edited on

Web Scraping in Node.js!

Web scraping:

Web Scraping is a way to collect all sorts of publicly available data like prices, text, images, contact information and much more from the world-wide-web. This can be useful when trying to collect data that might take a person a lot of time to collect and organize manually.

Some of the most useful use cases of web scraping include:

  1. Scraping product prices from e-commerce websites such as amazon, ebay or alibaba.

  2. Scraping social media posts, likes, comments, followers or bios.

  3. Scraping contacts from websites like yellowpages or Linkedin.

Puppeteer

While there are a few different libraries for scraping the web with Node.js, in this tutorial, i'll be using the puppeteer library.

Puppeteer is a popular and easy to use npm package used for web automation and web scraping purposes.

Some of puppeteer's most useful features include:

  1. Being able to extract a scraped element's text content.
  2. Being able to interact with a webpage by filling out forms, clicking on buttons or running searches inside a search bar.
  3. Being able to scrape and download images from the web.
  4. Being able to see the web scraping in progress using headless mode.

You can read more about puppeteer here

Installation

For this tutorial, I will suppose you already have npm and node_modules installed, as well as a package.json and package-lock.json file.

If you don't, here's a great guide on how to do so: Setup

To install puppeteer, run one of the following commands in your project's terminal:

npm i puppeteer

Or

yarn add puppeteer

Once puppeteer is installed, it will appear as a directory inside your node_modules.

Let's make a simple web scraping script in Node.js

The web scraping script will get the first synonym of "smart" from the web thesaurus by:

  1. Getting the HTML contents of the web thesaurus' webpage.

  2. Finding the element that we want to scrape through it's selector.

  3. Displaying the text contents of the scraped element.
    Scraping

Let's get started!

Before scraping, and then extracting this element's text through it's selector in Node.js, we need to setup a few things first:

Create or open an empty javascript file, you can name it whatever you want, but I'll name mine "index.js" for this tutorial. Then, require puppeteer on the first line and create the async function inside which we will be writing our web scraping code:

index.js

const puppeteer = require('puppeteer')

async function scrape() {
}
scrape()
Enter fullscreen mode Exit fullscreen mode

Next, initiate a new browser instance and define the "page" variable, which is going to be used for navigating to webpages and scraping elements within a webpage's HTML contents:

index.js

const puppeteer = require('puppeteer')

async function scrape() {
   const browser = await puppeteer.launch({})
   const page = await browser.newPage()
}
scrape()
Enter fullscreen mode Exit fullscreen mode

Scraping the first synonym of "smart"

To locate and copy the selector of the first synonym of "smart", which is what we're going to use to locate the synonym inside of the web thesaurus' webpage, first go to the web thesaurus' synonyms of "smart", right click on the first synonym and click on "inspect". This will make this webpage's DOM pop-up at the right of your screen:
Inspect

Next, right click on the highlighted HTML element containing the first synonym and click on "copy selector":
Copy selector

Finally, to navigate to the web thesaurus, scrape and display the first synonym of "smart" through the selector we copied earlier:

  1. First, make the "page" variable navigate to https://www.thesaurus.com/browse/smart inside the newly created browser instance.

  2. Next, we define the "element" variable by making the page wait for our desired element's selector to appear in the webpage's DOM.

  3. The text content of the element is then extracted using the evaluate() function, and displayed inside the "text" variable.

  4. Finally, we close the browser instance.

index.js

const puppeteer = require('puppeteer')

async function scrape() {
   const browser = await puppeteer.launch({})
   const page = await browser.newPage()

   await page.goto('https://www.thesaurus.com/browse/smart')
   var element = await page.waitForSelector("#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(1) > a")
   var text = await page.evaluate(element => element.textContent, element)
   console.log(text)
   browser.close()
}
scrape()
Enter fullscreen mode Exit fullscreen mode

Time to test

Now if you run your index.js script using "node index.js", you will see that it has displayed the first synonym of the word "smart":
Test

Scraping the top 5 synonyms of smart

We can implement the same code to scrape the top 5 synonyms of smart instead of 1:

index.js

const puppeteer = require('puppeteer')

async function scrape() {
   const browser = await puppeteer.launch({})
   const page = await browser.newPage()

   await page.goto('https://www.thesaurus.com/browse/smart')
   for(i = 1; i < 6; i++){
    var element = await page.waitForSelector("#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(" + i + ") > a")
    var text = await page.evaluate(element => element.textContent, element)
    console.log(text)
   }
   browser.close()
}
scrape()
Enter fullscreen mode Exit fullscreen mode

5 synonyms

The "element" variable will be: "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(1) > a" on the first iteration, "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(2) > a" on the second, and so on until it reaches the last iteration where the "element" variable will be "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(5) > a".

As you can see, the only thing that is altered in the "element" variable throughout the iterations is the "li:nth-child()" value.

This is because in our case, the elements that we are trying to scrape are all "li" elements inside a "ul" element,

so we can easily scrape them in order by increasing the value inside "li:nth-child()":

  1. li:nth-child(1) for the first synonym.

  2. li:nth-child(2) for the second synonym.

  3. li:nth-child(3) for the third synonym.

  4. li:nth-child(4) for the fourth synonym.

  5. And li:nth-child(5) for the fifth synonym.

Final notes

While web scraping has many advantages like:

  1. Saving time on manually collecting data.

  2. Being able to programmatically aggregate pieces of data scraped from the web.

  3. Creating a dataset of data that might be useful for machine learning, data visualization or data analytics purposes.

It also has 2 disadvantages:

  1. Some websites don't allow for scraping their data, one popular example is craigslist.

  2. Some people consider it to be a gray area since some use cases of web scraping practice user or entity data collection and storage.

Wrapping up

Hopefully this article gave you some insight into web scraping in Node.js, it's practical applications, pros and cons, and how to extract specific elements and their text contents from webpages using the puppeteer library.

Top comments (7)

Collapse
 
capncrockett profile image
capncrockett

Very helpful and straightforward. I did find that waitFor is deprecated. In this instance I made use of waitForSelector. Thanks!

Collapse
 
rayremnant profile image
RayRemnant

waitFor will be deprecated as mentioned in this GitHub issue of Puppeteer and by ESLint, use waitForSelector instead.
Just tested it and works.

Collapse
 
edwardsmarkf profile image
marky edwards • Edited

excellent article.... i tried several "scraping" suggestions (that is a new concept to me) and your code is the only example i could find that works. thank you so much for providing this.

a couple points/questions:

waitFor is giving deprecation messages, maybe use waitForSelector instead?
github.com/puppeteer/puppeteer/iss...

let elements = await page.waitForSelector(cssSelector);

what does the , element) part do? is this the same thing?
let text = await page.evaluate(element => { return ( element.textContent)})

shortcuts can be a bit confusing sometimes.

EDIT:
let text = await page.evaluate(element => { return ( element.textContent)})
well that did NOT work at all. 😣 however, this worked for me (below) - the word element used multiple times is confusing my newbie brain:
let elements = await page.waitForSelector(cssSelector); // note 'elements' - plural
let text = await page.evaluate(element => element.textContent, elements);

but i am still confused on how the await page.evaluate(element => line works. is this some sort of shortcut? its like "elements" is being used, but i cannot figure out how?

Collapse
 
alco profile image
Jakub Stibůrek

Thanks for the article. I've been looking into web scraping for some time but everyone out there uses Python, which I don't command. I never thouught of using Node. I'll try this.

Collapse
 
code_jedi profile image
Code_Jedi

Good luck, I've been web scraping using Node.js for 3 years now and I can say that you won't be disappointed!

Collapse
 
khaleelorg profile image
Khaleel Ahmad

To the point web scraping quick tutorial. I was wondering if you edit your Article and update "waitFor" to waitForSelector because waitFor is depricated and giving a warning in the terminal.

Collapse
 
khaleelorg profile image
Khaleel Ahmad

Thank you @code_jedi for this outstading tutorial. I was doing web scraping in python previously and i wanted to shift to nodejs but nodejs seems to slower than python bs4. I scraped this data which is a post of technewztop from internet. I am using python for scraping and php for auto posting. Please tell if i can auto post from nodejs instead of php? is it possible? Can we increase the speed of web scraping in nodejs? Please do reply me.