🗄️ The easiest way to scrape a website with Javascript (in Node.js)

#javascript #webdev #puppeteer

Scraping of webpages is really simple and elegant with Puppeteer. Let's try to scrape Codesnacks and get all the links on the page with anchor and text.

We can easily do this using puppeteer. There's no need to fetch the data first and parse it. You can just let puppeteer visit the page and run your own Javascript in the context of the page. The best way to do this is to first run it in the console of your browser and the just copy it to the code if you made sure everything works as planned.

// npm i puppeteer
const puppeteer = require("puppeteer");

// we're using async/await - so we need an async function, that we can run
const run = async () => {
  // open the browser and prepare a page
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // open the page to scrape
  await page.goto("https://codesnacks.net");

  // execute the JS in the context of the page to get all the links
  const links = await page.evaluate(() => 
    // let's just get all links and create an array from the resulting NodeList
     Array.from(document.querySelectorAll("a")).map(anchor => [anchor.href, anchor.textContent])
  );

  // output all the links
  console.log(links);

  // close the browser 
  await browser.close();
};

// run the async function
run();

Before there was puppeteer there were several tools, that you had to stitch together.

a library to fetch the document (e.g. axios or node-fetch)
a parser to parse the HTML and access the DOM nodes (e.g. cheerio)

The problem with this approach was, that dynamically rendered pages were even harder to scrape. That's no issue with puppeteer, since it's actually using chrome - just headless.

Top comments (6)

Ali Ulvi Bayram • Nov 28 '19

Hello , I'm trying in c# to scraping an website but I have some problems when I wanna get some values from that website, in chrome I can see that Values but becouse they are dynamic I can't see in my prog. What should I do using c# to can take that dynamic values ?

Steve Clark 🤷‍♀️ • Jan 3 '20 • Edited

Since there is no native way to do it, here is a small bash script to do it:

#!/bin/bash
curl -s $1

Benjamin Mock • Jan 3 '20

Then you have the document, yes. But you have not parsed or scraped anything. You also don't interpret the JavaScript of the page. You just get the static html. That's not what this tutorial is about ;)

Steve Clark 🤷‍♀️ • Jan 3 '20

I understand. I'm still a noob. I did a code challenge recently and without using any modules I had to figure out how to stdout the html of a webpage. You're right, it works great on a static site. I tried with

curl -s $1 | grep -Po '(?<=href=")[^"]*'

and I almost got everything. Thanks for the tutorial.

Benjamin Mock • Jan 3 '20

It's a different use case. With puppeteer you can also scrape content that's rendered with JavaScript on the client. A lot of applications are client side only. Scraping that is not possible using curl.

Also it's way easier to write DOM selectors than regular expressions. Imagine instead of just getting all links like in this simple example, getting all links of every first paragraph inside of a div if it's inside of an article tag. Good luck writing a regular expression for that. The selector is still easy to write and can be used within the page context with puppeteer.