Omojola Tomiloba David

Posted on May 14, 2024

Web Scraping Fundamentals with Puppeteer and Node

#webdev #node #webscraping #javascript

Hey Techies! Today I'm excited to dive into a fascinating topic in the web development community: web scraping.
In more detail, we'll explore how you can use the dynamic duo of Puppeteer and Node.js to collect data from websites like a pro.

What is Web Scraping

Let's talk about what web scraping actually is. Basically, it is the process of extracting information from websites and storing it for further analysis or use. Whether you're building a price comparison tool, collecting market research data, or just to satisfy your curiosity, web scraping can be a powerful tool in your developer toolbox.

Introducing Puppeteer

So why Puppeteer and Node.js? Puppeteer is a Node library that provides an advanced API for controlling headless Chrome or Chromium using the DevTools protocol. Simply put, it allows you to automate interactions with web pages, such as clicking buttons, filling out forms, and yes, scraping data. And with the flexibility and versatility of Node.js, the possibilities are endless. Now let's get to work.
Here are step-by-step instructions to help you start web scraping with Puppeteer and Node.js:

Environment Setup:

First, make sure Node.js is installed on your computer. You can download it from the official Node.js website if you haven't already. Once Node.js is configured, you can setup a node server

npm init -y

you can then go on to install Puppeteer via npm with

npm install puppeteer

Scripting:

Now that your environment is ready, it's time to start coding! Create a new JavaScript file (let's call it "index.js")

touch index.js

and import Puppeteer at the top of the file using

const puppeteer = request('puppeteer');

Start Browser:

Next, you need to start a new browser with Puppeteer. This can be done with one line of code:

async function scrape(){
const browser = await puppeteer.launch();
}

.
This will open a new instance of Chrome in headless mode (ie, with no visible browser window)

Navigating to a New web page:

If you have a web browser, you can navigate to any web page using Puppeteer's "newPage()" and "goto()" methods. Example:

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})

};

Data Scraping:

Now comes the fun part - collecting the required data from the site. This may include selecting elements, extracting text or attributes, and saving the data to a file or database. Puppeteer provides several methods to interact with the page, such as the "evaluate()" method, which make scraping easy.

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we select the element using query selector
// for example to get the page title 
const title=document.title
return title
})
};

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we can select all the paragraphs within a particular class 
const posts=document.querySelectorAll('.posts')

return Array.from(posts).map((post)=>{
const text=post.querySelector("p.text").innerText
const author=post.querySelector("p.author").innerText
return{ text, author}
})
return title
})
};

Closing the browser:

When you are done collecting data, don't forget to close your browser to free up system resources. You can do this with the "browser.close()" method.

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we can select all the paragraphs within a particular class 
const posts=document.querySelectorAll('.posts')

return Array.from(posts).map((post)=>{
const text=post.querySelector("p.text").innerText
const author=post.querySelector("p.author").innerText
return{ text, author}
})
return title
})
await browser.close()
};

scrape().then((res)=>{
console.log(res)
}).catch((error)=>{
console.log(error)
})

And there you have it - a basic overview of web scraping with Puppeteer and Node.js! Of course, you can do a lot more with Puppeteer, from taking screenshots and creating PDF files to testing and debugging web applications. But hopefully this guide has given you a solid foundation to explore the exciting world of online scraping. Good luck scraping!🕵️‍♂️✨.

Top comments (5)

Marco • May 15 '24

I would add that I recommend dockerizing the browser in order to optimize server-side resources

Omojola Tomiloba David • May 15 '24

That's something I did not know. I am quite new to containers and docker. I will implement this and see how it goes.
Thanks for the info

Kos-M • May 16 '24

Nice post !
I have wrote a lib to distribute tasks to run to workers and i have as a use case distributed scraping.
Master node distribute urls to fetch and all connected workers start scraping and returning results back to master.

check it out if you are interest in this : github.com/queue-xec/master/tree/d...

❤️

Omojola Tomiloba David • May 17 '24

Thanks a lot
And really nice work you have there

Brian • May 17 '24

This is great, except use playwright

DEV Community

Web Scraping Fundamentals with Puppeteer and Node

What is Web Scraping

Introducing Puppeteer

Environment Setup:

Scripting:

Start Browser:

Navigating to a New web page:

Data Scraping:

Closing the browser:

Top comments (5)

Read next

The Hyper-Specialization Dilemma in Modern Web Development - Rant

Topitexam

JavaScript Promises Simplified: Beginner’s Guide with Real-Life Examples 🚀

Cypress vs Selenium: Which Testing Tool Is Right for You?