Hey Techies! Today I'm excited to dive into a fascinating topic in the web development community: web scraping.
In more detail, we'll explore how you can use the dynamic duo of Puppeteer and Node.js to collect data from websites like a pro.
What is Web Scraping
Let's talk about what web scraping actually is. Basically, it is the process of extracting information from websites and storing it for further analysis or use. Whether you're building a price comparison tool, collecting market research data, or just to satisfy your curiosity, web scraping can be a powerful tool in your developer toolbox.
Introducing Puppeteer
So why Puppeteer and Node.js? Puppeteer is a Node library that provides an advanced API for controlling headless Chrome or Chromium using the DevTools protocol. Simply put, it allows you to automate interactions with web pages, such as clicking buttons, filling out forms, and yes, scraping data. And with the flexibility and versatility of Node.js, the possibilities are endless. Now let's get to work.
Here are step-by-step instructions to help you start web scraping with Puppeteer and Node.js:
Environment Setup:
First, make sure Node.js is installed on your computer. You can download it from the official Node.js website if you haven't already. Once Node.js is configured, you can setup a node server
npm init -y
you can then go on to install Puppeteer via npm with
npm install puppeteer
Scripting:
Now that your environment is ready, it's time to start coding! Create a new JavaScript file (let's call it "index.js")
touch index.js
and import Puppeteer at the top of the file using
const puppeteer = request('puppeteer');
Start Browser:
Next, you need to start a new browser with Puppeteer. This can be done with one line of code:
async function scrape(){
const browser = await puppeteer.launch();
}
.
This will open a new instance of Chrome in headless mode (ie, with no visible browser window)
Navigating to a New web page:
If you have a web browser, you can navigate to any web page using Puppeteer's "newPage()" and "goto()" methods. Example:
async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
waitUntill:"documentLoaded"
})
};
Data Scraping:
Now comes the fun part - collecting the required data from the site. This may include selecting elements, extracting text or attributes, and saving the data to a file or database. Puppeteer provides several methods to interact with the page, such as the "evaluate()" method, which make scraping easy.
async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we select the element using query selector
// for example to get the page title
const title=document.title
return title
})
};
async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we can select all the paragraphs within a particular class
const posts=document.querySelectorAll('.posts')
return Array.from(posts).map((post)=>{
const text=post.querySelector("p.text").innerText
const author=post.querySelector("p.author").innerText
return{ text, author}
})
return title
})
};
Closing the browser:
When you are done collecting data, don't forget to close your browser to free up system resources. You can do this with the "browser.close()" method.
async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we can select all the paragraphs within a particular class
const posts=document.querySelectorAll('.posts')
return Array.from(posts).map((post)=>{
const text=post.querySelector("p.text").innerText
const author=post.querySelector("p.author").innerText
return{ text, author}
})
return title
})
await browser.close()
};
scrape().then((res)=>{
console.log(res)
}).catch((error)=>{
console.log(error)
})
And there you have it - a basic overview of web scraping with Puppeteer and Node.js! Of course, you can do a lot more with Puppeteer, from taking screenshots and creating PDF files to testing and debugging web applications. But hopefully this guide has given you a solid foundation to explore the exciting world of online scraping. Good luck scraping!🕵️♂️✨.
Top comments (5)
I would add that I recommend dockerizing the browser in order to optimize server-side resources
That's something I did not know. I am quite new to containers and docker. I will implement this and see how it goes.
Thanks for the info
Nice post !
I have wrote a lib to distribute tasks to run to workers and i have as a use case distributed scraping.
Master node distribute urls to fetch and all connected workers start scraping and returning results back to master.
check it out if you are interest in this : github.com/queue-xec/master/tree/d...
❤️
Thanks a lot
And really nice work you have there
This is great, except use playwright