DEV Community

Cover image for Web Scraping Made Easy: Parse Any HTML Page with Puppeteer
Niharika Goulikar
Niharika Goulikar

Posted on

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Imagine building an e-commerce platform where we can easily fetch product data in real-time from major stores like eBay, Amazon, and Flipkart. Sure, there’s Shopify and similar services, but let's be honest—it can feel a bit cumbersome to buy a subscription just for a project. So, I thought, why not scrape these sites and store the products directly in our database? It would be an efficient and cost-effective way to get products for our e-commerce projects.

What is Web Scraping?

Web scraping involves extracting data from websites by parsing the HTML of web pages to read and collect content. It often involves automating a browser or sending HTTP requests to the site, and then analyzing the HTML structure to retrieve specific pieces of information like text, links, or images.Puppeteer is one library used to scrape the websites.

🟢What is Puppeteer?

Puppeteer is a Node.js library.It provides a high-level API for controlling headless Chrome or Chromium browsers.Headless Chrome is a version of chrome that runs everything without an UI(perfect for running things in the background).

We can automate various tasks using puppeteer,such as:

  • Web Scraping: Extracting content from websites involves interacting with the page's HTML and JavaScript. We typically retrieve the content by targeting the CSS selectors.
  • PDF Generation: Converting web pages into PDFs programmatically is ideal when you want to directly generate a PDF from a web page, rather than taking a screenshot and then converting the screenshot to a PDF. (P.S. Apologies if you already have workarounds for this).
  • Automated Testing: Running tests on web pages by simulating user actions like clicking buttons, filling out forms, and taking screenshots. This eliminates the tedious process of manually going through long forms to ensure everything is in place.

🌟How to get started with puppetter?

Firstly we have to install the library,go ahead and do this.
Using npm:

npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.
Enter fullscreen mode Exit fullscreen mode

Using yarn:

yarn add puppeteer // Downloads compatible Chrome during installation.
yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.
Enter fullscreen mode Exit fullscreen mode

Using pnpm:

pnpm add puppeteer # Downloads compatible Chrome during installation.
pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.
Enter fullscreen mode Exit fullscreen mode

🛠 Example to demonstrate the use of puppeteer

Here is an example of how to scrape a website. (P.S. I used this code to retrieve products from the Myntra website for my e-commerce project.)

const puppeteer = require("puppeteer");
const CategorySchema = require("./models/Category");

// Define the scrape function as a named async function
const scrape = async () => {
    // Launch a new browser instance
    const browser = await puppeteer.launch({ headless: false });

    // Open a new page
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the DOM is fully loaded
    await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens%20sport%20wear', { waitUntil: 'domcontentloaded' });

    // Wait for additional time to ensure all content is loaded
    await new Promise((resolve) => setTimeout(resolve, 25000));

    // Extract product details from the page
    const items = await page.evaluate(() => {
        // Select all product elements
        const elements = document.querySelectorAll('.product-base');
        const elementsArray = Array.from(elements);

        // Map each element to an object with the desired properties
        const results = elementsArray.map((element) => {
            const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src");
            return {
                image: image ?? null,
                brand: element.querySelector(".product-brand")?.textContent,
                title: element.querySelector(".product-product")?.textContent,
                discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent,
                actualPrice: element.querySelector(".product-price .product-strike")?.textContent,
                discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1),
                total: 20, // Placeholder value, adjust as needed
                available: 10, // Placeholder value, adjust as needed
                ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration
            };
        });

        return results; // Return the list of product details
    });

    // Close the browser
    await browser.close();

    // Prepare the data for saving
    const data = {
        category: "mens-sport-wear",
        subcategory: "Mens",
        list: items
    };

    // Create a new Category document and save it to the database
    // Since we want to store product information in our e-commerce store, we use a schema and save it to the database.
    // If you don't need to save the data, you can omit this step.
    const category = new CategorySchema(data);
    console.log(category);
    await category.save();

    // Return the scraped items
    return items;
};

// Export the scrape function as the default export
module.exports = scrape;

Enter fullscreen mode Exit fullscreen mode

🌄Explanation:

  • In this code, we are using Puppeteer to scrape product data from a website. After extracting the details, we create a schema (CategorySchema) to structure and save this data into our database. This step is particularly useful if we want to integrate the scraped products into our e-commerce store. If storing the data in a database is not required, you can omit the schema-related code.
  • Before scraping, it's important to understand the HTML structure of the page and identify which CSS selectors contain the content you want to extract.
  • In my case, I used the relevant CSS selectors identified on the Myntra website to extract the content I was targeting.

Top comments (26)

Collapse
 
rohan_sharma profile image
Rohan Sharma

Thanks for this. I was waiting for some this kind of explanation!

Collapse
 
niharikaa profile image
Niharika Goulikar

I delighted to know that you found this helpful!

Collapse
 
niharikaa profile image
Niharika Goulikar

Hey guys,Let me know your thoughts on this...

Collapse
 
jottyjohn profile image
Jotty John

Great!

Collapse
 
jose_bernard profile image
Jose Bernard Lagumbay

I'm using this for web scraping. very nice and detail explanation

Collapse
 
jennie_py profile image
Priya Yadav

Helpful and thanks di for sharing this😊🤩

Collapse
 
niharikaa profile image
Niharika Goulikar

Most welcome priyaaa!

Collapse
 
cryptodev profile image
Otto Aleski

So helpful Niharikaa!

Collapse
 
sabberrahman profile image
Shohanur Rahman Sabber

amazing

Collapse
 
harshika_982e868132d9ddba profile image
Harshika

Awesome explanation!

Collapse
 
anisaa_96baa257 profile image
Anisa

Excited to try this out!

Collapse
 
st3adyp1ck profile image
Ilya Belous • Edited

Oh, where do I even start? "Web scraping made easy"? With Puppeteer? Really? Sure, if "easy" means spinning up a headless browser and having a memory footprint that rivals Chrome’s absurd hunger for RAM. Let’s be real: Puppeteer is like bringing a bulldozer to plant a flower. Overkill much? Not to mention that Puppeteer scrapes are notoriously fragile. One small change in the target site's structure, and boom! Your scraper falls apart like a house of cards.

And let's not get started on performance. Spawning a browser instance just to scrape HTML when simpler, more efficient solutions like Cheerio or Axios exist is like saying, "Nah, I don't care about scaling or resources." I mean, when you want to parse some basic HTML, using Puppeteer is like trying to hack an egg with a chainsaw. It works, but why?

Oh, and that assumption that it’s "easy"? Tell that to someone trying to debug Puppeteer's often cryptic error messages. Sure, Puppeteer can be handy, but calling it "easy" is like saying skydiving is "just falling."

Collapse
 
niharikaa profile image
Niharika Goulikar

I get where you're coming from, but let's put things in perspective. You're right—Puppeteer can feel like overkill if all you need is to scrape some basic HTML. Tools like Cheerio or Axios are indeed more lightweight and can handle simpler tasks without the overhead of a headless browser.

Sure, it's not the go-to for every scraping job, and yes, it has a learning curve. But for cases where you need to interact with a site as a real user would—clicking buttons, waiting for elements to load, bypassing CAPTCHAs, etc.—Puppeteer is invaluable. It’s not the easiest tool for every use case, but in the right hands and for the right job, it’s incredibly powerful.

The fragility you mentioned? That’s true for most scraping tools. Websites change, and scrapers break—whether you’re using Puppeteer, Cheerio, or anything else. It’s the nature of the beast. Debugging can be tricky, but that’s the trade-off for flexibility and power.

So, yeah, it’s not always the simplest option, but dismissing Puppeteer as overkill ignores the complex scenarios where it's not just useful but necessary. It’s about choosing the right tool for the job, and sometimes, you need that chainsaw.

Collapse
 
st3adyp1ck profile image
Ilya Belous

fair play mate

Collapse
 
aloisseckar profile image
Alois Sečkár

If you need to emulate the browser to get the web page client-side rendered, how to do it without a tool like Puppeteer? I am really curious, because I am looking for alternatives.

Collapse
 
st3adyp1ck profile image
Ilya Belous

To emulate a browser and handle client-side rendering without a tool like Puppeteer, you have a few alternatives depending on the use case. One common method is using headless browsers like Playwright, which is similar to Puppeteer but offers additional features, such as better cross-browser support (Chromium, Firefox, and WebKit).

If you're looking for something lightweight, consider Selenium, though it might not be as fast or efficient for heavy-duty scraping or automation tasks. Another option is Scrapy with a middleware like Splash, which can handle JavaScript-rendered pages, though it's more tailored to web scraping.

If you're working with React or similar front-end frameworks and want to avoid full browser emulation, you can explore static rendering approaches using server-side rendering (SSR) with tools like Next.js or even Prerender.io, which can generate static HTML content from JavaScript apps.