Serpdog

Posted on Sep 24, 2022 • Edited on Oct 29, 2022 • Originally published at serpdog.io

Web Scraping Google With Node JS - A Complete Guide

#javascript #tutorial #beginners #node

Introduction

In this post, we will learn web scraping Google with Node JS using some of the in-demand web scraping and web parsing libraries present in Node JS.

This article will be helpful to beginners who want to make their career in web scraping, data extraction, or mining in Javascript. Web Scraping can open many opportunities to developers around the world. As we say, “Data is the new oil.”

So, here we end the introduction and get started with our long tutorial on scraping Google with Node JS.

Before we start with the tutorial, let me explain the headers and their importance in scraping.

What are HTTP Headers?

Headers are an important part of an HTTP request or response that provides some additional meta-information about the request or response.

Headers are case-insensitive, and the header name and its value are usually separated by a single colon in a text string format.

Headers play important in web scraping. Usually, when website owners' has information that their website data extraction can take place in many different ways, they implement many tools and strategies to save their website from being scraped by the bots.

Scrapers with nonoptimized headers get failed to scrape these types of websites. But when you pass correct headers, your bot not only mimics a real user but is also successfully able to scrape the quality data from the website. Thus, scrapers with optimized headers can save your IPs from being blocked by these websites. Web

Headers can be classified into four different categories:

Request Headers
Response Headers
Representation Headers
Payload Headers

HTTP Request Header

These are the headers sent by the client while fetching some data from the server. It consists same key-value pair headers in a text string format as other headers. Identification of the request sender can be done with the help of the information in request headers.

The below example shows some of the request headers:

authority: accounts.google.com
method: GET
accept-language: en-US
origin: https://www.geeksforgeeks.org
referer: https://www.geeksforgeeks.org/
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36

HTTP Request header contains various information like:

The browser version of the request sender.
Requested page URL
Platform from which the request is sent.

HTTP Response Header

The headers sent back by the server after successfully receiving the request headers from the user are known as Response Headers. It contains information like the date and time and the type of file sent back by the server. It also consists of information about the server that generated the response.

The below example shows some of the response headers:

content-length: 27098
content-type: text/html
date: Fri, 16 Sep 2022 19:49:16 GMT
server: server: nginx
cache-control: cache-control: max-age=21584

Content-Encoding list all the encodes that are applied to the representation. Content-Length is the size of the received by the user in bytes. The Content-Type header indicates the media type of the resource.

HTTP Representation Header

The representation header describes the type of resource sent in an HTTP message body. The data transferred can be in any format, such as JSON, XML, HTML, etc. These headers tell the client about the data format they received.

The below example shows some of the representation headers:

content-encoding: gzip
content-length: 27098
content-type: text/html

HTTP Payload Headers

Understanding the payload headers is quite difficult, so first you should know about the meaning of payload, then we will come to an explanation.

What is Payload?

When the data is transferred from a server to a recipient, the message content or the data expected by the server that the recipient will receive is known as payload.

The Payload Header is the HTTP header that consists of the payload information about the original resource representation. They consists of information about the content length and range of the message, any encoding present in the transfer of the message, etc.

The below example shows some of the payload headers:

content-length: 27098
content-range: bytes 200-1000/67589
trailer: Expires
transfer-encoding: chunked

The content–range header indicates where in a full body massage a partial message belongs. The transfer-encoding header is used to specify the type of encoding to transfer securely the payload body to the user.

User Agent

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent. It can help us in mimicking as a real user. Thus, saving our IP from being blocked by Google. It is one of the main headers we use while scraping Google Search Results.

It looks like this:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

Unirest

In this section, we will learn about the Node JS library Unirest which will help us scrape Google Search Results. We will discuss the need for this library and the disadvantages associated with it.

Unirest is a lightweight HTTP library available in many languages, including Java, Python, PHP, .Net, etc. Kong currently manages the Unirest JS. Also, it comes in the list of one of the most popular web scraping Javascript libraries. It helps us to make all types of HTTP requests to scrape the precious data on the requested page.

Let us take an example of how we can scrape Google using this library:

npm i unirest

Then we will make a request on our target URL:

const unirest = require(“unirest”)

    function getData()
    {
    const url =   "https://www.google.com/search?q=javascript&gl=us&hl=en"

    let header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Viewer/96.9.4688.89"
    }

    return unirest
    .get(url)
    .headers(header)
    .then((response) => {
    console.log(response.body);
    })

    }
    getData();

Step-by-step explanation after header declaration:

get() is used to make a get request at our target URL.
headers() are used to pass HTTP request headers along with the request.

This block of code will return an HTML file and will look like this:

Unreadable, right? Don’t worry. We will be discussing a web parsing library in a bit.

As we know, Google can block our request if we request with the same User Agent each time. So, if you want to rotate User-Agents on each request, let us define a function that will return random User-Agent strings from the User-Agent array.

const selectRandom = () => {
    const userAgents =  ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
    ]
    var randomNumber = Math.floor(Math.random() * userAgents.length);
    return userAgents[randomNumber];
    }
    let user_agent = selectRandom();
    let header = {
    "User-Agent": `${user_agent}`
    }

This logic will ensure we don’t have to use the same User-Agents each time.

Advantages of Unirest:

It has proxy support.
It supports all HTTP request methods(GET,POST,DELETE,etc).
It supports form downloads.
It supports TLS/SSL protocol.
It supports HTTP authentication.

Axios

Axios is a promise-based HTTP client for Node JS and browsers and one of the most popular and powerful javascript libraries. It can make XMLHttpRequests and HTTP from the browser Node JS respectively. It also has client-side support for protecting against the CSRF.

Let us take an example of how we can use Axios for web scraping:

npm i axios

The below block of code will return the same HTML file we saw in the Unirest section.

   const axios = require('axios');

    let headers = { 
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64)   AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Viewer/96.9.4688.89"
    }

    axios.get('https://www.google.com/search?q=javascript&gl=us&hl=en' , headers)
    .then((response) {
    console.log(response.body);
    })
    .catch((e) {
    console.log(e);
    });

Advantages of Axios:

It can support old browsers also, indicating wider browser support.
It supports response timeout.
It can support multiple requests at the same time.
It can intercept HTTP requests.
Most important for developers, it has brilliant community support.

Cheerio

Cheerio is a web parsing library that can parse any HTML and XML document. It implements a subset of jQuery, which is why its syntax is quite similar to jQuery.

Manipulating and rendering the markup can be done very fast with the help of Cheerio. It doesn’t produce a visual rendering, apply CSS, load external resources, or execute Javascript.

Let us take a small example of how we can use Cheerio to parse the Google ads search results.

You can install Cheerio by running the below command in your terminal.

npm i cheerio

Now, we will prepare our parser by finding the CSS selectors using the SelectorGadget extension. Watch the tutorial on the selector gadget website if you want to learn how to use it.

Let us first scrape the HTML with the help of unirest and make a cheerio instance for parsing the HTML.

const cheerio = require("cheerio");
    const unirest = require("unirest");

    const getData = async() => {
    try
    {
    const url = "https://www.google.com/search?q=life+insurance";

    const response = await unirest
    .get(url)
    .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    })

    const $ = cheerio.load(response.body)

In the last line, we just created a constant and loaded the scraped HTML in it. If you see the right bottom of the page, the results of the ads are under the tag .uEierd.

We will scrape the ad's title, snippet, link, displayed link, and site links.

Look at the bottom of the image for the tag of the title.

Similarly, for the snippet:

Let us find the tag for the displayed link:

And if you inspect the title, you will find the tag for the link to be a.sVXRqc.

After searching all the tags, our code will look like this:

    let ads = [];
    $("#tads .uEierd").each((i,el) => {
    ads[i] = {
    title: $(el).find(".v0nnCb span").text(),
    snippet: $(el).find(".lyLwlc").text(),
    displayed_link: $(el).find(".qzEoUe").text(),
    link: $(el).find("a.sVXRqc").attr("href"),
    }
    })

Now, let us find tags for site links.

Now, similarly, if we follow the above process to find the tags for sitelinks titles, snippets, and links, our code will look like this:

    let sitelinks = [];  
    if($(el).find(".UBEOKe").length)
    {
    $(el).find(".MhgNwc").each((i,el) => {
    sitelinks.push({
        title: $(el).find("h3").text(),
        link: $(el).find("a").attr("href"),
        snippet: $(el).find(".lyLwlc").text()
    })
    })
    ads[i].sitelinks = sitelinks
    }

And our results:

Complete Code:

const cheerio = require("cheerio");
    const unirest = require("unirest");

    const getData = async() => {
    try
    {
    const url = "https://www.google.com/search?q=life+insurance";

    const response = await unirest
    .get(url)
    .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    })

    const $ = cheerio.load(response.body)

    let ads=[];

    $("#tads .uEierd").each((i,el) => {
        let sitelinks = [];
        ads[i] = {
            title: $(el).find(".v0nnCb span").text(),
            snippet: $(el).find(".lyLwlc").text(),
            displayed_link: $(el).find(".qzEoUe").text(),
            link: $(el).find("a.sVXRqc").attr("href"),
        }

        if($(el).find(".UBEOKe").length)
        {
            $(el).find(".MhgNwc").each((i,el) => {
                sitelinks.push({
                title: $(el).find("h3").text(),
                link: $(el).find("a").attr("href"),
                snippet: $(el).find(".lyLwlc").text()
                })
            })
            ads[i].sitelinks = sitelinks
        }
        })
        console.log(ads)
    }
    catch(e)
    {
        console.log(e);
    }
    }

    getData();

You can see how easy it is to use Cheerio JS for parsing HTML. Similarly, we can use Cheerio with other web scraping libraries like Axios, Puppeteer, Playwright, etc.

If you want to learn more about scraping websites with Cheerio, you can consider my blogs where I have used Cheerio as a web parser:

Advantages of Cheerio:

Cheerio implements a subset of jQuery. It reveals its gorgeous API by removing all the DOM inconsistencies from jQuery.
Cheerio JS is incredibly fast as it doesn’t produce visual rendering, apply CSS, load external resources, or execute Javascript, which is common in single-page applications.
It can parse nearly any HTML and XML document.

Headless Browsers

Gone are the days when websites used to build with only HTML and CSS. Nowadays, interaction on modern websites can be handled by javascript completely, especially the SPAs(single page applications), built on frameworks like React, Next, and Angular are heavily relied on Javascript for rendering the dynamic content.

But when doing web scraping, the content we require is sometimes rendered by Javascript, which is not accessible from the HTML response we get from the server.

And that’s where the headless browser comes into play. Let’s discuss some of the Javascript libraries which use headless browsers for web automation and scraping.

Puppeteer

Puppeteer is a Google-designed Node JS library that provides a high-quality API that enables you to control Chrome or Chromium browsers.

Here are some features associated with Puppeteer JS:

It can be used to crawl single-page applications and can generate pre-rendered content, i.e., server-side rendering.
It works in the background and perform actions as directed by the API.
It can generate screenshots and pdf of web pages.
It can be used for automate form submission and UI testing.

Let us take an example of how we can scrape Google Books Results using Puppeteer JS. We will scrape the book title, image, description, and writer.

First, install puppeteer by running the below command in your project terminal:

npm i puppeteer

Now, let us create a web crawler by launching the puppeteer in a non-headless mode.


   const url = "https://www.google.com/search?q=merchant+of+venice&gl=us&tbm=bks";

    browser = await puppeteer.launch({
        headless: false,
        args: ["--disabled-setuid-sandbox", "--no-sandbox"],
    });
    const page = await browser.newPage();
    await page.setExtraHTTPHeaders({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Agency/97.8.6287.88",
    });

    await page.goto(url, { waitUntil: "domcontentloaded" });

What each line of code says:

puppeteer.launch() - This will launch chrome browser with non-headless mode.
browser.newPage() - This will open a new tab in the browser.
page.setExtraHTTPHeaders() - This will allow us to set headers on our target URL.
page.goto() - This will navigate us to our target URL page.

Now, let us find the CSS selector for the book title.

As you can see at the bottom of the page, the CSS selector of our title.

We will paste this in our code:

    let books_results = [];

    books_results = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
        return {    
            title: el.querySelector(".DKV0Md")?.textContent
        }
      })
    });

Here I have used the page.evaluate() function to evaluate the page’s context and returns the result.

Then I selected the parent handler of the title, which is also a parent handler of other things we want to scrape(image, writer, description, etc as stated above) using the document.querySelectorAll() method.

And finally, we selected the title from the elements present in the parent handler container with the help of querySelector(). The textContent will allow us to grab the text inside the selected element.

We will select the other elements just in the same way as we selected the title. Now, let us find the tag for the writer.

   books_results = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
     return {    
      title: el.querySelector(".DKV0Md")?.textContent,
      writers: el.querySelector(".N96wpd")?.textContent,
      }
     })
    });

Let us find the tag for our description as well.

  let books_results = [];

    books_results = await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
            return {    
                title: el.querySelector(".DKV0Md")?.textContent,
                writers: el.querySelector(".N96wpd")?.textContent,
                description: el.querySelector(".cmlJmd")?.textContent,
            }
        })
    });

And finally for the image:

  let books_results = [];

  books_results = await page.evaluate(() => {
     return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
        return {    
            title: el.querySelector(".DKV0Md")?.textContent,
            writers: el.querySelector(".N96wpd")?.textContent,
            description: el.querySelector(".cmlJmd")?.textContent,
            thumbnail: el.querySelector("img").getAttribute("src"),
        }
    })
  });
  console.log(books_results);
  await browser.close();

We don’t need to find the tag for the image as it is the only image in the container. So we just used the “img” element for reference. Don’t forget to close the browser.

Now, let us run our program to check the results.

The long URL you see as a thumbnail value is nothing but a base64 image URL. So, we got the results we wanted.

Complete Code:

  const puppeteer = require("puppeteer");
  const cheerio = require("cheerio");

  const getBooksData = async () => {
  const url = "https://www.google.com/search?q=merchant+of+venice&gl=us&tbm=bks";


  browser = await puppeteer.launch({
    headless: true,
    args: ["--disabled-setuid-sandbox", "--no-sandbox"],
  });
  const page = await browser.newPage();
  await page.setExtraHTTPHeaders({
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Agency/97.8.6287.88",
  });

  await page.goto(url, { waitUntil: "domcontentloaded" });

  let books_results = [];

  books_results = await page.evaluate(() => {
     return Array.from(document.querySelectorAll(".Yr5TG")).map((el) => {
        return {    
            title: el.querySelector(".DKV0Md")?.textContent,
            writers: el.querySelector(".N96wpd")?.textContent,
            description: el.querySelector(".cmlJmd")?.textContent,
            thumbnail: el.querySelector("img").getAttribute("src"),
        }
    })
  });

  console.log(books_results)
  await browser.close();
};

getBooksData();

So, we have now understood a basic understanding of Puppeteer JS. Now, let’s discuss its some advantages.

Advantages of Puppeteer:

We can scroll the page in puppeteer js.
We can click on elements like buttons and links.
We can take screenshots and can also make pdf of the web page.
We can navigate between the web pages.
We can parse Javascript also with the help of Puppeteer JS.

Playwright JS

Playwright JS is a test automation framework used by developers around the world to automate web browsers. The same team that worked on Puppeteer JS previously has developed the Playwright JS. You will find the syntax of Playwright JS to be similar to Puppeteer JS, the API method in both cases are also identical, but both languages have some differences. Let’s discuss them:

Playwright v/s Puppeteer JS:

Playwright supports multiple languages like C#, .NET, Javascript, etc. While the latter only supports Javascript.
The Playwright JS is still a new library with limited community support, unlike Puppeteer JS, which has good community support.
Playwright supports browsers like Chromium, Firefox, and Webkit, while Puppeteer main focus is Chrome and Chromium, with limited support for Firefox.

Let us take an example of how we can use Playwright JS to scrape Top Stories from Google Search Results. First, install playwright by running the below command in your terminal:

npm i playwright

Now, let's create our scraper by launching the chromium browser at our target URL.

    const browser = await playwright['chromium'].launch({ headless: false, args: ['--no-sandbox'] });
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto("https://www.google.com/search?q=india&gl=us&hl=en");

Step-by-step explanation:

The first step will launch the chromium browser in non-headless mode.
The second step creates a new browser context. It won't share cookies/cache with other browser contexts.
The third step opens a new tab in the browser.
In the fourth step, we navigate to our target URL.

Now, let us search for the tags for these single stories.

As you can see every single story comes under the .WlydOe tag. This method page.$$ will find all elements matching the specified selector within the page and will return the array containing all these elements.

Look for tags of the title, date, and thumbnail, with the same approach as we have done in the Puppeteer section. After finding the tags push the data in our top_stories array and close the browser.

    let top_stories = [];
    for(let single_story of single_stories)
    {
        top_stories.push({
        title: await single_story.$eval(".mCBkyc", el => el.textContent.replace('\n','')),
        link: await single_story.getAttribute("href"),
        date: await single_story.$eval(".eGGgIf", el => el.textContent),
        thumbnail: await single_story.$eval("img", el => el.getAttribute("src"))
        })
    }
    console.log(top_stories)
    await browser.close();

The $eval will find the specified element inside the parent element we declared above in single_stories array. The textContent will return the text inside the specified element and getAttribute will return the value of the specified element’s attribute.

Our result will should look like this:

Here is the complete code:

 const playwright = require("playwright");
    const getTopStories = async () => {
    try {
    const browser = await playwright['chromium'].launch({ headless: false, args: ['--no-sandbox'] });
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto("https://www.google.com/search?q=football&gl=us&hl=en");
    const single_stories = await page.$$(".WlydOe");
    let top_stories = [];
    for(let single_story of single_stories)
    {
        top_stories.push({
        title: await single_story.$eval(".mCBkyc", el => el.textContent.replace('\n','')),
        link: await single_story.getAttribute("href"),
        date: await single_story.$eval(".eGGgIf", el => el.textContent),
        thumbnail: await single_story.$eval("img", el => el.getAttribute("src"))
        })
    }
    console.log(top_stories)
    await browser.close();

    } catch (e) {
    console.log(e);
    }
    };

    getTopStories();

Advantages of Playwright:

It enables auto-wait for elements before performing any tasks.
It allows you to test your web applications in mobile browsers.
It comes in the list of one of the fastest processing libraries when it comes to web scraping.
It covers all modern web browsers like Chrome, Edge, Safari, and Firefox.

Recap

The above sections taught us to scrape and parse Google Search Results with various Javascript libraries. We also saw how we can use a combination of Unirest and Cheerio and Axios and Cheerio to extract the data from Google. It is obvious, if you want to scrape millions of pages of Google, that won’t work without proxies and captchas.

But, wait! You can still use Serpdog’s | Google Search API that solves all your problems of handling proxies and captchas enabling you to scrape millions of Google Search Results without any hindrance.

Also, you require a large pool of user agents to make millions of requests on Google. But if you use the same user agent each time you request, your proxies will get blocked. Serpdog also solves this problem as our Google Search API uses a large pool of User Agents to scrape Google Search Results successfully.

Moreover, Serpdog provides its users 100 free credits on the first sign-up.

Here are some articles if you want to know more about how to scrape Google:

Other Libraries

In this section, we will discuss some of the alternatives to the above-discussed libraries.

Nightmare JS

Nightmare JS is a web automation library designed for websites that don’t own APIs and want to automate browsing tasks.
Nightmare JS is mostly used by developers for UI testing and crawling. It can also help mimic user actions(like goto, type, and click) with an API that feels synchronous for each block of scripting.

Let us take an example of how we can use Nightmare JS to scrape the Google Search Twitter Results.

Install the Nightmare JS by running this command:

npm i nightmare

As you can see in the above image, each Twitter result is under the tag .dHOsHb. So, this makes our code look like this:

const Nightmare = require("nightmare")
const nightmare = Nightmare()

nightmare.goto("https://www.google.com/search?q=cristiano+ronaldo&gl=us")
.wait(".dHOsHb")
.evaluate(() => {
    let twitter_results = [];
    const results = document.querySelectorAll(".dHOsHb")
    results.forEach((result) => {
        let row = {
            "tweet": result.innerText,
        }
        twitter_results.push(row)
    })
    return twitter_results;
})
.end()
.then((result) => {
result.forEach((r) => {
    console.log(r.tweet);
})
})
.catch((error) => {
    console.log(error)
})

Step-by-step explanation:

After importing the library, we created an instance of Nightmare JS with the name nightmare.
Then we use goto() to navigate to our target URL.
In the next step, we used wait() to wait for the selected tag of the twitter result. You can also pass a time value as a parameter to wait for a specific period.
Then we used evaluate(), which invokes functions on the page, in our case, it is querySelectorAll().
In the next step, we used the forEach() function to iterate over the results array and fill each element with the text content.
At last we called the end() to stop the crawler and returned our scraped value.

Here are our results:

Node Fetch

Node Fetch is a lightweight module that brings Fetch API to Node JS, or you can say it enables to use of the fetch() functionality in Node JS.
Features:

Use native promise and async functions.
It is consistent with window.fetch API.
It uses native Node streams for the body, on both request and response.
To use Node Fetch run this command in your project terminal:

npm i node-fetch@2

Let us take a simple example to request our target URL:

const fetch = require("node-fetch");

const getData = async() => {

const response = await fetch("https://google.com/search?q=web+scraping&amp;gl=us" , {

headers: {

“User-Agent”: 

    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36 Agency/97.8.6287.88"

}

});

const body = await response.text();

console.log(body);

}

getData();

Osmosis

Osmosis is an HTML/XML parser for Node JS.

Features:

It supports CSS 3.0 and XPath 1.0 selector hybrids.
It is a very fast parser with a small memory footprint.
No large dependencies like jQuery, cheerio JS.

Advantages of Osmosis:

It supports fast searching.
Supports single or multiple proxies and handles their failures.
Support form submission and session cookies.
Retries and redirect limits.

Conclusion

In this tutorial, we discussed eight Javascript libraries that can be used for web scraping Google Search Results. We also learned some examples to scrape search results. Each of these libraries has unique features and advantages, some are just new, and some have been updated and adopted according to developer needs. Thus, you know which library to choose according to the circumstances.

If you have any questions about the tutorial, please feel free to ask me.

If you think I have not covered some topics in the tutorial, please feel free to message me.

Additional Resources

Frequently Asked Questions

1. Which Javascript library is best for web scraping?

When selecting the best library for web scraping, consider that library that is easier to use, a library that has good community support and can withstand large amounts of data.

2. From where should I start learning scraping Google?

This tutorial is designed for beginners to develop a basic understanding to scrape Google. And if anyone wants to learn more, I have already made various blogs on scraping Google, which you can see on the Serpdog’s Blog web page, which will give you an intermediate to advanced understanding of scraping Google.

3. Is web scraping Google hard?

Web scraping Google is pretty much easy! Even a developer with decent knowledge can kickstart his career in web scraping if given the right tool.

4. Is web scraping legal?

Yes. All the data publicly available on the internet is legal to scrape.

Donation Appeal

Hey, can you buy me a coffee. It would help me to keep writing these types of blogs.

DEV Community