Serpdog

Posted on Oct 1, 2022 • Edited on Mar 5, 2023

Web Scraping Google Images

#tutorial #beginners #javascript #programming

Introduction

This post will teach us to scrape Google Images results with Node JS using multiple methods.

Requirements

Web Parsing with CSS selectors

To search the tags from the HTML files is not only a difficult thing to do but also a time-consuming process. It is better to use the CSS Selectors Gadget for selecting the perfect tags to make your web scraping journey easier.

This gadget can help you to choose the perfect CSS selector for your need. Here is the link to the tutorial, which will teach you to use this gadget for selecting the best CSS selectors according to your needs.

User Agents

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can help in making a fake visit to Google by acting as a real user. You can also rotate User Agents, read more about this in this article: How to fake and rotate User Agents using Python 3.

If you want to further safeguard your IP from being blocked by Google, you can try these 10 Tips to avoid getting Blocked while Scraping Google.

Install Libraries

Before we begin, install these libraries so we can move forward and prepare our scraper.

Or you can type the below commands in your project terminal to install the libraries:

  npm i unirest
  npm i cheerio

To extract our HTML data, we will use Unirest JS, and for parsing the HTML data, we will use Cheerio JS.

Target:

Process

Method-1

We have set up all the things to prepare our scraper. Now, let us discuss about our first method to scrape Google Images.

First, we will make a GET request on our target URL with the help of Unirest to extract the raw HTML data.

  let header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
  };
  return unirest
  .get("https://www.google.com/search?q=nike&oq=nike&hl=en&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8")
  .headers(header)
  .then((response) => {
    let $ = cheerio.load(response.body);

Step-by-step explanation:

In the first step, we made a GET request to our target URL.
In the second step, we passed the headers required with our target URL.
Then we stored the returned response in the Cheerio instance.

But one User Agent might not be enough as Google can block your request. So, we will make an array of User Agents and rotate it on every request.

    const selectRandom = () => {
    const userAgents =  ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
    ]
    var randomNumber = Math.floor(Math.random() * userAgents.length);
    return userAgents[randomNumber];
    }
    let user_agent = selectRandom();
    let header = {
    "User-Agent": `${user_agent}`
    }

Copy the below target URL. Paste it into your browser, which will download a text file. Open that text file in your code editor and convert it into an HTML file.

https://www.google.com/search?q=nike&oq=nike&hl=en&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8

Additional parameters which can be used with this URL:

tbs - Term By Search parameter. Read more about this parameter in this article.
chips - Used to filter image results.
ijn - Used for pagination. ijn = 0 will return the first page of results, ijn = 1 will return the second page of results and so on.

Scroll the HTML file till the end of style tag you will see the HTML tags of the respective images results.

Now, we will parse the required things we want in our response and search for the title tag from the above image. You will find it as .mVDMnf inside an anchor tag. Just below the title, we have the tag for our source as .FnqxG.

    let images_results= [];
    $("div.rg_bx").each((i, el) => {
     images_results.push({    
     title: $(el).find(".iKjWAf .mVDMnf").text(),
     source: $(el).find(".iKjWAf .FnqxG").text()
    });
  });

After the end of the anchor tag, you will find the div tag with the class name rg_meta which contains a JSON string.

{"bce":"rgb(249,252,249)","cb":21,"cl":21,"clt":"n","cr":21,"ct":21,"id":"qYZE1rcH_OCntM","isu":"www.nike.com","itg":0,"oh":1088,"os":"15KB","ou":"https://c.static-nike.com/a/images/w_1920,c_limit/bzl2wmsfh7kgdkufrrjq/image.jpg","ow":1920,"pt":"Nike.
    Just Do It.
    Nike.com","rh":"www.nike.com","rid":"mgtROrdDu1XGJM","rmt":0,"rt":0,"ru":"https://www.nike.com/","st":"www.nike.com","th":169,"tu":"https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcQQAtNCsBlvuD_5pu9bKrTr-Sv5mMwD1-hZE9MS4Px4GKk05naP\u0026s","tw":298}

We will parse it and extract the link and the URL of the original image from it.

    let images_results= [];
    $("div.rg_bx").each((i, el) => {
        let json_string = $(el).find(".rg_meta").text();
        images_results.push({
        title: $(el).find(".iKjWAf .mVDMnf").text(),
        source: $(el).find(".iKjWAf .FnqxG").text(),
        link: JSON.parse(json_string).ru,
        original: JSON.parse(json_string).ou,
    });     
  });

And at last, we will find the thumbnail URL. If you look at the HTML, there is an image tag under the first anchor tag, which contains the thumbnail URL.

Now, our parser looks like this:

    let images_results= [];
    $("div.rg_bx").each((i, el) => {
        let json_string = $(el).find(".rg_meta").text();
        images_results.push({
        title: $(el).find(".iKjWAf .mVDMnf").text(),
        source: $(el).find(".iKjWAf .FnqxG").text(),
        link: JSON.parse(json_string).ru,
        original: JSON.parse(json_string).ou,
        thumbnail: $(el).find(".rg_l img").attr("src")? $(el).find(".rg_l img").attr("src") : $(el).find(".rg_l img").attr("data-src")
    });    
  })

Results:

 [
  {
    title: 'Shoes, Clothing & Accessories. Nike ...',
    source: 'www.nike.com',
    link: 'https://www.nike.com/in/men',
    original: 'https://c.static-nike.com/a/images/w_1920,c_limit/mdbgldn6yg1gg88jomci/image.jpg',
    thumbnail: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTCsZWS0YPC1NFXd4g_Ucn4jkz8VYxL4VbLvWfKa5QI3PKRuHc&s'
  },
  {
    title: 'Nike. Just Do It. Nike.com',
    source: 'www.nike.com',
    link: 'https://www.nike.com/',
    original: 'https://static.nike.com/a/images/f_jpg,q_auto:eco/61b4738b-e1e1-4786-8f6c-26aa0008e80b/swoosh-logo-black.png',
    thumbnail: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRbbeIzjUozRCMzN8gaujUFBJlIFHheriDFvKhSCMD84JL8KeuX&s'
  },
  ....

Here is the full code:


    const unirest = require("unirest");
    const cheerio = require("cheerio");

    const getImagesData = () => {
        const selectRandom = () => {
        const userAgents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
        ];
        var randomNumber = Math.floor(Math.random() * userAgents.length);
        return userAgents[randomNumber];
        };
        let user_agent = selectRandom();
        let header = {
        "User-Agent": `${user_agent}`,
        };
        return unirest
        .get(
            "https://www.google.com/search?q=nike&oq=nike&hl=en&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8"
        )
        .headers(header)
        .then((response) => {
            let $ = cheerio.load(response.body);

            let images_results = [];
            $("div.rg_bx").each((i, el) => {
            let json_string = $(el).find(".rg_meta").text();
            images_results.push({
                title: $(el).find(".iKjWAf .mVDMnf").text(),
                source: $(el).find(".iKjWAf .FnqxG").text(),
                link: JSON.parse(json_string).ru,
                original: JSON.parse(json_string).ou,
                thumbnail: $(el).find(".rg_l img").attr("src") ? $(el).find(".rg_l img").attr("src") : $(el).find(".rg_l img").attr("data-src"),
            });
            });

            console.log(images_results);
        });
    };

    getImagesData();

Method - 2

In this method, we will use a simple GET request to fetch the first page results of Google Images. So, let us find the tags for the image results.

https://www.google.com/search?q=Badminton&gl=us&tbm=isch

First, we will find the tag for the title. Look at the above image. You will find the tag for the title as h3 under the div with class name MSM1fd.

    const images_results = [];

    $(".MSM1fd").each((i,el) => {
        images_results.push({
        title: $(el).find("h3").text(),
        })
    })

Then we will find the tag for the source. If you look at the image, you will find the source of the image under the second anchor tag with the class name as VFACy inside the div with the class name MSM1fd. Also, this anchor tag contains our link. So, our parser would look like this:

    const images_results = [];

    $(".MSM1fd").each((i,el) => {
        images_results.push({
        image: $(el).find("img").attr("src") ? $(el).find("img").attr("src") : $(el).find("img").attr("data-src"),
        title: $(el).find("h3").text(),
        source: $(el).find("a.VFACy .fxgdke").text(),
        link: $(el).find("a.VFACy").attr("href")
        })
    })

The img tag is the only image tag inside the div, so it is not important to look for its class name as particular.
Results

   {
    image: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSjxyuvqYQfybxq9F2XgME-ya6xb81WUyw3Dpa-YA40-Fy7fx0IlOhXIrK17kNON-r6vNs&usqp=CAU',
    title: 'Nike for Men - Shop New Arrivals - FARFETCH',
    source: 'farfetch.com',
    link: 'https://www.farfetch.com/in/shopping/men/nike/items.aspx'
    },
    {
    image: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSfCJOrZr0zFxogpQjNT_6kBQ3rmxSPqvCHPpTWLmpOltZinUpptGM-290ssFMCIzFnD1M&usqp=CAU',
    title: "Women's Clothing. Nike IN",
    source: 'nike.com',
    link: 'https://www.nike.com/in/w/womens-clothing-5e1x6z6ymx6'
    },
    ....

Note: You will also find some images with base64 URLs.

This method is fast, but we can't use pagination in this method, while in the first, we can use it. Another method you can work with is the Puppeteer Infinite Scrolling Method, which can solve the problem of pagination. But it is a very time-consuming method.

With Google Images API

If you don't want to code and maintain the scraper in the long run and don't want to work with complex URLs and HTML, then you can try this Google Search API for scraping Google Images Results.

Serpdog solves all the problem of captchas and proxies and allow developers to scrape Google Search Results smoothly. Also, the pre-cooked structured JSON data can save you a lot of time.


  const axios = require('axios');

  axios.get('https://api.serpdog.io/images?api_key=APIKEY&q=football&gl=us')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

Result:

 "image_results": [
    {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS_Tu78LWxIu_M_sN_kMfj2guqIbu2VcSLyI84CQGbuFRIyTCVR&s",
        "title": "Football - Wikipedia",
        "source": "en.wikipedia.org",
        "link": "https://en.wikipedia.org/wiki/Football",
        "original": "https://upload.wikimedia.org/wikipedia/commons/b/b9/Football_iu_1996.jpg",
        "rank": 1
    },
    {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTxvsz_pjLnFyCnYyCxxY5rSHQCHjNJyYGFZqhQUtTm0XOzOWw&s",
        "title": "Soft toy, American football/brown - IKEA",
        "source": "www.ikea.com · In stock",
        "link": "https://www.ikea.com/us/en/p/oenskad-soft-toy-american-football-brown-90506769/",
        "original": "https://www.ikea.com/us/en/images/products/oenskad-soft-toy-american-football-brown__0982285_pe815602_s5.jpg",
        "rank": 2
    },
    .....

Conclusion:

In this tutorial, we learned to scrape Google Images Results using Node JS. Feel free to message me anything you need clarification on. Follow me on Twitter. Thanks for reading!

Additional Resources

Also published here.

DEV Community

Web Scraping Google Images

Introduction

Requirements

Web Parsing with CSS selectors

User Agents

Install Libraries

Target:

Process

Method-1

With Google Images API

Conclusion:

Additional Resources

Top comments (0)

Read next

Writing high quality tests

How to Create a Dynamic Popover with CSS and JavaScript 🚀

How to Set Git Username and Password in Git Bash?

25 lines Result Based Error Handling with TypeScript