Waqasabi

Posted on Jan 22, 2020 • Edited on Jul 9, 2020

Building a Search Engine API with Node/Express and Puppeteer using Google Search

#javascript #node #puppeteer #webdev

In this post, we're going to build a Search Engine API with Node/Express & Puppeteer. It will use web scraping to get top results from Google

If you haven't read the first post, I highly recommend reading it! It goes over the basics of web scraping with puppeteer.

Note: Unfortunately, the concepts discussed in Part 2 and 3 are still valid, however, the examples used to demonstrate these concepts, no longer work. This is the nature of web scrapping. If a website decides to change its class name for a certain HTML element, then the web scrapper needs to be adjusted to those class names. In this example, we used class names which Google used at the time of writing this post, however, those class names have changed since then and so the example no longer works.

This is why sometimes it's better to find a dynamic way to target an element so that if the class name or element id was to change, the web scrapper would still continue to operate.

This is part a 3 Part Series:

1st Part: Basics of Puppeteer and Creating a Simple Web Scrapper.
2nd Part: Creating Search Engine API using Google Search with Node/Express and Puppeteer.
3rd Part: Optimising our API, Increasing Performance, Troubleshooting basics and Deploying our Puppeteer API to the Web.

Table Of Contents - Part 2

API Requirements
Setting up a Node/Express Server
Creating the Search Engine API with Puppeteer

API Requirements

Before we get started, It's important to know what we're trying to build. We're going to build an API, that will take in a search request and return a JSON with the top results from Google's Search Results.

The information we care about from the results:

Website Title
Website Description
Website URL

The search request will be a GET request and we're going to make use of URL Query Params to specify the search query. The user will send a request to /search with search query searchquery=cats:



localhost:3000/search?searchquery=cat

Our API is expected to return the top Results about cats from Google in JSON:



[
    {
      title: 'Cats Are Cool',
      description: 'This website is all about cats and cats are cool',
      url: 'catsarecool.com'
    },
    ...
    {
      title: 'Cats funny videos',
      description: 'Videos all about cats and they are funny!',
      url: 'catsfunnyvideos.com'
    }
]

Now that we know our requirements, we can go ahead start building our API

Setting up a Node/Express Server

If you want to skip setting up the Node/Express server, you can skip right ahead to the part where we start writing the code for puppeteer to crawl Google. Although I recommend reading this part.

To get started, we're going create a new project directory and initilise npm:



mkdir search-engine-api
cd search-engine-api
npm init -y

For this API, we're going to use Express.js to create a simple API and so we need to install express, puppeteer and nodemon. We're going to use nodemon for development. Nodemon will detect any changes in our server file and automatically restart our server. This will save us time in the long run.



npm i express puppeteer nodemon

We can now create our server file:



touch server.js

After doing so, we need to configure our package.json and add scripts for npm start to start our server. For development purposes, we can create a script with nodemon. We will use npm run dev for running the nodemon script:



{
  "name": "search-engine-api",
  "version": "1.0.0",
  "description": "",
  "main": "server.js",
  "scripts": {
    "start": "node server.js",
    "dev": "nodemon server.js"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "express": "^4.17.1",
    "nodemon": "^2.0.2",
    "puppeteer": "^2.0.0"
  }
}

Now if we run npm run dev and try to make changes in our server.js file, nodemon will automatically restart the server. We can now start writing code for our server.

Before we get into building our API, we need to setup a simple Express server. We're going to use the Hello World example provided by Express Docs:



const express = require('express');
const app = express();
const port = 3000;

//Catches requests made to localhost:3000/
app.get('/', (req, res) => res.send('Hello World!'));


//Initialises the express server on the port 30000
app.listen(port, () => console.log(`Example app listening on port ${port}!`));

This creates an express server on the port 3000 of our local machine. If someone sends a GET Request to localhost:3000/ our server responds with Hello World. We can see it working by opening the URLlocalhost:3000/ in a browser.

We're going to create a new route for our search. This is where we will pass information in the URL with Query Params, for example, If we want search results for the query "dogs", we can send a request to:



localhost:3000/search?searchquery=dogs

To implement this, we need to create a new GET request function with express and since we expect this to be a GET request, we can make use of app.get(route, callbackFunc)



const express = require('express');
const puppeteer = require('puppeteer');
const app = express();
const port = 3000;

//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {
    //Do something when someone makes request to localhost:3000/search
    //request parameter - information about the request coming in
   //response parameter - response object that we can use to send a response
});

//Catches requests made to localhost:3000/
app.get('/', (req, res) => res.send('Hello World!'));


//Initialises the express server on the port 30000
app.listen(port, () => console.log(`Example app listening on port ${port}!`));

Now that we have a function that catches requests made to localhost:3000/search, we can start looking into how we can make use of any query params that are in the URL. Any requests made to this route will execute the callback function in this handler.

Express allows us to access the query params through the request parameter. In our case, since we named our query field searchquery, we can access it through that:



//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {

  //Holds value of the query param 'searchquery' 
    const searchQuery = request.query.searchquery;
});

However, if this query does not exist then we have nothing to search for, so we can handle that case by only doing something when the search query is provided. If the search query does not exist then we can quickly end the response without any data with response.end()



//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {

  //Holds value of the query param 'searchquery'.
    const searchQuery = request.query.searchquery;

  //Do something when the searchQuery is not null.
  if(searchQuery != null){

  }else{
    response.end();
  }
});

Now that we have our Node/Express server setup, we can start writing code for our scraper.

Creating the Search Engine API with Puppeteer

When it comes to web-scraping Google, one way to search something directly on Google Search is to pass the search query as a URL query parameter:



https://www.google.com/search?q=cat

This will show us results for the keyword 'cat' on Google. This would be the ideal approach, however, for the purpose of this post, we're going to do things the difficult way by opening google.com(Homepage) having puppeteer type in the search box and press Enter to get the results.

We'll do it this way because not all websites make use of query parameters and sometimes the only way to get to the next step of the website (in our case the results page) is to do things manually in the first step.

At this point our server.js looks like this:



const express = require('express');
const puppeteer = require('puppeteer');
const app = express();
const port = 3000;

//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {

    //Holds value of the query param 'searchquery'.
    const searchQuery = request.query.searchquery;

    //Do something when the searchQuery is not null.
    if(searchQuery != null){

    }else{
      response.end();
    }
});

//Catches requests made to localhost:3000/
app.get('/', (req, res) => res.send('Hello World!'));


//Initialises the express server on the port 30000
app.listen(port, () => console.log(`Example app listening on port ${port}!`));

We're going to create a new function called searchGoogle. This will take in the searchQuery as an input parameter and return an array of JSON with the top results.

Before we go ahead to write searchGoogle with puppeteer, we're going to write the footprint of the function so we know how the code should behave:



const express = require('express');
const puppeteer = require('puppeteer');
const app = express();
const port = 3000;

//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {

    //Holds value of the query param 'searchquery'.
    const searchQuery = request.query.searchquery;

    //Do something when the searchQuery is not null.
    if (searchQuery != null) {

        searchGoogle(searchQuery)
            .then(results => {
                //Returns a 200 Status OK with Results JSON back to the client.
                response.status(200);
                response.json(results);
            });
    } else {
        response.end();
    }
});

//Catches requests made to localhost:3000/
app.get('/', (req, res) => res.send('Hello World!'));


//Initialises the express server on the port 30000
app.listen(port, () => console.log(`Example app listening on port ${port}!`));

Since puppeteer works asynchronously, we need to wait for the results to be returned from searchGoogle. For this reason, we need to add a .then so that we make sure to wait until searchGoogle processes and fetches the results before we make use of them and the results can be accessed with a callback function which will have the results as the first parameter. After which we can respond back to the client by using response.json().

response.json() returns a JSON back to the client. There are different methods that you can use with the response. You can read more about them on the official Express docs.

We can now start writing code and building the puppeteer function searchGoogle. To do this, we're going to create a new file in the same directory. This is because having a separate file will allow us to test our puppeteer file without having to make a manual request to our server, which can be a time-consuming process. We'll name it searchGoogle.js:



touch searchGoogle.js

Now we need to initialize the function in the file:



const puppeteer = require('puppeteer');

const searchGoogle = async (searchQuery) => {
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://google.com');

    await browser.close();
};

export default searchGoogle;

Right now, we are just launching a Headless Instance of Chrome and browsing to Google. We need to now find the Search Bar, where we can write the query. For this, we need to inspect the source code of Google's Homepage.

After using the Mouse tool for selecting elements, we can see the HTML for this search bar:

We can see that it has name="q" We can use it to Identify and target the input through puppeteer. To type in our search query, puppeteer provides a function for the page page.type(selector, textToType). With this we can target any forms and input our values directly:



const puppeteer = require('puppeteer');

const searchGoogle = async (searchQuery) => {
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://google.com');

    //Finds input element with name attribue 'q' and types searchQuery
    await page.type('input[name="q"]', searchQuery);

  await browser.close();
};

export default searchGoogle;

Just to make sure, everything is working we can take a screenshot after it is done typing:



const puppeteer = require('puppeteer');

const searchGoogle = async (searchQuery) => {
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://google.com');

    //Finds input element with name attribue 'q' and types searchQuery
    await page.type('input[name="q"]', searchQuery);

    await page.screenshot({path: 'example.png'});

    await browser.close();
};

//Exports the function so we can access it in our server
module.exports = searchGoogle;

searchGoogle('cats');

As you can see, at the end of the file we make a call to the searchGoogle function. This is so we can start testing it. We can now go to our command line and execute:



node searchGoogle.js

After few seconds, the file should finish executing and you should be able to view the screenshot:

Now, all we need to do is either have puppeteer press 'Enter' on the keyboard or click 'Google Search' button below the Search Bar.

Both approaches are suitable solutions, however, for precision, we're going to have puppeteer press 'Google Search'. However, if you were to press Enter this is how you would do it:



 await page.keyboard.press('Enter');

We're going to inspect the page once again and find information regarding the 'Google Search' Button. Doing so reveals this:

We can see that it has a name "btnK". We can use this to target the element and click it:



//Finds the first input with name 'btnK', after it is found, it executes .click() DOM Event Method
await page.$eval('input[name=btnK]', button => button.click());

Adding it to our file:



const puppeteer = require('puppeteer');

const searchGoogle = async (searchQuery) => {
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://google.com');

    //Finds input element with name attribue 'q' and types searchQuery
    await page.type('input[name="q"]', searchQuery);

    //Finds an input with name 'btnK', after so it executes .click() DOM Method
    await page.$eval('input[name=btnK]', button => button.click());

    await page.screenshot({path: 'example.png'});

    await browser.close();
};

searchGoogle('cats');

//Exports the function so we can access it in our server
module.exports = searchGoogle;

Executing the file and seeing the screenshot yields this result:

We need to make sure to wait for Google to load up all the results before we do anything. There are different ways we can do this. If we want to wait for a certain time we can use:



await page.waitFor(durationInMilliseconds)

Alternatively, if we already know the element we are looking for, then we can use waitForSelector to wait for puppeteer to load the first element with the matching selector before proceeding:



await page.waitForSelector('selector');

This will wait for the selector to load before proceeding. To use this, we need to first identify the selector for our results, so that puppeteer can wait for the results selector to load before proceeding. You should keep in mind that this will only wait for the first selector it finds.

After going through the HTML source code for the search results, I found that all the search results are stored in a div with an id search:

So we can use waitForSelector(selector) and target the div with id=search:



const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto('https://google.com');

    //Finds input element with name attribue 'q' and types searchQuery
    await page.type('input[name="q"]', searchQuery);

    //Finds an input with name 'btnK', after so it executes .click() DOM Method
    await page.$eval('input[name=btnK]', button => button.click());

    //Wait until the first div element with id search laods
    await page.waitForSelector('div[id=search]');

    await page.screenshot({path: 'example.png'});

    await browser.close();
};

searchGoogle('cats');

//Exports the function so we can access it in our server
module.exports = searchGoogle;

Now that our results have loaded, we can start parsing them. If you want to skip the part where we try to find the divs with relevant information then you can skip right ahead to the implementation.

If we take a closer look at the source code to make meaning out of the HTML, we can see that the information we're looking for is stored in divs with class=bkWMgd however not all divs with this class contain relevant information, some of these div contain video recommendations, news stories etc. The ones we're interested in, are the ones with an h2 title with Web Results text.

If we take a closer look at that div, we can see that it's nested very deeply. For this reason, we're going to make use of special selectors to target deep children. The main information is stored in the div with class 'g':

We can target the specific divs we care about. We're going to use '>' CSS Selectors known as Child-Combinators to target the nested information.

We can target nested elements like so:



<div class='1'>
    <div class='2'>
        <div class='3'>
            <p>Information</p>
        </div>
    </div>
</div>

For an HTML file with structure like this, we can access the paragraph by doing:



'div[class=1] > div[class=2] > div[class=3] > p'

We can select the div with results:



//Finds the first div with class 'bkWMgd' and returns it
const parent = await page.$eval('div[class=bkWMgd]', result => result);

Since the parent variable represents a DOM node returned from page.$eval(), we can run HTML DOM methods on this object. Since all the information is available in the div with class g we can set the parent to its immediate child.



//Sets the parent to the div with all the information 
parent = parent.querySelector('div[class=g]');

With this we can now target the information we care about, this information can be seen in this image:

Title



//Targets h3 Website Title i.e. 'Cats  (2019 film)  - Wikipedia'
const title = parent.querySelector('div[class=rc] > div[class=r] > a >  h3').innerText;

URL



//Targets the <a> href link i.e. 'https://en.wikipedia.org/wiki/Cats_(2019_film)'
const url = parent.querySelector('div[class=rc] > div[class=r] > a').href;

Description



const desc = parent.querySelector('div[class=rc] > div[class=s] > div > span[class=st]').innerText;

Now that we know how to target our information we can add this to our file. We only looked at parsing information from one search result, but there are multiple search results so we need to use page.$$eval to target ALL divs with h2 Web results and target divs with class g, we can see here that some divs have multiple search results:

When there are multiple divs with class g they are nested in another div with class srg. Let's start adding all of this to our code so we can start putting all the pieces together. Please read this code carefully, it might seem confusing but it's based on the screenshot above.



//Find all div elements with class 'bkWMgd'
const searchResults = await page.$$eval('div[class=bkWMgd]', results => {
        //Array to hold all our results
        let data = [];

        //Iterate over all the results
        results.forEach(parent => {

            //Check if parent has h2 with text 'Web Results'
            const ele = parent.querySelector('h2');

            //If element with 'Web Results' Title is not found  then continue to next element
            if (ele === null) {
                return;
            }

            //Check if parent contains 1 div with class 'g' or contains many but nested in div with class 'srg'
            let gCount = parent.querySelectorAll('div[class=g]');

            //If there is no div with class 'g' that means there must be a group of 'g's in class 'srg'
            if (gCount.length === 0) {
                //Targets all the divs with class 'g' stored in div with class 'srg'
                gCount = parent.querySelectorAll('div[class=srg] > div[class=g]');
            }

            //Iterate over all the divs with class 'g'
            gCount.forEach(result => {
                //Target the title
                const title = result.querySelector('div[class=rc] > div[class=r] > a >  h3').innerText;

                //Target the url
                const url = result.querySelector('div[class=rc] > div[class=r] > a').href;

                //Target the description
                const desciption = result.querySelector('div[class=rc] > div[class=s] > div > span[class=st]').innerText;

                //Add to the return Array
                data.push({title, desciption, url});
            });
        });

        //Return the search results
        return data;
    });

The code above will parse the page and get us our results in an Array. We can now return that array from our main function searchGoogle:



const puppeteer = require('puppeteer');

const searchGoogle = async (searchQuery) => {
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://google.com');

    //Finds input element with name attribue 'q' and types searchQuery
    await page.type('input[name="q"]', searchQuery);

    //Finds an input with name 'btnK', after so it executes .click() DOM Method
    await page.$eval('input[name=btnK]', button => button.click());

    //Wait for one of the div classes to load
    await page.waitForSelector('div[id=search]');

    const searchResults = await page.$$eval('div[class=bkWMgd]', results => {
        //Array to hold all our results
        let data = [];
        ...
        ...
                //Return the search results
        return data;
    });

    await browser.close();

    return searchResults;
};

module.exports = searchGoogle;

We can now remove the last line, where we manually call the function. We are now finished with this Search Engine API! Now, all we need to do is to import this function in our main server.js file:



const express = require('express');
const app = express();
const port = 3000;

//Import puppeteer function
const searchGoogle = require('./searchGoogle');

//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {

    //Holds value of the query param 'searchquery'.
    const searchQuery = request.query.searchquery;

    //Do something when the searchQuery is not null.
    if (searchQuery != null) {

        searchGoogle(searchQuery)
            .then(results => {
                //Returns a 200 Status OK with Results JSON back to the client.
                response.status(200);
                response.json(results);
            });
    } else {
        response.end();
    }
});

//Catches requests made to localhost:3000/
app.get('/', (req, res) => res.send('Hello World!'));


//Initialises the express server on the port 30000
app.listen(port, () => console.log(`Example app listening on port ${port}!`));

Now if we start our server with npm start and go to our browser and browse to:



http://localhost:3000/search?searchquery=cats

We get a JSON! I'm using a JSON Viewer Chrome Extension to be able to view JSON in my browser

The code for this project can be found on Github

However, we are not done yet. At the moment, our API is ready but it's a bit slow. It's also currently running on our local machine, so we need to deploy it somewhere. This will all be covered in Part 3!

Part 3 will cover:

Optimizing and Improving Performance
Troubleshooting Basics
Deploying the API

This is the end of this post! I hope you enjoyed reading this and found this to be useful. Stay tuned for Part 3!

If you're interested in other use-cases, check out the Net-Income Calculator, which uses Node/Express Puppeteer API to scrap information about state taxes and average rent in cities from websites. You can check out it's Github Repo

If you enjoyed reading this and would like to provide feedback, you can do so anonymously here. Any feedback regarding anything is appreciated!