Waqasabi

Posted on Jan 18, 2020 • Edited on Jan 21, 2020

Turn Any Webpage into your Personal API with Puppeteer

#webdev #node #javascript #beginners

In this 3-part-series we're going to learn how to convert any webpage into our personal API. We'll do this with an example of creating a Search API that uses Google Search to obtain its results. All of this will be done with Web Scraping with puppeteer

This is a 3 Part Series:

In the 1st Part: We'll go over the basics and create a Simple Web Scraper.
In the 2nd Part: We'll create a Search Engine API using Google Search.
In the 3rd Part: We'll go over different ways we can optimize our API, increase performance, troubleshooting basics and how we can deploy our puppeteer API to the web.

Basics of Puppeteer

We need to first understand what puppeteer is and what you can do with it.

What exactly is Puppeteer?

The definition according to official docs:

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.

In simple words, puppeteer is a library that allows you to access the Chromium browser programmatically (a.k.a headless).

A headless browser is a great tool for automated testing and server environments where you don't need a visible UI shell. For example, you may want to run some tests against a real web page, create a PDF of it, or just inspect how the browser renders an URL.

The Chrome browser is very powerful and has a lot of features. Anything that is possible in Chrome is possible with puppeteer and this includes everything possible in DevTools. You can learn more about what you can do with DevTools here.

Here are some use-cases for puppeteer:

Generate screenshots and PDFs of pages.

Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).

Automate form submission, UI testing, keyboard input, etc.

Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.

Capture a timeline trace of your site to help diagnose performance issues.

Test Chrome Extensions.

However, in this post, we're going to use puppeteer exclusively for web-scraping.

Creating a Simple Puppeteer Scraper

To get started, we need to first initialize a directory and initialize npm (or yarn) using the command line:

mkdir hello-puppeteer
cd hello-puppeteer
npm init -y

We can then install the puppeteer library:

npm i puppeteer

To make sure that the library works as intended on all devices, the Chromium browser comes with the library. This ensures that the library is guaranteed to work and can avoid having to get the user to configure path and/or download /install Chrome.

The library downloads a recent version of Chromium (~170MB Mac, ~282MB Linux, ~280MB Win) that is guaranteed to work with the API.

For those of you interested, the puppeteer team is currently also working on an experimental version to bring Puppeteer to Firefox.

Since puppeteer is a Node.js library, we need to create a node.js file and run it with node. For the purpose of this post, we'll name it server.js:

touch server.js

To start our API, we need to configure the package.json file so that we can have node run our server file. We can do this by adding an npm start script in scripts:

{
  "name": "hello-puppeteer",
  "version": "1.0.0",
  "description": "",
  "main": "server.js",
  "scripts": {
     "start": "node server.js"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^2.0.0"
  }
}

We are now ready to write code to create a simple scraper in our server.js file:

const puppeteer = require('puppeteer');

(async () => {
    //Creates a Headless Browser Instance in the Background
    const browser = await puppeteer.launch();

    //Creates a Page Instance, similar to creating a new Tab
    const page = await browser.newPage();

    //Navigate the page to url
    await page.goto('https://example.com');

    //Closes the Browser Instance
    await browser.close();
})();

This creates an anonymous function that gets executed when we run npm start. It creates a Browser Instance, with a new page and navigates to https://example.com. Afterward, It closes the Browser Instance, and node finishes executing the file.

To make sure this is working as intended we can take a screenshot of the page after puppeteer is finished navigating to the page:

page.screenshot({path: 'example.png'});

After adding this to our file:

const puppeteer = require('puppeteer');

(async () => {
    //Creates a Headless Browser Instance in the Background
    const browser = await puppeteer.launch();

    //Creates a Page Instance, similar to creating a new Tab
    const page = await browser.newPage();

    //Navigate the page to url
    await page.goto('https://example.com');

    //Takes a screenshot of the page after navigating there and saves it as 'example.png'
    await page.screenshot({path: 'example.png'});

    //Closes the Browser Instance
    await browser.close();
})();

We can replace https://example.com with a working url. For this example, we'll use https://google.com. We can now run npm start and after a while, we can see example.png show up in our file directory, and opening it shows the homepage of Google.

We're almost done with our simple web scraper. We can now choose to get any information we want from Google's homepage. For now, we'll just get the image source for Google's Homepage Logo

This has no inherent purpose. However, the point is that we can access this information programmatically.

To do this, we need to go to our browser and navigate to the URL and find the element we're looking for by Inspect Elementing the page. You can Right-Click on the page and choose Inspect or you can open Dev-Tools directly and navigate to the HTML (Elements Tab).

After using the Mouse tool to highlight the Logo Element, this is what it points to (This might be different for you):

The important thing to look for is anything that can identify the HTML element. In our case, the img element has an id hplogo. So we can use this information to get access to the image source.

There are many different ways to get the specific element(s) from the DOM/page.

To target a single element we can use $eval method where we substitute the name of the id, class or any identifying attribute of the element we're looking for as the selector parameter.

page.$eval(selector, callbackFunc);

This method runs document.querySelector within the element and passes it as the first argument to callbackFunc. If there's no element matching selector, the method throws an error.

To target multiple elements we can use:

page.$$eval(selector, callbackFunc);

This method runs document.querySelectorAll within the element and passes it as the first argument to callbackFunc. If there's no element matching selector, the method throws an error.

If the element is found, it is passed as the first argument to the callback function and therefore we can use it to get the information we need.

const googleLogo = await page.$eval('#hplogo', (img) => img.src);

The targeting of an element is similar to how it is targeted in CSS or Javascript selectors

In our case since we only need to get a single image so we can use $eval and access the src directly:

const puppeteer = require('puppeteer');

(async () => {
    //Creates a Headless Browser Instance in the Background
    const browser = await puppeteer.launch();

    //Creates a Page Instance, similar to creating a new Tab
    const page = await browser.newPage();

    //Navigate the page to url
    await page.goto('https://google.com');

    //Takes screenshot of the page after navigating there
    await page.screenshot({path: 'example.png'});

    //Finds the first element with the id 'hplogo' and returns the source attribute of that element
    const googleLogo = await page.$eval('#hplogo', img => img.src);
    console.log(googleLogo);

    //Closes the Browser Instance
    await browser.close();
})();

After running npm start , The server console.logs:

https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png

If we open this URL in a browser tab, we can see that it's the Image we were looking for! Our simple web-scraper is now complete!

The code for this simple web scraper can be found on Github

In the next part, we'll create a Search Engine API using Google Search. The user will be able to request our API with a search query. Our API will then scrape Google and return the top 10 search results.

This is the end of Part 1. I hope you enjoyed reading this, and stay tuned for Part 2! Any feedback is appreciated!