In this 3-part-series we're going to learn how to convert any webpage into our personal API. We'll do this with an example of creating a Search API that uses Google Search to obtain its results. All of this will be done with Web Scraping with puppeteer
This is a 3 Part Series:
- In the 1st Part: We'll go over the basics and create a Simple Web Scraper.
- In the 2nd Part: We'll create a Search Engine API using Google Search.
- In the 3rd Part: We'll go over different ways we can optimize our API, increase performance, troubleshooting basics and how we can deploy our puppeteer API to the web.
Table Of Contents - Part 1
Basics of Puppeteer
We need to first understand what puppeteer
is and what you can do with it.
What exactly is Puppeteer?
The definition according to official docs:
Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
In simple words, puppeteer
is a library that allows you to access the Chromium browser programmatically (a.k.a headless).
A headless browser is a great tool for automated testing and server environments where you don't need a visible UI shell. For example, you may want to run some tests against a real web page, create a PDF of it, or just inspect how the browser renders an URL.
The Chrome browser is very powerful and has a lot of features. Anything that is possible in Chrome is possible with puppeteer
and this includes everything possible in DevTools. You can learn more about what you can do with DevTools here.
Here are some use-cases for puppeteer
:
- Generate screenshots and PDFs of pages.
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
- Automate form submission, UI testing, keyboard input, etc.
- Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.
However, in this post, we're going to use puppeteer
exclusively for web-scraping.
Creating a Simple Puppeteer Scraper
To get started, we need to first initialize a directory and initialize npm (or yarn) using the command line:
mkdir hello-puppeteer
cd hello-puppeteer
npm init -y
We can then install the puppeteer
library:
npm i puppeteer
To make sure that the library works as intended on all devices, the Chromium browser comes with the library. This ensures that the library is guaranteed to work and can avoid having to get the user to configure path and/or download /install Chrome.
The library downloads a recent version of Chromium (~170MB Mac, ~282MB Linux, ~280MB Win) that is guaranteed to work with the API.
For those of you interested, the puppeteer
team is currently also working on an experimental version to bring Puppeteer to Firefox.
Since puppeteer
is a Node.js library, we need to create a node.js file and run it with node. For the purpose of this post, we'll name it server.js
:
touch server.js
To start our API, we need to configure the package.json
file so that we can have node run our server file. We can do this by adding an npm start
script in scripts
:
{
"name": "hello-puppeteer",
"version": "1.0.0",
"description": "",
"main": "server.js",
"scripts": {
"start": "node server.js"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"puppeteer": "^2.0.0"
}
}
We are now ready to write code to create a simple scraper in our server.js
file:
const puppeteer = require('puppeteer');
(async () => {
//Creates a Headless Browser Instance in the Background
const browser = await puppeteer.launch();
//Creates a Page Instance, similar to creating a new Tab
const page = await browser.newPage();
//Navigate the page to url
await page.goto('https://example.com');
//Closes the Browser Instance
await browser.close();
})();
This creates an anonymous function that gets executed when we run npm start
. It creates a Browser Instance, with a new page and navigates to https://example.com
. Afterward, It closes the Browser Instance, and node finishes executing the file.
To make sure this is working as intended we can take a screenshot of the page after puppeteer
is finished navigating to the page:
page.screenshot({path: 'example.png'});
After adding this to our file:
const puppeteer = require('puppeteer');
(async () => {
//Creates a Headless Browser Instance in the Background
const browser = await puppeteer.launch();
//Creates a Page Instance, similar to creating a new Tab
const page = await browser.newPage();
//Navigate the page to url
await page.goto('https://example.com');
//Takes a screenshot of the page after navigating there and saves it as 'example.png'
await page.screenshot({path: 'example.png'});
//Closes the Browser Instance
await browser.close();
})();
We can replace https://example.com
with a working url. For this example, we'll use https://google.com
. We can now run npm start
and after a while, we can see example.png
show up in our file directory, and opening it shows the homepage of Google
.
We're almost done with our simple web scraper. We can now choose to get any information we want from Google's homepage. For now, we'll just get the image source for Google's Homepage Logo
This has no inherent purpose. However, the point is that we can access this information programmatically.
To do this, we need to go to our browser and navigate to the URL and find the element we're looking for by Inspect Elementing the page. You can Right-Click on the page and choose Inspect or you can open Dev-Tools directly and navigate to the HTML (Elements Tab).
After using the Mouse tool to highlight the Logo Element, this is what it points to (This might be different for you):
The important thing to look for is anything that can identify the HTML element. In our case, the img
element has an id hplogo
. So we can use this information to get access to the image source.
There are many different ways to get the specific element(s) from the DOM/page.
To target a single element we can use $eval
method where we substitute the name of the id
, class
or any identifying attribute of the element we're looking for as the selector
parameter.
page.$eval(selector, callbackFunc);
This method runs
document.querySelector
within the element and passes it as the first argument to callbackFunc. If there's no element matchingselector
, the method throws an error.
To target multiple elements we can use:
page.$$eval(selector, callbackFunc);
This method runs
document.querySelectorAll
within the element and passes it as the first argument to callbackFunc. If there's no element matchingselector
, the method throws an error.
If the element is found, it is passed as the first argument to the callback function and therefore we can use it to get the information we need.
const googleLogo = await page.$eval('#hplogo', (img) => img.src);
The targeting of an element is similar to how it is targeted in CSS or Javascript selectors
In our case since we only need to get a single image so we can use $eval
and access the src
directly:
const puppeteer = require('puppeteer');
(async () => {
//Creates a Headless Browser Instance in the Background
const browser = await puppeteer.launch();
//Creates a Page Instance, similar to creating a new Tab
const page = await browser.newPage();
//Navigate the page to url
await page.goto('https://google.com');
//Takes screenshot of the page after navigating there
await page.screenshot({path: 'example.png'});
//Finds the first element with the id 'hplogo' and returns the source attribute of that element
const googleLogo = await page.$eval('#hplogo', img => img.src);
console.log(googleLogo);
//Closes the Browser Instance
await browser.close();
})();
After running npm start
, The server console.logs:
https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png
If we open this URL in a browser tab, we can see that it's the Image we were looking for! Our simple web-scraper is now complete!
The code for this simple web scraper can be found on Github
In the next part, we'll create a Search Engine API using Google Search. The user will be able to request our API with a search query. Our API will then scrape Google and return the top 10 search results.
This is the end of Part 1. I hope you enjoyed reading this, and stay tuned for Part 2! Any feedback is appreciated!
Top comments (2)
where is part2/3?
Hi, I have just released Part 2!