Introduction
Web scraping is the process of extracting meaningful data from websites. Although it can be done manually, nowadays there are several developer-friendly tools that can automate the process for you.
In this tutorial we are going to create a web scraper using Puppeteer, a Node library developed by Google to perform several automated tasks using the Chromium engine.
Web scraping is just one of the several applications that makes Puppeteer shine. In fact, according to the official documentation on Github Puppeteer can be used to:
- Generate screenshots and PDFs of web pages.
- Crawl a Single-Page Application and generate pre-rendered content.
- Automate form submission, UI testing, keyboard input, and interface interactions.
- Create an up-to-date, automated testing environment.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.
In this guide, we will first create scripts to showcase Puppeteer capabilities, then we will create an API to scrape pages via a simple HTTP API call using Express.js and deploy our application on Koyeb.
Requirements
To successfully follow and complete this guide, you need:
- Basic knowledge of JavaScript.
- A local development environment with Node.js installed
- Basic knowledge of Express.js
- Docker installed on your machine
- A Koyeb account to deploy and run the application
- A GitHub account to version and deploy your application code on Koyeb
This tutorial does not require any prior knowledge of Puppeteer as we will go through every step of setting up and running a web scraper.
However, make sure your version of Node.js is at least 10.18.1 as we are using Puppeteer v3+.
For more information take a look at the official readme on the Puppeteer Github repository.
Steps
To deploy a web scraper using Puppeteer, Express.js, and Docker on Koyeb, you need to follow these steps:
- Initializing the project
- Your first Puppeteer application
- Puppeteer in action
- Scrap pages via a simple API call using Express
- Deploy the app on Koyeb
Initializing the project
Get started by creating a new directory that will hold the application code.
To a location of your choice, create and navigate to a new directory by executing the following commands:
mkdir puppeteer-on-koyeb
cd puppeteer-on-koyeb
Inside the freshly created folder, we will create a Node.js application skeleton containing the Express.js dependencies that we will need to build our scraping API. In your terminal run:
npx express-generator
You will be prompted with a set of questions to populate the initial package.json
file including the project name, version, and description.
Once the command has been completed, your package.json
content should be similar to the following:
{
"name": "puppeteer-on-koyeb",
"version": "1.0.0",
"description": "Deploy a Web scraper using Puppeteer, ExpressJS and Docker on Koyeb",
"private": true,
"scripts": {
"start": "node ./bin/www"
},
"dependencies": {
"cookie-parser": "~1.4.4",
"debug": "~2.6.9",
"express": "~4.16.1",
"http-errors": "~1.6.3",
"jade": "~1.11.0",
"morgan": "~1.9.1",
},
"author": "Samuel Zaza",
"license": "ISC"
}
Creating the skeleton of the application is going to come handy to organize our files, especially later on when creating we will create our API endpoints.
Next, add Pupetteer, the library we will use to perform the scraping as a project dependency by running:
npm install --save puppeteer
Last, install and configure nodemon. While optional in this guide, nodemon will allow us to automatically restart our server when file changes are detected. This is a great tool to improve the development experience when developing locally. To install nodemon in your terminal, run:
npm install nodemon --save-dev
Then, in your package.json
add the following section so we will be able to launch the application in development running npm dev
using nodemon and in production running npm start
.
{
"name": "puppeteer-on-koyeb",
"version": "1.0.0",
"description": "Deploy a Web scraper using Puppeteer, ExpressJS and Docker on Koyeb",
"private": true,
+ "scripts": {
+ "start": "node ./bin/www",
+ "dev": "nodemon ./bin/www"
+ },
"dependencies": {
"cookie-parser": "~1.4.4",
"debug": "~2.6.9",
"express": "~4.16.1",
"http-errors": "~1.6.3",
"jade": "~1.11.0",
"morgan": "~1.9.1",
},
"author": "Samuel Zaza",
"license": "ISC"
}
Execute the following command to launch the application and ensure everything is working as expected:
npm start
# or npm run dev
Open your browser at http://localhost:3000
and you should see the Express welcome message.
Your first Puppeteer application
Before diving into more advanced Puppeteer web scraping capabilities, we will create a minimalist application to take a webpage screenshot and save the result in our current directory.
As mentioned previously, Puppeteer provides several features to control Chrome/Chromium and the ability to take screenshots of a webpage comes very handy.
For this example, we will not dig into each parameter of the screenshot
method as we mainly want to confirm our installation works properly.
Create a new JavaScript file named screenshot.js
in your project directory puppeteer-on-koyeb
by running:
touch screenshot.js
To take a screenshot of a webpage, our application will:
- Use Puppeteer, and create a new instance of
Browser
. - Open a webpage
- Take a screenshot
- Close the page and browser
Add the code below to the screenshot.js
file:
const puppeteer = require('puppeteer');
const URL = 'https://koyeb.com';
const screenshot = async () => {
console.log('Opening the browser...');
const browser = await puppeteer.launch();
const page = await browser.newPage();
console.log(`Go to ${URL}`);
await page.goto(URL);
console.log('Taking a screenshot...');
await page.screenshot({
path: './screenshot.png',
fullPage: true,
});
console.log('Closing the browser...');
await page.close();
await browser.close();
console.log('Job done!');
};
screenshot();
As you can see, we are taking a screenshot of the Koyeb homepage and saving the result as a png
file in the root folder of the project.
You can now run the application by running:
$ node screenshot.js
Opening the browser...
Taking a screenshot...
Closing the browser...
Job done!
Once the execution is completed, a screenshot is saved in the root folder of the application.
You created your first automation using Puppeteer!
Puppeteer in action
Simple Scraper
In this section, we will create a more advanced scenario to scrap and retrieve information from a website's page.
For this example, we will use the Stack Overflow questions pages and instruct Puppeteer to extract each question and excerpt present on the webpage HTML.
Before jumping into the code, in your browser, open the devTools to inspect the webpage source code. You should see a similar block for each question in the HTML:
<div class="question-summary" id="question-summary-11227809">
<!-- we can ignore the stats wrapper -->
<div class="statscontainer">...</div>
<div class="summary">
<h3><a href="/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array" class="question-hyperlink">Why is processing a sorted array faster than processing an unsorted array?</a></h3>
<div class="excerpt">
Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (before the timed region) miraculously makes the loop almost six times faster. #include &...
</div>
<!-- unnecessary wrapper -->
<div class="d-flex ai-start fw-wrap">...</div>
</div>
</div>
We will use the JavaScript methods querySelectorAll
and querySelector
to extract both question and excerpt, and return as result an array of objects.
-
querySelectorAll
: Will be used to to collect each question element.document.querySelectorAll('.question-summary')
-
querySelector
Will be used to extract the question title by callingquerySelector('.question-hyperlink').innerText
and the excerpt usingquerySelector('.excerpt').innerText
Back to your terminal, and create a new folder lib
containing a file called scraper.js
:
mkdir lib
touch lib/scraper.js
Inside the file scraper.js
add the following code:
const puppeteer = require('puppeteer');
const URL = 'https://stackoverflow.com/questions';
const singlePageScraper = async () => {
console.log('Opening the browser...');
const browser = await puppeteer.launch();
const page = await browser.newPage();
console.log(`Navigating to ${URL}...`);
await page.goto(URL, { waitUntil: 'load' });
console.log(`Collecting the questions...`);
const questions = await page.evaluate(() => {
return [...document.querySelectorAll('.question-summary')]
.map((question) => {
return {
question: question.querySelector('.question-hyperlink').innerText,
excerpt: question.querySelector('.excerpt').innerText,
};
});
});
console.log('Closing the browser...');
await page.close();
await browser.close();
console.log('Job done!');
console.log(questions);
return questions;
};
module.exports = {
singlePageScraper,
};
Although it looks way more complex than the screenshot.js
script, it is actually performing the same actions except for the scraping one. Let's list them:
- Create an instance of Browser and open a page.
- Go to the URL (and wait for the website to load).
- Extract the information from the website and collect the questions into an array of objects.
- Close the browser and return the tools list as question-excerpt pair.
You might feel confused about the scraping syntax of:
const questions = await page.evaluate(() => {
return [...document.querySelectorAll('.question-summary')]
.map((question) => {
return {
question: question.querySelector('.question-hyperlink').innerText,
excerpt: question.querySelector('.excerpt').innerText,
};
});
});
We first call page.evaluate
to interact with the page DOM and then we start extracting the question and the excerpt.
Moreover, in the code above, we transformed the result from document.querySelectorAll
into a JavaScript array to be able to call map
on it and return the pair { question, excerpt }
for each converter tool.
To run the function, we can import it into a new file singlePageScraper.js
, and run it. Create the file in the root directory of your application:
touch singlePageScraper.js
Then, copy the code below that import the singlePageScraper
function and call it:
const { singlePageScraper } = require('./lib/scaper');
singlePageScraper();
Run the script by executing the following command in your terminal:
node singlePageScraper.js
The following output appears in your terminal showing the questions and excerpts retrieved from the StackOverflow questions page:
Opening the browser...
Navigating to https://stackoverflow.com/questions...
Collecting the tools...
Closing the browser...
Job done!
[
{
question: 'Google Places Autocomplete for WPF Application',
excerpt: 'I have a windows desktop application developed in WPF( .Net Framework.)I want to implement Autocomplete textbox using google places api autocomplete, I found few reference which used Web browser to do ...'
},
{
question: 'Change the field of a struct in Go',
excerpt: "I'm trying to change a parameter of n1's left variable, but n1.left is not available, neither is n1.value or n1.right. What's wrong with this declarations? // lib/tree.go package lib type TreeNode ..."
},
...
]
Multi page scraper
In the previous example, we learned how to scrap and retrieve information from a single page.
We can now go even further and instruct Puppeteer to explore and extract information from multiple pages.
For this scenario, we will scrap and extract questions and excerpts from a pre-defined number of StackOverflow questions pages.
Our script will:
- Receive the number of pages to scrap as a parameter.
- Extract questions and excerpts from a page.
- Programmatically click on the "next page" element.
- Repeats point 2 and point 3 until the number of pages to scrap is reached.
Based on the previous function singlePageScraper
we created in lib/scraper.js
we will create a new function taking as an argument the number of pages to scrape.
Let's take a look at the HTML source code of the page to select the correct button element we will emulate the click to go to the next page:
<div class="s-pagination site1 themed pager float-left">
<div class="s-pagination--item is-selected">1</div>
<a class="s-pagination--item js-pagination-item" href="/questions?tab=votes&page=2" rel="" title="Go to page 2">2</a>
...
<a class="s-pagination--item js-pagination-item" href="/questions?tab=votes&page=1470621" rel="" title="Go to page 1470621">1470621</a>
<!-- This is the button we need to select and click on -->
<a class="s-pagination--item js-pagination-item" href="/questions?tab=votes&page=2" rel="next" title="Go to page 2"> Next</a></div>
The Puppeteer class Page
provides a handy method click
that accepts CSS selectors to simulate a click on an element. In our case, to go to the next page, we decide to use the .pager > a:last-child
selector.
In the lib/scraper.js
file, create a new function called multiPageScraper
:
const puppeteer = require('puppeteer');
const URL = 'https://stackoverflow.com/questions';
const singlePageScraper = async () => {
console.log('Opening the browser...');
const browser = await puppeteer.launch();
const page = await browser.newPage();
console.log(`Navigating to ${URL}...`);
await page.goto(URL, { waitUntil: 'load' });
console.log(`Collecting the questions...`);
const questions = await page.evaluate(() => {
return [...document.querySelectorAll('.question-summary')]
.map((question) => {
return {
question: question.querySelector('.question-hyperlink').innerText,
excerpt: question.querySelector('.excerpt').innerText,
};
});
});
console.log('Closing the browser...');
await page.close();
await browser.close();
console.log('Job done!');
console.log(questions);
return questions;
};
+const multiPageScraper = async (pages = 1) => {
+ console.log('Opening the browser...');
+ const browser = await puppeteer.launch();
+ const page = await browser.newPage();
+
+ console.log(`Navigating to ${URL}...`);
+ await page.goto(URL, { waitUntil: 'load' });
+
+ const totalPages = pages;
+ let questions = [];
+
+ for (let initialPage = 1; initialPage <= totalPages; initialPage++) {
+ console.log(`Collecting the questions of page ${initialPage}...`);
+ let pageQuestions = await page.evaluate(() => {
+ return [...document.querySelectorAll('.question-summary')]
+ .map((question) => {
+ return {
+ question: question.querySelector('.question-hyperlink').innerText,
+ excerpt: question.querySelector('.excerpt').innerText,
+ }
+ });
+ });
+
+ questions = questions.concat(pageQuestions);
+ console.log(questions);
+ // Go to next page until the total number of pages to scrap is reached
+ if (initialPage < totalPages) {
+ await Promise.all([
+ await page.click('.pager > a:last-child'),
+ await page.waitForSelector('.question-summary'),
+ ])
+ }
+ }
+
+ console.log('Closing the browser...');
+
+ await page.close();
+ await browser.close();
+
+ console.log('Job done!');
+ return questions;
+};
module.exports = {
singlePageScraper,
+ multiPageScraper,
};
Since we are collecting questions and related excerpts for multiple pages, we use a for loop to retrieve the list of questions for each page.
Each questions
retrieved for a page is then concatenated in an array questions
which is returned once the fetching is completed.
As we did for the single page scraper example, create a new file multiPageScraper.js
in the root directory to import and call the multiPageScraper
function:
touch multiPageScraper.js
Then, add the following code:
const { multiPageScraper } = require('./lib/scaper');
multiPageScraper(2);
For the purpose of the script, we are hardcoding the number of pages to fetch to 2
. We will make this dynamic when we will build the API.
In your terminal execute the following command to run the script:
node multiPageScraper.js
The following output appears in your terminal showing the questions and excerpts retrieved for each page scraped:
Opening the browser...
Navigating to https://stackoverflow.com/questions...
Collecting the questions of page 1...
[
{
question: 'Blazor MAIUI know platform',
excerpt: 'there is some way to known the platform where is running my Blazor maui app?. Select not work property in "Windows" (you need use size=2 or the list not show), i would read the platform in ...'
},
...
]
Collecting the questions of page 2...
[
{
question: 'Blazor MAIUI know platform',
excerpt: 'there is some way to known the platform where is running my Blazor maui app?. Select not work property in "Windows" (you need use size=2 or the list not show), i would read the platform in ...'
},
...
]
Closing the browser...
Job done!
In the next section, we will write a simple API server containing one endpoint to scrap a user-defined number of pages and return the list of questions and excerpts scraped.
Scrap pages via a simple API call using Express
We are going to create a simple Express.js API server having an endpoint /questions
that accepts a query parameter pages
and returns the list of questions and excerpts from page 1 to the page sent as parameter.
For instance, to retrieve the first three pages of questions and their excerpts from Stack Overflow, we will call:
http://localhost:3000/questions?pages=3
To create the questions
endpoit, go to the routes
directory and create new file question.js
:
cd routes
touch questions.js
Then, add the code below to the question.js
file:
const express = require('express');
const scraper = require('../lib/scaper');
const router = express.Router();
router.get('/', async (req, res, next) => {
// 1. Get the parameter "pages"
const { pages } = req.query;
// 2. Call the scraper function
const questions = await scraper.multiPageScraper(pages);
// 3. Return the array of questions to the client
res.status(200).json({
statusCode: 200,
message: 'Questions correctly retrieved',
data: { questions },
});
});
module.exports = router;
What the code does is:
- Get the query parameter
pages
. - Call the newly created function
multiPageScraper
and pass thepages
value. - Return the array of questions back to the client.
We now need to define the questions
route in the Express router. To do so, open app.js
and add the following
const createError = require('http-errors');
const express = require('express');
const path = require('path');
const cookieParser = require('cookie-parser');
const logger = require('morgan');
const indexRouter = require('./routes/index');
+const questionsRouter = require('./routes/questions');
const app = express();
// view engine setup
app.set('views', path.join(__dirname, 'views'));
app.set('view engine', 'jade');
app.use(logger('dev'));
app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use(cookieParser());
app.use(express.static(path.join(__dirname, 'public')));
app.use('/', indexRouter);
+app.use('/questions', questionsRouter);
Note that thanks to the Express generator we used to initialize our project, we do not have to setup our server from scratch, a few middlewrides are already setup for us.
Run the server again and try the to call the /questions
API endpoint using either the browser or cURL.
Here is the ouput you should get running it from your terminal using cURL:
$ curl http://localhost:3000/questions\?pages\=2 | jq '.'
{
"statusCode": 200,
"message": "Questions correctly retrieved",
"data": {
"questions": [
{
"question": "Is there a way to return float or integer from a conditional True/False",
"excerpt": "n_level = range(1, steps + 2) steps is user input,using multi-index dataframe for i in n_level: if df['Crest'] >= df[f'L{i}K']: df['Marker'] = i elif df['Trough'] &..."
},
{
"question": "Signin With Popup - Firebase and Custom Provider",
"excerpt": "I am working on an application that authenticates users with a Spotify account. I have a working login page, however, I would prefer users to not be sent back to the home page when they sign in. I ..."
},
...
And we are done! We now have a web scraper that with minimal changes can scrap any websites.
Deploy the app on Koyeb
Now that we have a working server, we can demonstrate how to deploy the application on Koyeb.
Koyeb simple user interface gives you two choices to deploy our app:
- Deploy native code using git-driven deployment
- Deploy pre-built Docker containers from any public or private registries.
Since we are using Puppeteer, we need some extra system packages installed so we will deploy on Koyeb using Docker.
In this guide, I won't go through the steps of creating a Dockerfile and pushing it to the Docker registry but if you are interested in learning more, I suggest you start with the official documentation.
Before we start working with Docker, we have to perform a change into the /lib/scaper.js
file:
const puppeteer = require('puppeteer');
const URL = 'https://stackoverflow.com/questions';
const singlePageScraper = async () => {
console.log('Opening the browser...');
const browser = await puppeteer.launch();
const page = await browser.newPage();
console.log(`Navigating to ${URL}...`);
await page.goto(URL, { waitUntil: 'load' });
console.log(`Collecting the questions...`);
const questions = await page.evaluate(() => {
return [...document.querySelectorAll('.question-summary')]
.map((question) => {
return {
question: question.querySelector('.question-hyperlink').innerText,
excerpt: question.querySelector('.excerpt').innerText,
};
});
});
console.log('Closing the browser...');
await page.close();
await browser.close();
console.log('Job done!');
console.log(questions);
return questions;
};
const multiPageScraper = async (pages = 1) => {
console.log('Opening the browser...');
- const browser = await puppeteer.launch();
+ const browser = await puppeteer.launch({
+ headless: true,
+ executablePath: '/usr/bin/chromium-browser',
+ args: [
+ '--no-sandbox',
+ '--disable-gpu',
+ ]
+ });
const page = await browser.newPage();
console.log(`Navigating to ${URL}...`);
await page.goto(URL, { waitUntil: 'load' });
const totalPages = pages;
let questions = [];
for (let initialPage = 1; initialPage <= totalPages; initialPage++) {
console.log(`Collecting the questions of page ${initialPage}...`);
let pageQuestions = await page.evaluate(() => {
return [...document.querySelectorAll('.question-summary')]
.map((question) => {
return {
question: question.querySelector('.question-hyperlink').innerText,
excerpt: question.querySelector('.excerpt').innerText,
}
});
});
questions = questions.concat(pageQuestions);
console.log(questions);
// Go to next page until the total number of pages to scrap is reached
if (initialPage < totalPages) {
await Promise.all([
await page.click('.pager > a:last-child'),
await page.waitForSelector('.question-summary'),
])
}
}
console.log('Closing the browser...');
await page.close();
await browser.close();
console.log('Job done!');
return questions;
};
module.exports = {
singlePageScraper,
multiPageScraper,
};
These extra parameters are required to properly run Puppeteer inside a Docker container.
Dockerize the application and push it to the Docker Hub
Get started by creating a Dockerfile containing the following:
FROM node:lts-alpine
WORKDIR /app
RUN apk update && apk add --no-cache nmap && \
echo @edge http://nl.alpinelinux.org/alpine/edge/community >> /etc/apk/repositories && \
echo @edge http://nl.alpinelinux.org/alpine/edge/main >> /etc/apk/repositories && \
apk update && \
apk add --no-cache \
chromium \
harfbuzz \
"freetype>2.8" \
ttf-freefont \
nss
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
COPY . /app
RUN npm install
EXPOSE 3000
CMD ["npm", "start"]
This is a fairly simple Dockerfile. We inherit from the Node alpine base image, install the dependencies required by Puppeteer, add our web scraper application code and indicate how to run it.
We can now build the Docker image by running the following command:
docker build . -t <YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb
Take care to replace <YOUR_DOCKER_USERNAME>
with your Docker Hub username.
Once the build succeeded, we can push our image to the Docker Hub running:
docker push <YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb
Deploy the app on Koyeb
Let's login to the Koyeb Control Panel and click on the Create App button.
You land on the App creation page.
- In "Deployment method", select Docker
- Enter the Docker image you just pushed
<YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb
to the Docker Hub. We do not need to configure the extra args or command fields - Pick the container size, server region, and number of instances you'd like to run your application
- In the Ports section, change the port value from
8080
to3000
, this is used by Koyeb to determine if your service is healthy. The3000
port is the port our application listen to - Give your Koyeb App a name.
Once you click the Create App button, you will automatically be redirected to the Koyeb App page where you can follow the progress of your application deployment. Once your app is deployed, click on the Public URL ending with koyeb.app
.
Then ensure everything is working as expected by retrieving the two first pages of questions from Stack Overflow running:
curl http://<KOYEB_APP_NAME>-<KOYEB_ORG_NAME>.koyeb.app/questions?pages=2
If everything is working fine, you should see the list of questions returned by the API.
Conclusion
First of all, congratulations on reaching this point! It was a long journey but we now have all the basic knowledge to successfully create and deploy a web scraper.
Starting from the beginning, we played with Puppeteer screenshot capabilities and slowly built up a fairly robust scraper that can automatically change the page to retrieve questions from StackOverflow.
Successively, we moved from a simple script to a running Express API server which exposes a specific endpoint to call the script and scrap a dynamic number of pages based on the query parameter sent along with the API call.
Finally, the cherry on the top is the deployment of our server with Koyeb:
Thanks to the simplicity of its deployment using pre-build Docker images we can now perform our scraping in a production environment.
If you have any questions or suggestions regarding this guide, feel free to reach out on Slack.
Top comments (0)