In this post, we'll look into how we can optimize and improve our puppeteer Web Scraping API. We'll also look into several puppeteer plugins to improve our API and have it produce more consistent results. Even though this post refers to a Web Scraping API that we built with puppeteer, some of these tips can be applied to general web scraping and can be implemented with other web scrapers built with different tools and languages i.e. Python.
This is the 3rd part of the 3-Part Series Web Scraping with Puppeteer:
- 1st Part: Basics of Puppeteer and Creating a Simple Web Scrapper.
- 2nd Part: Creating Search Engine API using Google Search with Node/Express and Puppeteer.
- 3rd Part: Optimising our API, Increasing Performance, Troubleshooting Basics and Deploying our Puppeteer API to the Web.
Table Of Contents - Part 3
Headless Mode Off (Troubleshooting)
The simplest way to troubleshoot puppeteer is to turn headless mode off. Doing so shows the full version of the Chromium browser and you can see exactly what puppeteer is trying to do. To do this, we can set the headless option to false before launching a browser:
const browser = await puppeteer.launch({headless: false}); // default is true
Now if we execute our API, we can see exactly what puppeteer is trying to do! Don't forget to turn it off after you're done troubleshooting as this increases execution-time.
For advanced troubleshooting, you can refer to the troubleshooting docs.
Improving Performance
To get started with improving our API's performance, we need to first measure the execution time. This will help us measure the difference after we apply all the optimizations. Since our puppeteer code lies in the file searchGoogle.js
we'll modify it a bit and execute that file separately.
We can use performance.now() to measure the performance by doing:
const averageTime = async () => {
//snapshot in time
const t0 = performance.now();
//wait for our code to finish
await searchGoogle('cats');
//snapshot in time
const t1 = performance.now();
//console logs the difference in the time snapshots
console.log("Call to searchGoogle took " + (t1 - t0) + " milliseconds.");
}
To use performance.now()
We need to install the library perf_hooks
:
npm install perf_hooks
Now we can import it with:
const {performance} = require('perf_hooks');
We can create an averageTime function that runs searchGoogle
20 times and calculates the average execution time. This will take a long-time to execute, however, it will give us a good average (you can increase it for an even better average). Due to the total-time required, I don't recommend calculating the average however I wanted to mention this for anyone curious about how to measure execution time. Please keep in mind that this performance is dependent on your network connection and computer. Adding this to our searchGoogle.js
file:
const puppeteer = require('puppeteer');
const {performance} = require('perf_hooks');
//minimised code
const searchGoogle = async (searchQuery) => {...};
//calculates average time by executing searchGoogle 20 times asynchronously
const averageTime = async () => {
const averageList = [];
for (let i = 0; i < 20; i++) {
const t0 = performance.now();
//wait for our function to execute
await searchGoogle('cats');
const t1 = performance.now();
//push the difference in performance time instance
averageList.push(t1 - t0);
}
//adds all the values in averageList and divides by length
const average = averageList.reduce((a, b) => a + b) / averageList.length;
console.log('Average Time: ' + average + 'ms');
};
//executing the average time function so we can run the file in node runtime.
averageTime();
module.exports = searchGoogle;
To execute the file we can run the command:
node searchGoogle.js
Now we can go ahead and start optimizing our API.
Getting to know your Webpage
This is one of the most important steps to optimizing your API's performance. Sometimes playing around with a webpage/website reveals different and faster ways to get the necessary information.
In our example, we were manually typing the search query in the google search bar and waiting for the results to load. We did this to see how typing behaves with puppeteer, however, we can instead use a faster way of displaying google search results for our search query and that is to use URL Params with Google Search, and we can do this by simply entering our search query after the q=
in the URL https://www.google.com/search?
:
https://www.google.com/search?q=cats
This will display all the results for the search query 'cats'. To add this, we need to first remove the code that navigates to www.google.com
and enters the search query into the search bar:
//finds input element with name attribute 'q' and types searchQuery
await page.type('input[name="q"]', searchQuery);
//finds an input with name 'btnK', after so it executes .click() DOM Method
await page.$eval('input[name=btnK]', button => button.click());
Removing this and adding the Google Search with URL Params to our searchGoogle.js
file::
const puppeteer = require('puppeteer');
const {performance} = require('perf_hooks');
const searchGoogle = async (searchQuery) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//use google search URL params to directly access the search results for our search query
await page.goto('https://google.com/search?q='+searchQuery);
//wait for one of the div classes to load
await page.waitForSelector('div[id=search]');
//minimised - Find all div elements with ... the information we need
const searchResults = await page.$$eval('div[class=bkWMgd]', results => {...});
await browser.close();
return searchResults;
};
//minimised - Calculates average time by executing searchGoogle 20 times asynchronously
const averageTime = async () => {...};
module.exports = searchGoogle;
Sometimes the website you are trying to scrape provides better ways that you can use to optimize your web scrapper. In our case Google Search can be used through URL Params and we don't need to manually enter queries into the Google Search bar and press Enter (Saving us some time). This is why it's very important to get to know the webpage you're trying to scrape.
Blocking Images and CSS
A significant amount of webpages on the web make use of images and they are known to reduce page load-time due to their size. Since we don't really care about the images or the CSS of the webpage, we can just prevent the page from making requests to images or stylesheet files. This way we can focus on the HTML (The part we care about). The difference in the load time will depend on the webpage you're trying to scrape. This example was taken from official docs.
To proceed to block images we need to add a Request Interceptor.
This provides the capability to modify network requests that are made by a page.
This means that we can prevent the webpage from making any requests to certain resources. In our case, we can use it to prevent the webpage from making requests to images and stylesheets. Setting this up is very simple, we need to turn the Request Interceptor on and abort requests made to images:
//turns request interceptor on
await page.setRequestInterception(true);
//if the page makes a request to a resource type of image then abort that request
page.on('request', request => {
if (request.resourceType() === 'image')
request.abort();
else
request.continue();
});
Similarly, we can also do the same thing if the resource type is a stylesheet:
//turns request interceptor on
await page.setRequestInterception(true);
//if the page makes a request to a resource type of image or stylesheet then abort that request
page.on('request', request => {
if (request.resourceType() === 'image' || req.resourceType() === 'stylesheet')
request.abort();
else
request.continue();
});
Adding this to our searchGoogle.js
:
const searchGoogle = async (searchQuery) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//turns request interceptor on
await page.setRequestInterception(true);
//if the page makes a request to a resource type of image or stylesheet then abort that request
page.on('request', request => {
if (request.resourceType() === 'image' || request.resourceType() === 'stylesheet')
request.abort();
else
request.continue();
});
//use google search URL params to directly access the search results for our search query
await page.goto('https://google.com/search?q='+searchQuery);
//wait for one of the div classes to load
await page.waitForSelector('div[id=search]');
//minimised - Find all div elements with ... the information we need
const searchResults = await page.$$eval('div[class=bkWMgd]', results => {...});
await browser.close();
return searchResults;
};
This way of blocking supports other types of resources:
document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.
Using getElementsByTagName Wherever Possible
This might not necessarily lower the execution-time but it might help, document.getElementsByTagName
method is described as:
The
getElementsByTagName
method ofDocument
interface returns anHTMLCollection
of elements with the given tag name.
This means that if we want all the <a>
tags on the page then we do:
nodes = document.getElementsByTagName('a');
the alternative to doing this would be using document.querySelectorAll
and this is more widely used:
nodes = document.querySelectorAll('a');
Based on tests it seems that document.getElementsByTagName()
executes a little bit faster than document.querySelectorAll()
when the aim is to select all tags on a page, this might not come as a surprise however I thought I should mention this since it's not very commonly used. In our case, this is not really applicable since we don't necessarily need to fetch a certain HTML tag.
Useful Puppeteer Plugins (Adblock & Stealth)
With the help of puppeteer-extra we can make use of plugins and teach puppeteer new tricks through plugins. We'll only be going through puppeteer-extra-plugin-adblocker
and puppeteer-extra-plugin-stealth
. If you want to check out all available plugins, you can do so here.
We need to first install puppeteer-extra, puppeteer-extra-plugin-adblocker & puppeteer-extra-plugin-stealth
:
npm install puppeteer-extra puppeteer-extra-plugin-adblocker puppeteer-extra-plugin-stealth
Please keep in mind that these plugins might not necessarily help the execution time.
Stealth Plugin
We will be using Stealth Plugin to create consistent environments and make the results more similar to what we see when we browse the webpage, this is because webpages are able to detect if the user browsing the webpage is Headless and they might choose to serve different content or not serve any content at all. For this reason, this plugin can allow us to create a consistent environment when scraping. According to the docs:
Applies various evasion techniques to make detection of headless puppeteer harder.
It's very easy to use the plugins, to make use of plugins we need to first replace our puppeteer
client with puppeteer-extra
client and we do the following to add Stealth Plugin:
const puppeteer = require('puppeteer-extra')
// Add stealth plugin and use defaults (all tricks to hide puppeteer usage)
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
However before we execute it, we need to make sure that we provide {headless: false}
config to our puppeteer client during launch, otherwise puppeteer-extra
will throw an error:
const searchGoogle = async (searchQuery) => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
...
await browser.close();
return searchResults;
};
However, this plugin isn't necessarily designed to reduce page-load time so we'll likely not see any difference in the execution time.
Adblock Plugin
We will be using the Adblock-Plugin to block any ads or trackers on our page since ads/trackers can play a role in our page load-time. According to the docs:
Ads and trackers are on most pages and often cost a lot of bandwidth and time to load pages. Blocking ads and trackers allows pages to load much faster because fewer requests are made and less JavaScript needs to run.
This automatically blocks all the ads when using puppeteer. However, at the moment there is conflict between this plugin and our method of blocking requests to images and stylesheets, this is because Adblock-plugin and our method of blocking image/stylesheet make use of Request Interception and puppeteer
doesn't expect multiple entities to be interested in using Request Interception, therefore for your solution you have to either block images/stylesheets / other resources or use this Adblock plugin. For use-case, I would recommend testing both and seeing which one yields better results.
Adding this to our searchGoogle.js
:
const puppeteer = require('puppeteer-extra')
const {performance} = require('perf_hooks');
// Add stealth plugin and use defaults (all tricks to hide puppeteer usage)
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
// Add adblocker plugin, which will transparently block ads in all pages you
// create using puppeteer.
const AdblockerPlugin = require('puppeteer-extra-plugin-adblocker')
puppeteer.use(AdblockerPlugin({blockTrackers: true}))
//minimised searchGoogle with Image / Stylesheet blocking removed
const searchGoogle = async (searchQuery) => {...};
//minimised averageTime
const averageTime = async () => {...};
module.exports = searchGoogle;
This will block all ads and trackers that might be present on our page. There are other options available with the Adblock Plugin:
interface PluginOptions {
/** Whether or not to block trackers (in addition to ads). Default: false */
blockTrackers: boolean
/** Persist adblocker engine cache to disk for speedup. Default: true */
useCache: boolean
/** Optional custom directory for adblocker cache files. Default: undefined */
cacheDir?: string
}
Deploying Your Puppeteer API
Now that we know about different ways of lowering execution time and creating more consistent results, we can look into how we can deploy our puppeteer API to the cloud. For this post, we'll be deploying to Heroku but the process is very similar for other platforms. If you're interested in deploying to other cloud platforms such as AWS, Google App Engine etc, please refer to this troubleshooting guide.
Before we deploy to Heroku, we need to edit our server.js
express file so that Heroku can use ports and IP it needs for the express server. We need to add dynamic port and IP, this will allow Heroku to use the port and IP it needs:
const ip = process.env.IP || '0.0.0.0';
const port = process.env.PORT || 8080;
app.listen(port, ip);
Adding this to our server.js
file:
const express = require('express');
const app = express();
const ip = process.env.IP || '0.0.0.0';
const port = process.env.PORT || 8080;
//Import puppeteer function
const searchGoogle = require('./searchGoogle');
//Catches requests made to localhost:3000/search
app.get('/search', (request, response) => {
//Holds value of the query param 'searchquery'.
const searchQuery = request.query.searchquery;
//Do something when the searchQuery is not null.
if (searchQuery != null) {
searchGoogle(searchQuery)
.then(results => {
//Returns a 200 Status OK with Results JSON back to the client.
response.status(200);
response.json(results);
});
} else {
response.end();
}
});
//Catches requests made to localhost:3000/
app.get('/', (req, res) => res.send('Hello World!'));
//Initialises the express server on the port 30000
app.listen(port, ip);
Once we have that setup, we can start uploading our server to Heroku. You need to make sure you have a Heroku account before proceeding
#skip git init if you already have a git repository initialized
git init
git add .
git commit -m "Added files"
heroku login
After logging in through the browser/terminal, we can create a new Heroku app.
heroku create
Please make sure that you do not already have 5 apps on your Heroku account as free accounts are only limited to 5 apps. After Heroku creates the app, all you need to do is push the code onto Heroku:
git push Heroku master
If this command gives you an error:
fatal: 'heroku' does not appear to be a git repository
fatal: 'heroku' does not appear to be a git repository
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Then you need to go to your Heroku dashboard and copy the name of the app you just created and do:
heroku git:remote -a your-app-name
We are almost done. We need to now take care of puppeteer dependencies. The list of Dependencies can be found here. No matter where you deploy it, you need to make sure that these dependencies are installed on the machine hosting puppeteer. Luckily for us, Heroku has build packs. Buildpacks are a collection of dependencies that instruct Heroku on what is needed to be installed for the project.
Running Puppeteer on Heroku requires some additional dependencies that aren't included on the Linux box that Heroku spins up for you.
The URL of the buildpack:https://github.com/jontewks/puppeteer-heroku-buildpack
To add the buildpack to our project we can just do:
heroku buildpacks:add https://github.com/jontewks/puppeteer-heroku-buildpack.git
Before we push the changes, we need to add one last configuration to our searchGoogle.js
We need to use '--no-sandbox'
mode when launching Puppeteer. This can be done by passing it as an argument to your .launch()
:
const puppeteer = require('puppeteer-extra');
const {performance} = require('perf_hooks');
// Add stealth plugin and use defaults (all tricks to hide puppeteer usage)
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
// Add adblocker plugin, which will transparently block ads in all pages you
// create using puppeteer.
const AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(AdblockerPlugin({blockTrackers: true}));
const searchGoogle = async (searchQuery) => {
const browser = await puppeteer.launch({headless: true, args: ['--no-sandbox']});
...
...
await browser.close();
return searchResults;
};
We can now commit the changes and push to heroku master:
git add .
git commit -m 'Disabled sandbox mode'
git push heroku master
After a while, our Puppeteer API should be deployed, we can simply click the URL from the terminal or go to our dashboard and open our app through there and we can simply just make requests to the URL provided by Heroku:
https://yourappname.herokuapp.com/search?searchquery=cats
And we can change the search query by changing the URL parameter search query. Our Search Engine API is ready!
Please make sure that you are not using Adblock Plugin and blocking Images/ Resources with Request Interception together and to only use one of them, otherwise the Heroku server will run into errors.
The code for this project can be found on Github.
Conclusion
This is the end of this post and the end of the 3-Part Series Web Scraping with Puppeteer! I hope you enjoyed this series and found it to be useful!
If you're interested in other use-cases, check out the Net-Income Calculator, which uses Node/Express Puppeteer API to scrap information about state taxes and average rent in cities from websites. You can check out it's Github Repo.
If you enjoyed reading this and would like to provide feedback, you can do so anonymously here. Any feedback regarding anything is appreciated!
Top comments (5)
Hii, I am making an instagram scraping tool.
Instagram divs weren't loading in headless:true mode, than I changed to puppeteer-extra and added stealth plugin. Everything worked fine on localhost, thanks to you.
But, unfortunately when deployed to heroku, the divs are not loading again, even page.waitForSelector shows timeout error.
PS-: 1) I've added the args: ['--no-sandbox']
2) I've also added github.com/jontewks/puppeteer-hero... buildpack in my heroku-app-settings.
Link to my project-: github.com/apanjwani0/Scrape-Insta...
Thanks in advance !
did you find any solution for that?
No, I thought maybe dockerizing my project would solve the issue (that way we can also run headless:false), but never continued with the project.
Do let me know if it works for you, or you find any other solution.
await page.goto(BASE_URL, { waitUntil: "networkidle0" })
waitUntil: "networkidle0" is nessary for this issue and set the headless to new
Great article!
Have you faced a problem with Heroku IP being blocked by the website scraped? If yes, how did you bypass it? Example: stackoverflow.com/questions/143289...