How I met your... Scraper?
Hello developer pal!, glad to see you here.
In this post, I'll share my experience after running into a topic I had not met before... web scraping!.
Show Me The Topics
The topics to be focused on are:
- Problem to solve: Booking a weekly service
- Project dependencies
- NodeJS folders structure
- Express, Routing and Services
- Puppeteer(Booking Service)
- Nodemailer(Email Service)
- Local use and remote Deploy
- Bonus: Dealing with Captcha
- Conclusion
Disclaimer: This post comes from a particular scenario I've been struggling with, I'm not preaching this is the best approach to follow for web scraping, nor the worst, any contribution is more than welcome in the threads below!
Note: There is available also a template project on GitHub in case it could be useful and save you some time.
Problem to solve: Booking a weekly service
A couple weeks ago, I got subscribed to a weekly delivery service, I am pretty happy with the service!, it is fast, efficient, always on time!, since day 1 the service has had no issues, not even delays, what's the only fallback I have found so far?, the booking process!
This could be a little picky from my end, I know, but see the steps I need to do every-single-day:
- Open a website(only works on Chrome, no other browser)
- Fill in my user/password
- Move to Members path
- Check my information and select the delivery address(displayed in a dropdown)
- Move to the Next Step
- Select the day of the week I want to book me the service(come on!, it is a week from today, as usual)
- Move to the Next Step
- Select the time of the day I want to book me the service on(it is the same time as every single day, dammit)
- Finish the process
- A "Thank you page is displayed"(without the result of the process I just did)
- Move to Members path(again) and look for my upcoming bookings table result
These steps needs to be done every-single-day, and it is a pain in the back, cause if for some reason I forget to do it, my preferred time could have been taken, and I need to look for a different time, then I need to be aware of the delivery time(it would be different a week from today than the rest of the days); am I clear why this is a pain?, I hope so...
After a few days of missing the booking, I decided to automate the process with the help of some tools, I was not sure about how to start, so I research and gladly met web scrapping(don't get me wrong, I had heard about it, but there is a slightly difference between hearing and researching with a purpose, at least form my end 🤷).
So, what's the web scraping?, there are plenty of definitions out there on the Internet, the one that is more accurate for this post purposes is:
Web scraping is the process of using bots to extract content and data from a website.
- Source: What is web scraping?
This is exactly what this post is about, create a sort of robot that will fill in information on my behalf in a site and later it will extract a result for me, and put it on my inbox.
Project dependencies
The tools used for accomplishing this enterprise are:
Main Dependencies
Dev Dependencies
package.json
"dependencies": {
"express": "^4.17.1",
"nodemailer": "^6.6.2",
"puppeteer": "^10.1.0"
},
"devDependencies": {
"eslint-config-prettier": "^8.3.0",
"eslint-plugin-prettier": "^3.4.0",
"nodemon": "^2.0.9",
"prettier": "^2.3.2"
}
Prettier
and Nodemon
come handy for having a nice experience, not mandatory though, fell free to use any other tool.
NodeJS folders structure
For this project, the structure is simple and set as follows:
scraper-template/
├── index.js
├── package.json
└── routes/
├── booking.js
└── screenshots/
├── home-page.png
└── services/
├── bookingHandler.js
├── emailSender.js
There is one route for express to serve, two services for booking and emailing the results and a folder for screenshots, which only steps in development environment.
Express, Routing and Services
The index.js
is a simple file with a 20 lines extension:
const express = require('express');
const app = express();
const port = process.env.PORT || 3000;
const booking = require('./routes/booking');
app.get('/', (req, res) => {
res.json({ message: 'ok' });
});
app.use('/booking', booking);
/* Error handler middleware */
app.use((err, req, res, next) => {
const statusCode = err.statusCode || 500;
console.error(err.message, err.stack);
res.status(statusCode).json({ message: err.message });
return;
});
app.listen(port, '0.0.0.0', () => {
console.log(`Scrapper app listening at http://localhost:${port}`);
});
The routes/booking.js
includes the expressjs
, services
and config
references, let's decompose it!:
express.js
The references to the packages used:
const express = require('express');
const router = express.Router();
...
...
services.js
The references to the defined services for handling the bookings and sending emails, a preview can be found below on Services Visualization
...
...
const emailSender = require('../services/emailSender');
const bookingHandler = require('../services/bookingHandler');
...
...
config.js
All the vales in here are process.env
vars, these includes keys for login(webSiteUser
, webSitePassword
), email impersonation(authUser
, appPassword
) and email receivers(emailFrom
, emailTo
):
...
...
const {
webSiteUser,
webSitePassword,
authUser,
appPassword,
emailFrom,
emailTo,
preferTime,
} = require('../config');
book-me endpoint
This route does the booking process for a user with a preferred time(if any):
router.get('/book-me', async function (req, res, next) {
try {
const bookMeResult = await bookingHandler.bookMe(
webSiteUser,
webSitePassword,
preferTime
);
res.send(`The result of the booking was::${bookMeResult}`);
} catch (err) {
console.error(`Error while booking me for next week`, err.message);
next(err);
}
});
...
...
book-me endpoint
This route gets the bookings the user has set for the upcoming week:
...
...
router.get('/my-bookings', async function (req, res, next) {
try {
const bookingResult = await bookingHandler.myBookings(
webSiteUser,
webSitePassword
);
emailSender.sendEmail(bookingResult, {
authUser,
appPassword,
emailFrom,
emailTo,
});
res.format({
html: () => res.send(bookingResult),
});
} catch (err) {
console.error(`Error while getting the booking for this week`, err.message);
next(err);
}
});
Services Visualization
Service emailSender
:
Service bookingHandler
:
Puppeteer(Booking Service)
Here is where the magic begins!, only one reference for rule the whole process:
const puppeteer = require('puppeteer');
After this import, puppeteer
is ready to roll!; there are plenty of examples on the internet, most of them apply all the concepts for web scraping in one single file, this is not the case.
This project applies some separations that, from my perspective, makes it easier to understand what's going on every step throughout the whole process, so let's dive into the sections:
-- Start the Browser --
The first interaction is starting the Browser. Puppeteer
works perfectly with Chronium and Nightly, for this project the reference used is the default one, with Chrome
(the web site to scrap only opens on Chrome
), but if Firefox
preferred, take a look at this thread on StackOverflow.
In the piece of code below, there is a var initialized for isProduction
, this var is ready for being used when deployed on a web platform(Heroku we'll talk about it later), and another for isDev
, I repeat, this is for explanation purposes, it is not required to have 2 when one of them can be denied and cause the same result.
When isProduction
the launch is done headless
by default, it means that the process is done in the background without any UI, also some args
are included for a better performance, refer to the list of Chromium
flags here.
When isDev
, the headless
is false, and args
also include one for opening te dev tools after loading the browser.
const isProduction = process.env.NODE_ENV === 'production' ? true : false;
const isDev = !isProduction;
const authenticationError = 'Failed the authentication process';
const bookingError = 'Failed the booking process';
async function startBrowser() {
let browser = null;
let context = null;
let page = null;
if (isProduction) {
browser = await puppeteer.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'],
});
page = await browser.newPage();
} else {
browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
slowMo: 75,
args: [
'--auto-open-devtools-for-tabs',
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process',
'--flag-switches-begin --disable-site-isolation-trials --flag-switches-end',
],
});
context = await browser.createIncognitoBrowserContext();
page = await context.newPage();
}
return { browser, page };
}
As seen above, the site is loaaded in Incognito, but can be opened in a regular tab.
-- Do Login --
For doing the login, some puppeteer
features come in play:
-
goto
: allows the navigation to a web site -
type
: types a value in an input field -
click
: allows clicking on buttons, table cells, submits -
waitForSelector
: recommended for allowing the page to recognize a particular selector before moving along -
screenshot
: takes a screenshot on demand, and store it in the app(it is possible to redirect the screenshots to remote services, in dev just place them in a root folder)
async function doLogIn(page, webSiteUser, webSitePassword) {
await page.goto(constants.baseUrl + constants.loginEndpoint, {
timeout: constants.timeOut,
waitUntil: 'load',
});
isDev && console.log('Navigation to Landing Page Succeeded!!!');
await page.type('#loginform-email', webSiteUser);
await page.type('#loginform-password', webSitePassword);
await page.click('button[type="submit"]');
isDev && console.log('Login submitted');
await page.waitForSelector('#sidebar');
isDev && (await page.screenshot({ path: 'screenshots/home-page.png' }));
return await findLink(page, constants.scheduleEndpoint);
}
Something to remark in the code above is that when dealing on development environment, the screenshots are taken, in production those are skipped(on purpose for the sake of the example)
-- Find a link --
This can change from page to page, but for this project, there is a link that was tracked down to the point that only loggedin
members are able to see, for finding this or any other, a function is available, which receives as parameters the page
instance and the endpoint
to look for as an href:
async function findLink(page, endpoint) {
const pageLinks = await page.evaluate(() =>
Array.from(document.querySelectorAll('a[href]'), a => a.getAttribute('href')),
);
return pageLinks.includes(endpoint) || null;
}
-- Close the Browser --
Just pass the browser
instance as parameter and close
it.
async function closeBrowser(browser) {
return browser.close();
}
Note: not going to elaborate on the details of the booking process, just take into account:
- It is a wizard
- The wizard has 3 steps, the final one is a submit
- The name of the elements in the query selectors are tied to the site I'm scraping on, feel free to change them as much as you need
- The idea is to share how to find elements, how to use query selectors, how to get the outerHtml on elements, wait for them to be available, all of this using
Puppeteer
Nodemailer(Email Service)
Email service is contained in 30 lines of code, it is a define structure required by the import of nodemailer
Note: When using Gmail
, it is mandatory to enable less secure apps, this will create a new password for just the particular application you are trying to link to, can read more here in nodemailer or in Google Support
const nodemailer = require('nodemailer');
async function sendEmail(weekBookings, { authUser, appPassword, emailFrom, emailTo }) {
const mail = nodemailer.createTransport({
service: 'gmail',
auth: {
user: authUser,
pass: appPassword,
},
});
const mailOptions = {
from: emailFrom,
to: emailTo,
subject: 'Your bookings for this week',
html: weekBookings,
};
mail.sendMail(mailOptions, (error, info) => {
if (error) {
console.log(error);
} else {
console.log('Email sent: ' + info.response);
}
});
}
module.exports = {
sendEmail,
};
There is not too much complication in here, pass the authUser
, appPassword
, email from/to
and the html
to be send as email.
Local use and remote Deploy
How to be sure that everything is working as expected?, well two options:
-- Locally --
For running this locally Postman is the tool(don't judge me too much, I am used to it... used to Postman I meant, anyway)
WEB_SITE_USER=YOUR_USER@YOUR_EMAIL_DOMAIN.com WEB_SITE_PASSWORD=YOUR_PASSWORD
GMAIL_AUTH_USER=YOUR_USER@gmail.com GMAIL_APP_PASSWORD=YOUR_APP_PASSWORD
GMAIL_EMAIL_FROM=YOUR_USER@gmail.com GMAIL_EMAIL_TO=YOUR_USER@gmail.com
BOOKING_PREFER_TIME=06:55:00 npm run dev
This command will start the local server using nodemon
setting all the expected process.env
variables in port 3000 by default, so just use Postman
for hitting http://localhost:3000/booking/book-me
or http://localhost:3000/booking/my-bookings
and a result will be retrieved.
-- Remote --
For deploying remotely the platform used id Heroku, not getting in details but found this helpful post in case you decide to follow that path(read carefully the Heroku's
sections, and highly suggested to use Kaffeine).
All the process.env
passed to the terminal when running locally are set as Heroku's
environment variables, then the deploy is transparent.
Bonus: Dealing with Captcha
Sometimes the sites you try to scrap are sort of "protected" by Captcha
, I say "sort of" cause there are ways to skip it, even some companies pay to regular users to help them to recognize captchas
, you can read more over here.
The page scraped for this post behaves "interesting", sometimes the reCaptcha
is ignored, some others appear right after submitting the login, so randomly fails; I opened an issue in puppeteer-extra
, an npm lib extension for puppeteer
which works hand-to-hand with 2captcha, I'm watching the issue closely, in case of getting a fix for the random issue I'll edit the post.
In case you were wondering, the hit of the endpoints after deployed to Heroku
are done by a Cron-Job, it is fast and easy, and I received a custom email when the process randomly fails(the idea is to make it work permanently!).
Conclusion
As shown above, the web scraping is a great technique for making life easier, some hiccups could appear along the way(Captcha, Deploy servers restrictions or conditions) though some how it is possible to make it through!; maybe you could have a better way to do it, let's discuss in a thread below!
Thanks for reading!
Top comments (0)