What is Web scraping?
Web scraping is the process of extracting content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and with it, data stored in a database.
Note: Not all site are allowed to be scraped, you should make enquiries about a site before scraping.
Before scraping a site make an enquiry if you are allowed to scrap the site, you can check the site privacy policy and terms and condition.
Fetching the webpage
The site we will be scraping is Stackoverflow Jobs it is a section where job vacancies are listed.
Getting started
Step 1: Setting up the working directory
Now that we have Node.js and npm installed, we can start with the project. Open up your preferred terminal and run these commands:
if you don't have Nodejs and npm installed you can check there official docs on how to do that NodeJs Docs.
Create a directory
Move into the directory
mkdir web-scraper
cd web-scraper
Now we have a directory for our web-scraper, but we need a package.json, this tells npm information about our project. To do this, (in the same terminal window) we need to run this:
npm init
This command will tell npm to initialize a pre-made package.json in our project directory. Just hit enter at all of the prompts, we can worry about those later.
Step 2: Install necessary packages
For this project, we will only need two(2) npm package axios and cheerio. An npm package is essentially a piece of code (“package”) in the npm registry that we can download with a simple command, npm install
.
npm install axios
npm install cheerio
Step 3: Write some code!
const axios = require("axios");
const cheerio = require("cheerio");
const url = "https://stackoverflow.com/jobs";
(async () => {
try {
const res = await axios.get(url);
const html = res.data;
//loading response data into a Cheerio instance
const $ = cheerio.load(html);
const siteName = $(".-logo").text();
// This would return the site Name
console.log(siteName);
} catch (error) {
console.log(error);
}
})();
Essentially, what this above code does is:
To include the modules used in the project with the require function, which is built-in within Node.js.
To make a GET HTTP request to the target web page with Axios..
Notice that when a request is sent to the web page, it returns a response. This Axios response object is made up of various components, including data that refers to the payload returned from the server.
So, when a GET request is made, we output the data from the response, which is in HTML format.
- We loaded the response data into a Cheerio instance. This way, we can create a Cheerio object to help us in parsing through the HTML from the target web page and finding the DOM elements for the data we want—just like when using jQuery.
To uphold the infamous jQuery convention, we’ll name the Cheerio object, $.
- We used the Cheerio’s selectors syntax to search the elements containing the data we want which is the site name:
Now, run the app.js file with this command:
node app.js
You should see something like this:
static@Abdulfatais-MacBook web-scraper $ node app.js
Stack Overflow
Now let's proceed with writing script to get job vacancies.
The Below. code looks for a parent class for every job listing and loop through it and then get it properties eg: title, link and date.
You can still select more like the location and amount just target the element name.
After that, it stores the values in an object then console log the data.
const axios = require("axios");
const cheerio = require("cheerio");
const url = "https://stackoverflow.com/jobs";
(async () => {
try {
const res = await axios.get(url);
const html = res.data;
//loading response data into a Cheerio instance
const $ = cheerio.load(html);
$('.fl1').each((i, el) => {
const title = $(el).find('.fs-body3').text().replace(/s\s+/g, '');
const link = $(el).find('.s-link').attr('href');
const date = $(el).find('.fc-orange-400').text();
const data = {
title,
link: `https://stackoverflow.com/${link}`,
date
}
console.log(data);
});
} catch (error) {
console.log(error);
}
})();
If everything goes well you should get this response on your console.
static@Abdulfatais-MacBook web-scraper $ node app.js
{
title: '\nFull-Stack Software Engineer ',
link: 'https://stackoverflow.com//jobs/471179/full-stack-software-engineer-unhedged',
date: '5d ago'
}
{
title: '\nSoftware Engineering ',
link: 'https://stackoverflow.com//jobs/473617/software-engineering-jpmorgan-chase-bank-na',
date: '5h ago'
}
{
title: '\nSenior Software Engineer (Backend) (m/w/d) ',
link: 'https://stackoverflow.com//jobs/471126/senior-software-engineer-backend-m-w-d-gp-9000-gmbh',
date: '7d ago'
}
{
title: '\nSenior Backend Engineer Who LoveTypescript ',
link: 'https://stackoverflow.com//jobs/470542/senior-backend-engineer-who-loves-typescript-well-health-inc',
date: '6d ago'
}
{
title: '\nJava Developer - Software Engineering ',
link: 'https://stackoverflow.com//jobs/473621/java-developer-software-engineering-jpmorgan-chase-bank-na',
date: '5h ago'
}
{
title: '\nSenior Software Engineer ',
link: 'https://stackoverflow.com//jobs/473494/senior-software-engineer-nori',
date: '7h ago'
}
Hopefully, this article was able to take you through the steps of scraping your first website.
In my other articles to come, if I have the opportunity, I would write about topics on Node.js. Kindly drop your requests in the comment section as well as like.
You can also check out my previous article on Creating a Telegram Bot with Nodejs.
Conclusion
We saw the possibility of web scraping with Nodejs and learned how to scrap a site with nodejs. If you have any questions, don't hesitate to contact me on Twitter: @iamnotstatic
Top comments (8)
I don't normally point out things as pedantic as spelling errors, but since it's in your title, and it's used repeatedly throughout the article...
It's scraping. Not scrapping. "Scrapping" is a colloquial term for discarding something. Like, "We're going to be scrapping that old mainframe system in favor of a cloud-based application."
Thanks for pointing that out, I didn't even notice it.
Nice article .. saved!
I'm glad it was helpful
Can you please explain how to effectively host this application on free hosting like heroku etc.
it is nothing to be hosted, its script, you can run it local or wrap it into an
express
js app.guess there are lots of tutorials for express apps on heroku.
Please add the language tag to the code blocks.
Yeah thanks, I've done that