Data is the new gold according to some of my folks. To some extent maybe they are right. Getting data especially massive data could be expensive. In this article, I will show you step-by-step how to build a multiple-page web scraper using rust programming language.
Imagine you wanted to get the price of an iPhone x from an eCommerce website with multiple pages. You can achieve that manually but it will take a lot of time. Writing a code for the purpose will make it easy. In seconds you can scrape over 10 pages.
Create a new project using cargo new multipage. We are going to the need the following dependencies, add them to your cargo.toml under dependencies:
reqwest={version="0.11.9", features=["blocking"]}
select="0.5.0"
anyhow="1.0.56"`
Reqwest is an http client and it will be useful for making http calls to the web pages will intend to scrap some data from.
Select is the heart of this program. It's a cool crate that has functions to scrap HTML web pages.
Anyhow will majorly help in handling errors.
let's dive in:
Firstly, we are going to inspect the web page we wanted to scrap. If you are using Chrome, right-click and pick the inspect option. inspect the HTML codes until you get to the elements of the web page you intended to scrap. I used this web link as a practice:
"https://www.jumia.com.ng/catalog/?q=iphone&viewType=grid&page=1#catalog-listing"
The website is an eCommerce website. After inspecting the page, I got the HTML code for the catalog-listing:
<a class="core" href="/castillo-castillo-de-liria-red-75cl-x1-81158848.html" data-id="CA254FF1EJDRYNAFAMZ" data-name="Castillo De Liria RED 75cl x1" data-price="3.85" data-brand="Castillo" data-category="Grocery/Beer, Wine & Spirits/Wine/Red Wine" data-dimension23="" data-dimension26="18"
data-dimension27="4.6" data-dimension28="1" data-dimension37="0" data-dimension43="CP_MT38|CP_MT72|CP_MT73|CP_MT76|CP_MT77|CP_MT78|CP_MT80|CP_MT81|CP_MT84|CP_MT92|Camp_45|Camp_70|Camp_72|FDY2020|FDYJE|Merch_148|Merch_150" data-dimension44="0" data-list=""
data-position="1" data-track-onclick="eecProduct" data-track-onview="eecProduct"
data-track-onclick-bound="true"><div class="img-c"><img data-src="https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/84/885118/1.jpg?0770" src="https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/84/885118/1.jpg?0770"
class="img" width="208" height="208" alt="">
<img data-src="https://ng.jumia.is/badges/fdyje/10/138x18.png?4886" src="https://ng.jumia.is/badges/fdyje/10/138x18.png?4886" class="_ni camp" alt="FDYJE"></div>
<div class="info"><h3 class="name">Castillo Castillo De Liria RED 75cl x1</h3>
<div class="prc">β¦ 1,725</div><div class="s-prc-w"><div class="old">β¦ 1,898</div>
<div class="tag _dsct _sm">9%</div></div>
<div class="rev"><div class="stars _s">4.6 out of 5<div class="in" style="width:91.99999999999999%">
</div></div>(18)</div><svg aria-label="Express Shipping" viewBox="0 0 114 12" class="ic xprss" width="94" height="10">
<use xlink:href="https://www.jumia.com.ng/assets_he/images/i-shop-jumia.c8de1c55.svg#express"></use>
</svg><p class="shipp">Jumia Express items in your order will be delivered for free (Lagos & Abuja only, excluding large items)</p></div></a>
The following attributes were use:
<class="info"> as the major node.
<class="name"> to get the name of the product
<class="prc"> to get the price of the product
Calling the URL with Reqwest
let mut page = 1;
while page != 10 {
let url = format!(
"https://www.jumia.com.ng/catalog/?q=iphone&viewType=grid&page={:?}#catalog-listing",
page
);
let res = reqwest::blocking::get(url).with_context(|| format!("opening url error"))?;
let document = Document::from_read(res).context("parsing response")?;
The first thing to note is that this is a multi-page scrapper, we are going to loop through different pages until we are satisfied. From the code above, I intend to go through the iPhone listing page 10 times and get the name of available iPhones and their price. To achieve looping through the page I used the format! macro, easy way for me to concatenate I guess. The macro will accept the page params as the program keeps looping through different pages.
I used reqwest to accept the URL and used the document method in select to parse it to a file that will be readable by the program.
let writer = OpenOptions::new()
.read(true)
.append(true)
.create(true)
.open("box.txt")
.with_context(|| format!("opening file "))?;
let mut writer = LineWriter::new(writer);
let jumia = document.find(Class("info"));
let link = document
.find(Class("core"))
.next()
.context("writing to the output file")?;
for node in jumia {
let name = node
.find(Class("name"))
.next()
.context("writing to the output file")?;
let price = node
.find(Class("prc"))
.next()
.context("writing to the output file")?;
writeln!(
writer,
"{:?}---{:?}---{:?}",
name.text(),
price.text(),
link.attr("href")
)
.context("writing to the output file")?;
}
page = page + 1;
}
Ok(())
The first part of the code specifies what the scrapped data will be saved as. It will be saved as box.txt.
The paged is being pulled down by the document variable and every node can easily be assessed and data can be scraped out of it.
Thanks for reading, you can also check nutrisoft, my nutrition web application.
Code available on github
Top comments (0)