Posted on Dec 19, 2023

Building an Archiver in Rust

#rust #archive #scraper

Note
Probably won't work on windows, not tested on windows and because of how it sets up paths with the forward slashes, it may not work on windows.

tldr: code

Prereqs: having Rust installed

This is a tutorial for building a basic archiver with Reqwest. If you want to try to do js heavy sites, i suggest you use Fantoccini (jonhoo is a beast). Granted a lot of sites don't like bots so they take a lot of steps with antibots. Though I've heard people doing some good things with Selenium. If you go this route you'd want to change things like user agents, options and some other stuff. In this code we'll only be doing user agent changing within reqwest.

Note
the archive uses the path and builds directories on it. So
rustlang.org/rust/language
will translate to
rustlang.org/rust/language/date/index.html it will also have an image folder, a css folder and a js folder.

So lets start setting up the project.

run

cargo new <Project Name>

rename to what ever you want to name it.

Now in the Cargo.toml file, add the following dependencies:

[dependencies]
url = "2.5.0"
reqwest = "0.11.23"
bytes = "1.5"
dirs = "5.0.1"
regex = "1.10.2"
scraper = "0.18.1"
lazy_static = "1.4.0"
chrono = "0.4.31"
rand = "0.8.5"
image = "0.24.6"
tokio = { version = "1.35", features = ["full"] }

Lets add the ways to parse necessary links for things like img's, css, js. Things pertinent to a webpage.
Create a file under ./src called html.rs

Now lets add the required dependencies:

use chrono::Utc;
use lazy_static::lazy_static;
use regex::Regex;
use scraper::{Html, Selector};
use std::collections::HashSet;
use url::{ParseError, Url};

Now those are out of the way, time to add add the HtmlRecord struct.

#[derive(Debug)]
pub struct HtmlRecord {
    pub origin: String,
    pub date_time: String,
    pub body: String,
    pub html: Html,
}

Now lets implement the new:

impl HtmlRecord {
    pub fn new(origin: String, body: String) -> HtmlRecord {
        HtmlRecord {
            origin,
            date_time: Utc::now().format("%d-%m-%Y-%H:%M:%S").to_string(),
            html: Html::parse_document(&body),
            body,
        }
    }
}

This basically takes the time of making and attaches it to the date_time field. The html field is the html parsed from the body which we will use to get links and stuff. The body is the html we will save.

Next, under new add the following method:


    //the tuple returns the unparsed string in the 0's spot
    //returns the parsed link in the 1's spot
    pub fn get_image_links(&self) -> Result<Option<HashSet<(String, String)>>, String> {
        //checks for base64 images
        lazy_static! {
            static ref RE3: Regex = Regex::new(r";base64,").unwrap();
        }
        let mut link_hashset: HashSet<(String, String)> = HashSet::new();

        //select image tags
        let selector = Selector::parse("img").unwrap();

        //loop through img tags
        for element in self.html.select(&selector) {
            //grab the source attribute of the tag
            match element.value().attr("src") {
                //if we have a link
                Some(link) => {
                    //see if a relative link
                    if Url::parse(link) == Err(ParseError::RelativeUrlWithoutBase) {
                        //get base url
                        let plink = Url::parse(&self.origin)
                            .expect("get css links, origin could not be parsed")
                            .join(link)
                            .expect("css links, could not join")
                            .to_string();
                        //push to return vector
                        link_hashset.insert((link.to_string(), plink.to_string()));
                        //check if base64 and continue if so
                    } else if RE3.is_match(link) {
                        continue;
                    //if fully formed link, push to return vector
                    } else if let Ok(parsed_link) = Url::parse(link) {
                        link_hashset.insert((link.to_string(), parsed_link.to_string()));
                    }
                }
                //No src, contine
                None => continue,
            };
        }
        //If hashset is empty return an Ok of None
        if link_hashset.is_empty() {
            Ok(None)
        //return some image links
        } else {
            Ok(Some(link_hashset))
        }
    }

I commented the code, if you have any questions feel free to ask.

Next we want to add the get_css_links under get_image_links

    pub fn get_css_links(&self) -> Result<Option<HashSet<(String, String)>>, String> {
        let mut link_hashset: HashSet<(String, String)> = HashSet::new();

        //get links
        let selector = Selector::parse("link").unwrap();
        //loop through elements
        for element in self.html.select(&selector) {
            //check if stylesheets
            if element.value().attr("rel").unwrap() == "stylesheet" {
                //get the href
                match element.value().attr("href") {
                    Some(link) => {
                        //take care of relative links here
                        if Url::parse(link) == Err(ParseError::RelativeUrlWithoutBase) {
                            //create url
                            let plink = Url::parse(&self.origin)
                                .expect("get css links, origin could not be parsed")
                                .join(link)
                                .expect("css links, could not join")
                                .to_string();
                            //add to hashset
                            link_hashset.insert((link.to_string(), plink.to_string()));
                        } else if let Ok(parsed_link) = Url::parse(link) {
                            link_hashset.insert((link.to_string(), parsed_link.to_string()));
                        }
                    }
                    None => continue,
                };
            }
        }

        if link_hashset.is_empty() {
            Ok(None)
        } else {
            Ok(Some(link_hashset))
        }
    }

Now lets lets add the get_js_links under the get_css_links

//get js links
    pub fn get_js_links(&self) -> Result<Option<HashSet<(String, String)>>, String> {
        //create hashset
        let mut link_hashset: HashSet<(String, String)> = HashSet::new();
        //get the selector which is basically used for getting the script tags
        let selector = Selector::parse("script").unwrap();
        for element in self.html.select(&selector) {
            //get src attribute of the script tag
            match element.value().attr("src") {
                Some(link) => {
                    if Url::parse(link) == Err(ParseError::RelativeUrlWithoutBase) {
                        //parse relative url
                        let plink = Url::parse(&self.origin)
                            .expect("get js links, origin could not be parsed ")
                            .join(link)
                            .expect("js links, could not join")
                            .to_string();
                        link_hashset.insert((link.to_string(), plink.to_string()));
                    } else if let Ok(parsed_link) = Url::parse(link) {
                        //url doesnt need to be parsed, add it to the hashset
                        link_hashset.insert((link.to_string(), parsed_link.to_string()));
                    }
                }
                None => continue,
            };
        }

        //if hashset is empty return a result of None
        if link_hashset.is_empty() {
            Ok(None)
        } else {
            //return a result of some
            Ok(Some(link_hashset))
        }
    }

and that is it for the html.rs file.

Now unto the client.rs file

first set up the usings:

use crate::html::HtmlRecord;
use bytes::Bytes;
use reqwest::header::USER_AGENT;
use url::Url;

Now set up the AGENT const under the usings, we'll be using the mozilla agent.

const AGENT: &str =
    "Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion";

Now set up the Client struct

pub(crate) struct Client {
    pub client: reqwest::Client,
}

now lets set up the new function

impl Client {
    pub fn new() -> Self {
        Self {
            client: reqwest::Client::new(),
        }
    }
}

below the closing tag add the following method

pub fn replace_encoded_chars(body: String) -> String {
    body.replace("&lt;", "<")
        .replace("&gt;", ">")
        .replace("&quot;", "\"")
        .replace("&amp;", "&")
        .replace("&apos", "\'")
}

This is as the function states: replaces encoded characters to we get good results.

Now back into the Client struct, under new method add the following method:

    pub async fn fetch_html_record(&mut self, url_str: &str) -> Result<HtmlRecord, reqwest::Error> {
        let url_parsed = Url::parse(url_str).expect("cannot parse");
        let res = self
            .client
            .get(url_parsed.as_str())
            .header(USER_AGENT, AGENT)
            .send()
            .await?;
        let body = res.text().await.expect("unable to parse html text");
        let body = replace_encoded_chars(body);
        let record: HtmlRecord = HtmlRecord::new(url_parsed.to_string(), body);

        Ok(record)
    }

This gets the html and creates a record with it.

Next add the fetch_image_bytes method:

   pub async fn fetch_image_bytes(&mut self, url_str: &str) -> Result<Bytes, String> {
        let url_parsed = Url::parse(url_str).expect("cannot parse");
        let res = self
            .client
            .get(url_parsed.as_str())
            .header(USER_AGENT, AGENT)
            .send()
            .await
            .map_err(|e| format!("fetch image bytes failed for url {}:\n {}", url_parsed, e))?;

        let status_value = res.status().as_u16();

        if status_value == 200 {
            let image_bytes = res.bytes().await.expect("unable to parse html text");
            Ok(image_bytes)
        } else {
            Err("status on image call not a 200 OKAY".to_string())
        }
    }

Lastly add the fetch_string_resource methods. This grabs the css and the js for the webpage we are archiving.

    pub async fn fetch_string_resource(&mut self, url_str: &str) -> Result<String, String> {
        let url_parsed = Url::parse(url_str).expect("cannot parse");
        let res = self
            .client
            .get(url_parsed.as_str())
            .header(USER_AGENT, AGENT)
            .send()
            .await
            .map_err(|e| format!("fetch string resource failed for url {}: {}", url_parsed, e))?;

        let status_value = res.status().as_u16();

        if status_value == 200 {
            let string_resource = res.text().await.expect("unable to parse html text");
            Ok(string_resource)
        } else {
            Err("status on css call not a 200 OKAY".to_string())
        }
    }

The client was pretty easy. Now for the bread and butter, the archiver.
create the archiver struct

pub struct Archiver;

now create the save_page method. This uses everything we built upon to save the page to the base directory we provided

impl Archiver {
   async fn save_page(
        html_record: &mut HtmlRecord,
        client: &mut Client,
        base_path: &str,
    ) -> Result<String, String> {
        //set up the directory to save the page in
        let url = Url::parse(&html_record.origin).expect("can't parse origin url");
        let host_name = url.host().expect("can't get host").to_string();
        let mut url_path = url.path().to_string();
        let mut base_path = base_path.to_string();
        if !base_path.ends_with('/') {
            base_path.push('/');
        }
        if !url_path.ends_with('/') {
            url_path.push('/');
        }
        let directory = format!(
            "{}{}{}{}",
            base_path, host_name, url_path, html_record.date_time
        );

        //create the directory
        fs::create_dir_all(&directory).map_err(|e| format!("Failed to create directory: {}", e))?;

        //Get images
        match html_record.get_image_links() {
            Ok(Some(t_image_links)) => {
                assert!(fs::create_dir_all(format!("{}/images", directory)).is_ok());
                for link in t_image_links {
                    if let Ok(image_bytes) = client.fetch_image_bytes(&link.1).await {
                        if let Ok(tmp_image) = image::load_from_memory(&image_bytes) {
                            let file_name = get_file_name(&link.1)
                                .unwrap_or_else(|| random_name_generator() + ".png");
                            let fqn = format!("{}/images/{}", directory, file_name);
                            let body_replacement_text = format!("./images/{}", file_name);

                            if (file_name.ends_with(".png")
                                && tmp_image
                                    .save_with_format(&fqn, image::ImageFormat::Png)
                                    .is_ok())
                                || (!file_name.ends_with(".png") && tmp_image.save(&fqn).is_ok())
                            {
                                html_record.body =
                                    html_record.body.replace(&link.0, &body_replacement_text);
                            }
                        }
                    }
                }
            }
            Ok(None) => {
                println!("no images for url: {}", url);
            }
            Err(e) => {
                println!("error {}", e)
            }
        }

        //Get css links
        match html_record.get_css_links() {
            Ok(Some(t_css_links)) => {
                assert!(fs::create_dir_all(format!("{}/css", directory)).is_ok());
                for link in t_css_links {
                    let file_name =
                        get_file_name(&link.1).unwrap_or_else(|| random_name_generator() + "css");
                    if let Ok(css) = client.fetch_string_resource(&link.1).await {
                        let fqn = format!("{}/css/{}", directory, file_name);
                        let mut file = File::create(&fqn).unwrap();
                        if file.write(css.as_bytes()).is_ok() {
                            let body_replacement_text = format!("./css/{}", file_name);
                            html_record.body =
                                html_record.body.replace(&link.0, &body_replacement_text);
                        } else {
                            println!("couldnt write css for url {}", &fqn);
                        }
                    }
                }
            }
            Ok(None) => {
                println!("no css for url: {}", url);
            }
            Err(e) => {
                println!("error for url {}\n error: {}", url, e)
            }
        }

        //get js links
        match html_record.get_js_links() {
            Ok(Some(t_js_links)) => {
                assert!(fs::create_dir(format!("{}/js", directory)).is_ok());
                for link in t_js_links {
                    let file_name =
                        get_file_name(&link.1).unwrap_or_else(|| random_name_generator() + "js");
                    if let Ok(js) = client.fetch_string_resource(&link.1).await {
                        let fqn = format!("{}/js/{}", directory, file_name);

                        if let Ok(mut output) = File::create(fqn) {
                            if output.write(js.as_bytes()).is_ok() {
                                let body_replacement_text = format!("./js/{}", file_name);
                                html_record.body =
                                    html_record.body.replace(&link.0, &body_replacement_text);
                            }
                        }
                    }
                }
            }
            Ok(None) => {
                println!("no js for url: {}", url);
            }
            Err(e) => {
                println!("error for url : {}\n error :{}", url, e);
            }
        }

        //write html to file
        let fqn_html = format!("{}/index.html", directory);
        let mut file_html = File::create(fqn_html.clone()).unwrap();
        if file_html.write(html_record.body.as_bytes()).is_ok() {
            Ok(fqn_html)
        } else {
            Err("error archiving site".to_string())
        }
    }
}

Run through this code. Basically it follows the pattern of getting the resources and saving them to the body of the html document.

Now right above save_page, create the following method:

    pub async fn create_archive(
        &mut self,
        client: &mut Client,
        url: &str,
        path: &str,
    ) -> Result<String, String> {
        //create record
        let mut record = client
            .fetch_html_record(url)
            .await
            .unwrap_or_else(|_| panic!("fetch_html_record failed \n url {}", url));
        //save the page
        match Archiver::save_page(&mut record, client, path).await {
            Ok(archive_path) => Ok(archive_path),
            Err(e) => Err(e),
        }
    }

This gets the HtmlRecord and then extracts all its resources.

Now lastly in the main, add the mods

mod archiver;
mod client;
mod html;

Now add the usings

use crate::archiver::Archiver;
use crate::client::Client;

Now change the main to this:

#[tokio::main]
async fn main() {
    let url = "https://en.wikipedia.org/wiki/Rust_(programming_language)";

    /*
    change these two lines if you want to use an absolute path, or create the directory "/Projects/archive_test"
     */
    //this will grab your home directory
    let home_dir = dirs::home_dir().expect("Failed to get home directory");
    //make sure this directory exists:
    let custom_path = "/Projects/archive_test";
    //this is the absolute path to your home directory and the added directories to the spot you want to
    //add your archives to.
    let new_dir = format!("{}{}", home_dir.to_str().unwrap(), custom_path);

    //create the client and pass it to the archiver
    let mut client = Client::new();
    let mut archiver = Archiver;
    let path = archiver.create_archive(&mut client, url, &new_dir).await;

    //path of the archived site
    println!("{:?}", path);
}

change the values i used (the site url and the path (which needs to be absolute)).

now run the following command in the base directory of the project

cargo run

The output path is where your archive now resides.

DEV Community

Building an Archiver in Rust

Top comments (0)

Read next

The Rise of Rust in Cybersecurity: What You Need to Know

Closures in Rust

What is Rust Programming?

⚔️ Rust vs Go vs Bun vs Node.js: The Ultimate 2024 Performance Showdown 🚀