DEV Community

Cover image for Rust Keyword Extraction: Creating the TF-IDF algorithm from scratch (and a Tokenizer)
Afonso Barracha
Afonso Barracha

Posted on • Edited on

Rust Keyword Extraction: Creating the TF-IDF algorithm from scratch (and a Tokenizer)

Disclosure: The material in this Tutorial has not been reviewed, endorsed, or approved of by the Rust Foundation. For more information on the Rust Foundation Trademark Policy, click here.

Series Intro

You know, like many of you, I was blown away by Chat GPT, especially GPT-4, which reignited my love for machine learning after a two-year break. But once I was back, I hit the same issue that made me quit: slow training times.

I tried unsupervised NLP models and keyword extraction, but even those were too slow for a simple VPS. That is when I realised Rust and Actix-Web could be the answer.

Therefore, let's explore Rust for NLP together and dive into the awesome world of text mining and keyword extraction.

What will I cover in this series

I will mainly cover 3 algorithms:

  1. TF-IDF: Term Frequency-Inverse Document Frequency algorithm converts raw text into insightful numerical data.
  2. Co-occurrence Matrix: the Co-occurrence Matrix covers underlying patterns and associations in a language by examining word pairs and building semantic networks in textual data.
  3. RAKE: Rapid Automatic Keyword Extraction Algorithm identifies crucial words and phrases to enhance text summarization, topic modeling, and information retrieval in any text-based project.

Intro

Before implementing the TF-IDF algorithm we have to know what it is, and what it does. Term Frequency-Inverse Document Frequency algorithm uses a weighting system that weights the importance of each word in a group of documents based on two components:

  • Term Frequency (TF): this element calculates how often a term appears in a document. It is calculated by dividing the word count by the count of all words.
  • Inverse Document Frequency (IDF): calculates the importance of a word relative to the corpus of the document. It is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents in which the word appears.

Steps to implement TF-IDF

Note: if you do not have time to read this entire article, you can find all the code in this repository. While at it consider buying me a coffee if you liked what you saw.

Here are the steps to implement TF-IDF:

  1. Tokenize the documents by converting them into a vector of words, and then remove punctuation and stop-words. Stop-words are common, language-specific words that do not carry much meaning and can typically be ignored.
  2. Calculate the Term Frequency (TF) for each word.
  3. Calculate the Inverse Document Frequency (IDF) for each word.
  4. Multiply the TF by IDF to get the Term Frequency-Inverse Document Frequency (TF-IDF) for each word.
  5. L2 (Pythagoras Theorem type of vector distance) normalize the base TF-IDF calculation

Implementation

Let start by creating a new RUST project with Cargo:

$ cargo new keyword_extraction --lib
Enter fullscreen mode Exit fullscreen mode

Since we will have several algorithms we can take advantage of the cargo workspaces functionality.

Text process

Tokenizer initialisation

Create a library called tokenizer:

$ cargo new tokenizer --lib
Enter fullscreen mode Exit fullscreen mode

And add it to the main Cargo.toml file:

[package]
name = "keyword_extraction"
version = "0.1.0"
edition = "2021"
license = "GPL-3.0-or-later"
authors = ["Afonso Barracha <barracha.afonso@gmail.com>"]
publish = false

[lib]
name = "keyword_extraction"
path = "src/lib.rs"

[workspace]
members = [
    ".",
    "tokenizer"
]

[dependencies]
tokenizer = { path = "./tokenizer" }
Enter fullscreen mode Exit fullscreen mode

And update the tokenizer's Cargo.toml:

[package]
name = "tokenizer"
version = "0.1.0"
edition = "2021"
publish = false
authors = ["Afonso Barracha <barracha.afonso@gmail.com>"]
description = "A simple tokenizer for the Rust programming language"
license = "GPL-3.0-or-later"

[lib]
name = "tokenizer"
path = "src/lib.rs"
Enter fullscreen mode Exit fullscreen mode

To process the document we will use a common used library for string processing unicode_segmentation, as the description of the library states:

This crate provides Grapheme Cluster, Word and Sentence boundaries according to Unicode Standard Annex #29 rules.

So we will use it to break (tokenize) the block of text (article, job offer, etc.) into sentences, paragraphs and words.

Start by adding this and the regex library to the Cargo.toml file:

# ...

[dependencies]
unicode-segmentation = "^1"
regex = "^1"
Enter fullscreen mode Exit fullscreen mode

On the top of the lib.rs file import UnicodeSegmentation trait from the this crate, the Regex struct from the regex crate, and the HashSet from the standard library:

use std::collections::HashSet;

use regex::Regex;
use unicode_segmentation::UnicodeSegmentation;
Enter fullscreen mode Exit fullscreen mode

On the lib.rs file add a new struct that contains the text, the stop-words and punctuation used in the text:

pub struct Tokenizer {
    text: String,
    stopwords: HashSet<String>,
    punctuation: HashSet<String>,
}
Enter fullscreen mode Exit fullscreen mode

On the struct implement a new method for its initialisation:

impl Tokenizer {
    pub fn new(text: &str, stopwords: Vec<String>, punctuation: Option<Vec<String>>) -> Self {
        Self {
            text: text.to_owned(),
            stopwords: stopwords
                .iter()
                .map(|s| s.to_owned())
                .collect::<HashSet<String>>(),
            punctuation: punctuation
                .unwrap_or(
                    vec![
                        "!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", ";", ".", "/",
                        ":", ",", "<", "=", ">", "?", "@", "[", "\\", "]", "^", "_", "`", "{", "|",
                        "}", "~", "-",
                    ]
                    .iter()
                    .map(|s| s.to_string())
                    .collect::<Vec<String>>(),
                )
                .iter()
                .map(|s| s.to_string())
                .collect::<HashSet<String>>(),
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Punctuation for most languages are the same so we make it optional and use the vec! macro to initialise the most common characters.

Using the iterator pattern we turn the vectors of stop-words and punctuation into hash sets for easy and efficient lookup.

Tokenizer Word split

The UnicodeSegmentation trait gives strings a very cool method, namely split_word_bounds() which breaks our block of text into a Vec<String> where words are separated from most punctuation characters, however some characters like "'s", "." and "," still need to be filtered so create a function to create a regex to filter those characters:

// ...

fn get_special_char_regex() -> Regex {
    Regex::new(r"('s|,|\.)").unwrap()
}
Enter fullscreen mode Exit fullscreen mode

Finally to process and filter those words we will use the iterator trait filter_map method,

Create a helper function that returns the Option<String> monad:

fn process_word(
    w: &str,
    special_char_regex: &Regex,
    stopwords: &HashSet<String>,
    punctuation: &HashSet<String>,
) -> Option<String> {
    let word = special_char_regex.replace_all(w.trim(), "").to_lowercase();

    if word.is_empty()
        || (word.graphemes(true).count() == 1) && punctuation.contains(&word)
        || stopwords.contains(&word)
    {
        return None;
    }

    Some(word)
}
Enter fullscreen mode Exit fullscreen mode

We do 4 actions on that function, we:

  1. Process the word by removing the special characters and turning it into lowercase:

    let word = special_char_regex.replace_all(w.trim(), "").to_lowercase();
    
  2. We remove filter the word out if it is an empty string:

    if word.is_empty() {
        return None;
    }
    
  3. If the string is one character long we check if it is a punctuation character (that was split by the split_word_bounds() method):

    if (word.graphemes(true).count() == 1) && punctuation.contains(&word) {
        return None;
    }
    

    Note: the graphemes method is provided by the UnicodeSegmentation trait as well

  4. We check if the word is a stop word, and filter it if it is:

    if stopwords.contains(&word) {
        return None;
    }
    

Now we can use the function inside an iterator on methods of our Tokenizer trait, like the one to split into words:

impl Tokenizer {
    // ...

    pub fn split_into_words(&self) -> Vec<String> {
        self.text
            .split_word_bounds()
            .filter_map(|w| {
                process_word(
                    w,
                    &get_special_char_regex(),
                    &self.stopwords,
                    &self.punctuation,
                )
            })
            .collect::<Vec<String>>()
    }
}
Enter fullscreen mode Exit fullscreen mode

The method is pretty self explanatory, we split_word_bounds to get an iterator of words, we filter_map and use the process_word function to process the words, and we collect everything into a Vec<String>.

Tokenizer Sentence Split

To split our we code into documents we will need to split the text block into sentences, again UnicodeSegmentation has a good method for that unicode_sentences(). But before that we need to remove new lines ("\n" and "\r") and tabs ("\t"), so create a regex for that:

fn get_sentence_space_regex() -> Regex {
    Regex::new(r"^([\.!?])[\n\t\r]").unwrap()
}
Enter fullscreen mode Exit fullscreen mode

And use it to filter them, like the words, we still need to remove the stop words:

impl Tokenizer {
    // ...

    pub fn split_into_sentences(&self) -> Vec<String> {
        let special_char_regex = get_special_char_regex();
        get_sentence_space_regex()
            .replace_all(&self.text, ".")
            .unicode_sentences()
            .map(|s| {
                s.split_word_bounds()
                    .filter_map(|w| {
                        process_word(w, &special_char_regex, &self.stopwords, &self.punctuation)
                    })
                    .collect::<Vec<String>>()
                    .join(" ")
            })
            .collect::<Vec<String>>()
    }
}
Enter fullscreen mode Exit fullscreen mode

The code is pretty much the same as the previous section, but now we have to split it in sentences, and for each sentence we need to process the words.

Tokenizer Paragraph Split

To split our code into paragraphs we just need to divide our code with the delimiter of new lines, so create an Regex for new lines:

fn get_newline_regex() -> Regex {
    Regex::new(r"(\r|\n|\r\n)").unwrap()
}
Enter fullscreen mode Exit fullscreen mode

And apply it to the text:

impl Tokenizer {
    // ...

    pub fn split_into_paragraphs(&self) -> Vec<String> {
        get_newline_regex()
            .split(&self.text)
            .filter_map(|s| {
                if s.trim().is_empty() {
                    return None;
                }

                Some(
                    s.unicode_sentences()
                        .map(|s| {
                            s.split_word_bounds()
                                .filter_map(|w| {
                                    process_word(
                                        w,
                                        &get_special_char_regex(),
                                        &self.stopwords,
                                        &self.punctuation,
                                    )
                                })
                                .collect::<Vec<String>>()
                                .join(" ")
                        })
                        .collect::<Vec<String>>()
                        .join(" "),
                )
            })
            .collect::<Vec<String>>()
    }
}
Enter fullscreen mode Exit fullscreen mode

Note that now we need three iterators, one for the paragraphs, one for the sentences, and one for the words of the sentences.

Calculate TF-IDF

At the end of the day TF-IDF is just a fancy HashMap that has the words as keys, and a 32 bits float (f32) score as values.

Start by generating a tf_idf library:

$ cargo new tf_idf --lib
Enter fullscreen mode Exit fullscreen mode

And import the necessary libraries, including the Tokenizer library:

use std::cmp::Ordering;
use std::collections::{HashMap, HashSet};

use tokenizer::Tokenizer;
Enter fullscreen mode Exit fullscreen mode

Create a tuple struct TfIdf that accepts a single HashMap params:

pub struct TfIdf(HashMap<String, f64>);
Enter fullscreen mode Exit fullscreen mode

Calculate TF

To calculate the TF we have to create two functions.

  1. This function takes the documents and returns a word hash map with the number of occurence of each word in all documents:

    fn generate_word_hashmap(documents: &Vec<String>) -> HashMap<String, f32> {
        documents
            .iter()
            .flat_map(|document| document.split_whitespace().map(|word| word.to_string()))
            .fold(HashMap::new(), |mut acc, word| {
                let count = acc.entry(word).or_insert(0.0);
                *count += 1.0;
                acc
            })
    }
    
  2. While this one takes a mutable hash map and returns a word hash map with its TF score:

    fn calculate_tf(tf: HashMap<String, f32>) -> HashMap<String, f32> {
        let total_words = tf.values().sum::<f32>();
    
        tf.iter()
            .map(|(word, count)| (word.to_string(), count / total_words))
            .collect::<HashMap<String, f32>>()
    }
    

Calculate IDF

You need to two functions to calculate the IDF score as well:

  1. One takes the documents, and returns on how many documents each word appeared:

    fn generate_unique_word_hashmap(documents: &Vec<String>) -> HashMap<String, f32> {
        documents
            .iter()
            .map(|document| {
                document
                    .split_whitespace()
                    .map(|s| s.to_owned())
                    .collect::<HashSet<String>>()
            })
            .flat_map(|unique_words| unique_words.into_iter())
            .fold(HashMap::new(), |mut acc, word| {
                let count = acc.entry(word).or_insert(0.0);
                *count += 1.0;
                acc
            })
    }
    
  2. The other takes the documents vector length, a hash map with the unique words and returns a hash map of IDF scores:

    fn calculate_idf(docs_len: f32, word_hashmap: HashMap<String, f32>) -> HashMap<String, f32> {
        let one = 1.0_f32;
    
        word_hashmap
            .iter()
            .map(|(word, count)| {
                let documents_with_term = (docs_len + one) / (count + one);
                (word.to_string(), documents_with_term.ln() + one)
            })
            .collect::<HashMap<String, f32>>()
    }
    

Calculate the TF-IDF

The TF-IDF is just a hash map of the product of the previous two hash maps:

fn calculate_tf_idf(tf: HashMap<String, f32>, idf: HashMap<String, f32>) -> HashMap<String, f32> {
    tf.iter()
        .map(|(word, count)| (word.to_string(), count * idf.get(word).unwrap_or(&0.0_f32)))
        .collect::<HashMap<String, f32>>()
}
Enter fullscreen mode Exit fullscreen mode

L2 Normalization

L2 norm, or Euclidean norm, is the distance of two vectors coordinates within a vector space. It is basically the same as the Pythagoras Theorem but for all the values in the vector:

l2 = sqrt(x*x + y*y + ...)  
Enter fullscreen mode Exit fullscreen mode

So the function that calculates this is something like:

fn l2_normalize(tf_id: HashMap<String, f32>) -> HashMap<String, f32> {
    let l2_norm = tf_id
        .values()
        .map(|value| value * value)
        .sum::<f32>()
        .sqrt();

    tf_id
        .iter()
        .map(|(key, value)| (key.clone(), value / l2_norm))
        .collect::<HashMap<String, f32>>()
}
Enter fullscreen mode Exit fullscreen mode

Putting it all together

To initialise the TfIdf struct implement a new method:

impl TfIdf {
    pub fn new(documents: &Vec<String>) -> Self {
        Self(l2_normalize(calculate_tf_idf(
            calculate_tf(generate_word_hashmap(documents)),
            calculate_idf(
                documents.len() as f32,
                generate_unique_word_hashmap(documents),
            ),
        )))
    }
}
Enter fullscreen mode Exit fullscreen mode

Adding the Final Touches

As you can see on our struct the only param is private, hence we need methods to access its data, one for checking the score for a single word, and one to get the top N words.

  • The get_score method, is just an wrapper on the hash map:

    impl TfIdf {
        // ...
    
        pub fn get_score(&self, word: &str) -> f32 {
            *self.0.get(word).unwrap_or(&0.0)
        }
    }
    
  • While the get_n_best gets the top n words:

    impl TfIdf {
        // ...
    
        pub fn get_n_best(&self, n: usize) -> Vec<(String, f32)> {
            let mut sorted_tfidf = self.0.iter().collect::<Vec<(&String, &f32)>>();
            sorted_tfidf.sort_by(|a, b| {
                let order = b.1.partial_cmp(&a.1).unwrap_or(Ordering::Equal);
    
                if order == Ordering::Equal {
                    return a.0.cmp(&b.0);
                }
    
                order
            });
            sorted_tfidf
                .iter()
                .take(n)
                .map(|(word, score)| (word.to_string(), **score))
                .collect::<Vec<(String, f32)>>()
        }
    }
    

Conclusion

In this article we learn how to implement one of the most common algorithms in NLP keyword extraction, the TF-IDF algorithm. This article is particularly useful for developing text based Data Analysis services like article, job offers, etc, analysers.

All the code from this article can be found here.

About the Author

Hey there! I am Afonso Barracha, your go-to econometrician who found his way into the world of back-end development with a soft spot for GraphQL. If you enjoyed reading this article, why not show some love by buying me a coffee?

Lately, I have been diving deep into more advanced subjects. As a result, I have switched from sharing my thoughts every week to posting once or twice a month. This way, I can make sure to bring you the highest quality content possible.

Do not miss out on any of my latest articles – follow me here on dev and LinkedIn to stay updated. I would be thrilled to welcome you to our ever-growing community! See you around!

Top comments (0)