Disclosure: The material in this Tutorial has not been reviewed, endorsed, or approved of by the Rust Foundation. For more information on the Rust Foundation Trademark Policy, click here.
Series Intro
You know, like many of you, I was blown away by Chat GPT, especially GPT-4, which reignited my love for machine learning after a two-year break. But once I was back, I hit the same issue that made me quit: slow training times.
I tried unsupervised NLP models and keyword extraction, but even those were too slow for a simple VPS. That is when I realised Rust and Actix-Web could be the answer.
Therefore, let's explore Rust for NLP together and dive into the awesome world of text mining and keyword extraction.
What will I cover in this series
I will mainly cover 3 algorithms:
- TF-IDF: Term Frequency-Inverse Document Frequency algorithm converts raw text into insightful numerical data.
- Co-occurrence Matrix: the Co-occurrence Matrix covers underlying patterns and associations in a language by examining word pairs and building semantic networks in textual data.
- RAKE: Rapid Automatic Keyword Extraction Algorithm identifies crucial words and phrases to enhance text summarization, topic modeling, and information retrieval in any text-based project.
Intro
Before implementing the TF-IDF algorithm we have to know what it is, and what it does. Term Frequency-Inverse Document Frequency algorithm uses a weighting system that weights the importance of each word in a group of documents based on two components:
- Term Frequency (TF): this element calculates how often a term appears in a document. It is calculated by dividing the word count by the count of all words.
- Inverse Document Frequency (IDF): calculates the importance of a word relative to the corpus of the document. It is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents in which the word appears.
Steps to implement TF-IDF
Note: if you do not have time to read this entire article, you can find all the code in this repository. While at it consider buying me a coffee if you liked what you saw.
Here are the steps to implement TF-IDF:
- Tokenize the documents by converting them into a vector of words, and then remove punctuation and stop-words. Stop-words are common, language-specific words that do not carry much meaning and can typically be ignored.
- Calculate the Term Frequency (TF) for each word.
- Calculate the Inverse Document Frequency (IDF) for each word.
- Multiply the TF by IDF to get the Term Frequency-Inverse Document Frequency (TF-IDF) for each word.
- L2 (Pythagoras Theorem type of vector distance) normalize the base TF-IDF calculation
Implementation
Let start by creating a new RUST project with Cargo:
$ cargo new keyword_extraction --lib
Since we will have several algorithms we can take advantage of the cargo workspaces functionality.
Text process
Tokenizer initialisation
Create a library called tokenizer:
$ cargo new tokenizer --lib
And add it to the main Cargo.toml
file:
[package]
name = "keyword_extraction"
version = "0.1.0"
edition = "2021"
license = "GPL-3.0-or-later"
authors = ["Afonso Barracha <barracha.afonso@gmail.com>"]
publish = false
[lib]
name = "keyword_extraction"
path = "src/lib.rs"
[workspace]
members = [
".",
"tokenizer"
]
[dependencies]
tokenizer = { path = "./tokenizer" }
And update the tokenizer
's Cargo.toml
:
[package]
name = "tokenizer"
version = "0.1.0"
edition = "2021"
publish = false
authors = ["Afonso Barracha <barracha.afonso@gmail.com>"]
description = "A simple tokenizer for the Rust programming language"
license = "GPL-3.0-or-later"
[lib]
name = "tokenizer"
path = "src/lib.rs"
To process the document we will use a common used library for string processing unicode_segmentation
, as the description of the library states:
This crate provides Grapheme Cluster, Word and Sentence boundaries according to Unicode Standard Annex #29 rules.
So we will use it to break (tokenize) the block of text (article, job offer, etc.) into sentences, paragraphs and words.
Start by adding this and the regex library to the Cargo.toml
file:
# ...
[dependencies]
unicode-segmentation = "^1"
regex = "^1"
On the top of the lib.rs
file import UnicodeSegmentation
trait from the this crate, the Regex
struct from the regex crate, and the HashSet
from the standard library:
use std::collections::HashSet;
use regex::Regex;
use unicode_segmentation::UnicodeSegmentation;
On the lib.rs
file add a new struct that contains the text, the stop-words and punctuation used in the text:
pub struct Tokenizer {
text: String,
stopwords: HashSet<String>,
punctuation: HashSet<String>,
}
On the struct implement a new
method for its initialisation:
impl Tokenizer {
pub fn new(text: &str, stopwords: Vec<String>, punctuation: Option<Vec<String>>) -> Self {
Self {
text: text.to_owned(),
stopwords: stopwords
.iter()
.map(|s| s.to_owned())
.collect::<HashSet<String>>(),
punctuation: punctuation
.unwrap_or(
vec![
"!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", ";", ".", "/",
":", ",", "<", "=", ">", "?", "@", "[", "\\", "]", "^", "_", "`", "{", "|",
"}", "~", "-",
]
.iter()
.map(|s| s.to_string())
.collect::<Vec<String>>(),
)
.iter()
.map(|s| s.to_string())
.collect::<HashSet<String>>(),
}
}
}
Punctuation for most languages are the same so we make it optional and use the vec!
macro to initialise the most common characters.
Using the iterator pattern we turn the vectors of stop-words and punctuation into hash sets for easy and efficient lookup.
Tokenizer Word split
The UnicodeSegmentation
trait gives strings a very cool method, namely split_word_bounds()
which breaks our block of text into a Vec<String>
where words are separated from most punctuation characters, however some characters like "'s"
, "."
and ","
still need to be filtered so create a function to create a regex to filter those characters:
// ...
fn get_special_char_regex() -> Regex {
Regex::new(r"('s|,|\.)").unwrap()
}
Finally to process and filter those words we will use the iterator trait filter_map
method,
Create a helper function that returns the Option<String>
monad:
fn process_word(
w: &str,
special_char_regex: &Regex,
stopwords: &HashSet<String>,
punctuation: &HashSet<String>,
) -> Option<String> {
let word = special_char_regex.replace_all(w.trim(), "").to_lowercase();
if word.is_empty()
|| (word.graphemes(true).count() == 1) && punctuation.contains(&word)
|| stopwords.contains(&word)
{
return None;
}
Some(word)
}
We do 4 actions on that function, we:
-
Process the word by removing the special characters and turning it into lowercase:
let word = special_char_regex.replace_all(w.trim(), "").to_lowercase();
-
We remove filter the word out if it is an empty string:
if word.is_empty() { return None; }
-
If the string is one character long we check if it is a punctuation character (that was split by the
split_word_bounds()
method):
if (word.graphemes(true).count() == 1) && punctuation.contains(&word) { return None; }
Note: the
graphemes
method is provided by theUnicodeSegmentation
trait as well -
We check if the word is a stop word, and filter it if it is:
if stopwords.contains(&word) { return None; }
Now we can use the function inside an iterator
on methods of our Tokenizer
trait, like the one to split into words:
impl Tokenizer {
// ...
pub fn split_into_words(&self) -> Vec<String> {
self.text
.split_word_bounds()
.filter_map(|w| {
process_word(
w,
&get_special_char_regex(),
&self.stopwords,
&self.punctuation,
)
})
.collect::<Vec<String>>()
}
}
The method is pretty self explanatory, we split_word_bounds
to get an iterator of words, we filter_map
and use the process_word
function to process the words, and we collect everything into a Vec<String>
.
Tokenizer Sentence Split
To split our we code into documents we will need to split the text block into sentences, again UnicodeSegmentation
has a good method for that unicode_sentences()
. But before that we need to remove new lines ("\n"
and "\r"
) and tabs ("\t"
), so create a regex for that:
fn get_sentence_space_regex() -> Regex {
Regex::new(r"^([\.!?])[\n\t\r]").unwrap()
}
And use it to filter them, like the words, we still need to remove the stop words:
impl Tokenizer {
// ...
pub fn split_into_sentences(&self) -> Vec<String> {
let special_char_regex = get_special_char_regex();
get_sentence_space_regex()
.replace_all(&self.text, ".")
.unicode_sentences()
.map(|s| {
s.split_word_bounds()
.filter_map(|w| {
process_word(w, &special_char_regex, &self.stopwords, &self.punctuation)
})
.collect::<Vec<String>>()
.join(" ")
})
.collect::<Vec<String>>()
}
}
The code is pretty much the same as the previous section, but now we have to split it in sentences, and for each sentence we need to process the words.
Tokenizer Paragraph Split
To split our code into paragraphs we just need to divide our code with the delimiter of new lines, so create an Regex for new lines:
fn get_newline_regex() -> Regex {
Regex::new(r"(\r|\n|\r\n)").unwrap()
}
And apply it to the text:
impl Tokenizer {
// ...
pub fn split_into_paragraphs(&self) -> Vec<String> {
get_newline_regex()
.split(&self.text)
.filter_map(|s| {
if s.trim().is_empty() {
return None;
}
Some(
s.unicode_sentences()
.map(|s| {
s.split_word_bounds()
.filter_map(|w| {
process_word(
w,
&get_special_char_regex(),
&self.stopwords,
&self.punctuation,
)
})
.collect::<Vec<String>>()
.join(" ")
})
.collect::<Vec<String>>()
.join(" "),
)
})
.collect::<Vec<String>>()
}
}
Note that now we need three iterators, one for the paragraphs, one for the sentences, and one for the words of the sentences.
Calculate TF-IDF
At the end of the day TF-IDF is just a fancy HashMap that has the words as keys, and a 32 bits float (f32
) score as values.
Start by generating a tf_idf
library:
$ cargo new tf_idf --lib
And import the necessary libraries, including the Tokenizer library:
use std::cmp::Ordering;
use std::collections::{HashMap, HashSet};
use tokenizer::Tokenizer;
Create a tuple struct TfIdf
that accepts a single HashMap
params:
pub struct TfIdf(HashMap<String, f64>);
Calculate TF
To calculate the TF we have to create two functions.
-
This function takes the documents and returns a word hash map with the number of occurence of each word in all documents:
fn generate_word_hashmap(documents: &Vec<String>) -> HashMap<String, f32> { documents .iter() .flat_map(|document| document.split_whitespace().map(|word| word.to_string())) .fold(HashMap::new(), |mut acc, word| { let count = acc.entry(word).or_insert(0.0); *count += 1.0; acc }) }
-
While this one takes a mutable hash map and returns a word hash map with its TF score:
fn calculate_tf(tf: HashMap<String, f32>) -> HashMap<String, f32> { let total_words = tf.values().sum::<f32>(); tf.iter() .map(|(word, count)| (word.to_string(), count / total_words)) .collect::<HashMap<String, f32>>() }
Calculate IDF
You need to two functions to calculate the IDF score as well:
-
One takes the documents, and returns on how many documents each word appeared:
fn generate_unique_word_hashmap(documents: &Vec<String>) -> HashMap<String, f32> { documents .iter() .map(|document| { document .split_whitespace() .map(|s| s.to_owned()) .collect::<HashSet<String>>() }) .flat_map(|unique_words| unique_words.into_iter()) .fold(HashMap::new(), |mut acc, word| { let count = acc.entry(word).or_insert(0.0); *count += 1.0; acc }) }
-
The other takes the documents vector length, a hash map with the unique words and returns a hash map of IDF scores:
fn calculate_idf(docs_len: f32, word_hashmap: HashMap<String, f32>) -> HashMap<String, f32> { let one = 1.0_f32; word_hashmap .iter() .map(|(word, count)| { let documents_with_term = (docs_len + one) / (count + one); (word.to_string(), documents_with_term.ln() + one) }) .collect::<HashMap<String, f32>>() }
Calculate the TF-IDF
The TF-IDF is just a hash map of the product of the previous two hash maps:
fn calculate_tf_idf(tf: HashMap<String, f32>, idf: HashMap<String, f32>) -> HashMap<String, f32> {
tf.iter()
.map(|(word, count)| (word.to_string(), count * idf.get(word).unwrap_or(&0.0_f32)))
.collect::<HashMap<String, f32>>()
}
L2 Normalization
L2 norm, or Euclidean norm, is the distance of two vectors coordinates within a vector space. It is basically the same as the Pythagoras Theorem but for all the values in the vector:
l2 = sqrt(x*x + y*y + ...)
So the function that calculates this is something like:
fn l2_normalize(tf_id: HashMap<String, f32>) -> HashMap<String, f32> {
let l2_norm = tf_id
.values()
.map(|value| value * value)
.sum::<f32>()
.sqrt();
tf_id
.iter()
.map(|(key, value)| (key.clone(), value / l2_norm))
.collect::<HashMap<String, f32>>()
}
Putting it all together
To initialise the TfIdf
struct implement a new method:
impl TfIdf {
pub fn new(documents: &Vec<String>) -> Self {
Self(l2_normalize(calculate_tf_idf(
calculate_tf(generate_word_hashmap(documents)),
calculate_idf(
documents.len() as f32,
generate_unique_word_hashmap(documents),
),
)))
}
}
Adding the Final Touches
As you can see on our struct the only param is private, hence we need methods to access its data, one for checking the score for a single word, and one to get the top N words.
-
The
get_score
method, is just an wrapper on the hash map:
impl TfIdf { // ... pub fn get_score(&self, word: &str) -> f32 { *self.0.get(word).unwrap_or(&0.0) } }
-
While the
get_n_best
gets the top n words:
impl TfIdf { // ... pub fn get_n_best(&self, n: usize) -> Vec<(String, f32)> { let mut sorted_tfidf = self.0.iter().collect::<Vec<(&String, &f32)>>(); sorted_tfidf.sort_by(|a, b| { let order = b.1.partial_cmp(&a.1).unwrap_or(Ordering::Equal); if order == Ordering::Equal { return a.0.cmp(&b.0); } order }); sorted_tfidf .iter() .take(n) .map(|(word, score)| (word.to_string(), **score)) .collect::<Vec<(String, f32)>>() } }
Conclusion
In this article we learn how to implement one of the most common algorithms in NLP keyword extraction, the TF-IDF algorithm. This article is particularly useful for developing text based Data Analysis services like article, job offers, etc, analysers.
All the code from this article can be found here.
About the Author
Hey there! I am Afonso Barracha, your go-to econometrician who found his way into the world of back-end development with a soft spot for GraphQL. If you enjoyed reading this article, why not show some love by buying me a coffee?
Lately, I have been diving deep into more advanced subjects. As a result, I have switched from sharing my thoughts every week to posting once or twice a month. This way, I can make sure to bring you the highest quality content possible.
Do not miss out on any of my latest articles β follow me here on dev and LinkedIn to stay updated. I would be thrilled to welcome you to our ever-growing community! See you around!
Top comments (0)