Marc Ryan Riginding

Posted on Aug 21, 2019 • Edited on Aug 28, 2019

Benchmarking or "How to count words really fast"

#rust #optimization #benchmark

A colleague at work showed me this interesting GitHub repository: github.com/juditacs/wordcount.

The readme explains what the word-counting challenge is:

The task is to split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.

@juditacs did a great job of aggregating different solutions in several programming languages, creating a Dockerfile to easily setup and benchmark all programs against a Hungarian Wikipedia text dump. They even wrote two blog posts about the word-counting challenge. 1, 2

The latest benchmark on the readme was done on 2016-06-13:

Rank	Experiment	CPU seconds	User time	Contributor
1	java -Xms2G -Xmx8G -classpath java:java/zah-0.6.jar WordCountOptimized	140.64	191.44
2	java -Xms2G -Xmx8G -classpath java:java/trove-3.0.3.jar WordCountOptimized	183.17	219.39
3	rust/wordcount/wordcount	208.86	193.34	Joshua Holmer

Something I find very interesting is that the java solution was the fastest one submitted. It was faster than the usual suspects of C, C++ and Rust programs.

This piqued my interest! Why is Java in the same ballpark as the rust program? Why does the java baseline solution differ so much from the optimized solution? And how can we make the Rust version as fast or even faster than the Java Solution?

Luckily the authors of the optimized java solution left some helpful comments:

Key tricks:

Faster hashing

Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array

Process recurring tokens separately from singletons (big performance gain)

Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost

Using the same techniques we should be able to boost the rust solution to the top.

Before we start, let's look at the baseline rust implementation first. This is the submitted Rust solution by Joshua Holmer; I added some comments for people who are new to rust.

	use std::io::{self, Read, Write, BufWriter}; // Bring some IO stuff into scope
	use std::collections::HashMap; // Bring Hashmap into scope

	fn main() { // special entry function
	let mut buffer = String::new(); // Initialize a string
	io::stdin().read_to_string(&mut buffer).ok(); // fill string with input from stdin

	let mut words: HashMap<&str, usize> = HashMap::new(); // initialize HashMap
	for word in buffer.split(is_whitespace).filter(is_not_empty) {
	if let Some(count) = words.get_mut(word) {
	*count += 1;
	continue;
	}
	words.insert(word, 1);
	}

	// we gather all entries of the dictionary in a Vector of Tuples
	// the first field is the word, the second the number of occurrences
	let mut sortable: Vec<(&str, usize)> = words
	.iter()
	.map(\|(word, count)\| (word, count))
	.collect();

	sortable.sort_by_key(\|a\| a.0); //sort words
	sortable.sort_by(\|a, b\| b.1.cmp(&a.1)); // sort occurences

	let mut output = BufWriter::new(io::stdout());
	for word in sortable {
	writeln!(output, "{}\t{}", word.0, word.1).ok(); // output to stdout
	}
	}

	#[inline] // this instructs the compiler to inline this function
	fn is_not_empty(s: &&str) -> bool {
	!s.is_empty()
	}

	#[inline]
	fn is_whitespace(c: char) -> bool {
	c == ' ' \|\| c == '\t' \|\| c == '\n'
	}

view raw wordcount.rs hosted with ❤ by GitHub

The rust solution is straight forward and the code is fairly easy to understand. We will use this code as the basis for our optimized rust solution.

Benchmarking

Hyperfine is a benchmarking tool written by the great sharkdp to measure the runtime of different cli commands and to gather benchmark results. The runtime of the baseline rust program against 5M lines of Hungarian Wikipedia is around 28 seconds.

Start improving the rust code

1. Faster hashing

Desired effect:
The programming challenge consists of basically two parts, gathering all words into a HashMap and secondly arranging the words into an array to sort it. Since most of the work happens within the HashMap, using a faster HashMap should give us some speed.

Changes:
The Rust solution was submitted in April 2016, that means the used Rust Version was around 1.10. In Rust 1.36 we got a new default implementation for Hashmap which is based on SwissTable. Just updating the rust version to the newest stable version gives us some performance improvements.

Benchmark:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_1.10' 
'cat 5M_wiki_dump.xml | ./wordcount_1.37'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.10
  Time (mean ± σ):     28.131 s ±  0.363 s    [User: 27.292 s, System: 1.661 s]
  Range (min … max):   27.774 s … 28.718 s    5 runs

Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_1.37
  Time (mean ± σ):     23.357 s ±  0.335 s    [User: 22.644 s, System: 1.346 s]
  Range (min … max):   22.910 s … 23.708 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_1.37'
ran 1.20 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.10'

The runtime of the updated rust binary takes around 23 seconds. We already improved by 5 seconds. 20% faster than the original solution just by updating rust!

2. Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array

Desired effect:
We want to buffer our output to minimize the number of syscalls we send.

Changes:
The solution already uses BufWriter. No need to do anything here.

Benchmark:
The runtime stayed the same because we did not change anything 🙂

3. Process recurring tokens separately from singletons (big performance gain)

Desired effect:
All benchmark optimization are walking on a thin line between "smart hack" and cheat. This optimization works well with the test data (Hungarian Wikipedia).
Basically, we are creating two Lists, one with words that have 2 or more occurrences and another one with only single occurrences.
The list with the multiple occurrences needs to be sorted twice, once by the number of occurrences and the second time the words need to be sorted lexically. The singleton list only has to be sorted lexically, because the number of occurrences is the same for the whole list (n=1).
Reducing the amount of sortable data that needs to be sorted twice reduces the runtime.

Changes:

-    sortable_words.sort_by_key(|a| a.0);
-    sortable_words.sort_by(|a, b| b.1.cmp(&a.1));

+    let mut multiple: Vec<(&str, usize)> = vec![];
+    let mut single: Vec<(&str, usize)> = vec![];
+
+    for (word, count) in words.iter() {
+        if *count > 1 {
+            multiple.push((*word, *count));
+        } else {
+            single.push((*word, *count));
+        }
+    }
+
+    multiple.sort_by_key(|a| a.0);
+    multiple.sort_by(|a, b| b.1.cmp(&a.1));
+    single.sort_by_key(|a| a.0);

     let mut output = BufWriter::new(io::stdout());
-    for word in sortable_words {
+    for word in multiple {
+        writeln!(output, "{}\t{}", word.0, word.1).ok();
+    }
+
+    for word in single {
         writeln!(output, "{}\t{}", word.0, word.1).ok();
     }
 }

link to gist

Benchmark:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_1.37' 
'cat 5M_wiki_dump.xml | ./wordcount_single'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.37
  Time (mean ± σ):     23.809 s ±  0.278 s    [User: 23.079 s, System: 1.369 s]
  Range (min … max):   23.401 s … 24.152 s    5 runs

Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_single
  Time (mean ± σ):     23.257 s ±  0.398 s    [User: 22.533 s, System: 1.360 s]
  Range (min … max):   22.661 s … 23.729 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_single'
ran 1.02 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.37'

Our new binary finished 2% faster than the previous solution.

4. Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost

Desired effect:
Another performance optimization we can do is using raw bytes instead of UTF-8 Strings. In Rust, a String does check its content for UTF-8 validity, something that can be avoided by using plain byte arrays. The std::string::from_unchecked_utf8 function, which can only be used in an unsafe block, is used to print the byte array. This is a not safe operation, but because we know that the input will be valid UTF-8, we indicate to the compiler that we know what we are doing by putting it inside an unsafe block.

Changes:

	use std::io::{self, Read, Write, BufWriter};
	use std::collections::HashMap;

	fn main() {
	let mut buffer = Vec::<u8>::new();
	io::stdin().read_to_end(&mut buffer).ok();

	let mut words: HashMap<&[u8], usize> = HashMap::new();
	for word in buffer.split(is_whitespace).filter(is_not_empty) {
	if let Some(count) = words.get_mut(word) {
	*count += 1;
	continue;
	}
	words.insert(word, 1);
	}

	let mut multiple: Vec<(&[u8], usize)> = vec![];
	let mut single: Vec<(&[u8], usize)> = vec![];

	for (word, count) in words.iter() {
	if *count > 1 {
	multiple.push((word, count));
	} else {
	single.push((word, count));
	}
	}

	multiple.sort_by_key(\|a\| a.0);
	multiple.sort_by(\|a, b\| b.1.cmp(&a.1));
	single.sort_by_key(\|a\| a.0);

	let mut output = BufWriter::new(io::stdout());
	unsafe {
	for word in multiple {
	writeln!(output, "{}\t{}", std::str::from_utf8_unchecked(word.0), word.1).ok();
	}

	for word in single {
	writeln!(output, "{}\t{}", std::str::from_utf8_unchecked(word.0), word.1).ok();
	}
	}
	}
	#[inline]
	fn is_whitespace(c: &u8) -> bool {
	c == b' ' \|\| c == b'\t' \|\| *c == b'\n'
	}

	#[inline]
	fn is_not_empty(s: &&[u8]) -> bool {
	!s.is_empty()
	}

view raw wordcount.rs hosted with ❤ by GitHub

Benchmark:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_single' 
'cat 5M_wiki_dump.xml | ./wordcount_u8'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_single
  Time (mean ± σ):     23.380 s ±  0.296 s    [User: 22.658 s, System: 1.367 s]
  Range (min … max):   23.053 s … 23.745 s    5 runs

Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_u8
  Time (mean ± σ):     22.713 s ±  0.438 s    [User: 21.975 s, System: 1.378 s]
  Range (min … max):   22.154 s … 23.159 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_u8' 
ran 1.03 ± 0.02 times faster than 
'cat 5M_wiki_dump.xml | ./wordcount_single'

Our Improvements gained us another 3%.

Result

Benchmark baseline vs latest improvement:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_1.10' 
'cat 5M_wiki_dump.xml | ./wordcount_u8'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.10
  Time (mean ± σ):     27.722 s ±  0.483 s    [User: 26.926 s, System: 1.629 s]
  Range (min … max):   26.909 s … 28.146 s    5 runs

Benchmark #2:cat 5M_wiki_dump.xml | ./wordcount_u8
  Time (mean ± σ):     22.254 s ±  0.286 s    [User: 21.577 s, System: 1.315 s]
  Range (min … max):   21.950 s … 22.559 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_u8' 
ran 1.25 ± 0.03 times faster than 
'cat 5M_wiki_dump.xml | ./wordcount_1.10'

All in all our improvements gained us a 25% faster binary compared to the baseline rust solution.

Global Data Access Unlocked – Reach data across borders without restrictions.

Unlock the power of global data collection with our advanced proxy solutions. Ideal for market research and more.

Unlock Data Now

Top comments (2)

Tim McNamara • Aug 21 '19

Really like that you have broken down all of the steps and taken the time to benchmark each of them.

Some other suggestions:

Maybe stream stdin, rather than reading the whole thing at the start? io::stdin().read_to_end(&mut buffer).ok(); feels expensive.

Perhaps pick a specialized hash function that's happy with short strings.

Maybe use Vec::with_capacity(words.len()) rather than the vec![] macro when you construct multiple and single. That'll avoid several allocations. Although it also will probably waste a lot of memory.

Marc Ryan Riginding • Aug 22 '19

Great ideas, i'll update my code and measure.

Im primarily interested in optimizing for runtime, but I think it would be interesting to see how the implementation differs if you optimize for space.

DEV Community

Benchmarking or "How to count words really fast"

Benchmarking

Start improving the rust code

1. Faster hashing

2. Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array

3. Process recurring tokens separately from singletons (big performance gain)

4. Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost

Result

Global Data Access Unlocked – Reach data across borders without restrictions.

Top comments (2)

Bypass Anti-Scraping Technologies – Access your critical web data without barriers.

Read next

The Rise Of Reasoner Models: Scaling Test-Time Compute

Mastering Pull-to-Refresh in React Native: A Guide to Smooth Data Reloading 🚀

Master AWS Transit Gateway Management with Terraform: A Step-by-Step Guide

Optimize Cloud Expenses with CloudCADI

Okay