A colleague at work showed me this interesting GitHub repository: github.com/juditacs/wordcount.
The readme explains what the word-counting challenge is:
The task is to split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.
@juditacs did a great job of aggregating different solutions in several programming languages, creating a Dockerfile to easily setup and benchmark all programs against a Hungarian Wikipedia text dump. They even wrote two blog posts about the word-counting challenge. 1, 2
The latest benchmark on the readme was done on 2016-06-13:
Rank | Experiment | CPU seconds | User time | Contributor |
---|---|---|---|---|
1 | java -Xms2G -Xmx8G -classpath java:java/zah-0.6.jar WordCountOptimized | 140.64 | 191.44 | |
2 | java -Xms2G -Xmx8G -classpath java:java/trove-3.0.3.jar WordCountOptimized | 183.17 | 219.39 | |
3 | rust/wordcount/wordcount | 208.86 | 193.34 | Joshua Holmer |
Something I find very interesting is that the java solution was the fastest one submitted. It was faster than the usual suspects of C, C++ and Rust programs.
This piqued my interest! Why is Java in the same ballpark as the rust program? Why does the java baseline solution differ so much from the optimized solution? And how can we make the Rust version as fast or even faster than the Java Solution?
Luckily the authors of the optimized java solution left some helpful comments:
Key tricks:
- Faster hashing
- Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array
- Process recurring tokens separately from singletons (big performance gain)
- Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost
Using the same techniques we should be able to boost the rust solution to the top.
Before we start, let's look at the baseline rust implementation first. This is the submitted Rust solution by Joshua Holmer; I added some comments for people who are new to rust.
The rust solution is straight forward and the code is fairly easy to understand. We will use this code as the basis for our optimized rust solution.
Benchmarking
Hyperfine is a benchmarking tool written by the great sharkdp to measure the runtime of different cli commands and to gather benchmark results. The runtime of the baseline rust program against 5M lines of Hungarian Wikipedia is around 28 seconds.
Start improving the rust code
1. Faster hashing
Desired effect:
The programming challenge consists of basically two parts, gathering all words into a HashMap and secondly arranging the words into an array to sort it. Since most of the work happens within the HashMap, using a faster HashMap should give us some speed.
Changes:
The Rust solution was submitted in April 2016, that means the used Rust Version was around 1.10
. In Rust 1.36
we got a new default implementation for Hashmap which is based on SwissTable. Just updating the rust version to the newest stable version gives us some performance improvements.
Benchmark:
$ hyperfine -w 1 -m 5
'cat 5M_wiki_dump.xml | ./wordcount_1.10'
'cat 5M_wiki_dump.xml | ./wordcount_1.37'
Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.10
Time (mean ± σ): 28.131 s ± 0.363 s [User: 27.292 s, System: 1.661 s]
Range (min … max): 27.774 s … 28.718 s 5 runs
Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_1.37
Time (mean ± σ): 23.357 s ± 0.335 s [User: 22.644 s, System: 1.346 s]
Range (min … max): 22.910 s … 23.708 s 5 runs
Summary
'cat 5M_wiki_dump.xml | ./wordcount_1.37'
ran 1.20 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.10'
The runtime of the updated rust binary takes around 23 seconds. We already improved by 5 seconds. 20% faster than the original solution just by updating rust!
2. Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array
Desired effect:
We want to buffer our output to minimize the number of syscalls we send.
Changes:
The solution already uses BufWriter. No need to do anything here.
Benchmark:
The runtime stayed the same because we did not change anything 🙂
3. Process recurring tokens separately from singletons (big performance gain)
Desired effect:
All benchmark optimization are walking on a thin line between "smart hack" and cheat. This optimization works well with the test data (Hungarian Wikipedia).
Basically, we are creating two Lists, one with words that have 2 or more occurrences and another one with only single occurrences.
The list with the multiple occurrences needs to be sorted twice, once by the number of occurrences and the second time the words need to be sorted lexically. The singleton list only has to be sorted lexically, because the number of occurrences is the same for the whole list (n=1).
Reducing the amount of sortable data that needs to be sorted twice reduces the runtime.
Changes:
- sortable_words.sort_by_key(|a| a.0);
- sortable_words.sort_by(|a, b| b.1.cmp(&a.1));
+ let mut multiple: Vec<(&str, usize)> = vec![];
+ let mut single: Vec<(&str, usize)> = vec![];
+
+ for (word, count) in words.iter() {
+ if *count > 1 {
+ multiple.push((*word, *count));
+ } else {
+ single.push((*word, *count));
+ }
+ }
+
+ multiple.sort_by_key(|a| a.0);
+ multiple.sort_by(|a, b| b.1.cmp(&a.1));
+ single.sort_by_key(|a| a.0);
let mut output = BufWriter::new(io::stdout());
- for word in sortable_words {
+ for word in multiple {
+ writeln!(output, "{}\t{}", word.0, word.1).ok();
+ }
+
+ for word in single {
writeln!(output, "{}\t{}", word.0, word.1).ok();
}
}
Benchmark:
$ hyperfine -w 1 -m 5
'cat 5M_wiki_dump.xml | ./wordcount_1.37'
'cat 5M_wiki_dump.xml | ./wordcount_single'
Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.37
Time (mean ± σ): 23.809 s ± 0.278 s [User: 23.079 s, System: 1.369 s]
Range (min … max): 23.401 s … 24.152 s 5 runs
Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_single
Time (mean ± σ): 23.257 s ± 0.398 s [User: 22.533 s, System: 1.360 s]
Range (min … max): 22.661 s … 23.729 s 5 runs
Summary
'cat 5M_wiki_dump.xml | ./wordcount_single'
ran 1.02 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.37'
Our new binary finished 2% faster than the previous solution.
4. Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost
Desired effect:
Another performance optimization we can do is using raw bytes instead of UTF-8 Strings. In Rust, a String does check its content for UTF-8 validity, something that can be avoided by using plain byte arrays. The std::string::from_unchecked_utf8 function, which can only be used in an unsafe block, is used to print the byte array. This is a not safe operation, but because we know that the input will be valid UTF-8, we indicate to the compiler that we know what we are doing by putting it inside an unsafe block.
Changes:
Benchmark:
$ hyperfine -w 1 -m 5
'cat 5M_wiki_dump.xml | ./wordcount_single'
'cat 5M_wiki_dump.xml | ./wordcount_u8'
Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_single
Time (mean ± σ): 23.380 s ± 0.296 s [User: 22.658 s, System: 1.367 s]
Range (min … max): 23.053 s … 23.745 s 5 runs
Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_u8
Time (mean ± σ): 22.713 s ± 0.438 s [User: 21.975 s, System: 1.378 s]
Range (min … max): 22.154 s … 23.159 s 5 runs
Summary
'cat 5M_wiki_dump.xml | ./wordcount_u8'
ran 1.03 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_single'
Our Improvements gained us another 3%.
Result
Benchmark baseline vs latest improvement:
$ hyperfine -w 1 -m 5
'cat 5M_wiki_dump.xml | ./wordcount_1.10'
'cat 5M_wiki_dump.xml | ./wordcount_u8'
Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.10
Time (mean ± σ): 27.722 s ± 0.483 s [User: 26.926 s, System: 1.629 s]
Range (min … max): 26.909 s … 28.146 s 5 runs
Benchmark #2:cat 5M_wiki_dump.xml | ./wordcount_u8
Time (mean ± σ): 22.254 s ± 0.286 s [User: 21.577 s, System: 1.315 s]
Range (min … max): 21.950 s … 22.559 s 5 runs
Summary
'cat 5M_wiki_dump.xml | ./wordcount_u8'
ran 1.25 ± 0.03 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.10'
All in all our improvements gained us a 25% faster binary compared to the baseline rust solution.
Top comments (2)
Really like that you have broken down all of the steps and taken the time to benchmark each of them.
Some other suggestions:
Maybe stream stdin, rather than reading the whole thing at the start?
io::stdin().read_to_end(&mut buffer).ok();
feels expensive.Perhaps pick a specialized hash function that's happy with short strings.
Maybe use
Vec::with_capacity(words.len())
rather than thevec![]
macro when you constructmultiple
andsingle
. That'll avoid several allocations. Although it also will probably waste a lot of memory.Great ideas, i'll update my code and measure.
Im primarily interested in optimizing for runtime, but I think it would be interesting to see how the implementation differs if you optimize for space.