Goal
To efficiently use a large list of known English words to create a spell checker.
NOTE This is a work in progress. Please respond with comments on how it could be done better / if there are existing frameworks or libraries that I should use instead.
Plan
In this article I'll create two versions of a simple spell checker which really will just take a word and see if it's in a large list of known words. I'm using this list of 20,000 words, which is supposed to be a list of the most frequent ones.
I plan to use this in a Phoenix app so I wanted to do some measurements and determine the performance characteristics. My first thought was to use a GenServer so that I could have a single process load up the list and answer check requests, but I wanted to make sure overhead of a GenServer is worth it.
Code Structure
I'm creating two different SpellCheckers - a Naive one and a GenServer one.
I'm going to use a page from Bram Stoker's Dracula as test data.
And then I'll use Benchee
to measure results.
Implementations
Helper:
defmodule FileReader do
def read_words_from_file(file_path) do
file_path
|> File.stream!()
|> Enum.map(&String.trim/1)
|> Enum.reject(&(&1 == ""))
end
end
Naive:
defmodule SpellCheck.Naive do
#@all_words ["these", "are", "some", "words"]
def exists?(word) do
all_words = FileReader.read_words_from_file("/tmp/20k.txt")
word in all_words
end
end
GenServer:
defmodule SpellCheck.GenServer do
use GenServer
#@all_words ["these", "are", "some", "words"]
def start_link(_opts \\ []) do
GenServer.start_link(__MODULE__, :ok, name: __MODULE__)
end
def exists?(word) do
GenServer.call(__MODULE__, {:check_exists, word})
end
@impl true
def init(_) do
all_words = FileReader.read_words_from_file("/tmp/20k.txt")
{:ok, all_words}
end
@impl true
def handle_call({:check_exists, word}, _from, all_words) do
{:reply, word in all_words, all_words}
end
end
SpellCheck.GenServer.start_link()
Test words:
sample_text = """
I soon lost sight and recollection of ghostly fears
<snip>
"""
eval_words = String.split(sample_text, " ", trim: true)
Benchee:
Benchee.run(%{
"Naive" => fn -> Enum.map(eval_words, fn w -> SpellCheck.Naive.exists?(w) end) end,
"GenServer" => fn -> Enum.map(eval_words, fn w -> SpellCheck.GenServer.exists?(w) end) end
},
time: 5,
memory_time: 2
)
Test 1 - Hard coded dictionary
First, I actually tested the code without reading the 20,000 words from a file and just used an array set as module attributes.
Perhaps not surprisingly, when using a hard coded list of words as a dictionary the extra overhead of a GenServer was noticable and the Naive implementation performed better:
Name ips average deviation median 99th %
Naive 14.17 K 70.57 μs ±37.68% 69.79 μs 87.25 μs
GenServer 2.99 K 333.95 μs ±3.30% 330.88 μs 374.93 μs
Comparison:
Naive 14.17 K
GenServer 2.99 K - 4.73x slower +263.38 μs
Memory usage statistics:
Name Memory usage
Naive 171.97 KB
GenServer 246.55 KB - 1.43x memory usage +74.58 KB
Interpreting the results: Naive was 5x faster and used ~60% less memory.
But this is not a viable way to check spelling because we need to load a large set of words.
How will it perform at higher scale?
Test 2 - Reading dictionary from a file
You won't be surprised to see that the GenServer
performs much better, considering that their purpose is maintaining state so we don't need to load the state over and over.
However, I was surprised to see HOW MUCH better it was:
Name ips average deviation median 99th %
GenServer 110.18 0.00908 s ±55.16% 0.00876 s 0.00981 s
Naive 0.68 1.46 s ±6.78% 1.41 s 1.61 s
Comparison:
GenServer 110.18
Naive 0.68 - 160.88x slower +1.45 s
Memory usage statistics:
Name Memory usage
GenServer 0.00024 GB
Naive 1.98 GB - 8419.39x memory usage +1.98 GB
Interpreting the results: GenServer was 160x faster and a tiny fraction of the memory that Naive used. In fact, it looks like GenServer's memory usage didn't change much between tests.
Though the overall throughput between the two tests varies significantly. My method of checking if a word is in the dictionary could certainly be improved with a data structure that supports a binary search instead of a linear search. But this is good enough for my use case and I'll need to explore faster searching another time.
Top comments (0)