Table of Content
- Introduction
- Frequency count
- What is CM Sketch
- Why is CM Sketch is important?
- Super Linear vs Linear vs Sub Linear
- Why not hash tables?
- Problems in CM Sketch ?
- Unveiling the Inner Workings
- From Theory to Action: Implementation Unleashed
- In Python
- Memory Comparison
- Summary of Key Points
- Takeaway
- Applications
- References
Introduction
Count-Min Sketch is a popular probabilistic data structure used for approximate frequency estimation of elements in a data stream. It provides a memory-efficient (sub-linear) way to estimate the frequency of elements without storing the entire dataset. In this blog, we'll explore the concept of Count-Min Sketch and learn how to implement and use it in Python.
Frequency count
- Frequency count, or frequency analysis, is a technique in data analysis that involves determining the number of occurrences of distinct elements in a dataset or data stream.
- It provides valuable insights into the distribution and popularity of elements within the data. By counting how often each element appears, frequency counts help uncover patterns, identify trends, and reveal the most common or significant elements in the dataset.
- This technique finds applications in various domains, including text analysis, data mining, web analytics, market research, network traffic analysis, and social media analysis. It helps in such tasks as identifying popular words, detecting anomalies, understanding customer preferences, and making data-driven decisions.
- In simpler terms, this frequency count helps to gain a deeper understanding and insights into the data we have.
What is CM Sketch?
- As mentioned above CM Sketch is a probabilistic data structure, which will act as a frequency table for the stream of data.
- CM Sketch consists of multiple hash functions and a 2-D array for counters.
- CM Sketch offers O(1) for inserting and retrieval so that we can use it for high-speed data streams such as network monitoring, stock market analysis, traffic analysis, and data mining.
Importance of CM Sketch
Approximate Counting: CM Sketch is designed to estimate the frequency of items in a data stream with limited memory.
Memory Efficiency: The Count-Min Sketch (CM Sketch) demonstrates sub-linear memory usage, as it employs a fixed amount of memory regardless of the super-linear size of the data stream it processes.
Streaming Data Analysis: CM Sketch is specifically well-suited for processing data streams, where data arrives continuously and cannot be stored entirely due to its volume.
Scalability: As CM Sketch's memory requirements are fixed and independent of the data stream size, it offers a scalable solution for processing large and continuously growing data sets without becoming impractical as the data volume increases.
Super Linear vs Linear vs Sub Linear
In algorithmic analysis we use terms like Super Linear, Linear, Sub Linear which are used to describe the recourse consumption(time or space).
Super-linear: A function or algorithm is considered super-linear if its resource consumption grows faster than the size of the input. Like when we do recursion or nested for loops the time complexity increases at the faster rate. We can consider an algorithms which is greater than O(N) (N is size of input) will be Super Linear
Linear: A function or algorithm is linear if its resource consumption grows proportionally with the size of the input. We can consider an algorithms which is equals to O(N).
Sub-linear: A function or algorithm is considered sub-linear if its resource consumption grows at a slower rate than the size of the input. We can consider an algorithm's time complexity is O(log n). Here even if we double the input it will not double the resource allocation.
Why not Hash Tables?
When it comes to frequency count, our natural inclination is to consider Hash Tables, since hash tables are the most commonly used data structure for storing frequency counts. Hash tables could give the exact count of the data, along with this it will provide efficient insert and retrieval of data. However, there are some reasons why CM Sketch can be preferred over Hash tables for the below reasons
Efficient Memory: CM Sketch requires sufficiently less memory(Sub Linear Memory) compared to Hash Tables. We can achieve this by using a fixed 2-D array instead of storing individual k-v pairs in a table.
Streaming Data: CM Sketch is specially designed for handling streaming data, which mean element arrives one by one and may not fit in the memory. We can process the data in a single pass without storing the entire dataset. In streaming data we couldn't look back at the old data, Here in CM sketch we can perform all operations in real-time.
Scalable: CM Sketch can be easily distributed across multiple servers, Each sketch can handle a data stream independently. Also, we can merge the results to find the overall analysis. CM Sketch is well suited for large-scale distributed systems due to its memory efficiency and scalability.
Simple and faster updates: It has efficient update operations. It performs O(1) time for each element to get processed(insert or retrieval). The combination of space and time complexity gives high efficiency for the stream of data
Problems in CM Sketch
Approximation Error: Unlike Hash tables, CM Sketch will not give an exact count. CM Sketch is a probabilistic data structure that majorly focuses on memory efficiency while providing approximate data. This trade-off will be suitable in places where having an approximate frequency is enough, and converging memory is valuable.
False Positive: Count-Min Sketch may produce false positives, where the estimated frequency is higher than the actual frequency. This can occur when two different elements hash to the same counters.
Hash Functions selection: We should select our hash function to be good enough to handle more data, Poorly chosen hash functions will lead to collisions and reduced accuracy.
Set of elements: CM Sketch can process the predefined input data sets, we should know more about the data set we have before processing
Unveiling the Inner Workings
Let's explore the real backend implementations of CM Sketch.
For more accuracy of data, we can use more hash functions.
CM Sketch is a sublinear space since we have a predefined sketch/size for the counter.
High-level components needed.
- Counter Array: This is a 2D array that has a fixed number of rows and columns. The Sketch in CM Sketch is nothing but a 2D array of counters
- Rows: → Number of Hash Functions
- Columns: → 0 to Maximum number which each hash function gives.
- Hash Functions: CM Sketch uses multiple hash functions which independently determine the indices within the counter array for each element.
- Incrementing Counter: When we get an element from the datastream, each hash function will process on go to calculate the respective row and column in the counter array.
- Estimate Frequency: To estimate the frequency of the element, the sketch retrieves the minimum value among the counters in the row mapped by each hash function for the element. Min value gives the lower bound approximation of the element's frequency.
This special data structure is known as Count Min Sketch since we are counting the minimum value of a counter inside a sketch.
Unveiling the Inner Workings
High-Level Working Component
Consider the below data stream which has a defined set of elements.
stream_data → [A, B, K, A, A, K, S, ....∞]
First, we have to build predicted hash function outputs for the elements we have. We can use as many hash functions as possible to avoid collision.
Now initiate the CM Sketch with the Number of Rows as the length of Hash Functions and the Number of columns as the max number which hash returns. And values in the matrix/grid will be 0
In our case we use 4 hash functions out of those hash function's result max hash value is 6
- Rows → 4
- Columns → 7 (0-6)
With 4 rows and 7 columns, we created a counter array that will store the count/frequency.
Steps
1) When we start processing the datastream first we will pass A to the CM Sketch, And for A,
→ We need to increment the counter in the indices where the hash function return value.
→ in our example A pass through hash functions H1, H2, H3, and H4 and get the values 1, 6, 3, 1.
→ So updated those values with the respective hash function's row and its return value.
2) Similarly for B,
→ We need to increment the counter in the indices where the hash function return value.
→ now, B passes through hash functions H1, H2, H3, and H4 and gets the values 1, 2, 4, 6.
→ So updated those values with the respective hash function's row and its return value.
→ Here in H1, for A and B return 1, so we need to add 1 in the index.
3) Similarly for K,
→ We need to increment the counter in the indices where the hash function return value.
→ now, K passes through hash functions H1, H2, H3, and H4 and gets the values 3, 4, 1, 6.
→ So updated those values with the respective hash function's row and its return value.
4) Now again we are getting A,
→ Now we need to add the values from the hash.
→ For A hash functions H1, H2, H3, and H4 will return 1, 6, 3, 1.
→ Now we need to add 1 with those index values already present.
5) Again we are getting A,
→ Repeat step 4.
→ For A hash functions H1, H2, H3, and H4 will return 1, 6, 3, 1.
→ Now we need to add 1 with those index values already present.
6) Now the element will be K,
→ Repeat the same step with K's hash value
→ For K hash functions H1, H2, H3, and H4 will return 3, 4, 1, 6.
→ Now we need to add 1 with those index values already present.
7) Now we are getting a new element S,
→ Repeat the same step with S's hash value
→ For S hash functions H1, H2, H3, and H4 will return 6, 2, 4, 1.
→ Now we need to add 1 with those index values already present.
Thus, we generate the CM Sketch for the data stream. We can retrieve the count at any moment we need.
How to get the frequency
- Consider we need to retrieve the frequency of A.
- We know that A's Hash function will return the index 1, 6, 3, 1.
- Get the hash frequency from the CM Sketch Matrix for those indices.
- Hash Frequency for A is,
A → [4, 3, 3, 4]
- So from the hash frequency we will be taking the minimum value as a count/frequency, which is 3 in our case
- And we have 3 A's in our data stream.
- Similarly, for K, we have the hash frequency.
-
K → [2, 2, 2, 3]
- And minimum of the frequency list are 2, and we have 2 K's in the stream.
⚠️ Inaccuracy !!
- When we count minimum frequency for B
- B → [4, 2, 2, 3]
- min_count = 2 ≠ count of B in the stream
Considerations
- We need to consider minimum value because hash collision by another hash update might hit the accuracy.
- When we increase the number of hash we will get a more accurate frequency.
In Python
In Python we can use the same CM Sketch using the package/library, pyprobables
Prerequisite:
- Python Version > 3.6
-
pyprobables package, and we can install it through pip
pip install pyprobables
We can import CountMinSketch class from pyprobables and can implement this datastructure.
Code Breakthrough
- CMS = CountMinSketch(4, 1000).
- This will create 4 hash functions internally and each hash function will return values in the range 0-999.
- We add datastream to cms (cms.add), and cms internally do the incrementations with respect to the hash functions and elements from the data stream.
- cms.check({element}) will return the count/frequency of the element/
- For this input, we will get the below output.
That's how we can use CM Sketch in Python for streaming data.
Memory Comparison
Now, let try to compare the memory utilization for Hash Tables vs CM Sketch.
Consider we have a large input which will create a random word of length of 103.
And this is the result that we can see, we have almost saved around 97% of memory with the CM sketch.
Summary
- CM Sketch is a probabilistic data structure used for approximate frequency estimation of elements in a stream of data.
- CM Sketch provides an efficient way to estimate frequencies by using limited memory compared to exact counting methods.
- CM Sketch is a 2D matrix/grid which will increment the count with respect to the Hash Functions.
- Each cell in the CM Sketch will store the frequency of the elements.
Takeaway
- CM Sketch is an Approximate Frequency Estimation.
- CM Sketch is highly memory efficient.
- CM Sketch is highly scalable and we can merge the final output with multiple servers output.
- Consider multiple hashes to improve the accuracy and reduce the collision.
- Also it's major point to consider with Time vs Space. So its better to do some research before we use it in the specific use case.
Applications
- Network Traffic Analyze.
- Monitoring.
- Data Streaming Algorithm.
- Compressed Sensing.
- NLP/ML
Reference
- https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
- Graham Cormode's CM Sketch: https://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf
Code Repo:
- https://git.epam.com/ajay_thangavelu/cm-sketch/-/blob/main/src/cm_sketch.py
- https://git.epam.com/ajay_thangavelu/cm-sketch/-/blob/main/src/cm_sketch_comp.py
Disclaimer:
This is a personal blog. The views and opinions expressed here are only those of the author and do not represent those of any organization or any individual with whom the author may be associated, professionally or personally.
Top comments (0)