Did you ever think about how Instagram, Twitter, Facebook or any social media platforms track who liked your posts? Let's figure it out in this post!
If you’re just starting working with databases, you might want to start off by reading my initial post, Database 101: Data Consistency for Beginners. That captures my own exploration of how many database paradigms exist as I look far beyond my previous experience with just SQL and MySQL. I’m keeping track on my studies in this Database 101 series.
Table of Contents
- 1. Prologue
- 2. Let's Research
- 3. Researching Data Types
- 4. Properly Modeling
- 5. Final Considerations
1. Prologue
Recently, I was invited to speak at an event called "CityJS." But here's the thing: I'm the PHP guy. I don't do JS at all, but I accepted the challenge. To pull it off, I needed to find a good example to show how a highly scalable and low latency database works.
So, I asked one of my coworkers for examples. He told me to look for high numbers inside any platform, like counters or something like that. At that point I realized that any type of metrics can fit this example. Likes, views, comments, follows, etc. could be queried as counters. In this article, you will find my studies of how do proper data modeling for these using ScyllaDB.
2. Let's Research
First things first, right? After deciding what to cover in my talk, I needed to understand how to build this data model.
We'll need a posts
table and also a post_likes
table that relates who liked each post. So far, it seems enough to do our likes counter.
My first bet for a query to count all likes was something like:
Ok and if I just do a query with
SELECT count(*) FROM social.post_likes
it can work, right?
Well, it worked but it was not as performant as expected when I did a test with a couple thousands of likes in a post. As the number of likes grows, the query becomes slower and slower...
"But ScyllaDB can handle thousands of rows easily… why isn’t it performant?" That's probably what you're thinking right now (or maybe not).
ScyllaDB – even as a cool database with cool features – will not solve the problem of bad data modeling. We need to think about how to make things faster.
3. Researching Data Types
Ok, let's think straight: the data needs to be stored and we need the relation between who liked our post, but we can't use it for count. So what if I create a new row as integer
in the posts
table and increment/decrement it every time?
Well, that seems like a good idea, but there's a problem: we need to keep track of every change on the posts table and if we start to INSERT or UPDATE data there, we'll probably create a bunch of nonsense records in our database.
Using ScyllaDB, every time that you need to update something, you actually create new data.
scylla@cqlsh:socials> INSERT INTO socials.posts (id, user_id, description, image_url, created_at, likes) VALUES (4d18bb8c-9c57-44fe-827a-4a2d65f331e5, 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129, 'Such a cool event P99 Conf!', 'https://i.imgur.com/Xp8gi7t.jpg', '2023-04-23 15:02:49', 1);
scylla@cqlsh:socials> INSERT INTO socials.posts (id, user_id, description, image_url, created_at, likes) VALUES (4d18bb8c-9c57-44fe-827a-4a2d65f331e5, 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129, 'Such a cool event P99 Conf!', 'https://i.imgur.com/Xp8gi7t.jpg', '2023-04-23 15:02:50', 2);
scylla@cqlsh:socials> INSERT INTO socials.posts (id, user_id, description, image_url, created_at, likes) VALUES (4d18bb8c-9c57-44fe-827a-4a2d65f331e5, 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129, 'Such a cool event P99 Conf!', 'https://i.imgur.com/Xp8gi7t.jpg', '2023-04-23 15:02:51', 3);
scylla@cqlsh:socials> SELECT * from posts;
id | user_id | created_at | description | image_url | likes
--------------------------------------+--------------------------------------+---------------------------------+-----------------------------+---------------------------------+-------
4d18bb8c-9c57-44fe-827a-4a2d65f331e5 | 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129 | 2023-04-23 15:02:48.000000+0000 | Such a cool event P99 Conf! | https://i.imgur.com/Xp8gi7t.jpg | 1
4d18bb8c-9c57-44fe-827a-4a2d65f331e5 | 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129 | 2023-04-23 15:02:50.000000+0000 | Such a cool event P99 Conf! | https://i.imgur.com/Xp8gi7t.jpg | 2
4d18bb8c-9c57-44fe-827a-4a2d65f331e5 | 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129 | 2023-04-23 15:02:51.000000+0000 | Such a cool event P99 Conf! | https://i.imgur.com/Xp8gi7t.jpg | 3
You will have to track everything that changes in your data. So, for each increase, there will be one more row unless you don't change your clustering keys or don't care about timestamps (a really stupid idea).
After that, I went into ScyllaDB docs and found out that there's a type called counter
that fit our needs and is also ATOMIC!
Ok, it fit our needs but not our data modeling. To use this type, we have to follow a few rules but let's focus on the ones that are causing trouble for us right now:
- The only other columns in a table with a counter column can be columns of the primary key (which cannot be updated).
- No other kinds of columns can be included.
- You need to use UPDATE queries to handle tables that own a counter data type.
- You only can INCREMENT or DECREMENT values, setting a specific value is not permitted.
This limitation safeguards correct handling of counter and non-counter updates by not allowing them in the same operation.
So, we can use this counter but not on the posts table... Ok then, it seems that we're finding a way to get it done.
4. Properly Modeling
With the information that counter
type should not be "mixed" with other data types in a table, the only option that is left to us is create a NEW TABLE and store this type of data.
So, I made a new table called post_analytics
that will hold only counter
types. For the moment, let's work with only likes since we have a Many to Many relation (post_likes) created already.
These next queries are what you probably will run for this example that we created:
## Social when you like a post
UPDATE socials.post_analytics SET likes = likes + 1 WHERE post_id = 4d18bb8c-9c57-44fe-827a-4a2d65f331e5;
INSERT INTO socials.post_likes (post_id, user_id, liked_at) VALUES (4d18bb8c-9c57-44fe-827a-4a2d65f331e5, 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129, '2023-04-23 15:02:50');
# Social when you dislike a post
DELETE FROM socials.post_likes WHERE post_id = 4d18bb8c-9c57-44fe-827a-4a2d65f331e5 AND user_id = 3edd5f1d-67e9-4a3e-af1a-9adbb41e2129;
UPDATE socials.post_analytics SET likes = likes - 1 WHERE post_id = 4d18bb8c-9c57-44fe-827a-4a2d65f331e5;
Now you might have new unanswered questions in your mind like: "So every time that I need a new counter related to some data, I'll need a new table?" Well, it depends on your use case. In the social media case, if you want to store who saw the post, you will probably need a post_viewers
table with session_id and a bunch of other stuff.
Having these simple queries that can be done without joins can be way faster than having count(*)
queries.
5. Final Considerations
I learned a lot not only by studying new ways of data modeling but also having to learn TypeScript to create the CityJS presentation and build this use case.
As everything is brand new for me, I'll do my best to keep sharing my studies. Please feel free to correct me in the comments! Discussing is the best way to learn new things.
Don't forget to like this post, follow me on the socials and fill your water bottle xD
Twitter DanielHe4rt PT-BR
Twitter DanielHe4rt EN
Twitch Channel
Top comments (64)
Sounds like an interesting approach.
Did you have to limit the number of likes from users? (e.g. each user is only allowed to 'like' once.)
Also, am I right in thinking that only having Increment and Decrement commands safeguards the data against concurrency issues (where one update might be 'lost' due to another overwriting it)?
Totally right.
Since it's just +/- the current number doesn't matter at all. The reason that the only type setted on the table needs to be counter.
Should be fine with concurrency... You'll still want a separate likes table for post-like and probably a user-like that references the posts liked by a user.
Scylla/Cassandra is not atomic across tables... And query performance isn't always great. So sometimes you'll want your own 1:n mappings both ways.
Ah, great idea - with the user-like table, you'd also be able to see how many times a user has 'liked' a post (if that was a requirement).
It's nice about finding the posts a user looked... From the user profile or activities
I'm wondering why not just recount the total likes on each like update, and update the count field accordingly. Not as performant as increments, but feels safer for accuracy in case of issues under the hood. No idea if this way is done, but makes sense to me.
This is discussed in TFA. You wind up creating a new record for each update to a counts field in a posts table.
Having the separate analytics table, you can have atomic counters that effectively do the same. Yes it's an n+1 request,. But with horizontal scaling and often faster than joins for a single view item.
I think you're misunderstanding. You wouldn't use joins in my case. It would save the value, it would just recalculate it instead of incrementing. After dealing with Firestore, I don't trust increments to be accurate.
Great write up!
I think that this is a good example on where streaming databases like Materialize can really help. You can essentially create a live materialized view and the result will be kept up to date as the data comes.
In traditional DBs, data is stored in tables matching the structure of the writes (inserts, updates) and all the computation work happens on read queries (selects). A streaming database moves the work to the write side: Instead of just storing data on writes, a streaming database asks for the queries upfront (in the form of Materialized Views) and incrementally updates results as input data arrives.
If the database we use doesn't have counter type, is it possible to just use integer type instead? I see on your explanation that counter type is just an alias for integer type, right? Does it affect the performance of our application if we use integer type instead?
The idea behind the counter is to use always increment and decrement operation, eg: item + 1 || item - 1
Since it will not depend on which number stopped on the last row, you can deal easily with concurrent connections without loss any data.
If you're using a classic rdbms like postgres/MySQL/mssql then yes. There should be an atomic update to increment/decrement. Of course index on the likes table should work for count(postid) as well ..
Cassandra and ScyllaDB have differing behaviors for the benefit of horizontal scaling.
Thanks for the fun read! It makes me interested in checking out Scylla.
I do have a question that may not be related to the point of the article, but is there a reason you included the created_at field in the primary key of the posts table? I would think that the id field by itself would be enough for the PK.
At ScyllaDB we have a Partititon Key (like Primary) and Clustering Keys (Ordering Keys).
You can only do a query using PK or CK to guarantee that your query will be faster and both data should be present in every row.
Also I forgot to put user_id as clustering key for this example :p
I'm puzzled why we don't simply add up the total amount of likes with each update and then edit the count column as needed. Though not as efficient as incrementing, this appears to be a more dependable means of ensuring accuracy, especially if there are unanticipated complications. I'm not sure if this strategy is already in use, but it sounds reasonable to me.
I wonder if it would be even better to only do that every... 30 or so likes. So basically, make a condition where if the count is divisible by 30, recalculate and overwrite the count. Best of both worlds? Or is the condition too expensive itself? At least it doesn't involve reading another table.
Your last model (posts, post_analytics, post_likes and possibly user_likes) reminds me of a presentor/aggregator architecture style from a presentation on event driven systems. On scale, i would probably lean there even
Thay would mean one service, a so called frontend or an api , would publish "userLikedPost" event when someone clicks like
A userAggregator would calculate user stats and store what post this user liked, while a postAggregator would do the same. Different UIs would then query these two aggregators through an api gateway or so.
This decoupling ties you less into finding one db that fits all needs, at the cost of just that - decoupling and maintaining more services and tech stack choices
But its interesting to think about 👍
Neat. I have never heard of this DB system before. I wonder if anyone has clever ideas for how to do this well in MySQL as it is not always practical to move to a different DB for a single use case, and MySQL doesn't have a good way to do this without table locking, I think 😱
If you have a huge system probably you will work with multiple databases.
Scylla is a database that can handle monstrously amount of data easily if you model in the right way.
We have a free cloud environment, give it a try!
Excellent article!
Water bottle filled. Good article, primo!
Thanks! Hope that it helped you someway <3