dev.to is a wonderful blogging platform that emerged a few years ago. I love writing for it and reading content published there. But what I like the most, and I think what everybody like the most is the community that was built on the platform.
A community is known to interact a lot with the poster through different kind of like and comments. There is no βkarmaβ on dev.to, but one way to measure the popularity, the score, of a post is by looking a the number of interactions this post had with the community.
The number of comments, and of course the number of likes, which on the platform are divided into 3 categories: Unicorn π¦, Like β€ and bookmark π.
I recently wondered if an article posted at a certain time of the day performed better than others. And, if yes, what was the optimal time to post a blog post in order to be read by as many people as possible. I have some intuition, but I wanted to have proof and facts to work with.
Here is what I did:
Gathering the data:
I will be short here as I'll write a longer post in the future to explain in detail how to efficiently gather this type of data.
I recently noticed, looking at the dom, that every article had a public id available.
I also knew that there is a public endpoint that allow you to fetch user information that look like this:
http https://dev.to/api/users/<user_id>
So naturally I tried to do the same with article and ...
http https://dev.to/api/articles/81371
HTTP/1.1 200 OK
{
"body_html": "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>The other day I was touching up a PR that had been approved and was about to merge and deploy it when, out of habit, I checked the clock. It was 3:45pm, which for me, was past my \"merge before\" time of 3:30pm. I decided to hold off and wait until the next morning. </p>\n\n<p>The whole process got me thinking. Does anyone else have their own personal merge or deploy policies? Is there a time before or after when you say, not today? Is there a day of the week you don't like to merge stuff. A lot of people joke about read-only Fridays, but I have to admit, I kinda follow that rule. Anything remotely high risk I wait until Monday to merge. </p>\n\n<p>What's your personal merge/deploy policy?</p>\n\n</body></html>\n",
"canonical_url": "https://dev.to/molly_struve/whats-your-personal-mergedeploy-policy-30mi",
"comments_count": 6,
"cover_image": "https://res.cloudinary.com/practicaldev/image/fetch/s--o6RV_02d--/c_imagga_scale,f_auto,fl_progressive,h_420,q_auto,w_1000/https://thepracticaldev.s3.amazonaws.com/i/oqg9vv4u2c5orc3y2n6n.png",
"description": "What's your personal merge/deploy policy?",
"id": 81371,
"ltag_script": [],
"ltag_style": [],
"path": "/molly_struve/whats-your-personal-mergedeploy-policy-30mi",
"positive_reactions_count": 13,
"published_at": "2019-03-22T22:19:36.651Z",
"readable_publish_date": "Mar 22",
"slug": "whats-your-personal-mergedeploy-policy-30mi",
"social_image": "https://res.cloudinary.com/practicaldev/image/fetch/s--MJYBx9D---/c_imagga_scale,f_auto,fl_progressive,h_500,q_auto,w_1000/https://thepracticaldev.s3.amazonaws.com/i/oqg9vv4u2c5orc3y2n6n.png",
"tag_list": "discuss",
"title": "What's your personal merge/deploy policy?",
"type_of": "article",
"url": "https://dev.to/molly_struve/whats-your-personal-mergedeploy-policy-30mi",
"user": {
"github_username": "mstruve",
"name": "Molly Struve",
"profile_image": "https://res.cloudinary.com/practicaldev/image/fetch/s--UrIkLrxe--/c_fill,f_auto,fl_progressive,h_640,q_auto,w_640/https://thepracticaldev.s3.amazonaws.com/uploads/user/profile_image/119473/9e74ee0e-f472-4c33-bfb4-79937e51f766.jpg",
"profile_image_90": "https://res.cloudinary.com/practicaldev/image/fetch/s--apWeHy1C--/c_fill,f_auto,fl_progressive,h_90,q_auto,w_90/https://thepracticaldev.s3.amazonaws.com/uploads/user/profile_image/119473/9e74ee0e-f472-4c33-bfb4-79937e51f766.jpg",
"twitter_username": "molly_struve",
"username": "molly_struve",
"website_url": "https://www.mollystruve.com"
}
}
and bingo !!
All I had to do now was: number 1, find if articles' id where sequential, and 2 if 1 was true, find the most recent article's id.
Both things were easy to check. I just had to open my browser inspector a couple of times on recent articles.
What I did next was calling this API 94k times using scrappy and storing the information in a clear .csv
. More thing on this in a future post.
To do this I used ScrapingBee, a web scraping tool I recently launched π.
What do we have now ?
Out of 94k API calls, almost half of them returned a 404: resource not found
. I guess it means that half of the articles created are never published but I am not sure about it. I still had ~40k data points, which was more than enough to prove my point.
Each row in my csv had multiples useful information, but for what I was looking for I only needed two things: the number or like and the date of publishing.
Hopefully, those two things were returned by the API, see positive_reaction_count
and published_at
in the previous snippet.
Enriching data
To work with the data I used pandas, a well know python library, that is even one of the most famous python package on GitHub.
I'll show here some code snippet, if you want a more thorough tutorial, please tell me in the comments.
Loading data from csv with pandas is very easy:
import pandas as pd
df = pd.read_csv('./output.csv')
As I wanted to know the best time/day to post on dev.to, I need to transform the published_at
column in 2 other columns: day_of_week
('Mon', 'Tue', ...) and hour
.
Pandas allow to easily add, transform and manipulate data. All I need to do this was those few lines:
df['hour'] = pd.to_datetime(df['published_at']).dt.hour
days_arr = ["Mon","Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
def get_day_of_week(x):
date = pd.to_datetime(x)
return days_arr[date.weekday()]
df['day_of_week'] = df['published_at'].apply(get_day_of_week)
All my data is now stored in a dataframe, the main data structure used my pandas, hence the name: df
.
A little bit a data viz
I had now all the informations I needed.
Here what was in my dataframe:
day_of_week | hour | positive_reaction_count |
---|---|---|
Thu | 0 | 4 |
Mon | 1 | 34 |
... | ... | Β ... |
Sun | 22 | 41 |
Thu | 17 | 9 |
Each line representing one post, I had around 38k lines.
What I naturally did next was summing positive_reaction_count
by day and hour.
Here is how to do it in pandas:
aggregated_df = df.groupby(['day_of_week', 'hour'])['positive_reaction_count'].sum()
And now my df looked like this:
day | hour | positive_reaction_count |
---|---|---|
Monday | 0 | 4110 |
1 | 3423 | |
2 | 2791 | |
... | ... | Β ... |
22 | 4839 | |
23 | 3614 | |
... | ... | Β ... |
Sunday | 0 | 110 |
1 | 423 | |
2 | 731 | |
... | ... | Β ... |
22 | 4123 | |
23 | 2791 |
Great, in order to have exactly the data in the format I need, a few more work is necessary.
Basically rotating columns around.
pivoted_df = aggregated_df.reset_index().pivot('hour', 'day_of_week', 'positive_reaction_count')
Now my df has this look:
hour | Mon | Tue | ... | Sun |
---|---|---|---|---|
0 | 4110 | 5071 | ... | 5208 |
1 | 3423 | 4336 | ... | 3230 |
2 | 2791 | 3056 | ... | 1882 |
... | ... | ... | ... | ... |
23 | 3614 | 4574 | ... | 3149 |
And now, finally, I can use the seaborn
package do display a nice heatmap.
import seaborn as sns
sns.heatmap(pivoted_sorted , cmap="coolwarm")
And here is what I got:
Analysis
I find this heatmap very simple and self-explanatory. 2 regions stand out from the map. The red one, bottom left, and the dark blue one top right.
But first, because we are talking about times, we need to know what timezone we are talking about.
If you look carefully at the published_at": "2019-03-22T22:19:36.651Z
, you will notice a Z
at the end of the time string.
Well this Z
indicates that this time string represents UTC time, or time zone Z
ero.
So, going back to our heatmap, we noticed that Monday to Wednesday afternoon (Monday and Wednesday morning for people on the east coast) are the more active zone on the map.
And, Saturday and Sunday are two very calms day, especially from midnight to noon.
So, here, at first sight, you could think that you better post those time to maximize your chances of having many likes. Well, we need to step back a little.
What this heatmap show is the time of the day where we observe the most likes in total. It does not take into account the fact that more posts automatically means more likes.
So maybe, right now we can't know for sure, the red zone we see on the heatmap just means that we observe more like on the platform only because more articles are being posted during those times.
This difference is critical, because what we are trying to know is the best time to post in order to maximize our likes, and this map can't help us.
So what we need it to make the same kind of map, but instead of counting the total of likes during one hour for each day we have to compute the mean of those numbers of likes.
We could also compute the median, I did it, and there is not much difference π.
Thanks to pandas, we only to change one small thing in our code:
# sum -> mean
aggregated_df = df.groupby(['day_of_week', 'hour'])['positive_reaction_count'].mean()
And here is the new heatmap:
As you can see, the heat map is very different and much more exploitable than the previous one.
We now observe strip patterns. There is this wide blue one spanning from Monday to Sunday from 4 a.m to 10 a.m.
We also observe a peak of activity during the UTC afternoon.
What we can now state following this heatmap is that
articles posted during the afternoon, on average, had 10~20 more positive interactions than the one posted very early during UTC day.
I think it is all about the reader/writer ratio, and what those two heatmaps show is that even though there is much less reader during the weekend, there is also proportionally less writer. This is why an article published during the weekend will have the same numbers of interactions than an article published during the week.
Thank you for reading:
I hope you liked this post.
This series is far from over, I have plenty more information to show you related to this dataset.
Please tell in the comments if you want a particular aspect of dev.to data analyzed and don't forget to subscribe to my newsletter, there is more to come (And you'll also get the first chapters of my next ebook for free π).
If you want to continue reading about some python tips, go there, you could like it :).
If you like JS, I've published something you might like.
And if you prefer git, I got you covered.
Top comments (40)
One should check for one more fallacy: what if the same great authors always post at the same time and the author is the deciding factor for number of likes and not time?
Two methods come to mind:
Yes, you are totally right.
I think this kind of fallacy can also be caught by measuring the median number of like across the day instead of mean. I did it and did not find that much difference in the results.
I'll try your point 1, haven't thought of this one though.
And I plan to make an article about "famous" poster and their stat that will deal about point 2.
If we assume the opposite of "some people are great writers", namely "most people are bad writers", then there might be a situation where the majority make not good articles that aren't affected by time of day. In such a scenario using the median would be less telling.
That said, most posts here are pretty well-made, so that shouldn't be a problem. If it were the case, then the median would br pretty much even across all times of day, so if you didn't see much difference, then that disproves it.
If we think that "great authors" post at the same time then, even if we cater for numbers of users, we're basically assigning qualities to longitudes. One uncausal correlation from that would tell us we'd get more likes if we moved to a different country.
I think this is a whole lot like the rocket equation. The more people try to target "best times", the more they'll skew the data away from themselves.
Nice catch there...
Very cool! You can probably refactor it to use the "paged" articles. This will give you a set of 30 public article IDs per page so you don't have to scrape 94k IDs directly. You can just loop through the ~500 pages
https://dev.to/api/articles/?page=468
to grab the IDs. A cool thing about dev.to is since it's open source the API may not be documented but is available to view github.com/thepracticaldev/dev.to/.... Can't wait to see more!The articles api seems to return only articles with the defaults tags career discuss productivity
github.com/thepracticaldev/dev.to/...
I'd have to double check but I believe that's only during "onboarding" when a user first signs up. If not that's likely a bug in my mind.
I've done a runkit how check if the most recent react article is published after the most recent article, if so, check if it is in the latest 30 articles.
runkit.com/alfredosalzillo/5c9ba0d...
The react article, published after the most recent article (of the dev.to/api/articles/) isn't in the articles list.
It may be a bug of the api.
Thanks, did not know about this endpoint. Will use it for my next one.
Very nice meta article!
I did a similar job on where are dev.to users coming from :
π Where Are DEV Users Coming From?
π«π· Boris Jamot
Wow, Iβm definitely going to read this when I get a chance. Skimmed it.
Initial thoughts are that itβs definitely affected by when we are awake and working because of how we schedule for Twitter etc. But weβre always evolving and modifying the process so this could change.
I canβt wait to read through this.
Thank you! I hope you'll find helpful insights.
Love this post! β€οΈ
Just throwing this out here, but ya might want to switch out one of your tags for #meta as this is definitely an option - dev.to/t/meta ... however, all of your tags are apt descriptions.
Again, totally digging this post!
Thank you very much!
Great tag suggestion, just added it.
Really nice article.
A question though, shouldn't it include the age of articles in the factors to check?
Not sure it has a big impact but an older article would potentially get more likes than a recent one.
While this concern sounds right, after thinking about it I think it is not much of a problem because:
1: By observation, I noticed that articles tend to only gather positive reactions during the first 4 days of posting
2: Even if that would not be the case, if we assume that the pattern had not changed in three years, then there are no problems giving more weight to older posts.
Those two assumptions do not seem very far-fetched.
And again, if we assume old post = more likes, we can always do the same heatmap with a median, which gives almost the same results.
A valid concern.
These tables tell me a different story.
Here is a serious question: Are these results persistent across weeks and across months?
That is an interesting question,
I thought about making a gif of multiples heatmat across day / time / month hoping to discovere change of habit.
I'll post about it results are showing something interesting.
I know this is an old article but I'm wondering - have you got around to do the comparison? I'm pretty sure there would be a noticeable difference due to daylight saving time shifts, if nothing else.
Hmm.. One further analysis might be the use of tags. You can analyse the average tags on reactions per article and time of publishing.
This will be useful as well cause I based my articles on the tags that I would use which usually is the top 10-20 tags to optimise the numbers of reach and view for me.
Thanks for your idea, I actually wanted to do this for my next one.
Fun project, coincidentally I did a heatmap project myself yesterday too.
I've written a discord bot for a server, and it stores the amount of messages sent per hour, so I made a heatmap over when the server is active.
To have it viewable online without using images, I decided to use plotly, which works great, but but can be a bit annoying to use.
Write a post about it :) !
Great article!
I'd imagine well-written, popular articles have long tails -- they continue accumulating likes months or years after they are written.
It seems like the time of day matters less for the long-tail likes, as opposed to the initial burst of likes... although maybe time of day could be a catalyst (if the right people are skimming the recent articles and sharing with their social media for example, then time of day could be the difference between a long tail and no tail).