Background
Let's say we have built a CRM application, which keeps track of the various metadata logged by users when they use one of our other apps. For example this data could be that the user is using an iPhone, that they have logged in, or that they are within the age range of 30-40.
If we have millions of users and consequently millions of pieces of metadata - what is an efficient way for us to find groups of our users that meet certain criteria (also called cohorts)? For example, say we want to send a newsletter out to all of our users who are within the age range of 30-40 and who have logged in today - how do we determine this group of users?
One solution that tends to be very memory-efficient is to use Redis bitmaps. The concept of bitmaps, specifically in regards to Redis, has been explained rather thoroughly quite a few times already so I won't go into depth here. If you are hungry for all the juicy details - I suggest the following well-known articles as a good place to start:
The tl;dr
though, is that every time a user registers and lets us know they are in the age range of 30-40, we can set a key-value pair in Redis like so: SETBIT('agerange#3040', user_id, 1)
.
Similarly, once the user logs in, we can set another key-value pair: SETBIT('login:today', user_id, 1)
.
Then, finding the cohort becomes a matter of performing a bitwise AND
operation on the two keys, like so: BITOP('AND', '3040logintoday', 'agerange#3040', 'login:today')
And just like that we have determined all the users we want to send our newsletter to!
At this point, you may be saying something along the lines of, "Ya, that's nice and all but Redis.. Bitmaps.. keys.. values.. Bitwise operations.. my head hurts already!". This is a totally understandable reaction - with a totally awesome rebuttal. Enter Bitmapist.
Bitmapist
Bitmapist is a Python library developed by the folks at Doist that simplifies working with Redis bitmaps. Further, they have even gone as far as also developing their own optimized Bitmapist server that implements a subset of Redis operations used by Bitmapist while relying on compressed bitmaps representation and only keeping hot dataset in memory to improve memory efficiency by 443 times!
What does setting all this up look like? Let's give it a spin!
First, let's install the bitmapist server
$ git clone https://github.com/Doist/bitmapist-server.git
$ cd bitmapist-server
$ go build
go: downloading github.com/artyom/autoflags v1.1.0
go: downloading github.com/artyom/resp v1.0.0
go: downloading github.com/golang/snappy v0.0.1
[...]
Once go
finishes building the server, you should see a bitmapist-server
executable in the same directory. You can run it as it is, or supply an optional -addr
option if you want it to listen somewhere other than the default localhost:6379
. Running the executable should result in some output like so:
$ ./bitmapist_server
2020/01/25 15:22:09 loading data from bitmapist.db
2020/01/25 15:22:09 loaded in 4.922263ms
2020/01/25 15:25:09 periodic saving...
2020/01/25 15:25:09 saved in 11ms
At this point you have your bitmapist server all set up! You can even use redis-cli
to connect to it:
$ redis-cli
127.0.0.1:6379> keys **
(empty list or set)
As we can see, we don't have any data in our server yet - let's change that by installing and running Bitmapist!
$ pip install redis bitmapist4
$ python
>>> import bitmapist4
>>> b = bitmapist4.Bitmapist('redis://localhost:6379', key_prefix='bitmapist__')
The key_prefix
parameter above ensures that every key you set with this session will be prepended with the value you choose, in this case bitmapist__
. We currently use this functionality as a means of compartmentalizing different applications. That is, we set the key_prefix to be the application's uuid
so that way we only deal with that application's data (key-value pairs when we perform our Bitmapist routines.
Let's perform the keys **
equivalent in Bitmapist:
>>> b.get_event_names()
[]
As expected, it's simply an empty list. Let's mark some events!
>>> b.mark_unique('agerange#3040', 1)
>>> b.mark_unique('agerange#3040', 2)
>>> b.mark_unique('agerange#2030', 3)
>>> b.mark_event('login', 1)
>>> b.get_event_names()
['_agerange#2030', '_agerange#3040', '_login']
So now we've told our server that user 1 and 2 is within the age of 30 and 40, user 3 is within the age of 20 and 30, and user 1 has logged in today. On the redis-cli
side, keys **
now returns:
127.0.0.1:6379> keys **
1) "bitmapist__agerange#2030"
2) "bitmapist__agerange#3040"
3) "bitmapist__login_2020-1"
4) "bitmapist__login_W2020-4"
5) "bitmapist__login_2020-1-25"
6) "bitmapist__login"
You may be wondering why there are extra keys in our server and also why we only stored the key login
and not login:today
. The answer to both these questions is part of what makes Bitmapist so awesome. When using mark_event()
, Bitmapist automatically stores a key with just the event name, a key with the event name and with appended date values for the current day, week, month, and year.
This in turn allows you to perform calls such as the following to find out if an event has taken place within a certain time period:
>>> b.YearEvents('login')
YearEvents("login", 2020)
>>> b.MonthEvents('login')
MonthEvents("login", 2020, 1)
>>> b.WeekEvents('login')
weekEvents("login", 2020, 4)
>>> b.DayEvents('login')
DayEvents("login", 2020, 1, 25)
>>> b.HourEvents('login')
HourEvents("login", 2020, 1, 25, 15)
This is great, as it significantly decreases our mental load when creating and managing our keys. Now, going back to our original problem, let's find all our users who are in within the age range of 30 and 40 and who have logged in today:
>>> cohort = b.BitOpAnd(b.UniqueEvents('agerange#3040'), b.DayEvents('login'))
>>> for user in cohort:
... print(user)
1
Okay, so it turns out we will only be sending our newsletter to user 1 today, which is a bit anticlimactic.. BUT the cool thing to note here is that Bitmapist also provides bitwise operation helper functions, such as BitOpAnd
used above. Alternatively, you can also use the operator itself like so:
>>> cohort = b.UniqueEvents('agerange#3040') & b.DayEvents('login')
Pretty cool, eh? That's really just the tip of the iceberg though; check out the project's github page for more examples and features, like the cohort table!
Bonus
What if you've read the gitub page, scoured the source code, and still feel like Bitmapist is missing something? Luckily, the people behind it seem to be pretty into clean and organized code, because customizing it is a breeze.
For example, we have use cases where we need to calculate populations of users like so:
- How many users were using an android device between the dates of 16-01-20 and 19-01-20?
- How many users were using an ios device between February and May?
Bitmapist doesn't provide a class to handle this type of event search, so I added my own! I accomplished this by editing only two of the project's files: bitmapist4/events.py
and bitmapist4/core.py
. First, in events.py
:
class RollingEvents(BaseEvents):
"""
Events over a specified amount of time.
Example:
date = datetime.strptime('19-02-19', '%d-%m-%y')
RollingEvents('event_name', 'day', date, 36)
"""
def __init__(self, event_name, scale, dt=None, delta=0):
self.event_name = event_name
self.scale = scale
self.dt = dt or datetime.datetime.utcnow()
self.delta = delta
events = []
if scale == 'day':
events.append(self.bitmapist.DayEvents.from_date(event_name, dt))
for i in range(delta):
events.append(self.bitmapist.DayEvents.from_date(event_name, dt).delta(-(i + 1)))
elif scale == 'week':
events.append(self.bitmapist.WeekEvents.from_date(event_name, dt))
for i in range(delta):
events.append(self.bitmapist.WeekEvents.from_date(event_name, dt).detla(-(i + 1)))
elif scale == 'month':
events.append(self.bitmapist.MonthEvents.from_date(event_name, dt))
for i in range(delta):
events.append(self.bitmapist.MonthEvents.from_date(event_name, dt).delta(-(i + 1)))
elif scale == 'year':
events.append(self.bitmapist.YearEvents.from_date(event_name, dt))
for i in range(delta):
events.append(self.bitmapist.YearEvents.from_date(event_name, dt).delta(-(i + 1)))
if events:
if len(events) == 1:
self.redis_key = events[0].redis_key
else:
or_op = self.bitmapist.BitOpOr(*events)
self.redis_key = or_op.redis_key
[...]
Then, in core.py
:
class Bitmapist(object):
[...]
def __init__(self,
[...]
kw = {'bitmapist': self}
self.UniqueEvents = type('UniqueEvents', (ev.UniqueEvents, ),
kw) # type: Type[ev.UniqueEvents]
self.RollingEvents = type('RollingEvents', (ev.RollingEvents, ),
kw) # type; Type[ev.RollingEvents]
[...]
With those changes, I could solve the first problem mentioned above by using the RollingEvents
class like I do any other:
>>> len(b.DayEvents('os:android').delta(-6))
66
>>> len(b.DayEvents('os:android').delta(-7))
240
>>> len(b.DayEvents('os:android').delta(-8))
319
>>> len(b.DayEvents('os:android').delta(-9))
255
>>> six_days_ago = datetime.datetime.utcnow() - datetime.timedelta(6)
>>> len(b.RollingEvents('os:android', 'day', six_days_ago, 3))
880
Great success!
Conclusion
Bitmapist and the Bitmapist server both make reasoning about and implementing bitmap-based datasets straight-forward and fun. They also lower the knowledge barrier required to start experimenting with such datasets, as they abstract away all the Redis commands that are typically quite daunting to newcomers.
Further, they are written well and comprise code easy to navigate, which provides for quick customization if needed.
Keep up the great work Doist!
Top comments (0)