At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.
We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.
Here's a simple example of a protobuf definition:
message Person {
string first_name = 1;
string last_name = 2;
int64 birth_timestamp_millis = 3;
}
One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the .proto
file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.
However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.
One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of Person
(as defined above) in JSON format, but then renamed birth_timestamp_millis
to birthday_timestamp
- the previously stored Person
would now have an undecodeable birthTimestampMillis
field, and would be missing a value for birthdayTimestamp
.
What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.
Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.
We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.
For example, given the following protobuf definition:
message Example {
string string_field = 3;
ExampleEnum enum_field = 10;
oneof example_oneof {
int32 int32_field = 78;
int64 int64_field = 33;
}
NestedMessage nested_message = 107;
}
enum ExampleEnum {
VAL_0 = 1;
VAL_1 = 573;
}
message NestedMessage {
string nested_string_field = 2;
int32 nested_int32_field = 1;
}
And the following instance of Example
, in canonical JSON format:
{
"stringField": "foo",
"enumField": "VAL_0",
// Note that this is represented as a string because the JavaScript number type is smaller than an int64.
"int64Field": "123456789",
"nestedMessage": {
"nestedStringField": "bar",
"nestedInt32Field": 12
}
}
Our MongoDB codec will encode the instance of Example
as the following Mongo BSON document:
{
"3": "foo",
"10": 1,
"33": 123456789,
"107": {
"2": "bar",
"1": 12
}
}
With this encoding, if we change the name of nested_string_field
to something_else
, or the enum value VAL_0
to BETTER_ENUM_VALUE_NAME
, we'll still be able to decode the document, without any loss of data.
This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.
The code is open-sourced here (godoc). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!
Top comments (0)