Serverless Chats
Episode #44: Data Modeling Strategies from The DynamoDB Book with Alex DeBrie
About Alex DeBrie:
Alex is a trainer and consultant focused on helping people using cutting-edge, cloud-native technologies. He specializes in serverless technologies on AWS, including DynamoDB, Lambda, API Gateway, and more. He’s an AWS Data Hero and recently published author of The DynamoDB Book and the creator of DynamoDBGuide.com. He previously worked at Serverless, Inc., where he held a variety of roles during his tenure, helped build out a developer community, and architected and built their first commercial product.
- Twitter: @alexbdebrie
- Blog: https://www.alexdebrie.com/
- DynamoDB Book: www.dynamodbbook.com (Discount Code: SERVERLESSCHATS)
- DynamoDB Guide: www.dynamodbguide.com
Transcript:
Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Alex DeBrie. Hey, Alex, thanks for joining me.
Alex: Hey, Jeremy. Thanks for having me.
Jeremy: So you are actually returning to Serverless Chats. You were my first guest, and you are also my first returning guest. So I don't know if that's an honor, but thank you very much for being here.
Alex: Yeah, it was an honor to be here the first time and honored to be back as well.
Jeremy: So a lot has changed since you were with me almost a year ago. You went out, you used to be working at Serverless, Inc., where they created the Serverless Framework. You went out you started doing some consulting, you were named an AWS Data Hero. So why don't you tell the listeners a little bit about yourself and what you've been doing over these last few months?
Alex: Yep, sure. So as you mentioned, I used to work for Serverless Inc, creators of the Serverless Framework. That's how you and I got hooked up initially. Worked for them for about two and a half years. And then last fall I was named a AWS Data hHero specifically focusing on DynamoDB which was a big honor for me. And then in January I left Serverless Inc to go on my own to do a few different things, some consulting, some teaching and also finished up this book I've been working on.
Jeremy: Yeah and so speaking about this book, I'm super excited about this because I remember we were out I think in Seattle at one point several months ago and I looked over your shoulder, I saw you typing and I asked, "What are you doing?" You're like, "Oh, I'm writing a book on DynamoDB, of course."
And you obviously you created the DynamoDB guide at dynamodbguide.com which is a really great resource for anybody looking to get familiar with DynamoDB. It's much more approachable I think, than the documentation on AWS. It's really well written and there were a couple of modeling strategies in there and things like that but this new book, which I've had a chance to read which is awesome, by the way, so congratulations really, really well done. But this new book, just you know it is not DynamoDB guide repackaged.
This is a whole new thing with tons of strategies, tons of information. So why don't you tell us a little bit about this book?
Alex: Yeah, sure thing. So as you mentioned, I created dynamodbguide.com. That was about two and a half years ago now. And it was basically, I'd watch Rick Houlihans re:Invent talk over Christmas one year, and just had my mind blown and rewatched it so many times, and scribbling it out in Notepad on how it all worked, and then wanted to share what I learned. So I made this site DynamoDBguide.com.
That did pretty well. And I've stayed in touch with the DynamoDB team since then. But I really wanted to go further than that. Because I think like you're saying there are some missing stuff out there. So I've been working for the last almost about a year on this book.
I started I think, last June or July. And really, we just want to go deep on DynamoDB and not just the basics, all that stuff really introduce this idea of strategies, introduced some data modeling examples to show that you can really handle some complex access patterns. It's not just about key value storage, you can do complex relational data in DynamoDB.
Jeremy: Yeah, definitely. And so just in case somebody doesn't know what DynamoDB is, let's just give them a quick overview of what exactly that is.
Alex: Yep, sure. So DynamoDB is a NoSQL database offered by Amazon AWS. It's a fully managed database. I'd say, it got started where amazon.com their scaling needs just you know, they were out scaling their relational databases. So they built this underlying storage mechanism that... they built this underlying database to replace their relational databases. That was used internally at Amazon. They released some of the principles behind it in this DynamoDB or this Dynamo paper.
That eventually became a service in AWS called DynamoDB. Fully managed service, works really well for highly scalable applications. In fact, all the tier one services at Amazon and AWS are required to use it. So if you think about the shopping cart or the inventory system or IAM or EC2, all that stuff that's all using DynamoDB under the hood. But also it's gotten really popular in the serverless ecosystem just because the connection model, the permissions model, the provisioning model, the billing model, it all works really well with everything we like about serverless compute.
So a lot of people have been using it there. And that's how I got introduced to it mostly and just wanted to go deeper on it and really use it correctly.
Jeremy: Yeah, right. And so one thing that's super important to remember is DynamoDB is NoSQL, right? Or NoSQL, it is not like your traditional RDBMS system. There's no joins, right? You're not doing any of that sort of stuff. And there's reasons for it, obviously, it's a speed thing, and I did a whole episode or actually I did two episodes with Rick Houlihan himself and he went through a bunch of those things. So if you want to really learn or get a good audio overview, I guess, of DynamoDB I suggest go back, listen to those episodes because I want to use your time today to actually go through a couple of things in the book that I found to be just really helpful, like things that I don't think they pop out to you when you read the documentation.
And I've talked to so many people, because I love DynamoDB. I have my DynamoDB toolbox. I'm working on a new version of it right now that I'm thinking like, it's just going to make my life easier. Hopefully, it makes other people's lives easier. But I just use it so much. And the problem always is, is I think a lot of people think that it's just a key value store, right?
And it is to a certain extent, but there are ways to model data that are just, I mean, they're fascinating. It's absolutely amazing what you can do with some of these things. So, I'd love to point these things out. Because I think like I said, these are things that will not jump off the page at you when it comes to documentation. So the first thing that I think you did a really good job explaining was the importance of item collections. And this is something for me where I always think about them as folders with files in the folders and try to think about it that way. But you probably do a better job explaining it.
Alex: Thanks. I hope so. So, yeah, I introduced the concept of item collections and their importance pretty early on. I think it's in chapter two. And it was actually one of the solutions architects at AWS named Pete Naylor that that turned me on to this and really made me key into its importance.
But the idea behind item collections is you're writing all these items into DynamoDB, records are called items in DynamoDB. And all the items that have the same partition key are going to be grouped together in the same partition into what's called an item collection. And you can do different operations on those item collections, including reading a bunch of those items in a single request.
So as you're handling these different access patterns, what you're doing is you're basically just creating these different item collections that handle your access patterns. And that can be a join like access pattern. If you want to have a parent entity and some related entities in a one to many or many to many relationship, you can model those into an item collection and fetch all those in one request.
You can also handle different filtering mechanisms within an item collection, you can handle specific sorting requirements within an item collection. But you really need to think about, hey, what I'm doing is I'm building these item collections to handle my access patterns specifically.
Jeremy: Yeah, and I think that is something where people like, and I don't want to speak for other people, but these are the questions that I'm getting. And I'm sure you've gotten these similar questions where it's like, well, how do I join data? How do I how do I represent things like, I don't know, one to many relationships and things like that? And you explain all that in the book. I actually want to talk to you about that a little bit later.
But I think this comes down to this idea of like you said, it's putting these things into collections, it's knowing which groups of items that you actually need to have available on the same partition, right?
So again, I'm probably not doing a great job explaining it, which is why you need to buy the book and read what you wrote. But I think this brings us to another thing, just this idea of partitions and consistency, right? Because there are, the way that this works is a lot different than your relational database lookup would work. So understanding partitions and consistency, I think is actually really important to properly modeling your data as well.
Alex: I think that's right. I mean, these are more underlying architectural concepts, but they really do help your data modeling and really your mental model of how DynamoDB works and how you should organize your data. So let's start first with partitions. If you're working with a relational database, generally all your data is going to be all together on one node unless you're doing some pretty complex sharding strategy if you have really huge amounts of data.
But with DynamoDB, what they're going to do is they're going to try and split your data across a lot of different nodes. And each of those nodes are called partitions. And what they do is they use that partition key that we talked about to create these item collections. They use that partition key to find out which node that data should go to. So when you're writing an item to DynamoDB, they're going to look at that partition key, they're going to hash that partition key and then assign it to a particular node based on the value of that hash.
And then when you're reading data out of DynamoDB, what they're going to do is they're going to look at the partition key that you want to read from, they're going to look that up in their little hash table and figure out where you need to go to find that data. And what that does is it transforms. If you have a 10 terabyte table, instead of having to scan that whole table, it chops it up into these little 10 gig chunks, makes an 01 lookup in a hash table to figure out which node it belongs to. And now you're working on a much smaller amount of data, maybe 10 gigs or less there. And that makes for a lot more efficient operations down the road within that particular partition.
Jeremy: And then, what about consistency, though? Because I mean, that's the thing, whereas you start writing data across multiple nodes. If you, you know, there's the whole CAP theorem, right? I mean, if you are to access a node that hasn't been replicated to yet, then the data wouldn't be there.
Alex: Yep. Great point. So within each of those storage nodes, or those partitions that I was mentioning, there's actually going to be three copies of your data. So there'll be a primary and two secondaries of that data.
When that data gets written, it's going to write to the primary and one of the secondaries before it actually acknowledges that, that write was successful. And that's just for fault tolerance there to make sure that it's committed to a majority of those nodes.
Then after it's returned to you, they're also going to asynchronously replicate it to that third node, which is that second secondary. Now, when you're doing a read from DynamoDB, by default, it can read from any of those three nodes. So then you get into that situation where, if you've just written an item, and you do a read request, where it hits that third node, that second secondary, where it hasn't asynchronously replicated that yet, you have a chance of getting some stale data.
So this is, by default, DynamoDB gives you eventually consistent read guarantees, which means, that data is going to get there eventually, but you might not see the latest version of that data on a particular read.
Now, if you want to, you can opt into what's called a strongly consistent read, which says, "Hey, give me the most recent version of my data with all the rights included." If you do that, it's going to go straight to that primary node that has all those rights committed and read the data from there. You can guarantee you'll have all the latest rights.
So if you're doing a banking application or something like that, you might want to do that. But in a lot of cases where that replication lag is pretty small, a couple hundred milliseconds or less. Usually, you can handle that eventual consistency with DynamoDB.
Jeremy: Yeah, I think it's even less than the 100 milliseconds. I think I know, for GSI it's something like sub 10 milliseconds or something crazy like that.
Alex: I mean, they're pretty good most of the time, yeah.
Jeremy: Yeah. So besides, like you said, the banking use case, which is interesting. I haven't personally found a lot of use cases for strongly consistent reads. Are there a lot of them out there? Because for me, I almost feel like I'm never just writing data and then immediately reading it back.
Alex: Yeah, I would say in most of my applications, I just use that default, eventually consistent read, which if you do that, you're going to pay half the read capacity units of actually making that read so it's cheaper if you opt into eventually consistent reads. And all your rights are going to be actually strongly consistent, because in that case, it's going to that primary node. So you don't have to worry about writing the wrong data in some sort of way.
So in most cases, for me, eventually consistent has worked, it depends on your application needs, but I think for a lot of people it works, especially given how small the lag is on that replication.
Jeremy: Yeah, makes sense. So the other thing that you talk a lot about in the book is this idea of overloading keys and indexes. And obviously, if you're doing a single table design, and you've got multiple entity types in there, you only have one PK and SK at least for the primary index. So you have to put different, maybe different types of keys or whatever in those or different types of identifiers in those. You might be reusing things like your GSIs, your GSI PKs and SKs and things like that. So what are some of the strategies for maintaining your sanity with that maybe?
Alex: Yep, sure. I mean, one thing I do is I make what I call an entity chart whenever I'm making an application. And what I do is I create my ERD that has my entities and relationships, all that and then I list out all the entities in a chart.
And I say, and then I have a column for PK and a column for SK, which are the two elements of my primary key. And as I'm working through modeling that I just state out this is my customer entity, and it has this PK pattern, it has this SK pattern, I write that down.
This is my order entity, it has this PK, this SK and I write that down, all the way down the list. And if I'm adding secondary indexes, I add new columns, I have GSI1PK, GSI1SK. And I'm adding out what those patterns are. So then when you get to the end, you have this chart that says okay, this is the pattern I need for each of these entity types. This is how I'm going to handle all that stuff. And then you actually go into implementation and your items, you're just making those patterns and decorating your items with these GSI1PK, GSI1SK values as you need them.
Jeremy: Yeah, so that is another really interesting thing that you point out. And this is something I talked to Rick Houlihan about was this idea of not reusing just any old attribute, as secondary indexes or as the PK and the SK for your secondary indexes. And one of the reasons for this was, you can just bang your head against the wall when you're trying to model data. And you say, "Okay, well, let's see, if I make my date the SK on GSI1, then that'll fit these patterns, but then I've got a bunch of extra data in there that maybe doesn't need to be there. Or maybe I needed to be different with a different access pattern."
So I really like... I don't think I articulated this well when I was talking to Rick about this, but you do a great job. You basically say just separate out your attributes from your indexing attributes, I guess.
Alex: Yeah, so I split them into two buckets, like you're saying application attributes. And those are things that are meaningful in your application, in your code. So if you have a customer, the customer name, the date of birth, the address, all those things are what I call customer attributes, or application attributes that are useful in your application. I contrast that with indexing attributes, which are attributes that only exist on your items to index them properly in DynamoDB so you're putting them in the right item collection to handle your access patterns.
So this is going to be the PK and SK for every item, but also GSI1PK, GSI1SK. And for me, I say separate those completely. Don't try and cross the streams, as they say. If you have your PK and SK and even if your PK for that customer has the customer username or whatever in it, and I want to try and parse out the username from that PK and SK, I would just duplicate it, have a username attribute that has that application attribute in there, and don't worry about it.
And then also like you're saying, if the PK and SK pattern for your user is the same as your GSI1PK and SK pattern for that user, don't try and duplicate those across because like you're saying, you're really just going to tie yourself into knots trying to make sure that it works for all your different kinds of items that you're doing there.
Rather, I would just say duplicate those attributes over, even if it is the exact same value somewhere else, just add those over so you're not pulling your hair out trying to make these indexes fit.
Jeremy: Yeah, no, I totally agree with this idea because I mean, I've built tables in the past that would use like this fancy nesting structure or hierarchical structure for like the SK. You'd use that hierarchy sort of pattern. And then what you try to do is parse those out when you want to read those with your application. And even if you've built a layer in between there, that does it for you, it's still... There's just a bunch of things that could probably go wrong, right?
So if you have state and county and city and things like that, and you're laying them into a hierarchy, because that's one of your access patterns, and you need to use that for sort. That's great, definitely do that. But I would like you said, store each one of those things separately so that at any point, you can one, easily parse them out of the table without having to worry about trying to figure out what your index pattern is, or what the sort key pattern is. But even more so then if you need to recreate that sort key at any time, you've got all the data right there that you can do that with.
Alex: Yep, absolutely. And one other thing in this vein that I used to do is, I would have these different access patterns, say on orders where I want to fetch orders by date for a particular customer. And what I would do is I'd create a new index based on those application attributes for my order. So I'd make the partition key for that index be the customer ID on the order and I'd make the sort key on that index be the order date, rather than using these more generic attributes like GSI1PK, GSI1SK.
And I would say just get into those very generic indexing attribute qualities there, put them all in there, and then it becomes very methodical and more scientific on how you're creating these item collections rather than having all these ad hoc indexes all over the place with weird attributes and hard to understand what's going on.
Jeremy: Yeah, definitely. All right, so the next thing that I thought was really interesting, although it does make me question your sources, because I apparently you've learned this or you at least were inspired by something that I said, which is to add a Type field to every item in your table.
Alex: Yep, that's right. So I was actually complaining on Twitter about how difficult it is to export your DynamoDB items into an analytic system, because now you've got this single table that you need to renormalize into different things. And Jeremy's just like, "Hey, why don't you put a type attribute on every item that says, this is a customer, this is an order." And I'm like, "Okay, that makes it a lot easier." So yeah, the advice here is just on every single attribute that you're writing... every single item that you're writing into that table just include this type attribute that indicates what kind of item it is.
So whether it's a customer, whether it's an order or something like that. It's going to help you in a few different scenarios. Number one, if you're in the AWS console, just sort of debugging your table, it can be easier rather than trying to parse your PK and SK pattern and figuring out what that was, how it translates to an item type. If you can just see that type there, and you say, "Okay, this is a user, this is a order," whatever that is.
Number two, if you're doing a migration, which is common, especially a migration, where you need to add new attributes to existing items. What you're often going to need to do is scan your table and find particular items that you need to update. And if you have this type attribute on there, you can use a filter expression to just get those types of items. And now you're again, not parsing out some weird PK and SK patterns to handle that. And then finally, I think that the place where it fits best is the one you said where you're exporting it for analytics, because DynamoDB is not great for OLAP type queries. These big aggregations of saying, "Hey, what items sold the most last week," or, "What's our week over week growth."
So something like that you'll export to an external system whether that's something like Amazon Redshift, whether that's S3 and use a thematic query on that. But then having that type attribute there is going to be really helpful as you filter down to the particular items you want or maybe do additional ETL to split them out into different tables to where you can then do joins in your relational analytic system.
Jeremy: Yeah, and it's funny, because the way that I actually came up with adding the type thing, and I'm sure other people have thought of it as well. So it's not just me, but when I was building DynamoDB toolbox, I was thinking ahead to single table designs with multiple entity types, and then trying to figure out how to do the parsing of those types when they come back out.
Because again, you have a lot of overlapping attributes, right? You sometimes can't tell just from the PK and the SK exactly what type of item it is. So having that little bit of data is great, even if you build your own data access layer to say, "Hey, I need to take this item. I know it's a certain type. Maybe I want to maybe unmarshal it into some other format that's easier for me to understand." That is just a really, really great strategy to be able to do something like that.
Alex: Absolutely. Yep, totally agree.
Jeremy: Yeah. So speaking of strategies, this is another thing that I thought was really well done in the book, because you can learn everything you need to know about DynamoDB. You can know about consistent reads, you can know about partition keys, you can even understand how, yes, I can put all of these things into the same partition and all that kind of stuff. All the query capabilities, the condition expressions. I mean, there is a lot to learn with DynamoDB, just on the mechanics side of it. But when it comes to modeling, that is where I think there is just not enough information out there that gives you really, really good strategies to be able to do that.
And then there's a really helpful document on the AWS site that is like DynamoDB best practices. Gives a couple of overall points of some of that hierarchical stuff, some of the edge nodes type thing, and those sort of strategies, but they really don't go into deep detail.
You actually like, I mean, you have a whole big long section just on strategies here. So maybe start by what's the importance of having good strategies when approaching modeling?
Alex: Yep, sure. This is something that I evolved to over time because my original draft of the book was seriously going to be like six chapters of basics, and then like 20 chapters of examples, because I had all these little things that I wanted to show. And I'm just like that's like way too long. And what you see happen is they're all these little patterns that you see a few different times.
And the big thing for me is that this is very different than a relational database. With a relational database, there's generally one way to model your data and you make your ERD and then you actually put your data into your tables. You have your different table for each entity, you structure the relationships between them. And then you write your queries based on this one way that you've written your table.
You might add some indexes to help some things out. But generally, there's one way to do things in a relational database. With DynamoDB, it's different. You're going to model your data very much based on your application needs. So then there are different strategies and patterns for how you actually do it. And I go through that in the book, a few different chapters, there's one to many relationships, there's many, many relationships. There's sorting, there's filtering, there's migrations, and then just some additional grab bag strategies as well.
Jeremy: Yeah, so you actually outline a number of different strategies. So rather than just saying, "Oh, yeah, strategies are important." You actually go in and write about them. So I think there were five different ways that you outline to handle one to one... Sorry, five different ways to handle one to many relationship. So I mean, I would love it if you could just give us a quick overview, and then again it's there's a lot of information, so you have to dig into the book if you want to find out more about them, but just maybe an overview of each.
Alex: Yep, sure thing. And I think this is like the clearest way to explain strategies to people. Because there's one way to do strategies in a relational database. And it's that foreign key relationship. But like, when you tell people, there are five different ways, and here are the situations you might want to use them. I think that really opens up people's eyes.
So that first strategy I use is called, I call it denormalization plus using a complex attribute. And so DynamoDB has this notion of complex attributes where you can store a list of data or a map, or a dictionary of data on an item directly. And so that's denormalizing your data. That gets you away from that normalization principle in relational databases because now you're storing multiple elements in a single attribute. But the example I give here is imagine you have a customer in an E-commerce store and they are going to have multiple mailing addresses they want to save. They want to have their home address, their business address, their parents address because they send them a gift sometimes.
What you might do, instead of splitting that out into different items, if that's small enough, you can actually just put that as a complex attribute directly on that item. And then whenever you fetch that user, you'll get all those saved mailing addresses with them that you can show in the context of their user profile, or their order checkout page, or whatever that is.
Jeremy: Right. But that strategy works great for very, very small bits of data. Because one thing is, is that again, you pay every time you read data from your table. So if you have 100 addresses stored in there, and all you're trying to do is just get the customer name, then you're loading a lot of extra data and paying for that read that you don't need. But I like that strategy for very small, small groups of data.
Alex: Yep, yep, absolutely. So that one works. Yeah, if you don't have any access patterns on that related data itself, and also the number of related items are going to be small. So it's like you're saying, if you can eliminate them and say, "Hey, you can only save 10 addresses," that works well here.
Second pattern that is in the book is, I call it denormalization by duplicating data. And again, this is denormalization because in relational databases, you don't want to be duplicating data. If you do you split that into a different table and sort of refer to it via these foreign key relationships.
But the example I use here is maybe you have movies and actors or books and authors or anything like that, where on each of those book items, maybe you replicate some information about the author, such as the author's name, the author's birthdate, things that aren't going to change, especially. Stephen King's not going to get a new birth date. So rather than referring, or having to join up with his author profile every time, you can just store it directly on that book and show that data if you want to.
So again, you're getting away from normalization that would have in a relational database, but it's okay in this particular situation.
Jeremy: Yeah, and denormalization when it is something that is highly immutable, right? Like you said, the birthday is not going to change.
Alex: Yeah.
Jeremy: Let's say for some reason that somebody changes their name or something that you think isn't going to change. There's still plenty of strategies to go back and clean that up if you need to.
Alex: Yeah, absolutely. If it's one off updates, you know this duplication can work and maybe you just have to handle those one off updates. But if you actually have consistent access patterns where you're going to be needing to update information about that parent item, then you want to use some different strategies.
The two most common strategies I see for one to many relationships are the primary key and the query and the secondary index and the query. And both of these rely on that concept of item collections that we were talking about before. Where you're assembling different collections of items into a single item collection and using that same partition key. And then you can use this query operation that lets you fetch multiple items in a single request. And you're basically pre joining your data.
So imagine you have an access pattern where you want to fetch a customer and the customers most recent orders, which that would be a join in your relational database, but joins aren't as efficient once you scale. So DynamoDB doesn't have joins. So what you do is you're pre-joining them for your access pattern into this particular item collection to handle that access pattern that you have.
Jeremy: Yeah, and then what about the fifth strategy?
Alex: Yeah, the fifth strategy is kind of an interesting one. This one's called a composite sort key. So usually you're using what's called a composite primary key where you have a partition key and a sort key making up that composite primary key.
You can actually use this composite sort key pattern where you're encoding multiple values into that sort key to represent a hierarchy. And this works really well if you have multiple levels of hierarchy, and you want to query across those levels. And maybe sometimes at a high level, sometimes at a low level.
The example I always give here is think about like store locations, right? Starbucks locations where there's a hierarchy. Starbucks has some locations in different countries, the US versus France versus Mexico. But then, and maybe that's your partition key, but then your short key they have locations in different states, in different cities, in different zip codes. So maybe you'd encode each of those into that sort key. And now you can search at any level in that hierarchy.
You can search for a particular state, you can find all the Starbucks within New York State. You can find all the Starbucks within New York City or you can find it within a particular zip code to get all those Starbucks as you need it. And then it allows you to be querying at these different sort of levels of granularity as you need it.
Jeremy: Yeah. And that's a super powerful use case. And I actually find that works really well sometimes too, for secondary indexes, because you can, it's easy to write that data in, where you might have a more, you have a different type of identifier, maybe for the primary index, but then that secondary index, you just copy that in.
And of course, we didn't even talk about projections. That's a whole other thing to talk about. But that's something to think about too when you're building your secondary indexes.
Alright, so those are the one-to-many relationships, but I think we see quite a few many to many relationships in different types of data and different models and you give four strategies to handle those.
Alex: Yep, sure thing. So yeah, like you said, there are four different many to many relationship strategies that I outlined. The first one is one that I call shallow duplication. So the example I have here is imagine you have students in classes where a student can be in multiple classes, but also a class has multiple students that are enrolled. You can think about your access patterns, and for each side of those access patterns think like what do I need to show about these related items?
So if you're showing that class, maybe you're showing information about the class, the name, the teacher, the code, whatever that is, and you also just have a list of students in your class, but you don't need all the information about the students, maybe you just need the student name, and then it has a link to that student, and you can go click on them and go over there if you want to find more information about the student.
If that's the case, we'll do something similar to what we did in that one to many relationship section with that denormalization, that complex attribute and we'll just store a list or a map on that parent entity in that many, many relationship on that class item and you just have a student attribute that could be a list, that could be a map, whatever it is, and it just has the list of students in there.
And you don't need to know their GPA or their graduation date or any of that other stuff that could change. So you don't have to worry about updating all those different items. That's shallow duplication. That works sometimes, again, think about your access patterns and what you're going to need there.
The second one, and the more common one that I see. This is what I see mostly is one that's called adjacency list. And basically what you're doing is you're representing each side of that relationship in a different secondary index or a different primary key. So maybe, on that main table, on your primary key, you have your each side of the relationship. So let's, I think the example I use here is movies and actors where movies have many actors in them, and actors can be in many different movies.
So I have three different entity types. I have the movie, I have the actor, and then I have a role item that actually represents a actor in a particular movie. And once you do in that primary key, you set it up so that you can handle one side of that relation.
So maybe you put the movie and the role items all together. So it allows you in that single item collection to use that query and say, give me the movie, and give me all the roles in that movie, which is going to give you the actors and actresses that played roles in that movie. And then you have a secondary index where you're flipping that and you're putting all the actor or the actress and all the roles that they've been in, in the same item collection as well.
So now when someone clicks on Toy Story, they can hit that primary key and get Toy Story and all the roles in that one, or when someone clicks on Tom Hanks, and they can go to that secondary index, get Tom Hanks and all the roles he's played as well.
Jeremy: And it's one of those things that I think is really hard to visualize in your head, and even hard to visualize if you're trying to do it in Excel or like Google Sheets or something like that. And I know you love this tool, and I have been using it quite a bit recently is the NoSQL workbench for DynamoDB, which allows you to create those indexes, put some sample data in there, you can even use facets to define the different types or the different entity types, and then be able to flip that data so you can see how those relationships work.
So if you were totally confused by what, Alex just said, it's not uncommon, because it is hard to think about switching that. But yeah, so that's awesome. So what else?
Alex: Yep. So the next one is, this is probably the hardest thing that I have wrapping my head around in the entire DynamoDB ecosystem. This is called the materialized graph. And this is if you have a lot of different maybe node types or different types of entities and a lot of different relationships between those nodes and you want to be able to ritually query across those.
What you do here is you might in your primary key, maybe for a particular node, you put them all, you put a bunch of different items into an item collection, but each item in that item collection represents maybe a relationship that that node has to something else.
So for example with me I might have an item that represents, I have a relation to a particular date, which is my anniversary date to my wife. So that's a particular item.
Jeremy: And you don't want to forget that.
Alex: Exactly, I'll get in trouble if I lose that one. So I also have another node that represents where I live. And I have another node that represents my particular job. And so I've broken me as a person and all these different relationships that I have. And then I have a secondary index, where now maybe I'm grouping together all these different relationships into other things. So I can see the nodes and relationships to them.
So if I look at that date, that anniversary date of mine, I can see all the people that are related to it, and maybe that's someone else's birthday. But also, my wife has an entry in there too, it's her anniversary date as well. And we can see both of those relationships there. Or I can see all the people that are also developers. I can see all the people that live in Omaha.
So it allows you to query against that and say, what is this node, I can get a full look at that node if I want to, but also who has relationships with that node. And if I want to go find information about those nodes, then I go back to that primary key and query and get that full information about that node.
So, it's tough. I'd say if you're deep into graph theory that can be good, if you have a very highly relational model that you want to represent all these relationships that can help. But I don't even have like a great example for it in the book because it kind of twists my mind so much.
Jeremy: And there's Amazon, was it Neptune?
Alex: Yeah, probably use a graph if you're really going to be going down this road, so.
Jeremy: Right. Was that all the strategies?
Alex: There's one more, there's one called normalization and multiple requests. And this one sounds anathema to DynamoDB. Because DynamoDB is about denormalization. It's about trying to satisfy these access patterns in a single request. But sometimes it's pretty tricky.
And I think especially with many to many relationships, where if you need to fetch all the related items, and those related items can change, it can be hard to handle that in a single request consistently.
So with this, when you sort of normalize your data you might have a pointer back to two other items you need to go fetch, but then you need to make a follow up request to actually fetch those items.
And the example I use here is Twitter and like the people that I'm following, right? I can list the people I'm following on Twitter, I can follow multiple people that can be followed by multiple people. But when I'm viewing all the people that I'm following, I need to see their latest profile picture and their latest profile description, and their name and all that stuff, which can change a lot.
And that can be hard to duplicate that out if you're duplicating that data a ton. So instead, what I'll just say is, hey, this user happens, or I happen to follow this user, I get that list of people that I'm following, and then I make a follow up request to Dynamo to get all the information about them. Their profile picture, their name, their profile description, all those things.
Jeremy: Yeah. And that's one of those strategies too where it's like, you'd rather not have to use that one, Because you don't want to make all those different requests. And I think this is another thing that is super important with especially from a get perspective, is the idea of caching, right? I mean, just if you are fetching data like somebody's profile from Twitter or something like that. If you've got that stored, the more you can cache that kind of data so that you don't have to pay for that round trip to the database to get it, I think is really, really helpful. And then just having good timeouts or making sure that the data doesn't get stale, and things like that are great ways to do it.
So, alright, so those are awesome and definitely, I mean, check them out. Because these are things that are probably really hard to wrap your head around without seeing those examples. You give examples of these things. And it just makes it so much easier to digest. So that's really great stuff.
Alright. So the other thing that you talk a lot about in the book, and this is something where, again, super powerful when it comes to secondary indexes, is this idea of sparse indexes.
Alex: Yep, I think sparse indexes are one of my favorite patterns and strategies to use. So just to add some background. If you create that secondary index DynamoDB is going to replicate that data from your main table into that secondary index with this reshaped primary key that enables different access patterns, it's grouping into different item collections. As you're doing that Dynamo is only going to replicate it into that secondary index, if that item has all the elements of your secondary index as primary key.
So if you've got this GSI1PK, GSI1SK defined on your secondary index, but then you write an item that doesn't have those attributes, it's not going to replicate it into that index.
So what happens is, you actually end up with sparse indexes by default, because some of your entities won't have additional access patterns. So you won't be copying these entities into those secondary indexes. But sometimes you want to use a sparse index strategically. And what that sparse index is, is you're filtering out items specifically to help you with an access pattern. So you're applying a filter as part of that index. And there are two different strategies here. One is when you're filtering within a particular entity type based on some attributes about that entity. So the example I have in the book is imagine you have an organization with a bunch of users in it, and a small number of those users are administrators for the organization. And say you have an access pattern where you want to get all the administrators.
Now, you could do that query on your primary key and get all users and filter through that, but maybe you have 3000 users, and that's going to take a long time. So instead, what you do is on the users that are admins, you add these attributes, GSI1PK, GSI1SK. Now only those administrators will be copied into that secondary index. And you can use that access pattern to say, "Okay, give me all the admins for this organization." You can do that quickly without getting all the users that aren't admins.
Jeremy: Right, and that's and just I mean, that has so many use cases, right? I mean, if you're looking for error conditions, or orders that are in a certain state, or tickets that have risen to some severity level or things like that. That is, the copying of that data is very inexpensive, and you don't have to copy all of it right, again, back to the projections thing. But you copy a tiny bit of data over that just gives you quick access to that list. And then yeah, I mean, I actually love that strategy.
Alex: Yep, I think it's so cool. The great thing about this, since you're just filtering within a single entity type, you can use this with that overloading keys and indexing strategy. So you can have different entities that you're also copying into that secondary index and handling different access patterns on. So it works really well, it keeps it efficient, and limits the number of secondary indexes that you're using.
Now, the second pattern with sparse indexes is where you're projecting only a single entity type into that index. And the example I use here is imagine you have an E-commerce store. And you have a bunch of item types in your table. You have customers, you have orders, you have inventory items, all that different stuff.
Now your marketing team says, "Hey, occasionally, we want to find all the customers and send them an email because we're doing some cool new sale." And you're like, "I don't want to scan my 10 terabyte table just to find those customers because I have way more orders, I have way more items than I have actual customers, use that sparse index here." So what you're doing on this one is on each of those customer items, you add some attribute. It could just be customer index, or whatever you want to do there, just on those customer items, you make a secondary index that uses that attribute. And now in that secondary index, the only items that are getting put in there are those customers.
So you don't have orders, you don't have inventory items, any of that stuff. You can run a scan directly on that secondary index, you'll get just the customers and you can send them all an email or whatever you want to do, and that works really well. The difference with that previous strategy is that you can't use this with that overloaded keys and indexing strategy because you can't be jumbling up other items in there and handling other access patterns. The point of this index is to only hold those types of items so that you can find all of them as we need to.
Jeremy: Right, and maybe even a better way to handle that use case would be to use DynamoDB streams every time a customer record changes, push that into Marketo, or whatever your marketing software is, and do it that way.
I mean, that's I think another thing too, where we need to think about what are the use cases when you're accessing DynamoDB, right? Because if your use case is I want to get a list of all my customers, how often do you get that list of all your customers, and that's something that you need the power of DynamoDB to do. Especially, you can only bring so much data back, I think it's one megabyte per get request... or sorry, per query. And so you're going to be limited to how much data you can pull back, and then you got to keep looping through it. And that's kind of a pain.
Whereas just dumping that data for internal dashboards that you're going to use, put it in mySQL or put it in Postgres or something like that, and just replicate off of that stream. I love using those strategies, and try not to overthink it. I try to think of, if I have thousands and thousands of users banging up against my table, what do they need to do quickly? And that's what I try to optimize for. But certainly, I mean, again, there's other strategies or those strategies you mentioned are useful for obviously forward facing things or user facing things as well.
Alex: Yep, absolutely. Good point.
Jeremy: Alright, so another thing that I think people get hung up on with DynamoDB is that I mean, you are limited with your querying capabilities, right? I mean, you can't really... calling it a query is probably the wrong thing.
I mean, essentially, what you're doing is you're requesting a partition, and then some little bit of control over that sort key, I think the most, the best one you have is the begins with the, right? That's the search capability. Other than that, though, you're looking for values that are between a certain thing, or you're looking at value that's greater than or less than, or something like that.
So you are a bit limited in your querying capabilities when it comes to that. But one of the things I think, again, where you see this use pattern, or you see this use case quite a bit is needing to sort by date, right? I mean, that's just an obvious thing. But the problem is, is that if you just put a date in as your sort key, it's always a tough way to access it when you're like, "All right, I need to know that date that I put the item in there. Otherwise, I don't have a way to access it.
So it's not as easy saying, "Oh, I need customer one, two, three for example. But there are some different ways that you can do that. And you mentioned in the book, this idea of these case or user IDs or these KSUIDs. And so can you explain that a little bit?
Alex: Yep, sure. So KSUID, this was created by the folks at Segment and it's pretty cool. And like you're saying what you need is some sort of random identifier for an item. So this could be an order ID or a comment ID. Ideally, it's something that can be easily put in a URL path as well if you're doing some restful stuff there, but you want some sorting across that data set. Because you might want to say give me a user's most recent five orders or the most recent 10 comments for this Reddit post, anything like that.
So if you're using a generic UUID, you're not going to get that sorting capabilities. So this KSUID really interesting. It basically combines the uniqueness of a UUID with some sorting mechanisms where the prefix on this UUID is an encoded timestamp, basically. So then it's all going to be laid out in order, within your particular item collection, you can get the most recent five comments or the first five comments or anything like that. You can get chronological ordering for it. But also you get uniqueness, you get URL friendliness. It works really well that way. So I found that like, pretty late in the book and just fell in love with it and started using it everywhere.
Jeremy: And I think that's something that Twitter did with their tweet IDs, as they called it like snowflakes or something like that.
Alex: Yeah.
Jeremy: Yeah. Which I always thought was really fascinating. I remember, I was working on a startup. And I remember spending about a day and a half trying to figure out how to generate the right IDs with ordering and going down that whole snowflake thing thinking that of course, we're going to have as many records as Twitter does. I mean eventually when this thing takes off. Then it lasted for three years, but anyways.
So the other thing that probably, I guess, maybe this is just me. But I know when I'm thinking about building a table, I'm always trying to think of, "Well, what if I need to change this thing?" Like, what if I have a new access pattern that I need to add, or I need to sort the data in a different way. And with a couple of hundred thousand records, or whatever, moving things around or copying the data over isn't that difficult. But if I was to get a terabyte of data or two terabytes of data on a table, and then all of a sudden somebody comes to me and says, "Hey, we need to add this new access pattern."
That is one of those things in the past where people have said, and even Rick had said this in one of his talks, where it's not a flexible data model. But I think that tune has changed quite a bit. I mean, Rick, and I actually talked about this a little bit where there is some flexibility now. But so migrations in general, whether it's migrating from an existing table or adding new access patterns, or even just migrating data from your existing workloads, you spend a lot of time in the book talking about this.
Alex: Yeah, absolutely. And this was actually a late addition to the book, but I just got so many questions about, I don't want to use Dynamo because what if my access pattern's change, or how do I migrate data? Things like that. So I actually went through it, I think it's not as bad as you think. And I split migrations into two categories, basically. First off, they're just additive migrations, where if you're just adding a new application attribute to existing items, you just change that in your application code, you don't need to change anything in DynamoDB. Or if you're adding a new type of entity that doesn't have any relational access patterns with an existing entity, or if you can put it into an existing item collection of an existing entity, you don't need to do anything. It's just a purely application code change there.
The second type of migration is, I need to do something to existing items, either because I'm changing an access pattern for an existing items or I'm joining two existing items that weren't joined together, or I'm adding a new entity type that I need a relation and there's no existing item collections to use there.
So now you don't only need to change your application code, but you need to do something with your existing data. And that's harder. It seems scary, but it's actually not that bad. And like, once you've gone through one of these processes, these ETL migration processes, they're pretty easy. It's basically a three step process. You're going to have some giant background job that's going to scan your table. You're going to look for the particular items you need to change. So if it's an order, and you need to add GSI2PK and GSI2SK for it, you find your orders in that scan for each order that you find, then you add these new attributes on them.
And then if there are more pages in your scan, you loop around and do that again. So it's just this giant wild loop that operates on your whole table. Depending on how big your table is, it might take a few hours, but it's pretty straightforward, that three step process: scan your table, identify the items you want to change and change them.
Jeremy: Yeah. I mean, and that's one of the things that I've come to realize too, and certainly, where you get a lot of flexibility in adding new indexes or even just making changes to the underlying data in the table, it goes back to the strategy of using separate attributes for the indexing and separate attributes for your application. Because when those two things don't mix, then it's so much easier to just go in and change the shape of certain bits of data.
Alex: Yep, absolutely. I agree completely.
Jeremy: Alright. So then another thing too, you mentioned, it could take a couple of hours. One thing that's great about something like Lambda, for example, is the fact that you can run a very high, you know, a high level of concurrency when you're doing these jobs, and DynamoDB actually supports parallel scans, which you talk about as a way to help with these migrations.
Alex: Yep, parallel scans are awesome. Basically, it's a way for you to chop up your scan into a bunch of different segments as you need to and operate on them in parallel, without you needing to do any state management around that and figuring out which ones you've already done or haven't done.
So when you're doing a scan operation. If you add in these two additional parameters, one is total segment which says, hey, how many different segments do I want to chop this up into? Do I want to have 10 different workers? Do I want to have 100 workers? Do I want 1000? But you're going to, for that total segments parameter, you're going to use the same number across all your different workers.
So let's say you have 10 segments, and then there's also a segment parameter, which identifies what segment that particular worker is. So if you had 10 different workers, each one of them would get a different value for that segment, which would be zero through nine, each one for one of those 10 segments. And then they can just operate in parallel, and they're just going at it. You can set this up as a step function with Lambda and just use that map to fan it out across a bunch and just let it roll and it works really well.
Jeremy: Yeah, no, that is something that using parallel scans just is as soon as you figure out how to do that, all of a sudden, like all these daunting ETL tasks are much, much more approachable, I guess, is a good way to say it.
Alex: True.
Jeremy: Alright, so then another thing that you mentioned in the book, and I thought this was good to mention, because I tend to see this in my own design sometimes where I have maybe like a user ID or a username. Maybe it's the email address, right?
And then I have another item that also has that email address, but it's for a different entity type, like maybe one's for the authentication and one's for their user record or something like that.
And when you want to make sure that you've got uniqueness, because sometimes if one of those records doesn't exist, you want to make sure you have uniqueness across multiple items, and you have some strategies for that.
Alex: Yeah, absolutely. So if you're doing any uniqueness in DynamoDB, you're going to need to build that into your primary key pattern. And then when you're writing that item, you'll write what's called a condition expression that just says, "Hey, make sure there's not an item that already exists with this primary key pattern." So you'll do that, and if you have a user, often that username or something is built into that primary key patterns where you can assert that it doesn't exist.
But, if you have multiple uniqueness requirements, like you're saying here and you have, you want to make sure your user doesn't have the same username, and also that a user hasn't signed up with this email address, you don't bake both of them into the primary key, because that's not going to work. That's actually just going to make sure there's not a unique combination of that item in your table. What you need to do is make two separate items. One, which is tracking the username, one, which is tracking that email address, wrap that in a transaction. And transactions are really cool, you can operate on multiple items in a single request. And if any of those operations fail, the entire operation is going to get rolled back and fail for you. So that's how you can handle both of those uniqueness conditions if you need to.
Jeremy: Awesome. Alright. I have one more question for you.
Alex: Yep.
Jeremy: Why on DynamoDB can't I just do select count(*) from something where x, y, z equals whatever. I know the answer to that. But that is a question I think that comes up where people want to be able to count items, I want to count the number of downloads. So I'll just scan the table, or I'll run a query for all my downloads to match a particular query condition. That's a terrible idea, right? I think we know that. So how do we do that? How do we get reference counts?
Alex: Yeah, it's a good question. I think going back to why can't I do that is DynamoDB is going to strictly enforce that you cannot write a bad query. So unless you really try hard, they're going to make it so you can't do stuff. And something like that aggregation is totally unbounded. What if you had three million related items in this particular thing, or 10 million or whatever? That count would have to read through 10 million items, they could each be 400 kilobytes each, and it would just be a mess on how long that would take.
DynamoDB is going to cut it off and make sure that you can't do crazy stuff like that. So instead, what you need to do is you're going to need to maintain reference counts yourself. And the common example I use here is in the book, I have GitHub repos and the number of people that have started, right? Where like, I think 150,000 people have started the React repo, or if you have a tweet, and people can like it or retweet it, and that could have hundreds of thousands or millions of likes.
And what you do is you just store a reference count on that parent itself. So again, you'll be using that transaction, like you just said, and doing two operations at once. Number one, you're inserting that item that tracks that operation that happens. So if someone starts a repo, you're inserting an item there to make sure they don't start multiple times, or if someone likes or retweets a tweet, you insert that item to make sure they don't do that multiple times.
But in that same transaction, you're also incrementing that count on that parent item. So on that repo, on that tweet, whatever it is, you're just incrementing that counter by one. And then when you fetch that repo, that tweet, you can show the number of people that have started, or liked it, or retweeted it or whatever. And that's how you handle reference counts rather than sort of doing this count star when you're grouping by something huge.
Jeremy: Yeah. And I also, I'm always a little leery of transactions and DynamoDB. I know they work really well but they're not your traditional transactions. I know they're a little more expensive than you would normally get. So even in situations like that, depending on how important the account is. I mean the account has to be 100% accurate then yeah, transactions or maybe DynamoDB streams, but sometimes just writing those in a batch - you know what I mean - will probably, unless there's some error or something that happens, but for the most part that should be a fairly safe operation.
Alex: Yeah. The one thing I say there like on the bat, it depends, like if you're worried about the hot path and slowing down that hot path, then like doing it in streams later on and batching it, that can be a good way to increment those counts. But if you're trying to use batch to save money on your rights, it's not really going to save you unless you have a small number of items to where in a particular stream batch you're going to be able to group some together, right?
But if you get 25 stream records and it's 25 different items you need to increment counts, you're still going to pay the same write cost there as if you did in a transaction.
Jeremy: Yeah, I was just thinking like a batch item write. You know what I mean? Doing that to do multiple writes or something like that. But anyways, so listen anything else we should know about this book?
Alex: I mean, I don't think so. I'd say you know, I think it's been well received. Rick Houlihan who got me started on this journey, hero of mine. He wrote the foreword, he endorsed it. I think a lot of folks on the DynamoDB team and within AWS they're really enjoying it.
I think a lot of people have enjoyed it. So, get it. There's a money back guarantee if you're unhappy, but I think you'll be happy with it. I just think there's nothing else out there like it. There's maybe 25% of it, you can get out there cobbling together, and it's easier that's in one place, but 50, 75% of it, I think, is really just brand new stuff that I'm pretty happy with.
Jeremy: Yeah, no, I totally agree. And you have all those extra examples. You have videos that go along with it depending on which level you buy, and things like that. So yeah, so definitely. So listen, Alex, thank you again for coming on. And I mean, really, thank you for writing the book because it's my new reference, my new go-to reference for DynamoDB.
But I'm always like, rather than searching on Google I just search the PDF and find it there. It's a little bit easier and I know that the information will be accurate and up to date. So hugely important piece of work that you've done and I think that the community is very much so appreciative of it. So if you appreciate Alex's work go to dynamodbbook.com and reward him by paying him for that work, because I think that is hugely important.
So other than that, if people want to get a hold of you what's the best way to do that?
Alex: Yep, you can hit me up on Twitter, I'm @alexbdebrie there. I also blog at alexdebrie.com, I have dynamodbguide.com, I have dynamodbbook.com. So you know if you google me or look for me I'm out there. I'm always around in Twitter DMs or email anything you want to hit me up. I'm happy to chat with folks.
Jeremy: Awesome. Alright. I will get all that into the show notes. Thanks again, Alex.
Alex: Thanks, Jeremy.