Stop murdering “Agile”, and be agile instead
“Agile Macabre” on techcamp.hamburg, Apr-2020 I’ve been leading projects since 2002, becoming a full-blown agilist around 2011. Not so long...
DynamoDB is Amazon’s managed NoSQL database service. DynamoDB provides a simple, schemaless database structure and very high scalability based on partitioning. It also offers an online management console, which lets you query and edit data and makes the overall developer experience very convenient.
There are two main approaches to designing DynamoDB databases. Multi Table Design strores each database entity in a separate table. Single Table Design stores all entities in one big common table.
This article focuses mostly on the development experience of creating DynamoDB applications. If you’re working with a large scale project, performance and scalability may be more important aspects for you. However, you can’t completely ignore the developer experience. If you apply Single Table Design, the developer experience will be more cumbersome and less intuitive than with Multi Table Design.
DynamoDB is based on individual tables that have no relationships between each other. Despite the limitation, we tend to use them in the same way as SQL database tables. We name a DynamoDB table according to a database entity, and then store instances of that database entity in that table. Each entity gets their own table.
We can call this approach Multi Table Design, because an application usually requires multiple entities. It’s the default way most of us create DynamoDB applications.
Let’s say we have the entities User, Drive, Folder and File. We would typically then have four DynamoDB tables as shown in the database layout below.
The boldface headers are field names, and the numbers are field values organized into table rows. For simplicity, we’re only dealing with numeric identifiers.
USERS UserId(PK) 1
DRIVES UserId(PK) DriveId(SK) 1 1 1 2
FOLDERS UserId(PK) FolderId(SK) ParentDriveId 1 1 1 1 2 2
FILES UserId(PK) FileId(SK) ParentFolderId 1 1 1 1 2 2 1 3 2
Note: PK means Partition Key and SK means Sort Key. Together they are the table’s unique primary key.
It’s pretty easy to understand the structure of this database. Everything is partitioned by UserId. Underneath each User there are Drives which may contain Folders. Folders may contain Files.
The main limitation of Multi Table Design is that you can only retrieve data from one table in one query. If you want to retrieve a User and all their Drives, Folders and Files, you need to make four separate queries. This is particularly inefficient in use cases where you cannot make all the queries in parallel. You need to first look up some data in one table, so that you can find the related data in another table.
Single Table Design is the opposite of Multi Table Design. Amazon has advocated this design pattern in various technical presentations. For an example, see DAT401 Advanced Design Patterns for DynamoDB by Rick Houlihan.
The basic idea is to store all database entities in a single table. You can do this because of DynamoDB’s schemaless design. You can then makes queries that retrieve several kinds of entities at the same time, because they are all in the same table.
The primary key usually contains the entity type as part of it. The table might thus contain an entity called “User-1” and an entity called “Folder-1”. The first one is a User with identifier “1”. The second one is a Folder with identifier “1”. They are separate because of the entity prefix, and can be stored in the same table.
Let’s say we have the entities User, Drive, Folder and File that make up a hierarchy. A table containing a bunch of these entities might look like this:
PK SK HierarchyId User-1 User-1 User-1/ User-1 Drive-1 User-1/Drive-1/ User-1 Folder-1 User-1/Drive-1/Folder-1/ User-1 File-1 User-1/Drive-1/Folder-1/File-1/ User-1 Folder-2 User-1/Drive-1/Folder-2/ User-1 File-2 User-1/Drive-1/Folder-2/File-2/ User-1 File-3 User-1/Drive-1/Folder-2/File-3/
Note: PK means Partition Key and SK means Sort Key. Together they are the table’s unique primary key. We’ll explain HierarchyId in just a moment.
As you can see, all items are in the same table. The partition key is always User-1, so that all of User-1’s data resides in the same partition.
The main advantage that you get from Single Table Design is the ability to retrieve a hierarchy of entities with a single query. You can achieve this by using Secondary Indexes. A Secondary index provides a way to query the items in a table in a specific order.
Let’s say we create a Secondary Index where the partition key is PK and the sort key is HierarchyId. It’s now possible to query all the items whose PK is “User-1” and that have a HierarchyId beginning with “User-1/Drive-1/”. We get all the folders and files that the user has stored on Drive-1, and also the Drive-1 entity itself, as the result.
The same would have been possible with Multi Table Design, just not as efficiently. We would have defined similar Secondary Indexes to implement the relationships. Then we would have separately queried the user’s drives from the Drives table, folders from the Folders table, and files from the Files table, and combined all the results.
Single Table Design can also handle other kinds of access patterns more efficiently than Multi Table Design. Check the YouTube video mentioned in the beginning of this article to learn more about them.
Why would we not always use Single Table Design when creating DynamoDB based applications? Do we lose something significant by applying it to every use case?
The answer is yes. We lose simplicity in database design. When using Single Table Design, the application becomes more complicated and unintuitive to develop. As we add new features and access patterns over time, the complexity keeps growing.
Just managing one huge DynamoDB table is complicated in itself. We have to remember to include the “User-” entity prefix in all queries when working with AWS Console. Simple table scans aren’t possible without specifying a prefix.
We also need to manually maintain the HierarchyId composite key whenever we create or update entities. It’s easy to cause weird bugs by forgetting to update HierarchyId in some edge case or when editing the database manually.
As we start adding sorting and filtering capabilities to our database queries, things get even more complicated.
Now, let’s allow sorting files by their creation date. Extending our example, we might have a table design like this:
PK SK HierarchyId CreatedAt User-1 User-1 User-1/ 2019-07-01 User-1 Drive-1 User-1/Drive-1/ 2019-07-02 User-1 Folder-1 User-1/Drive-1/Folder-1/ 2019-07-03 User-1 File-1 User-1/Drive-1/Folder-1/File-1/ 2019-07-04 User-1 Folder-2 User-1/Drive-1/Folder-2/ 2019-07-05 User-1 File-2 User-1/Drive-1/Folder-2/File-2/ 2019-07-06 User-1 File-3 User-1/Drive-1/Folder-2/File-3/ 2019-07-07
How do we retrieve the contents of Folder-2 ordered by the CreatedAt field? We add a Global Secondary Index for this access pattern, which will consist of GSI1PK and GSI1SK:
PK SK HierarchyId CreatedAt GSI1PK GSI1SK User-1 User-1 User-1/ 2019-07-01 User-1/ ~ User-1 Drive-1 User-1/Drive-1/ 2019-07-02 User-1/ 2019-07-02 User-1 Folder-1 User-1/Drive-1/Folder-1/ 2019-07-03 User-1/Folder-1/ ~ User-1 File-1 User-1/Drive-1/Folder-1/File-1/ 2019-07-04 User-1/Folder-1/ 2019-07-04 User-1 Folder-2 User-1/Drive-1/Folder-2/ 2019-07-05 User-1/Folder-2/ ~ User-1 File-2 User-1/Drive-1/Folder-2/File-2/ 2019-07-06 User-1/Folder-2/ 2019-07-06 User-1 File-3 User-1/Drive-1/Folder-2/File-3/ 2019-07-07 User-1/Folder-2/ 2019-07-07
We’ll get to the semantics of GSI1PK and GSI1SK in just a moment.
But why did we call these fields GSI1PK and GSI1SK instead of something meaningful? Because they will contain different kinds of values depending on the entity stored in each database item. GSI1PK and GSI1SK will be calculated differently depending on whether the item is a User, Drive, Folder or File.
Since it’s not possible to give GSI keys sensible names, we just call them GSI1PK and GSI1SK. These kind of generic field names add cognitive load, because the fields are no longer self-explanatary. Developers need to check development documentation to find out what exactly GSI1PK and GSI1SK mean for some particular entity.
So, why is the GSI1PK field not the same as HierarchyId? Because in DynamoDB you cannot query for a range of partition key values. You have to query for one specific partition key. In this use case, we can query for GSI1PK = “User-1/” to get items under a user, and query for GSI1PK = “User-1/Folder-1” to get items under a user’s folder.
What about the tilde (~) characters in some GS1SK values? They implement reverse date sorting in a way that also allows pagination. Tilde is the last printable character in the ASCII character set and will sort after all other characters. It’s a nice hack, but it also adds even more cognitive load to understanding what’s happening.
When we query for GSI1PK = “User-1/Folder-1/” and sort the results by GSI1SK in descending key order, the first result is Folder-1 (because ~ comes after all other keys) and the following results are File-2 and File-3 in descending date order. Assuming there are lots of files, we could continue this query using the LastEvaluatedKey feature of DynamoDB and retrieve more pages. The parent object (Folder-1) always appears in the first page of items.
You may have noticed that we can now also query a user’s drives in creation date order. The GSI1PK and GSI1SK fields apply to this relationship as well. This works because the relationship between the User and Drive entities does not not overlap with the relationship between the Folder and File entities.
But what happens if we need to query all the Folders under a Drive? Let’s say the results must, again, be in creation date order.
We can’t use the GSI1 index for this query because the GSI1PK and GSI1SK fields already have different semantics. We already use those keys to retrieve items under Users or Folders.
So, we’ll create a new Global Secondary Index called GSI2, where GSI2PK and GSI2SK define a new relationship. The fields are shown in the table below:
PK SK HierarchyId CreatedAt GSI1PK GSI1SK GSI2PK GSI2SK User-1 User-1 User-1/ 2019-07-01 User-1/ ~ User-1 Drive-1 User-1/Drive-1/ 2019-07-02 User-1/ 2019-07-02 User-1/Drive-1/ ~ User-1 Folder-1 User-1/Drive-1/Folder-1/ 2019-07-03 User-1/Folder-1/ ~ User-1/Drive-1/ 2019-07-03 User-1 File-1 User-1/Drive-1/Folder-1/File-1/ 2019-07-04 User-1/Folder-1/ 2019-07-04 User-1/Drive-1/ 2019-07-04 User-1 Folder-2 User-1/Drive-1/Folder-2/ 2019-07-05 User-1/Folder-2/ ~ User-1/Drive-1/ 2019-07-05 User-1 File-2 User-1/Drive-1/Folder-2/File-2/ 2019-07-06 User-1/Folder-2/ 2019-07-06 User-1 File-3 User-1/Drive-1/Folder-2/File-3/ 2019-07-07 User-1/Folder-2/ 2019-07-07
Note: Please scroll the table horizontally if necessary.
Using this new index we can query for GSI2PK = “User-1/Drive-1/” and sort the results by GSI2SK to get the folders in creation date order. Drive-1 has a tilde (~) as the sort key to ensure it comes as the first result on the first page of the query.
At this point it’s becoming increasingly more complicated to keep track of all those GSI fields. Can you still remember what exactly GSI1PK and GSI2SK mean? The cognitive load is increasing because you’re dealing with abstract identifiers instead of meaningful field names.
The bad news is that it only gets worse. As we add more entities and access patterns, we have to add more Global Secondary Indexes. Each of them will have a different meaning in different situations. Your documentation becomes very important. Developers need to check it all the time to find out what each GSI means.
Let’s add a new Status field to Files and Folders. We will now allow querying for Files and Folders based on their Status, which may be VISIBLE, HIDDEN or DELETED. The results must be sorted by creation time.
We end up with a design that requires three new Global Secondary Indexes. GSI3 will contain files that have a VISIBLE status. GSI4 will contain files that have a HIDDEN status. GSI5 will contain files that have a DELETED status. Here’s what the table will look like:
PK SK HierarchyId CreatedAt GSI1PK GSI1SK GSI2PK GSI2SK Status GSI3PK GSI3SK GSI4PK GSI4SK GSI5PK GSI5SK User-1 User-1 User-1/ 2019-07-01 User-1/ ~ User-1 Drive-1 User-1/Drive-1/ 2019-07-02 User-1/ 2019-07-02 User-1/Drive-1/ ~ User-1 Folder-1 User-1/Drive-1/Folder-1/ 2019-07-03 User-1/Folder-1/ ~ User-1/Drive-1/ 2019-07-03 VISIBLE User-1/Folder-1/VISIBLE/ ~ User-1/Folder-1/HIDDEN/ ~ User-1/Folder-1/DELETED/ ~ User-1 File-1 User-1/Drive-1/Folder-1/File-1/ 2019-07-04 User-1/Folder-1/ 2019-07-04 User-1/Drive-1/ 2019-07-04 VISIBLE User-1/Folder-1/VISIBLE/ 2019-07-04 User-1/Folder-1/HIDDEN/ 2019-07-04 User-1/Folder-1/DELETED/ User-1 Folder-2 User-1/Drive-1/Folder-2/ 2019-07-05 User-1/Folder-2/ ~ User-1/Drive-1/ 2019-07-05 VISIBLE User-1/Folder-2/VISIBLE/ ~ User-1/Folder-2/HIDDEN/ ~ User-1/Folder-2/DELETED/ ~ User-1 File-2 User-1/Drive-1/Folder-2/File-2/ 2019-07-06 User-1/Folder-2/ 2019-07-06 HIDDEN User-1/Folder-2/VISIBLE/ User-1/Folder-2/HIDDEN/ 2019-07-06 User-1/Folder-2/DELETED/ User-1 File-3 User-1/Drive-1/Folder-2/File-3/ 2019-07-07 User-1/Folder-2/ 2019-07-07 DELETED User-1/Folder-2/VISIBLE/ User-1/Folder-2/HIDDEN/ User-1/Folder-2/DELETED/ 2019-07-07
Note: Please scroll the table horizontally if necessary.
You may think this is getting a bit too complicated. It’s complicated because we still want to be able to retrieve both a parent item and its children in just one query.
For example, let’s say we want to retrieve all VISIBLE files in Folder-1. We query for GSI3PK = “User-1/Folder-1/VISIBLE/” and again sort the results in descending order as earlier. We get back Folder-1 as the first result and File-1 as the second result. Pagination will also work if there are more results. If there are no VISIBLE files under the folder, we only get a single result, the folder.
That’s nice. But can you now figure out how to retrieve all DELETED files in Folder-2? Which GSI will you use and what do you query for? You probably need to stop your development work for a while and spend some time reading the documentation.
Let’s say we need to add a new Status value called ARCHIVED. This will involve creating a yet another GSI and adding application code in all the places where Files or Folders are created or updated. The new code needs to make sure that GSI6PK and GSI6SK are generated correctly.
That’s a lot of development and testing work. It will happen every time we add a new Status value or some other way to perform conditional queries.
Later we might also want to add new sort fields called ModifiedAt and ArchivedAt. Each new sort field will require its own set of Global Secondary Indexes. We have to create a new GSI for every possible Status value and sort key combination, so we end up with quite a lot of them. In fact, our application will now have GSI1-GSI18, and developers will need to understand what GSI1PK-GSI18PK and GSI1SK-GSI18SK mean.
In fairness, this complexity is not unique to Single Table Design. We would have similar challenges when applying Multi Table Design and implementing many different ways to query data.
What’s different in Multi Table Design is that each entity will live in its own table where the field names don’t have to be overloaded. If you add a feature that involves Folders, you only need to deal with the Folders table. Indexes and keys will have semantically meaningful names like “UserId-Status-CreatedAt-index”. Developers can understand them intuitively without referring to documentation all the time.
We can make compromises between Single Table Design and Multi Table Design to reduce complexity. Here are some suggestions.
First of all, you should think of Single Table Design as an optimization that you might be applying prematurely. If you design all new applications from scratch using Single Table Design, you’re basically optimizing before knowing the real problems and bottlenecks.
You should also consider whether the database entities will truly benefit from Single Table Design or not. If the use case involves retrieving a deep hierarchy of entities, it makes sense to combine those entities into a single table. Other entities can still live in their own tables.
In many real-life use cases the only benefit from Single Table Design is the ability to retrieve a parent entity and its children using a single DynamoDB query. In such cases the benefit is pretty small. You could just as well make two parallel requests. Retrieve the parent using GetItem and the children using a Query. In an API based web application the user interface can perform these requests in parallel and combine the results in the frontend.
Many of the design patterns related to Single Table Design also apply to Multi Table Design. For instance, overloaded composite keys and secondary indexes are sometimes quite helpful in modeling hierarchies and relationships. You can use them in Multi Table Design without paying the full price of complexity that Single Table Design would add.
In summary, you should use your judgment case by case. Don’t make blanket policy to design every application using either Single Table Design or Multi Table Design. Learn the design patterns and apply them where they make sense.
Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.