Pedro H Goncalves

Posted on Nov 19, 2024

Designing robust and scalable relational databases: A series of best practices.

#database #advanced #optimization #dataengineering

Throughout my experience as a data engineer and software developer, I've had the pleasure (and displeasure) of creating and maintaining many databases. In this article, I'll list some best practices that I consider essential for relational databases to become truly scalable.

Some practices, such as adding indexes, table partitioning, and studying normalization and denormalization applications, will require a slightly deeper level of knowledge to be used correctly and not generate future or immediate problems. I'll describe some of these points, but it's highly recommended that you study them more in-depth. Additionally, some of the practices I'll list are not possible to implement in existing, consolidated systems that primarily have many interface points (services that consume from them); changes in these cases can mean a lot of headaches. In any case, it's highly recommended that you use these practices in your new projects, even if you're the only developer. Your future self will thank your past self for implementing them.

Although I'll list some practices, concepts, and tools, I don't intend to delve deeply into them. My aim is to provide a basic explanation of each topic so that you can analyze and see if it makes sense to apply them to your projects. Therefore, for some of the topics, I'll leave a link to an article/post/answer that I believe has good quality regarding that theme and goes into more depth.

While this article covers several crucial aspects of database management and design, some advanced topics deserve a more in-depth exploration. Subjects like MPP, sharding, distributed processing and storage are complex enough to warrant their own dedicated article.

I plan to address these topics in a future piece where I can give them the careful consideration and detailed explanation they merit.

What we will see:

Monitoring your database.
Low-level configuration files and variables.
Naming Convention Standardization.
Correct Use of Data Types.
Normalizing and Denormalizing: How Far to Go.
Choosing a Tool for Versioning and Testing.
Documenting Your Entities and Fields.
Applying Indexes Correctly.
Modern Backup and Restoration Pipelines.

Monitoring your database

Monitoring your database is one of the tasks that should start in the early stages of your application. It's completely understandable that part of the product/service development team's focus is on making things work, but monitoring your database and its entities can deliver important insights about your application and infrastructure. Good database monitoring answers questions such as:

Average query response time.
Throughput (number of transactions per second).
CPU and memory utilization.
Disk I/O (reads/writes per second).
Uptime (available time within a range).
Error rates.
Deadlocks.
Locks.
Failed login attempts.
Backup metrics.

Some of these questions are not applied at the table level, but the more you can drill down, the better it will be for future investigations. By drilling down, I mean how far you can go, such as the average execution time of queries, down to the table level and having the average time of the table, down to the query level and understanding which types of queries take longer to execute on that table.

As a recommendation, I suggest a very famous monitoring stack, which is Prometheus, a monitoring system/time-series database, with Grafana, a tool for analyzing metrics through graphs. If for some reason you can't or don't want to use these tools, I'm sorry to say that answering all these questions with just one tool (especially OSS) will be a difficult task. So, the ideal approach is to study a series of tools and design some implementations, clearly dividing the functions so that they don't overlap and end up generating ambiguities, which could also possibly generate different data, leading to a big headache.

Low-level configuration files and variables.

I believe that, like me, many people are curious about what configuration files like postgresql.conf or my.cnf can change within the database, whether in performance, authorization, transactions, or anything that presents a visible improvement. To address part of the topic and answer many people's questions: yes, configuration files can and do dictate much of the DBMS behavior and how it manages transactions and manipulates data. Database systems like PostgreSQL and MySQL have default values for basically all manipulable values, and they are adjusted to fit local environments (most of these DBMSs are used in test environments and for small to medium-sized projects). This means that in environments with large volumes of data that have a dedicated server with high processing capacity, you can (and should) certainly make adjustments to these configurations so that they are better utilized.

Using some PostgreSQL configurations as a basis, I'll mention some important variables that can be altered. Ideally, you should read the documentation about these variables, study your needs, and test the changes to understand how they affect performance and whether they create problems elsewhere.

shared_buffers: This is the shared memory for data caching. Increasing the memory that the DBMS can consume significantly increases the query speed of frequently accessed data.

Values:
- OLTP environments: 25 - 40% RAM
- OLAP environments: 50 - 70% RAM
effective_cache_size: This is how much disk caching is available from the operating system and PostgreSQL. It's not effectively used but helps the query planner perform some operations to understand, for example, if an index fits entirely in the cache.

Values:
- OLTP environments: 50 - 70% RAM
- OLAP environments: 80 - 90% RAM
work_mem: This is the amount of memory PostgreSQL allocates for sorting operations (order by) and hash operations such as aggregations and joins.

Values:
- OLTP environments: 4MB - 64MB
- OLAP environments: 64MB - 1GB
max_connections: This is the maximum number of simultaneous connections. It's a great starting point for controlling concurrency; too many connections can overload the server.

Values:
- OLTP environments: 200 - 500
- OLAP environments: 20 - 60
max_wal_size: The Write-Ahead-Log (WAL) is the mechanism that ensures data integrity by recording changes before applying them to the database. A larger value decreases the frequency of checkpoints, which increases write performance but also increases recovery times after failure.

Values:
- OLTP environments: 1GB - 4GB
- OLAP environments: 4GB - 16GB

There are numerous other configurations that will make a difference across all aspects of your database, including resource consumption, authentication, encryption, connections, and other different areas. I strongly recommend reading the documentation for your DBMS and adapting to your needs, always making a prior backup to be able to return in case of unexpected behaviors and analyzing your database performance through monitoring.

Naming Convention Standardization

Naming convention standardization is one of the most powerful practices, but it's probably the most difficult to implement. It's easy to implement during the initial database modeling, but it's very easy to "lose" throughout development, and in many cases, it's impossible to implement in existing systems.

With naming convention standardization, we can list a few points:

Reduction of redundant names, such as the field client_name in the client table where it could simply be name.
Standardization of abbreviations. In one table, you might have the field sale_value, and in another, you have vl_service, where the first could be vl_sale.
Singular names for tables and fields. Instead of naming a table clients, you would name it client, and the same rule applies to fields, with the exception of fields that truly represent more than one value. An example could be a tags field of type ARRAY of STRINGS.
Schema naming. Not every database has the possibility of dividing entities by schemas, but the most famous one currently (PostgreSQL) does. One of the recommendations is to work with short names, such as the schema that organizes finance-related tables being abbreviated to fin, hr for human resources, or mkt for marketing. You might wonder why short names for schemas and not for tables and fields? Schemas are the most comprehensive hierarchical class; in most cases, the tables already explain what that schema means, so there's no real need for a long description.

Data Dictionary in DBMS - javatpoint

Correct Use of Data Types

The correct use of data types seems quite obvious to mention, and you've surely read/heard about it in more than one course or article. The important point I'll list here is, if you have a prior opportunity to study the data, study and define well how it should be stored. The correct data types, besides generating more performance in inserting new values or reading, ensure security and implement a layer of data safety.

It's also important to understand the difference between data types that seem ambiguous like varchar and char, tinyint and bigint, varchar and text, text and blob, numeric & decimal and float. The choices of these types may not appear to and really don't make a difference in small amounts of data, but when it comes to billions of rows with numerous fields and indexes, the correct choice of data type can bring a valuable performance difference to the application.

I'll mention some examples of use cases of data types that seemed correct but actually generated future problems.

The use of FLOAT to store transaction values. But well, why would this be a problem? Float (and decimal too) uses floating point, which is a storage and representation technique that, roughly speaking, allows for rounding in exchange for performance in read and write operations, unlike types like DECIMAL that use the fixed point technique. Fixed point stores the value exactly, without rounding. In financial transactions, rounding cannot occur because at some point the value of these roundings would be significant, and we already know the end of that.
The use of CHAR(355) to store product descriptions. The char type has a fixed size, which means that if the product description doesn't have 355 characters in length, the database will fill the missing characters with blank characters. The number of stored products scaled to various levels, and the difference in storage use from CHAR to VARCHAR was considered substantial, in addition to negatively affecting the performance of indexes such as FULLTEXT (we'll talk about indexes later).
The use of VARCHAR to store negotiation situations. The field would have (or at least should have) only 3 negotiation situations, approved, pending, denied. In this case, the correct type to use would be ENUM, which would limit the entry of other values, such as incorrect spellings or unmapped situations. You might argue that only your application interfaces with the database in manipulation operations and this type of verification is in code, but it's important to understand that data constraints are never too much, and you don't know if other applications that are beyond your control may arise in the future and end up interfacing with the database.

I strongly recommend that you study the differences between data types and how the database handles them in insertions, selections, updates, and deletions. You will find pertinent information in the documentation of the RDBMS you are using.

Normalizing and Denormalizing: How Far to Go

Briefly contextualizing, normalization is a set of rules to be followed in entity modeling. These rules aim to reduce redundancies and spaces for inconsistencies.

Getting ahead of much of the topic and being quite generalist, denormalization doesn't make sense in environments that aren't analytical, such as data warehouses or OLAP. You might even ask, "but joins are very costly, wouldn't it be better to reduce the number of joins in the most common queries?" Actually, no. SQL is famous precisely because joins are quite performant; there are some best practices to make joins execute in the best possible way, and I'll list them later.

It's important to remember that with normalization/denormalization, you're not just gaining/losing performance, but also giving up integrity and consistency. With normalization, you ensure referential integrity, and this prevents some inconsistent data insertions where to compensate, you end up opting to write triggers or the like, which exponentially increases the complexity of your entities.

Does this mean that analytical environments have always been wrong in terms of denormalization? No, analytical environments are designed already knowing the initial demands of what information will be needed, what is and isn't useful data. In these cases, many relationship keys don't make sense and/or little information from another entity is used, so denormalization occurs. Another point is that in the case of analytical environments, they are a set of entities designed for reading; the removal of join operations really makes a difference at large scales but is not a determining factor compared to other common practices in analytical environments.

Points to optimize your join operations (or lack thereof):

Primary and foreign keys should be of the same type and in the "same manner". Doing a type cast or an expression like lower() to then be able to do the join. The database needs to do a full scan, apply the expression/cast, and then make the comparisons.
Very small entity (a few dozen rows). In these cases, scanning the table will probably be more performant than doing a join.
Primary and/or foreign key are not indexed, which means the database needs to fully scan the table, which obviously reduces performance (SIGNIFICANTLY).

Choosing a Tool for Versioning and Testing

Versioning and testing in the creation and modification of entities is not very common but makes a HUGE difference in the maintenance and understanding of a system as a whole. It's rare to find codebases where you can see the entire history of the database with field alterations, insertion of new ones, and removal of old ones, such as the evolution of an E-Commerce or ERP system. The easiest way to version your database is with migration scripts. There are some tools like Flyway for JVM that does this with .sql scripts like V0__create_customer_table.sql, or YoYo Migrations for Python where you can even generate migrations with Python code or .sql scripts.

Don't forget to create rollback scripts!

Migration management tools enable you to write migration tests that are extremely important for several reasons:

No data is lost.
No inconsistencies are created between data.
There are no unnecessary duplications.
Prevention of failures due to logical or syntax errors.
Allows performance evaluation.

Before choosing the migration tool you'll use, it's important to know some aspects of it. The most important aspect I usually consider is whether my database will have more than one interface that will manipulate entities and make migrations. Why consider this? Tools like Alembic, also from Python, do a very simple metadata management that's perfect for small and unique projects. That is, you won't have problems working with just one project performing migrations. This scenario changes when you have other applications also making migrations because it's very easy to lose the order and continuity of migrations. Tools like Flyway and YoYo do more complex but also more complete metadata management and usually serve better in these cases.

It's important to create rollback scripts so that if any of the migrations are performed successfully but haven't generated the expected result (bug), it can easily be undone to return to the previous state of the database. It's understandable that not all migrations have a rollback script, especially those that affect large quantities of unmapped tuples. In these cases, it's extremely important to have a development database that copies the production database in smaller proportions.

Additionally, migration tools are easily integrated into CI/CD environments, which facilitates bringing entity changes to production environments, unifying and exposing the results.

Database Schema Migration: Understand, Optimize, Automate

Documenting Your Entities and Fields

This topic draws a bit from what the first topic (Naming Convention Standardization) introduced, which is a data dictionary defining meanings for fields, relationships, functions, giving meaning to elements that might seem ambiguous.

The creation of metadata (data about data) is extremely important, especially for consumption by other teams. It's a practice that comes from data engineering teams but is also valuable for development teams. First, it's important to briefly describe the cardinality and relationship of the tables. There are numerous tools that allow this type of work, but the most common ones are Lucidchart and the new drawdb. It's a simple task that is usually done even before the materialization of the tables; you can feel free to use any tool you want. There are some tools that generate documentation in HTML/PDF format like Dataedo which I also recommend. The next tool I'll list is a bit more comprehensive and solves the problem more satisfactorily.

Data engineering teams have been frequently using Open Metadata, an OSS tool for creating metadata. Software engineering professionals might think it could be a bit much for documentation, but tools like this are perfect for all types of companies and should be included in the culture as early as possible. Briefly explaining how Open Metadata works: When plugged into the database, the tool "scans" it for all entities, functions, triggers, and so on, and brings this information into a very user-friendly Web UI. With this "recognition" step done, you can put metadata on the fields to say how they are calculated, in which system they originate, and in which services they are used. This, combined with a very useful text search, gives you a complete tool that describes your database from end to end and provides crucial information about the entities and their respective fields.

In general, there are N ways to document your database entities. The point is how you will make it available to stakeholders, what work you will have to do to document/update, and what granularity of information your documentation will have.

Database Documentation - Lands of Trolls: Why and How? - Simple Talk Metadata: Definition, Examples, Benefits & Use Cases

Applying Indexes Correctly

In your queries, regardless of the data they retrieve or how you execute them, a universal best practice is to always think about reducing the number of I/Os the database needs to perform. The most common way to do this is with indexes. Indexes, in a quick, succinct, and rough explanation, are metadata about the data stored in the database that point to where certain tuples with certain values are located. A quick contextualization about indexes with a practical example: You make the following query select * from orders where order_dt = '2024-11-05', the order_dt column has a b+ tree index. Instead of traversing all the data pages, the query planner goes directly to the pages that have the value 2024-11-05 because it knows this thanks to the index metadata. In general, you'll prefer Index Scan queries because the query planner uses the index to reach the desired values, and a Table Scan query scans the entire table looking for these values. Indexes have a mathematics called selectivity, which basically tells the ability of the index to filter a large portion of the data from a table. The calculation is:

1 Selectivity = (Number of distinct values) / (Total number of records)

The closer to 1, the more selective the index is, and the more selective it is, the smaller the percentage of records returned in a query. It's counterintuitive, but they are usually more effective because the database can scan these indexes more quickly.

After multiple insertions, removals, and updates of the data, the indexes can and probably will become fragmented. This happens because the physical order (where they are stored) does not correspond to the logical order, and this will result in loss of performance in searches. Moreover, unfortunately, there's no magic, and indexes come with a cost that is paid mainly in storage (you have metadata and this needs to be stored somewhere) and also in the performance of data manipulation operations (INSERT, UPDATE, DELETE) (the database needs to rearrange the indexes in these cases). This already gives a clear warning that you can't simply go filling your database with indexes. Their implementation should be studied so that it's not more harmful than beneficial to the database. This also applies to analytical environments where many complex queries are executed and the database will naturally do a table scan, which makes the design of these indexes even more crucial.

It's important to note that there are different types of indexes such as hash, which is excellent for equality searches but not so good for range searches, or fulltext, which is optimized for large volumes of text and supports keyword and phrase searches. If you want to optimize queries on your database entities, I strongly recommend that you deepen your studies on indexes.

SQL Indexes - The Definitive Guide - Database Star
An in-depth look at Database Indexing – eLitmus Blog

Modern Backup and Restoration Pipelines

Those who think backups are only for restoring data in case of a natural catastrophe or some invasion and data hijacking are mistaken. Of course, these points are important and should be considered in planning your backups, but there are other reasons which you can use as motivators to improve your processes to make them more efficient on all fronts and better documented. We'll include in our list of "Why have modern backup pipelines" the ease of testing new features and integrations that make massive use of the database (high number of insertions or data manipulation and new data types). Having a backup of the database state in a previous version of the application can save you (and probably will) from numerous scenarios where things didn't go as planned. Another clear use is for compliance needs; in cases of cybercrime or legal disputes, it's extremely important to have the entire history of data movement for use in investigations.

It's extremely common to see database backups being done with a plain-text SQL dump and also with binary formats. SQL Dump is only an option because many DBMSs do it automatically where you configure few things and they are generated automatically, but they are far from ideal in scenarios where you have a medium/large amount of data. The format is quite ineffective both in storage (takes up a lot of space), restoration (medium/large volumes will take good hours if not days to be restored), take a significant amount of time to be written, and do not have integrability with other tools for reading (analysis/query). For these reasons, we disqualify SQL dump as an option for backups in robust databases. Binary files solve part of these problems such as storage and writing time, which are good reasons to be used despite the integrability being worse than SQL dumps since binaries are usually specific to a particular DBMS. Binaries are more difficult to manage because they usually require a bit more technical knowledge, but the gains are noticeable and in most cases worth it, but now, we'll talk about another type of architecture.

With the modernization of areas such as data engineering, machine learning engineering, data science and related fields, new file formats and new tools for manipulating these files have emerged. Although they are directed to analytical environments and not exactly for backup purposes, some of these tools can be applied to this context to extract great value. Something that analytical environments and backups have in common is the low, or no, amount of data change operations, only large volumes of insertion. With this, we can use columnar format files that have advanced compression techniques that significantly reduce the size and time used to read and reinsert back into a transactional environment.
Read my article Different file formats, a benchmark doing basic operations.

The parquet format is more optimized in terms of storage mainly for transactional environment data that has a large amount of duplicate values (parquet uses a compression technique that reduces duplication, decreasing size). Writing parquet files is also not time-consuming; the time is usually quite similar to binary files despite requiring a bit more computational resource. Parquet files can be queried without necessarily a restoration, which facilitates in some scenarios where it would be necessary to restore a complete version to query some specific information in the case of binary. Although it stores the schema in its metadata, this metadata is not compatible with most DBMSs, and this is a problem if you don't have your backups synchronized with migrations. This style of architecture called "modern data stack" in the data engineering area requires more specific technical knowledge in some tools and concepts, but it will probably generate great savings in the resources your backups consume and in the quality of processes. In all cases, changes of this type that affect such complex and delicate things as compliance should be widely discussed and studied before being adopted, no matter how much the numbers point to great improvements.

Conclusion

Many of these topics could easily become an article of this size on their own or even larger, but the idea is not to explain in depth, understand the anatomy, or explain in micro details how they work, but rather to give a reliable and robust introduction. The necessary deepening for future implementations or changes can be achieved by reading the documentation of the DBMSs, the tools you will use for assistance, in books about software architecture and application design, or in more specific articles.

If you want to know more about databases, I recommend reading my other article about the ACID principle. Transactions and the ACID principle, going a little deeper.

Thank you for your attention.

Top comments (5)

Matthew O. Persico • Nov 21 '24

A table holds multiple items. It should be plural. I’ve never understood the reluctance to use an s. Clients is a table. Client is a record in that table.

Pedro H Goncalves • Nov 21 '24

Sure, a table usually holds a bunch of items, but the table's name should be about what it is, not what's inside it. Using singular names keeps things consistent and avoids confusion. The table itself is just one thing that represents an idea, no matter how many records it's got. Like, a 'Customer' table is about the idea of a customer, not necessarily a bunch of customers. Also, if we started naming tables based on what's in them, we might end up using all sorts of weird grammar depending on how we use the table. That'd be a mess, keeping it singular just makes more sense and keeps things simple when you're designing databases.