Let's dive into the topic of data migration and database versioning in coding projects. When it comes to handling these aspects, there are various approaches and best practices. Share your perspective on how you tackle data migration and handle database versioning, considering factors like scalability, data integrity, and minimizing disruptions.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (3)
I'm the current maintainer for the Roundup Issue Tracker. This is a somewhat unique application in that it supports both nosql (kv store) and sql (sqlite, postresql, mariadb/mysql) primary databases. Also the database is designed to be modified by a schema written in Python code.
The user can modify the database at will adding/removing tables and relationships. Roundup code migrates the backend database to the new schema automatically. The current user defined schema is stored in a serialized form in the database. It is up to the tracker admin to track addition/removal of fields using their favorite VCS on the tracker's
schema.py
file.The serialized schema also tracks the database version that describes what relationships are defined and what data types are used under the hood. This changes with the Roundup Tracker software release. For example a string() property defined in the schema could be a varchar, blob, or text in the backend.
The CLI for the issue tracker includes a
migrate
command that changes the schema version when an upgrade is installed. It is responsible for running the alter commands on the db. If the database version does not match the version expected by the software, it will exit with an error.Next month's 2.3.0 release changed the schema . The underlying type used for storing an expiration timestamp changed to prevent truncation. This bumps the schema version from 7 to 8. A prior release bumped the version as it added indexes that were present only in new installations but not in older upgraded databases.
The supported migration path is one way (upgrade). You can manually downgrade the schema version if you need to go to a prior version of the software. (The tests for the last 3 migrations roll back the migration changes as part of testing the forward migration.)
Data integrity is maintained by never deleting data by default. Instead data is retired, so it's still available for record keeping, auditing etc., but is hidden so it doesn't get reused. You can make changes required for GDPR to remove users while still maintaining the referential integrity of the database but with loss of auditing info.
I've had a decent amount of pain over the years regarding this topic as it's not particularly easy when you choose tech stack layers that don't account for this as a first-class citizen in your application.
Database Migration - You have to be careful with how you architect your database as you can impede future efforts to migrate data between instances of your application. I'm stuck with a legacy app that is a significant revenue generator but the data structure is so complex we're unable to justify the resources required (time/labor) to build a migration tool.
Data Versioning - I once dealt with a database versioning situation where there was no data point on which version the database was actually at, and at that time we had nearly a hundred data versioning scripts for the project. If you dealing with versioning on your database, make sure it also provides you context on which version you're currently at.
Scalability - Full Stack Developers should learn database normalization, even if they're working with NoSQL datasources. I've honestly been shocked at the lack of knowledge by developers in this area over the last decade, it's something I took for granted as part of my college courses in the 2000s. These fundamentals are important as they will impede scalability - likely when the engineers who wrote the non-optimized code are gone.
Data Integrity - Privacy regulations like GDPR have a direct impact on data integrity as you may not be able to simply purge data depending on the context of your application. This will likely be more prevalent in B2B or team-based software as a data point may be primarily owned by one users but other users also have a right to it. Thankfully, anonymizing PII is a legal approach to addressing regulatory requirements so you don't have to destroy relational integrity in your system.
Minimizing disruptions - Test, Test, and Test some more, and do your best to keep your prod instance as similar to your non-prod instances as possible, they can drift if your team isn't on top of it which will create periodic disruptions.
For those who need to migrate single users between complex database schemas - you can try my UpRooted project. It allows to extract data tree starting from given row in given table and feed into another database or save to file. Currently MySQL and PostgreSQL are supported, but any other relational engine can be easily added.
Very useful on major DB upgrades to avoid big bang deploys of all users at once and for shard balancing after hardware purchases.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.