Prerequisites
We have a Rails application and with time some tables in our Database (DB) became too big. For example, one of them is 2.7Tb, another one which is in use with high frequency has a size 5.5Tb. This fact with some not optimal structure of DB brought us problems with operating data. We had implemented the process for deleting old data from the DB but when it runs it is impossible to do everyday work with data — the application is struggling a lot.
We had decided to cut some tables totally (if be more precisely leave just data but 3 months) and rest copy totally. The main question was how can we do it without long downtime. Even if we cut the biggest tables it would be around 500Gb.
Initial idea
The difficult part of the process was partially copying data from the same tables. The simple way which goes to mind is to stop the application, delete data, run optimize table
. But with our size, it takes too long. So another brave idea was nominated: create a replica, initialize it only with tables which we need fully, and start replicating from scratch those which we want to reduce. “Hm… sounds cool”: thought we, the man who had met replication only in the format press button create a new replica in AWS
.
So we’ve created a new instance in AWS RDS, copied the schema of the database, took Percona pt-table-sync (https://www.percona.com/doc/percona-toolkit/LATEST/pt-table-sync.html) for copying the tables which we wanted to have fully, and setup external master for replication with https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/mysql_rds_set_external_master.html this function (we used this because the standard way for making replication on AWS RDS makes snapshots).
We’d started with a staging environment and it looked like a working solution. However, Percona recommends using replication for copying data we used direct mode because pt-table-sync
tries to set up replication mode to STATEMENT
but RDS doesn’t allow it from SQL command.
Problems
It’s time to try the same on the production. We’ve made all the same steps, run the replication but at the same time got the replication error: 1032 HA_ERR_KEY_NOT_FOUND. Googling and reading documentation gave us the view that a table has no record which should be updated. You can say: but you’ve said that worked for staging. Yes, it worked sometimes but we didn’t pay attention to this after a long time and also environments have some differences (not all background jobs work on staging and loading much less).
But these errors happened on tables which we wanted to reduce, thus we can ignore it — we thought. MySQL allows you to ignore any errors by simply setting a list of these errors in the config file:
slave-skip-errors=1032,1062,1146
Cool, let’s do it. But wait, we are using RDS and don’t have the ability to change MySQL settings directly. Even more, you cannot do it through a parameters group. What AWS allows you to do is skip just one error at a time — we had thousands.
Game is up? No! Engineers never give up we’ve made from steel.
Final solution
We appreciate the care of AWS about us through limitations. If we cannot do it on RDS what if set up our own MySQL server — no problem. We’ve run a new EC2 instance, installed the MySQL — here we can do all that we want!
So the idea was: set up a replication process to the main DB and ignore all those errors that interrupted us on the RDS instance. After 3 months of replication, we will be ready for using this database. But we don’t want to use a standalone server, we want to continue to use RDS. For this step we considered two possibilities: 1) use mysqldump and setup replication on a new RDS instance from the same point at which we make dump 2) use AWS DMS service.
Look to the process more details:
- Setup standalone MySQL server
- Make a clone of the source database through the process of creating a Read replica on RDS
- Stop replication on this replica (remember the point in binary log when replication will stop)
- Dump database with specific tables
- Load dump to the database on the standalone server.
- Setup the replication process from the point at which we stop it on step 3.
- When we have enough data to switch to the new DB — stop replication on a standalone server (also need to remember the point of the replication process).
- Initialize a new RDS instance with data from the standalone server.
- Setup replication at the same point but for the main server for catching up on new data which can be written while we load data to a new RDS instance.
- Switch the application to the new database.
Conclusion
It is not a proper way of using replication, thus it can bring some troubles during the process. Some additional steps were involved because of the limitations of MySQL.
But the main outcome: try to not allow DB to go in this direction when you cannot work with it smoothly.
Top comments (0)