DEV Community

Adnan
Adnan

Posted on • Edited on

The Aftermath of a Database Catastrophe

Preface

It was just another day at work. We are a team of a dozen developers and content creators. As usual, we were working on new features. It's been a few months into my role. Me and my team lead (let's call him Rami) became pretty comfortable with each other's coding quality, I was a good developer. Until this happened ...

What Happened

I created a new Pull Request for a new version of our software, as usual. I wrote the changelog and the important changes.

Rami checked the Pull Request, and approved it after some manual testing.
Then, the Bitbucket Pipeline kicked in, and the database migration ran. Great, so much time saved with Continuous Deployment!

Fast forward a few minutes, we get reports of missing texts in some of our content. As an example, imagine all Dev posts chopped to the first 256 letters only ⚱️. Yikes.

Recovery

We used the most recent database backup to recover the data. The overall cost was around an hour of downtime.
The backup saved us, and the downtime was minimal. However, such downtime and data recovery process wouldn't be very easy when we reach real production scenarios where an hour of downtime could cause public outrage.

Prevention

Having a database backup is good, but recovering production data is not fun. The solution was simple and obvious: prevent it from happening.

Preventive Measures

To prevent this mistake from happening again, we made sure to add to our automated tests an assertion that makes sure that the target maximum text length can be reached after running the migration (on a test database!).

public function test_create_post() {
    $maxTitleLength = 100;
    $maxContentLength = 1000;
    $title = $faker->text($maxTitleLength);
    $content = $faker->text($maxContentLength);

    $post = Post::create([
        "title" => $title,
        "content" => $content
    ]);

    $this->assertTrue($post->title == $title);
    $this->assertTrue($post->content == $content);
}
Enter fullscreen mode Exit fullscreen mode

Thanks to this test case, should any future migration tamper with the varchar length, the test would fail and the problem won't propagate to production 🌈.

Finally, we also added this step as part of our PR review checklist:

  • ✅ Verify that proper test cases were written to prevent accidental data loss in migration scripts.

Top comments (0)