The end of the replication nightmare!

I’m pleased to report that our nightmare of finding/reconstructing the missing replication packets is finally over!

Through many heroic hours of work, Bitmap and Chirlu have reconstructed the missing replication packets. All clients should now be on their way to being up to date. We’ve learned a number of lessons (some good, some bad — that’s life, right?) in this ordeal and we hope to avoid these issues in the future.

An integral part of this recovery process were a number of people from our community who helped us: Users mbcz, rembo10 and xeam sent us their complete DB dumps! Bitmap used these to sanity check and diff several other database to finally extract the missing packets. Thank you for dropping what you were doing and sending us a few GB of data over blazingly fast connections. Without you this would not have been possible; and this is not an exaggeration. Thank you!

After some more rest we’re going to continue to put out smaller fires that remain from the move to NewHost, but for now, the big fires are put out. Just in time for the weekend!

In the 11 year history of the replication stream we’ve had to have users restart their stream about 3-4 times because of problems on our end. Zero would’ve been nicer, but I’m proud that we’ve been able to make this system work for so long. On a daily basis we seem to have about 400 replicated copies of MusicBrainz running all over the world. Clearly this part of our service is well used and I sleep a little better at night knowing that our most critical data is backed up across the globe.

Just for fun, here is a graph of the replication API usage over the last 6 months:

hourly-api-usage.png

Towards the end the graph shows the week plus long break, then a small blip as some of our replicas got unstuck yesterday and the much larger spike shows the rest of the replicas getting unstuck. Now, as to what caused the blip in mid-October — I have no idea.

Anyways, please accept my apologies for the replication stream outage and keep replicating!

Thanks!

3 thoughts on “The end of the replication nightmare!

  1. InvisibleMan78

    Keep up the great work! Thank you.
    Would you mind to share your “lessons learned”?

  2. ruaok Post author

    Mine personally:

    1. Review every step of the game plan with every team member.
    2. Identify very important steps.
    3. Add steps to verify the important steps
    4. I should’ve kept at least one drive for safe-keeping off each server, rather than immediately recycling them.

    Quite possibly #4 may never ever apply again. But still…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s