Category Archives: Hosting

The end of the replication nightmare!

I’m pleased to report that our nightmare of finding/reconstructing the missing replication packets is finally over!

Through many heroic hours of work, Bitmap and Chirlu have reconstructed the missing replication packets. All clients should now be on their way to being up to date. We’ve learned a number of lessons (some good, some bad — that’s life, right?) in this ordeal and we hope to avoid these issues in the future.

An integral part of this recovery process were a number of people from our community who helped us: Users mbcz, rembo10 and xeam sent us their complete DB dumps! Bitmap used these to sanity check and diff several other database to finally extract the missing packets. Thank you for dropping what you were doing and sending us a few GB of data over blazingly fast connections. Without you this would not have been possible; and this is not an exaggeration. Thank you!

After some more rest we’re going to continue to put out smaller fires that remain from the move to NewHost, but for now, the big fires are put out. Just in time for the weekend!

In the 11 year history of the replication stream we’ve had to have users restart their stream about 3-4 times because of problems on our end. Zero would’ve been nicer, but I’m proud that we’ve been able to make this system work for so long. On a daily basis we seem to have about 400 replicated copies of MusicBrainz running all over the world. Clearly this part of our service is well used and I sleep a little better at night knowing that our most critical data is backed up across the globe.

Just for fun, here is a graph of the replication API usage over the last 6 months:

hourly-api-usage.png

Towards the end the graph shows the week plus long break, then a small blip as some of our replicas got unstuck yesterday and the much larger spike shows the rest of the replicas getting unstuck. Now, as to what caused the blip in mid-October — I have no idea.

Anyways, please accept my apologies for the replication stream outage and keep replicating!

Thanks!

Replication update: Do you have a DB that could help us?

In my post from yesterday I talked about our continuing struggle to fix our replication stream. Overnight we learned two new things:

  1. The lengthy hard drive recovery process has failed and yielded no useful results.😦
  2. We have a working DB diff program in place that allows us to create missing replication packets from two DBs at known packet numbers.

#2 is a great step forward in fixing our replication stream, but we’re missing a specific replicated database at a specific point in time. If you have a replicated database, please read on and see if you can help us:

  1. Is your database at replication packet 99847? The way to find out your current replication sequence is to look in slave.log in your server directory or to issue the query
    select current_replication_sequence from replication_control;

    at the SQL prompt.

  2. Do you have a complete replicated MusicBrainz database, including the cover_art_archive schema?
  3. Are you willing to make a dump of the DB and send it to us?
  4. Do you have a fast internet connection to make #3 possible?

If you’ve answered yes to all of the above, please send email to support at metabrainz dot org.

Thanks!

 

Move to NewHost and Replication Update

It has been a long week since our move to the new hosting provider in Germany. Our move across the Atlantic worked out fairly well in the grand scheme of things. The new servers are performing well, the site is more stable and we have a modern infrastructure for most of our projects.

However, such moves are not without problems. While we didn’t encounter many problems, the most significant one we did encounter was the failure to copy two small replication packets off the old servers. We didn’t notice this problem until after the server in question had been decommissioned. Ooops.

And thus began a recovery effort that is almost worthy of a bad Hollywood B-movie plot. Between myself traveling and the team finishing the most critical migration bits, it took 2 days for us to realize the problem and find a volunteer to fetch the drives from the broken server. Only in a small and wealthy place such as San Luis Obispo, could a stack of recycled servers sit in an open container for 2 days and not be touched at all. My friend collected the drives and immediately noticed that the drives were damaged in the recycling process, which isn’t surprising. And we can consider ourselves really lucky that this drive didn’t contain private data — those drives have been physically destroyed!

Since then, my friend has been working with Linux disk recovery tools to try and recover the two replication packets off the drive. Given that he is working with a 1TB drive, this recovery process takes a while and must be fully completed before attempting to pull data off the drive. For now we wait.

At the same time, we’re actively cobbling together a method to regenerate the lost packets. In theory it is possible, but it involves heroic efforts of stupidity. And we’re expending that effort, but so far, it bears no fruit.

In the meantime, for all of the people who use our replicated (Live Data) feed — you have the following choices:

  1. If you need data updates flowing again as soon as possible, we strongly recommend importing a new data set. We have a new data dump and fresh replication packets being put out, so you can do this at any time you’re ready.
  2. If the need for updates is not urgent yet and you’d rather not reload the data, sit tight. We’re continuing our stupidly heroic efforts to recover the replication packets.
  3. Chocolate: It really makes everything better. It may not help with your data problems, but at least it takes the edge off.

We’re terribly sorry for the hassle in all of this! Our geek pride has been sufficiently dinged that our chocolate coping mechanisms will surely cause us to put on a pound or two.

Stay tuned!

UPDATE 1: The first recovery examination has not located the files, but my friend will do a second pass tomorrow and turn over file fragments to us that might allow us to recover files. But that won’t be for another 8 hours or so.

Final sprint towards NewHost

So tomorrow is the day when the old servers at Digital West will be put to rest. Our developers and system administrator have been hard at work over the last few weeks (and especially the last few days!) getting everything ready for a hopefully smooth transition to hosting everything at Hetzner in Germany.

The vast majority of our sites and services are in fact already being served from Germany, but the largest one—musicbrainz.org—remains in the US for another night.

Our transition of the final services, such as musicbrainz.org, has already started, but the very last sprint of moving the remaining bits over will start tomorrow, November 8th, at around 6 AM EST / noon UTC / 13:00 CET and will follow the plan laid out in this Google document:
https://docs.google.com/document/d/1MgqZ4hiKC0MZJ400ZCOD8JlcjBjh4GNpakauf51YKQQ/edit?usp=sharing

Expect some downtime, plenty of read-only time, and general wonkiness as we get the last gears set in place to hopefully keep us running smoothly for the next several years to come!

Massive connectivity issues

As you are probably aware, we’ve been having lots of network connectivity issues with all services hosted at Digital West in California (all of our projects, except ListenBrainz and AcousticBrainz).

Today we spent all morning trying to replace what we thought to be a faulty switch. That process didn’t go very well at all – we hit every conceivable issue that we could’ve hit. And a few more.

But, in this process we connected our gateway machines directly to our uplink (not through our switch) and the network issues persisted! After testing this setup with both of our machines, we’ve now conclusively eliminated all of our equipment as the possible source of trouble.

At this point our troubles lie in the hands of Digital West to fix. Thankfully the day staff will return to work in a few hours and hopefully we will make some progress on this issue then.

Sorry for all of this hassle.😦

Server capacity update

Zas and I have been working hard to improve the capacity and stability of the site. In the last week, we’ve identified and fixed at least 3 problems with the search servers and we’ve added a timeout function that times out queries that take longer than 3 seconds. We think that the main cause of trouble was that queries were piling up after a slow query ran too long and that the servers never recovered from that and consequently crashed.

We won’t go as far as saying that the search servers are fixed — every time we have a smidgen of hope that things are improving, they crash again. Seemingly out of spite! So, the search servers are better.😉

Zas has also made a number of changes to the gateways and how we rate limit our incoming traffic. The rate limiting is now being done in a smarter way that reduces the overall traffic on our web servers. Well done!

We’ve also increased our bandwidth budget by 4mbits per second, which makes the site feel considerably more responsive.

Let me put these improvement into numbers: About a week ago were were struggling to keep up 250 requests per second and the site felt very sluggish. Now we can handle 500 requests a second and the site feels considerably faster. For large chunks of the day we are managing to handle all the traffic we should handle. And, the search servers haven’t crashed in 4 days!

We hope that this will give us a solid base from which to release the scheme upgrade tomorrow. Then once that is complete, we will start work on moving to the new hosting company.

Thanks for being patient with us!

We’re actually really going to take the HTTPS plunge!

Closing in on three years after stating that “We’re going to take the HTTPS plunge!”, we’re actually really going to do it now.🙂

Most of our sites have forced HTTPS for some time (metabrainz.org, critiquebrainz.org, bookbrainz.org, listenbrainz.org), but there are still a couple of stragglers, notably musicbrainz.org and acousticbrainz.org.

For MusicBrainz, our beta site is now all HTTPS, web service and all. The main, non-beta musicbrainz.org will be going HTTPS-only except for what’s under /ws/ (ie., the web service) to allow taggers and other programs not currently using HTTPS some transition time. We do not currently have an ETA for when we will make the final jump to HTTPS-only on the MusicBrainz web service, as that partly depends on feedback from our web service users, which leads me to:

If you’re currently using the MusicBrainz web service, please try and switch your program to using beta.musicbrainz.org and see whether your program breaks or not and let us know the status of it. We are aware that some Python versions and MusicBrainz libraries do not support our setup, so while your program may fail now, it might simply be because of dependencies of your program not being updated yet and you might not need to do anything specifically on your end – however, some programs/libraries might need some updates, so the more people test and report back, the better we’ll be able to judge when we can go all-HTTPS-only on musicbrainz.org.

For AcousticBrainz, we now have a shiny new Let’s Encrypt certificate on https://acousticbrainz.org thanks to our systems administrator Zas! As a result, we are going to start redirecting all HTTP traffic to HTTPS on the AcousticBrainz website, including API queries.

In order to give everyone time to verify that their scripts correctly recognise and validate our Let’s Encrypt certificate, we are going to delay the redirect until July 1, 2016. On this date, any HTTP query will automatically be redirected to HTTPS. We will also enable HSTS, so that compliant browsers will redirect to HTTPS on the client-side.

If you have any questions about either the MusicBrainz or the AcousticBrainz transition, please ask.