Category Archives: Hardware

Massive connectivity issues

As you are probably aware, we’ve been having lots of network connectivity issues with all services hosted at Digital West in California (all of our projects, except ListenBrainz and AcousticBrainz).

Today we spent all morning trying to replace what we thought to be a faulty switch. That process didn’t go very well at all ā€“ we hit every conceivable issue that we could’ve hit. And a few more.

But, in this process we connected our gateway machines directly to our uplink (not through our switch) and the network issues persisted! After testing this setup with both of our machines, we’ve now conclusively eliminated all of our equipment as the possible source of trouble.

At this point our troubles lie in the hands of Digital West to fix. Thankfully the day staff will return to work in a few hours and hopefully we will make some progress on this issue then.

Sorry for all of this hassle.šŸ˜¦

State of the Onion: MetaBrainz

In the past few weeks we’ve been hit with several traffic increases to MusicBrainz which is putting considerably more strain on our aging infrastructure than we’re happy with. If it seems that we’re not doing anything about it, that is because we’ve been busy behind the scenes trying to keep things moving forward. This sometimes doesn’t leave us a lot of time to keep the public informed on our work. Hopefully this blog post will fix this in the short term:

In 2011 we started to make plans to move MusicBrainz hosting into the cloud, but then out of the blueĀ we were donated a pile of machines. There were so many machines that I postponed the cloud plans and prepared the donated machines for service. That has carried us for 4+ years with almost no hardware cost, which was really great. The plan was to move to the cloud sometime around 2015, but then I spent most of 2014/2015 dealing with conflicts in the team, putting us seriously behind schedule while our hardware decayed.

On top of that, we’ve recently had some “bad luck”. We have had some disrespectful commercial customers hit us really hard and we had to find and block them. We have had unexpected traffic spikes and when trying to address these unexpected traffic spikes, we had two more machines fail on us. These were the donatedĀ machines that we kept in reserve for just this moment. The loss of two machines caught us short on capacity to handle the increased demands on our servers.

So, now we face the tough question: Do we buy expensive hardware that we might use for 6 months (~$5000) or do we try and save the money and tough it out? I’d rather not spend so much money on such short term use if we can avoid it. We’re going to try and move to a new hosting facility somewhere in the EU, since that is where most of our users are.

Moving to a new hosting facility has an incredible number of dependencies that Christina (our Biz Dev manager), ZasĀ and I have been working through. It may not seem like we have a plan, but we do, and we’re incredibly busy trying to make the plan happen.Ā To give you a taste of what we’re up against:

  1. We want to move our hosting to Europe and have a business presence in Europe in order to reduce the costs and inefficiencies of being a solely US based business. A lot of our traffic, customers and contractors are in the EU and it simply makes sense to have a presence here.
  2. To establish a presence in the EU I needed local help to help with the business matters as well as researching and establishing an EU organization. So I needed to find a Biz Dev manager and that person is Christina.
  3. Once Christina was on board she researched our options about what was suited for us. Getting that process moving involved getting certified documents from California, board approval for spending funds to establish the organization, EU labor law research, (and we needed to swap a board member, too!), hiring help to establish the org. and generally navigating the Spanish bureaucracy. (See this only slightly exaggerated short filmĀ for some clues of our ordeal.)
  4. Once the org. had been established we needed to convince the bank to open a bank account for us. The draconian US banking laws extend worldwide and the local bank had to ensure that they were not opening themselves up to thousands of $$$ in accounting hassles just to allow a tiny non-profit to open a bank account. We finally have a bank account and have started paying our contractors with it!
  5. At the same time we’re also working to set up an office for the growing team here in Barcelona. That required a byzantine process that barely started when you sign the lease. Getting power, internet and water set up has taken a frustratingly long time. Had I known how long, I would have stayed at my co-working space for a while longer while addressing hosting issues.
  6. While Christina has been focused on the hardcore paperwork, ZasĀ is keeping the site running, which itself requires many heroics. ZasĀ and I have started planning the move to the EU hosting provider. We’ve got a 5-page document that collects some of the open questions and requirements around this process: Right now Zas and Bitmap areĀ here in Barcelona and we’re going to work on establishing a formal plan for moving to the new hosting company. We’re currently comparing hosting company offerings ā€“ see what we’ve collected so far if you care to follow along.Ā The amount of work required to make this happen is making my head hurt. (A special shoutout to KodeStar, lead developer of, for providing a lot of useful feedback about our various options.)
  7. While Christina, ZasĀ and I have our hands full, BitmapĀ and GentlecatĀ continue to release new features and work on the schema change. Not to mention all the contributions from FresoĀ and ReosarevokĀ to keep the community happy and polite while we deal with less than optimal site conditions. That said, I am really happy and proud of my team, trying to keep things running in sub-optimal conditions.

This is just a snapshot of everything that is happening behind the scenes that will culminate with the goal of moving to a new hosting company and being set up in the EU. And mind you, we’re doing this with a minuscule budget trying to be careful of how we spent our money.

Postgres troubles resolved

I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:

What we changed:

  1. It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
  2. OurĀ server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
  3. We reduced the number of PGbouncer active connections to the DB.

What we learned:

  1. We had too many backends
  2. The backends were being kept around for too long by PGbouncer.
  3. This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
  4. Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.

Going forward weā€™re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.

Finally a word about Postgres itself:

Postgres rocks our world. Iā€™m immensely pleased that once again the problems were our own stupidity and not Postgresā€™ fault. In over 10 years of using Postgres,Ā problems with our siteĀ have never been Postgresā€™ fault.Ā Not once.

Thanks to everyone who helped us through this tough time!

Postgres troubles

(Regular readers of this blog, please ignore this post. We’re casting a wide net to try and find help for our problems.)

UPDATE: This problem has been resolved and all of our services are returning to their normally dubious service levels. For a technical explanation of what went wrong, see here.

Dear Postgres gurus:

We at MusicBrainz have been very happy postgres users for over a decade now and Postgres is something that gives is very few headaches compared to all the other things that we run. But last week we started having some really vexing issues with our server. Here is some back-story:

When our load spiked, we did the normal set of things that you do:

  • Check for missing indexes, made some new ones, no change. (see details below)
  • Examined for new traffic; none of our web front end servers showed an increase in traffic.
  • Eliminated non-mission-critical uses of the DB server: stop building indexes for search, turn off lower priority sites. No change.
  • Review the performance settings of the server. Debate each setting as a team and tune. shared_buffers and work_mem tuning has made the server more resilient to recover from spikes, but still, we get massive periodic spikes.

From a restart, everything is happy and working well. Postgres will use all available ram for a while, but stay out of swap, exactly what we want it to do. But then, it tips the scales and digs into swap and everything goes to hell. Weā€™ve studied this post for quite some time and ran queries to understand how Posgres manages its ram:

And sure enough ram usage just keeps increasing and once we go beyond physical ram, it goes into swap. Not rocket science. Weā€™ve noticed that our back ends keep growing in size. According to top, once we start having processes that are 10+% of ram, weā€™re nearly on the cusp of entering swap. It happens predictably time and time again. Selective use of pg_terminate_backend() of these large back ends can keep us out of swap. A new, smaller backend gets created, RAM usage goes down. However, this is hardly a viable solution.

Weā€™re now on Postgres 9.1.15, and we have a lot of downstream users who also need to upgrade when we do, so this is something that we need to coordinate months in advance. Going to 9.4 is out in the short term.šŸ˜¦ Ideally we can figure out what might be going wrong so we can fix it post-haste. MusicBrainz has been barely usable for the past few days.šŸ˜¦

One final thought: We have an several tables from a previous version of the DB sitting in the public schema not being used at all. We keep meaning to drop those tables, but havenā€™t gotten around to it yet. They tables are not being used at all, so we assume that they should not impact the performance of Postgres. Might this be a problem?

So, any tips or words of advice you have for us, would be deeply appreciated. And now for way too much information about our setup:


9.1.15 (from ubuntu packages)


  • Linux totoro 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • 48GB ram
  • Raid 1,0 disks
  • PgBouncer in use.
  • Running postgres is its only task


archive_command = '/bin/true'
archive_mode = 'on'
autovacuum = 'on'
checkpoint_segments = '128'
datestyle = 'iso, mdy'
default_statistics_target = '300'
default_text_search_config = 'pg_catalog.english'
data_directory = '/home/postgres/postgres9'
effective_cache_size = '30GB'
hot_standby = 'on'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
listen_addresses = '*'
log_destination = 'syslog'
log_line_prefix = '<%r %a %p>'
log_lock_waits = 'on'
log_min_duration_statement = '1000'
maintenance_work_mem = '64MB'
max_connections = '500'
max_prepared_transactions = '25'
max_wal_senders = '3'
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = '1000' = 'off'
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = 'off'
shared_preload_libraries = 'pg_stat_statements,pg_amqp'
shared_buffers = '12GB'
silent_mode = 'on'
temp_buffers = '8MB'
track_activities = 'on'
track_counts = 'on'
wal_buffers = '16MB'
wal_keep_segments = '128'
wal_level = 'hot_standby'
wal_sync_method = 'fdatasync'
work_mem = '64MB'


musicbrainz_db_20110516 = host= dbname=musicbrainz_db_20110516



To see these, enter these anti-spam passwords: User: ā€œmusicbrainzā€ passwd: ā€œmusicbrainzā€


Disk IO:;Template=1196376086.1393;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_diskstats-sda-count.rrd

RAM Use:;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-physicalmemory.rrd

Swap use:;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-swapspace.rrd



We ran the query from this suggestion to identify possible missing indexes:

this is our result:

Most of these tables are tiny and kept in ram. Postgres opts to not use any indexes we create on the DB, so no change.


  • Five months ago we double the RAM from 24GB to 48GB, but our traffic has not increased.
  • We’ve set kernel.swapiness to 0 with no real change.
  • free -m:
             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470

OT: Toshiba USA service sucks. Don't buy their products!

Sorry for the off-topic post, but I feel that I need to speak up about the atrocious customer service I’ve gotten from Toshiba.

About a year ago we purchased three new portable hard drives that we use for backing up the MusicBrainz servers. These are used for off-site back-ups; every monday when I am in town, I pedal to Digital West and swap out the back-up disk. Should a bomb hit Digital West, we have an off-site backup that we can use to restore MusicBrainz. After about 3 months, the first drive failed and I promptly attempted to return the drive, but the site where you request an RMA number refused to recognize the drives as valid products that Toshiba supports. I periodically checked back to see if they would finally give me an RMA number. About 3 months ago, the system did give me an RMA number and I sent the drives in. 2 weeks later nothing had happened, no replacement drive appeared.

I called Toshiba and no one knew where my drive was. Finally I got an email saying that I had sent the drive to a place that was no longer accepting the drives and my drive was going to be returned to me. What? I filled out the forms and use their mailing label to send the package, how could this go wrong? I called them back asking to ”’not”’ return the drive, but to actually forward it to the RMA place. Of course, no one could actually tell me what was going on. Three days later of being escalated and talking to clueless idiots, I finally got someone who had a clue about what was going on. But that person was wholly unwilling to make an effort to make things right. My drive was sent back to me, but the address was written unclearly and it took an extra 2 weeks for UPS to actually get the drive back to me. (This person also said I should cut Toshiba USA some slack since they were a tiny organization. Ha! They have no idea what tiny really means!)

I finally have my broken drive back and I’m out many hours of time and $12 in shipping costs. And this week the second drive died and I have no patience for dealing with Toshiba again. I am going to recycle all of these drives and replace them with new drives and be done with this. I want nothing to do with Toshiba again. Thanks Toshiba, we’re out $300 — you suck.

This post serves as a public notice that Toshiba sucks and that you should not purchase anything from them, since they refuse to properly support their products.

Open source projects: Do you have servers that need RAM?

As part of our recent server donation, we’ve got piles of 20 1GB ECC server modules kicking around that we won’t put to use. Rather than them go to waste, I would much rather send them to *your* open source project and have you use them. Before too long we will have some servers to donate as well. If you need ECC ram or a server for your open source project, please leave a comment with the following information:

  • The name and URL of the project
  • The exact type of ram you need or what kind of server you need and what you plan to use it for.

I’ll make a list of people interested in servers and I will attempt to match them up with servers as they become available.