Category Archives: General

Laurent Monin joins the team as a part-time sysadmin

For the first time in a number of years we have a person responsible for system administration! Over the past few years we’ve been trying to spread the duties to maintain our servers among our developers. This only worked so well and the duties are piling up and not being attended to.

With the introduction of our new MetaBrainz site in May, we finally have an increasing revenue stream, which allows us to finally hire a paid sysadmin. Hopefully we can work on our back log of tasks now.

Laurent Monin (aka Zas) is no stranger to our project — he has been hacking on Picard for a number of years and he attended last year’s summit in Copenhagen. I’m quite happy to have found a community member and long-standing contributor to take on this task.

Some of the first tasks that Laurent will take on are from direct feedback from our blog series about community improvements. We’re hoping to consolidate our mailing lists and forums into a Discourse instance and then provide single sign on for Discourse, our Wiki and Jira. Stuff we’ve talked about for years, but never have made any progress on.

I’m quite excited to have Zas on board! Welcome!

A positive outlook going forward

My next installment of MusicBrainz management changes focuses on how we should frame our discussions going forward. Currently there is a lot of animosity in our community and a lot of finger pointing — neither of these are constructive for moving forward, so I will aim to cut these short and focus on fixing rather than blaming.

I’d like to offer an analogy to start this discussion: When two people are in a personal relationship and when that relationship starts falling apart, a lot of negative feelings come up. The two people will often blame each other and be convinced that the other person is the reason for all of their troubles. If you’ve ever had an opportunity to talk to two people in a failing relationship, you’ll probably have seen that failing relationships are usually the fault of both people. I’ve yet to find a relationship that failed, solely on the actions of one person alone. Both people are involved, both people had a hand in it.

That said, I’ll step forward and say it: I am guilty. I am partially to blame for what is going on. Go ahead, feel free to blame me for the troubles we’re facing.

But, that is it. Basta! We’re not going to engage in finding every little thing that was done wrong, by whom and work hard to lay blame. That is pointless and it brings up unnecessary emotions. Instead of finding blame we’re going to find problems to our solutions and we’re going to move forward.

As part of me restructuring MusicBrainz, I’m going to be asking everyone what problems they perceive with the project right now. I will listen to the problems, catalog them and attempt to build a plan for tackling these problems in the future. However, I will insist that problems are stated without aggressive communication (e.g. passive aggressive communication) and without value judgements. If you cannot state your issue without being aggressive or disrespectful, you can count on me calling you on your behaviour. I will not address problems that are stated in an aggressive or disrespectful manner.

For instance, it is not acceptable to say: “I don’t think that anyone is going to listen to me anyway, but I think that because of Joe’s idiotic decision to not allow white space in code, all of our code is a freaking mess — this was the worst idea ever!” This statement has passive aggressive communication, it lays blame and contains a value judgement. One way to express the same concern in a constructive manner could be: “The decision to exclude whitespace from our code has created a number of difficulties for people to follow our code. We should re-consider this decision.”

This means of expressing problems, ideas and solutions allows us to focus our energy on moving forward and improving the project. It avoids painful discussions that won’t gives us much insight on moving forward. As we work to mend our community, I will be relying on these communication tools heavily. If you run afoul of these new communication guidelines, expect me to remind of you of this blog post. :)

Postgres troubles resolved

I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:

What we changed:

  1. It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
  2. Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
  3. We reduced the number of PGbouncer active connections to the DB.

What we learned:

  1. We had too many backends
  2. The backends were being kept around for too long by PGbouncer.
  3. This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
  4. Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.

Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.

Finally a word about Postgres itself:

Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not Postgres’ fault. In over 10 years of using Postgres, problems with our site have never been Postgres’ fault. Not once.

Thanks to everyone who helped us through this tough time!

Postgres troubles

(Regular readers of this blog, please ignore this post. We’re casting a wide net to try and find help for our problems.)

UPDATE: This problem has been resolved and all of our services are returning to their normally dubious service levels. For a technical explanation of what went wrong, see here.

Dear Postgres gurus:

We at MusicBrainz have been very happy postgres users for over a decade now and Postgres is something that gives is very few headaches compared to all the other things that we run. But last week we started having some really vexing issues with our server. Here is some back-story:

When our load spiked, we did the normal set of things that you do:

  • Check for missing indexes, made some new ones, no change. (see details below)
  • Examined for new traffic; none of our web front end servers showed an increase in traffic.
  • Eliminated non-mission-critical uses of the DB server: stop building indexes for search, turn off lower priority sites. No change.
  • Review the performance settings of the server. Debate each setting as a team and tune. shared_buffers and work_mem tuning has made the server more resilient to recover from spikes, but still, we get massive periodic spikes.

From a restart, everything is happy and working well. Postgres will use all available ram for a while, but stay out of swap, exactly what we want it to do. But then, it tips the scales and digs into swap and everything goes to hell. We’ve studied this post for quite some time and ran queries to understand how Posgres manages its ram:

And sure enough ram usage just keeps increasing and once we go beyond physical ram, it goes into swap. Not rocket science. We’ve noticed that our back ends keep growing in size. According to top, once we start having processes that are 10+% of ram, we’re nearly on the cusp of entering swap. It happens predictably time and time again. Selective use of pg_terminate_backend() of these large back ends can keep us out of swap. A new, smaller backend gets created, RAM usage goes down. However, this is hardly a viable solution.

We’re now on Postgres 9.1.15, and we have a lot of downstream users who also need to upgrade when we do, so this is something that we need to coordinate months in advance. Going to 9.4 is out in the short term. :( Ideally we can figure out what might be going wrong so we can fix it post-haste. MusicBrainz has been barely usable for the past few days. :(

One final thought: We have an several tables from a previous version of the DB sitting in the public schema not being used at all. We keep meaning to drop those tables, but haven’t gotten around to it yet. They tables are not being used at all, so we assume that they should not impact the performance of Postgres. Might this be a problem?

So, any tips or words of advice you have for us, would be deeply appreciated. And now for way too much information about our setup:


9.1.15 (from ubuntu packages)


  • Linux totoro 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • 48GB ram
  • Raid 1,0 disks
  • PgBouncer in use.
  • Running postgres is its only task


archive_command = '/bin/true'
archive_mode = 'on'
autovacuum = 'on'
checkpoint_segments = '128'
datestyle = 'iso, mdy'
default_statistics_target = '300'
default_text_search_config = 'pg_catalog.english'
data_directory = '/home/postgres/postgres9'
effective_cache_size = '30GB'
hot_standby = 'on'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
listen_addresses = '*'
log_destination = 'syslog'
log_line_prefix = '<%r %a %p>'
log_lock_waits = 'on'
log_min_duration_statement = '1000'
maintenance_work_mem = '64MB'
max_connections = '500'
max_prepared_transactions = '25'
max_wal_senders = '3'
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = '1000' = 'off'
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = 'off'
shared_preload_libraries = 'pg_stat_statements,pg_amqp'
shared_buffers = '12GB'
silent_mode = 'on'
temp_buffers = '8MB'
track_activities = 'on'
track_counts = 'on'
wal_buffers = '16MB'
wal_keep_segments = '128'
wal_level = 'hot_standby'
wal_sync_method = 'fdatasync'
work_mem = '64MB'


musicbrainz_db_20110516 = host= dbname=musicbrainz_db_20110516



To see these, enter these anti-spam passwords: User: “musicbrainz” passwd: “musicbrainz”


Disk IO:;Template=1196376086.1393;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_diskstats-sda-count.rrd

RAM Use:;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-physicalmemory.rrd

Swap use:;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-swapspace.rrd



We ran the query from this suggestion to identify possible missing indexes:

this is our result:

Most of these tables are tiny and kept in ram. Postgres opts to not use any indexes we create on the DB, so no change.


  • Five months ago we double the RAM from 24GB to 48GB, but our traffic has not increased.
  • We’ve set kernel.swapiness to 0 with no real change.
  • free -m:
             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470

Service downtime to fix some database issues

This Friday we’re going to need to take a 15-20 minute downtime to fix a few leftover issues from our recent schema change. We tried to do this without downtime, but the service got progressively slower, so we’re electing to take some downtime.

We’ll be down shortly after Noon PST, 3PM EST, 20:00 UK, 21:00 CET for about 15-20 minutes.

Sorry for the hassles this causes.

Downtime for fall schema change

Our next schema change version will be released on Monday, 17 November, 2014 around Noon PST/3pm EST/20:00 GMT/21:00 CET. We expect that MusicBrainz will be unavailable for 30 – 60 minutes during this time. We will put up the downtime notification on the site and tweet from @musicbrainz right before the release.

Sadly, our backup database server suffered a hardware failure and we ran out of time to get a replicated database setup after the hardware was fixed. This means that we won’t be able to put the site into read-only mode and will require us to take a full-downtime.

It sucks and we’re not happy about it either, but there is only so much we can accomplish with our limited resources. :(

Sorry for any troubles this may cause you.