Category Archives: Hardware

Final sprint towards NewHost

So tomorrow is the day when the old servers at Digital West will be put to rest. Our developers and system administrator have been hard at work over the last few weeks (and especially the last few days!) getting everything ready for a hopefully smooth transition to hosting everything at Hetzner in Germany.

The vast majority of our sites and services are in fact already being served from Germany, but the largest one—musicbrainz.org—remains in the US for another night.

Our transition of the final services, such as musicbrainz.org, has already started, but the very last sprint of moving the remaining bits over will start tomorrow, November 8th, at around 6 AM EST / noon UTC / 13:00 CET and will follow the plan laid out in this Google document:
https://docs.google.com/document/d/1MgqZ4hiKC0MZJ400ZCOD8JlcjBjh4GNpakauf51YKQQ/edit?usp=sharing

Expect some downtime, plenty of read-only time, and general wonkiness as we get the last gears set in place to hopefully keep us running smoothly for the next several years to come!

Massive connectivity issues

As you are probably aware, we’ve been having lots of network connectivity issues with all services hosted at Digital West in California (all of our projects, except ListenBrainz and AcousticBrainz).

Today we spent all morning trying to replace what we thought to be a faulty switch. That process didn’t go very well at all – we hit every conceivable issue that we could’ve hit. And a few more.

But, in this process we connected our gateway machines directly to our uplink (not through our switch) and the network issues persisted! After testing this setup with both of our machines, we’ve now conclusively eliminated all of our equipment as the possible source of trouble.

At this point our troubles lie in the hands of Digital West to fix. Thankfully the day staff will return to work in a few hours and hopefully we will make some progress on this issue then.

Sorry for all of this hassle.😦

State of the Onion: MetaBrainz

In the past few weeks we’ve been hit with several traffic increases to MusicBrainz which is putting considerably more strain on our aging infrastructure than we’re happy with. If it seems that we’re not doing anything about it, that is because we’ve been busy behind the scenes trying to keep things moving forward. This sometimes doesn’t leave us a lot of time to keep the public informed on our work. Hopefully this blog post will fix this in the short term:

In 2011 we started to make plans to move MusicBrainz hosting into the cloud, but then out of the blue we were donated a pile of machines. There were so many machines that I postponed the cloud plans and prepared the donated machines for service. That has carried us for 4+ years with almost no hardware cost, which was really great. The plan was to move to the cloud sometime around 2015, but then I spent most of 2014/2015 dealing with conflicts in the team, putting us seriously behind schedule while our hardware decayed.

On top of that, we’ve recently had some “bad luck”. We have had some disrespectful commercial customers hit us really hard and we had to find and block them. We have had unexpected traffic spikes and when trying to address these unexpected traffic spikes, we had two more machines fail on us. These were the donated machines that we kept in reserve for just this moment. The loss of two machines caught us short on capacity to handle the increased demands on our servers.

So, now we face the tough question: Do we buy expensive hardware that we might use for 6 months (~$5000) or do we try and save the money and tough it out? I’d rather not spend so much money on such short term use if we can avoid it. We’re going to try and move to a new hosting facility somewhere in the EU, since that is where most of our users are.

Moving to a new hosting facility has an incredible number of dependencies that Christina (our Biz Dev manager), Zas and I have been working through. It may not seem like we have a plan, but we do, and we’re incredibly busy trying to make the plan happen. To give you a taste of what we’re up against:

  1. We want to move our hosting to Europe and have a business presence in Europe in order to reduce the costs and inefficiencies of being a solely US based business. A lot of our traffic, customers and contractors are in the EU and it simply makes sense to have a presence here.
  2. To establish a presence in the EU I needed local help to help with the business matters as well as researching and establishing an EU organization. So I needed to find a Biz Dev manager and that person is Christina.
  3. Once Christina was on board she researched our options about what was suited for us. Getting that process moving involved getting certified documents from California, board approval for spending funds to establish the organization, EU labor law research, (and we needed to swap a board member, too!), hiring help to establish the org. and generally navigating the Spanish bureaucracy. (See this only slightly exaggerated short film for some clues of our ordeal.)
  4. Once the org. had been established we needed to convince the bank to open a bank account for us. The draconian US banking laws extend worldwide and the local bank had to ensure that they were not opening themselves up to thousands of $$$ in accounting hassles just to allow a tiny non-profit to open a bank account. We finally have a bank account and have started paying our contractors with it!
  5. At the same time we’re also working to set up an office for the growing team here in Barcelona. That required a byzantine process that barely started when you sign the lease. Getting power, internet and water set up has taken a frustratingly long time. Had I known how long, I would have stayed at my co-working space for a while longer while addressing hosting issues.
  6. While Christina has been focused on the hardcore paperwork, Zas is keeping the site running, which itself requires many heroics. Zas and I have started planning the move to the EU hosting provider. We’ve got a 5-page document that collects some of the open questions and requirements around this process: https://docs.google.com/document/d/16KNm4KksNwz29Opk1aILOMtCmPIeXFuxxUjMoPT3th0/edit#heading=h.dpfvoz1idcro. Right now Zas and Bitmap are here in Barcelona and we’re going to work on establishing a formal plan for moving to the new hosting company. We’re currently comparing hosting company offerings – see what we’ve collected so far if you care to follow along. The amount of work required to make this happen is making my head hurt. (A special shoutout to KodeStar, lead developer of FanArt.tv, for providing a lot of useful feedback about our various options.)
  7. While Christina, Zas and I have our hands full, Bitmap and Gentlecat continue to release new features and work on the schema change. Not to mention all the contributions from Freso and Reosarevok to keep the community happy and polite while we deal with less than optimal site conditions. That said, I am really happy and proud of my team, trying to keep things running in sub-optimal conditions.

This is just a snapshot of everything that is happening behind the scenes that will culminate with the goal of moving to a new hosting company and being set up in the EU. And mind you, we’re doing this with a minuscule budget trying to be careful of how we spent our money.

Postgres troubles resolved

I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:

What we changed:

  1. It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
  2. Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
  3. We reduced the number of PGbouncer active connections to the DB.

What we learned:

  1. We had too many backends
  2. The backends were being kept around for too long by PGbouncer.
  3. This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
  4. Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.

Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.

Finally a word about Postgres itself:

Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not Postgres’ fault. In over 10 years of using Postgres, problems with our site have never been Postgres’ fault. Not once.

Thanks to everyone who helped us through this tough time!

Postgres troubles

(Regular readers of this blog, please ignore this post. We’re casting a wide net to try and find help for our problems.)

UPDATE: This problem has been resolved and all of our services are returning to their normally dubious service levels. For a technical explanation of what went wrong, see here.

Dear Postgres gurus:

We at MusicBrainz have been very happy postgres users for over a decade now and Postgres is something that gives is very few headaches compared to all the other things that we run. But last week we started having some really vexing issues with our server. Here is some back-story:

https://blog.musicbrainz.org/2015/03/14/hosting-issues-downtime-tonight/

When our load spiked, we did the normal set of things that you do:

  • Check for missing indexes, made some new ones, no change. (see details below)
  • Examined for new traffic; none of our web front end servers showed an increase in traffic.
  • Eliminated non-mission-critical uses of the DB server: stop building indexes for search, turn off lower priority sites. No change.
  • Review the performance settings of the server. Debate each setting as a team and tune. shared_buffers and work_mem tuning has made the server more resilient to recover from spikes, but still, we get massive periodic spikes.

From a restart, everything is happy and working well. Postgres will use all available ram for a while, but stay out of swap, exactly what we want it to do. But then, it tips the scales and digs into swap and everything goes to hell. We’ve studied this post for quite some time and ran queries to understand how Posgres manages its ram:

http://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/

And sure enough ram usage just keeps increasing and once we go beyond physical ram, it goes into swap. Not rocket science. We’ve noticed that our back ends keep growing in size. According to top, once we start having processes that are 10+% of ram, we’re nearly on the cusp of entering swap. It happens predictably time and time again. Selective use of pg_terminate_backend() of these large back ends can keep us out of swap. A new, smaller backend gets created, RAM usage goes down. However, this is hardly a viable solution.

We’re now on Postgres 9.1.15, and we have a lot of downstream users who also need to upgrade when we do, so this is something that we need to coordinate months in advance. Going to 9.4 is out in the short term.😦 Ideally we can figure out what might be going wrong so we can fix it post-haste. MusicBrainz has been barely usable for the past few days.😦

One final thought: We have an several tables from a previous version of the DB sitting in the public schema not being used at all. We keep meaning to drop those tables, but haven’t gotten around to it yet. They tables are not being used at all, so we assume that they should not impact the performance of Postgres. Might this be a problem?

So, any tips or words of advice you have for us, would be deeply appreciated. And now for way too much information about our setup:

Postgres:

9.1.15 (from ubuntu packages)

Host:

  • Linux totoro 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • 48GB ram
  • Raid 1,0 disks
  • PgBouncer in use.
  • Running postgres is its only task

postgresql.conf:

archive_command = '/bin/true'
archive_mode = 'on'
autovacuum = 'on'
checkpoint_segments = '128'
datestyle = 'iso, mdy'
default_statistics_target = '300'
default_text_search_config = 'pg_catalog.english'
data_directory = '/home/postgres/postgres9'
effective_cache_size = '30GB'
hot_standby = 'on'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
listen_addresses = '*'
log_destination = 'syslog'
log_line_prefix = '<%r %a %p>'
log_lock_waits = 'on'
log_min_duration_statement = '1000'
maintenance_work_mem = '64MB'
max_connections = '500'
max_prepared_transactions = '25'
max_wal_senders = '3'
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = '1000'
pg_stat_statements.save = 'off'
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = 'off'
shared_preload_libraries = 'pg_stat_statements,pg_amqp'
shared_buffers = '12GB'
silent_mode = 'on'
temp_buffers = '8MB'
track_activities = 'on'
track_counts = 'on'
wal_buffers = '16MB'
wal_keep_segments = '128'
wal_level = 'hot_standby'
wal_sync_method = 'fdatasync'
work_mem = '64MB'

pgbouncer.ini:

[databases]
musicbrainz_db_20110516 = host=127.0.0.1 dbname=musicbrainz_db_20110516

[pgbouncer]
pidfile=/home/postgres/postgres9/pgbouncer.pid
listen_addr=*
listen_port=6899
user=postgres
auth_file=/etc/pgbouncer/userlist.txt
auth_type=trust
pool_mode=session
min_pool_size=10
default_pool_size=320
reserve_pool_size=10
reserve_pool_timeout=1.0
idle_transaction_timeout=0
max_client_conn=400
log_connections=0
log_disconnections=0
stats_period=3600
stats_users=postgres
admin_users=musicbrainz_user

Monitoring:

To see these, enter these anti-spam passwords: User: “musicbrainz” passwd: “musicbrainz”

Load: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196205794.8081;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_load.rrd

Disk IO: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376086.1393;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_diskstats-sda-count.rrd

RAM Use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-physicalmemory.rrd

Swap use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-swapspace.rrd

Processes: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376477.1968;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_processes.rrd

Indexes:

We ran the query from this suggestion to identify possible missing indexes:

http://stackoverflow.com/questions/3318727/postgresql-index-usage-analysis

this is our result:

https://gist.github.com/mayhem/423b084043235fb78642

Most of these tables are tiny and kept in ram. Postgres opts to not use any indexes we create on the DB, so no change.

UPDATES:

  • Five months ago we double the RAM from 24GB to 48GB, but our traffic has not increased.
  • We’ve set kernel.swapiness to 0 with no real change.
  • free -m:
             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470

OT: Toshiba USA service sucks. Don't buy their products!

Sorry for the off-topic post, but I feel that I need to speak up about the atrocious customer service I’ve gotten from Toshiba.

About a year ago we purchased three new portable hard drives that we use for backing up the MusicBrainz servers. These are used for off-site back-ups; every monday when I am in town, I pedal to Digital West and swap out the back-up disk. Should a bomb hit Digital West, we have an off-site backup that we can use to restore MusicBrainz. After about 3 months, the first drive failed and I promptly attempted to return the drive, but the site where you request an RMA number refused to recognize the drives as valid products that Toshiba supports. I periodically checked back to see if they would finally give me an RMA number. About 3 months ago, the system did give me an RMA number and I sent the drives in. 2 weeks later nothing had happened, no replacement drive appeared.

I called Toshiba and no one knew where my drive was. Finally I got an email saying that I had sent the drive to a place that was no longer accepting the drives and my drive was going to be returned to me. What? I filled out the forms and use their mailing label to send the package, how could this go wrong? I called them back asking to ”’not”’ return the drive, but to actually forward it to the RMA place. Of course, no one could actually tell me what was going on. Three days later of being escalated and talking to clueless idiots, I finally got someone who had a clue about what was going on. But that person was wholly unwilling to make an effort to make things right. My drive was sent back to me, but the address was written unclearly and it took an extra 2 weeks for UPS to actually get the drive back to me. (This person also said I should cut Toshiba USA some slack since they were a tiny organization. Ha! They have no idea what tiny really means!)

I finally have my broken drive back and I’m out many hours of time and $12 in shipping costs. And this week the second drive died and I have no patience for dealing with Toshiba again. I am going to recycle all of these drives and replace them with new drives and be done with this. I want nothing to do with Toshiba again. Thanks Toshiba, we’re out $300 — you suck.

This post serves as a public notice that Toshiba sucks and that you should not purchase anything from them, since they refuse to properly support their products.