Category Archives: Hardware

Postgres troubles resolved

I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:

What we changed:

  1. It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
  2. Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
  3. We reduced the number of PGbouncer active connections to the DB.

What we learned:

  1. We had too many backends
  2. The backends were being kept around for too long by PGbouncer.
  3. This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
  4. Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.

Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.

Finally a word about Postgres itself:

Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not Postgres’ fault. In over 10 years of using Postgres, problems with our site have never been Postgres’ fault. Not once.

Thanks to everyone who helped us through this tough time!

Postgres troubles

(Regular readers of this blog, please ignore this post. We’re casting a wide net to try and find help for our problems.)

UPDATE: This problem has been resolved and all of our services are returning to their normally dubious service levels. For a technical explanation of what went wrong, see here.

Dear Postgres gurus:

We at MusicBrainz have been very happy postgres users for over a decade now and Postgres is something that gives is very few headaches compared to all the other things that we run. But last week we started having some really vexing issues with our server. Here is some back-story:

http://blog.musicbrainz.org/2015/03/14/hosting-issues-downtime-tonight/

When our load spiked, we did the normal set of things that you do:

  • Check for missing indexes, made some new ones, no change. (see details below)
  • Examined for new traffic; none of our web front end servers showed an increase in traffic.
  • Eliminated non-mission-critical uses of the DB server: stop building indexes for search, turn off lower priority sites. No change.
  • Review the performance settings of the server. Debate each setting as a team and tune. shared_buffers and work_mem tuning has made the server more resilient to recover from spikes, but still, we get massive periodic spikes.

From a restart, everything is happy and working well. Postgres will use all available ram for a while, but stay out of swap, exactly what we want it to do. But then, it tips the scales and digs into swap and everything goes to hell. We’ve studied this post for quite some time and ran queries to understand how Posgres manages its ram:

http://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/

And sure enough ram usage just keeps increasing and once we go beyond physical ram, it goes into swap. Not rocket science. We’ve noticed that our back ends keep growing in size. According to top, once we start having processes that are 10+% of ram, we’re nearly on the cusp of entering swap. It happens predictably time and time again. Selective use of pg_terminate_backend() of these large back ends can keep us out of swap. A new, smaller backend gets created, RAM usage goes down. However, this is hardly a viable solution.

We’re now on Postgres 9.1.15, and we have a lot of downstream users who also need to upgrade when we do, so this is something that we need to coordinate months in advance. Going to 9.4 is out in the short term. :( Ideally we can figure out what might be going wrong so we can fix it post-haste. MusicBrainz has been barely usable for the past few days. :(

One final thought: We have an several tables from a previous version of the DB sitting in the public schema not being used at all. We keep meaning to drop those tables, but haven’t gotten around to it yet. They tables are not being used at all, so we assume that they should not impact the performance of Postgres. Might this be a problem?

So, any tips or words of advice you have for us, would be deeply appreciated. And now for way too much information about our setup:

Postgres:

9.1.15 (from ubuntu packages)

Host:

  • Linux totoro 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • 48GB ram
  • Raid 1,0 disks
  • PgBouncer in use.
  • Running postgres is its only task

postgresql.conf:

archive_command = '/bin/true'
archive_mode = 'on'
autovacuum = 'on'
checkpoint_segments = '128'
datestyle = 'iso, mdy'
default_statistics_target = '300'
default_text_search_config = 'pg_catalog.english'
data_directory = '/home/postgres/postgres9'
effective_cache_size = '30GB'
hot_standby = 'on'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
listen_addresses = '*'
log_destination = 'syslog'
log_line_prefix = '<%r %a %p>'
log_lock_waits = 'on'
log_min_duration_statement = '1000'
maintenance_work_mem = '64MB'
max_connections = '500'
max_prepared_transactions = '25'
max_wal_senders = '3'
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = '1000'
pg_stat_statements.save = 'off'
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = 'off'
shared_preload_libraries = 'pg_stat_statements,pg_amqp'
shared_buffers = '12GB'
silent_mode = 'on'
temp_buffers = '8MB'
track_activities = 'on'
track_counts = 'on'
wal_buffers = '16MB'
wal_keep_segments = '128'
wal_level = 'hot_standby'
wal_sync_method = 'fdatasync'
work_mem = '64MB'

pgbouncer.ini:

[databases]
musicbrainz_db_20110516 = host=127.0.0.1 dbname=musicbrainz_db_20110516

[pgbouncer]
pidfile=/home/postgres/postgres9/pgbouncer.pid
listen_addr=*
listen_port=6899
user=postgres
auth_file=/etc/pgbouncer/userlist.txt
auth_type=trust
pool_mode=session
min_pool_size=10
default_pool_size=320
reserve_pool_size=10
reserve_pool_timeout=1.0
idle_transaction_timeout=0
max_client_conn=400
log_connections=0
log_disconnections=0
stats_period=3600
stats_users=postgres
admin_users=musicbrainz_user

Monitoring:

To see these, enter these anti-spam passwords: User: “musicbrainz” passwd: “musicbrainz”

Load: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196205794.8081;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_load.rrd

Disk IO: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376086.1393;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_diskstats-sda-count.rrd

RAM Use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-physicalmemory.rrd

Swap use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-swapspace.rrd

Processes: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376477.1968;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_processes.rrd

Indexes:

We ran the query from this suggestion to identify possible missing indexes:

http://stackoverflow.com/questions/3318727/postgresql-index-usage-analysis

this is our result:

https://gist.github.com/mayhem/423b084043235fb78642

Most of these tables are tiny and kept in ram. Postgres opts to not use any indexes we create on the DB, so no change.

UPDATES:

  • Five months ago we double the RAM from 24GB to 48GB, but our traffic has not increased.
  • We’ve set kernel.swapiness to 0 with no real change.
  • free -m:
             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470

OT: Toshiba USA service sucks. Don't buy their products!

Sorry for the off-topic post, but I feel that I need to speak up about the atrocious customer service I’ve gotten from Toshiba.

About a year ago we purchased three new portable hard drives that we use for backing up the MusicBrainz servers. These are used for off-site back-ups; every monday when I am in town, I pedal to Digital West and swap out the back-up disk. Should a bomb hit Digital West, we have an off-site backup that we can use to restore MusicBrainz. After about 3 months, the first drive failed and I promptly attempted to return the drive, but the site where you request an RMA number refused to recognize the drives as valid products that Toshiba supports. I periodically checked back to see if they would finally give me an RMA number. About 3 months ago, the system did give me an RMA number and I sent the drives in. 2 weeks later nothing had happened, no replacement drive appeared.

I called Toshiba and no one knew where my drive was. Finally I got an email saying that I had sent the drive to a place that was no longer accepting the drives and my drive was going to be returned to me. What? I filled out the forms and use their mailing label to send the package, how could this go wrong? I called them back asking to ”’not”’ return the drive, but to actually forward it to the RMA place. Of course, no one could actually tell me what was going on. Three days later of being escalated and talking to clueless idiots, I finally got someone who had a clue about what was going on. But that person was wholly unwilling to make an effort to make things right. My drive was sent back to me, but the address was written unclearly and it took an extra 2 weeks for UPS to actually get the drive back to me. (This person also said I should cut Toshiba USA some slack since they were a tiny organization. Ha! They have no idea what tiny really means!)

I finally have my broken drive back and I’m out many hours of time and $12 in shipping costs. And this week the second drive died and I have no patience for dealing with Toshiba again. I am going to recycle all of these drives and replace them with new drives and be done with this. I want nothing to do with Toshiba again. Thanks Toshiba, we’re out $300 — you suck.

This post serves as a public notice that Toshiba sucks and that you should not purchase anything from them, since they refuse to properly support their products.

Open source projects: Do you have servers that need RAM?

As part of our recent server donation, we’ve got piles of 20 1GB ECC server modules kicking around that we won’t put to use. Rather than them go to waste, I would much rather send them to *your* open source project and have you use them. Before too long we will have some servers to donate as well. If you need ECC ram or a server for your open source project, please leave a comment with the following information:

  • The name and URL of the project
  • The exact type of ram you need or what kind of server you need and what you plan to use it for.

I’ll make a list of people interested in servers and I will attempt to match them up with servers as they become available.

Last.fm donates a 48-port gig switch!

Our friends at Last.fm just donated a 48 port gig switch to MusicBrainz! This new (to us) switch will allow us to shorten the update cycle for our indexed searches. Currently it takes about 3 hours to generate the indexes and to push them out to the search servers. With this new switch, we should be able to push the indexes quite a bit faster, which should shave off 60-90 minutes off our update cycle.

Thanks so much for the donation! Thanks to Adrian for making this happen in next to no time!

Server donation and finances for 2011

I’m pleased to announce that an anonymous company decided to donate a pile of 20 Supermicro servers to us!! I’ve tallied up and estimated a value for all of these servers and it comes to $49,480!

Out of that pile of servers, I’ve built 10 servers that are very close to the servers that we purchased during our fundraiser last year. Two servers are nearly ready for use and a bunch of other servers decorate (read: fill up) my tiny office. Thanks so much to our anonymous donors — you’re going to help us grow over the next couple of years! With the rate at which our traffic is growing, the timing of this donation couldn’t be better!

If anyone wants to send dark chocolate to our donors as a thank you, please do! Just send them to MetaBrainz and I’ll pass them on our friends.

The timing of this donation was especially spot-on since it put us into the black for 2011. With a scant excess revenue (retained earnings) of $4,166.91 on revenues of $239,756.07, we barely made it. Phew. See our financial reports for all of 2011, if you care to get more details!

That closes an exciting year for MusicBrainz/MetaBrainz! We hope to have the annual report published before the month is out.

Thanks for an amazing 2011 everyone!