I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:
What we changed:
- It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
- Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
- We reduced the number of PGbouncer active connections to the DB.
What we learned:
- We had too many backends
- The backends were being kept around for too long by PGbouncer.
- This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
- Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.
Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.
Finally a word about Postgres itself:
Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not Postgres’ fault. In over 10 years of using Postgres, problems with our site have never been Postgres’ fault. Not once.
Thanks to everyone who helped us through this tough time!