There are two web servers running on the main web server machine. The first web server is light and handles all the content that is simple, such as static pages, images and the like. Anything that requires more intelligence, such as talking to the DB, gets passed to the second web server, which is designated for these heavy requests.
The light server will wait for a specified time for the heavy server to finish its job — currently 120 seconds. If the heavy server hasn’t finished the job in that time, the light server gives up and and returns you the dreaded 502 error. The DB server will unfortunately continue to chug on the query and finish executing it as requested — cancelling an existing query is hard to do, and often times its better to let the server just run its course.
The gut reaction might be to say: “Why not stick around longer and wait for the results, if the DB is going to crank them out anyway?” Problem is that if we do this, the light web server is sitting idle doing nothing while waiting for the DB/heavy server to finish its job. The light server can give up and instead spend its time better doing things it can accomplish in a reasonable amount of time — like serving smaller requests for others. With this setup, the overall system favors the less intensive requests and thereby increasing the overall number of queries that were successfully handled. If we stopped and waited for the DB/heavy server to finish its stuff, we would pretty quickly clog up the web server with requests that are sitting idle, doing nothing. And that clog would then prevent any further connections to the web server and the whole site comes to a halt.
If you want a visual representation of what is going on, check the load graphs for dexter, our DB server. Any load greater than 4.0 and the DB server is no longer running optimally. We’re fine right this second, but in 10 minutes time?
So, what are we doing about this?
- Optimize the server code so that the user cannot make these intensive requests
- Spread the DB load across multiple replicated slave servers.
- Partition the database so that we can have multiple master servers. For instance, we could have one DB server that handles all the edits and one that handles the data. Maybe one that handles TRMs and PUIDs. This way each machine does less work, but this is a lot of work to code for mb-server devels.
- Find someone to give us a beefy database server with 12GB – 16GB of RAM
So, next time you’re aching for new mb-server features, please keep in mind that we’re spending a lot of time just keeping the service running smooth. Our income isn’t great enough yet that we can hire people to maintain the site AND hire people to hack on new features. In the meantime, Dave Evans and I will focus on keeping things running and hard working folks like Keschte are working on new features for the server. Overall we’re still moving forward, just a lot slower than we care for.
What can you do?
- Help us solve our DB issues if you’re a DB person.
- Help us write more mb-server code.
- Most important of all, make a donation!!
- Bug your rich friends to donate to MusicBrainz so we can buy a beefy database server. 🙂