Monthly Archives: April 2003

Server Updates

CHANGE LOG for mb_server

WEB SITE

Search Facility

The search facility has been rewritten. Previously, searching was quite slow, especially for tracks, and especially if your query contained several (say, five or more) words. For example, if you did a track search for “The house of the rising sun”, that could easily have taken ten minutes to run – but your web browser would time out long before that, of course.

Accents, apostrophes and other punctuation are now handled much better. You can search for numbers too.

You can now deliberately search for a word more than once within a query – for example, an artist search for “The The” will only match artists containing at least two “The”s.

The ranking of search results is now much better, and is based on a
“similarity” algorithm. Try a track search for “love to hate you” and you’ll see what I mean.

If your search takes longer than 30 seconds, your query will be aborted and the web server will return a “Your search has been cancelled” message. Hence you get told what’s going on, and the server doesn’t get bogged down running a query which you’re not going to see the results of anyway.

Finally, support for an “any of these words” search has been removed, at least for now. You can now only search for “all of these words” (and because that’s the only option, that field has been removed from the search form).

Moderation Enhancements

The “artist filter” pages now won’t show certain artists when it’s appropriate – for example, if you’re merging artist A into something, then artist A won’t be shown as a possible “target” to merge into.

The “change track” page (changetrack.html) now uses the right default values in the form, and includes a “use guesses” as well as a “use current” button.

When entering album data (via /cdi/enter.html etc), some extra checks are made:

  • If all (or most) of the track names seem to start with the track number, then you are warned about this, and given the option to automatically fix the problem, go back and manually fix it, or continue regardless.
  • If you’re entering a single-artist album, but it looks (according to the the track names) like you’re entering a multiple-artists album – or vice versa – then you are warned about the possible discrepancy, and given the choice between going back and fixing it, or continuing regardless.

More “moderation suggestion” reports: possibly duplicate artists, albums which need converting to multiple artists, and tracks named with their own sequence (track) numbers. There’s also a pair of reports of TRMs with many tracks, and tracks with many TRMs.

Other Miscellaneous Changes

Many pages now perform much better validation of their inputs, e.g. data provided on the “query string”. For example, numbers must be numeric, etc. For “artist id” inputs, we reject the id of the “deleted artist” on most occasions.

Many web pages have had minor usability enhancements, e.g. moving the input focus using Javascript (preferences permitting). Some of the visual layout has been reworked, the aim being a clearer, simpler appearance.

For the first time, many of the pages are now (finally!) valid HTML 4.0! There have been many minor HTML and CSS improvements.

There is now an artistinfo.html page (like showtrack.html and
albumdetail.html) which shows the database internal ID and the MusicBrainz UUID/GID values.

The various download links now redirect you via a “pick a mirror” page. Your preferred mirror is remembered via a cookie, if possible.

The “bio” page (bio.html) has been expanded to include many more people.

Some links which used to only work with Javascript enabled now also work without it.

BACK END COMPONENTS

We’ve moved to Postgres 7.3! (specifically, 7.3.2 at the moment). This affects a few things, such as the “create” SQL scripts.

Support has been added for having the database on a remote host. Did we mention that we’ve got a new, dedicated database box now? :-)

Database Export / Import

The export / import facility has been re-written. Well, the export bit has anyway. The principal change, and a really important one, is that the nightly database dumps that appear on our download sites are now a consistent point-in-time snapshot. (Previously the dumps it used to generate could be, and sometimes were, inconsistent, and therefore could not be re-imported).

As part of that rewrite, a couple of other changes came about too:

  • The export format is now Postgres “copy” format, which is basically a tab-separated text format. Since this is much easier to manipulate than the old Postgres “dump” format we had before, it’s now much easier to munge the exported data into non-Postgres formats. Hence, because the exported data is now much more database-neutral, there is no longer any explicit support for dumping in MySQL or ExcelCSV format.
  • The handling of the “moderator_sanitised” pseudo-table is now much cleaner; it’s all done via a simple temporary table, with no need for the “fill_moderator” function we used to have.

Also, the “SetSequences” script forgot to update the “wordlist” sequence – this is now fixed.

When loading data using “InitDB.pl –createdb –import”, foreign keys are now added /after/ the data has been loaded. This should make for a faster import, and it certainly makes it a whole load easier to diagnose consistency problems if you’re loading from an inconsistent dump for some reason.

Finally, the script to rebuild the search engine metadata (build_words.pl) is now *much* faster – something of the order of 40 times faster according to my tests. It does have the very minor drawback that you can now only use it on an “offline” system, because the metadata will be unusable during the rebuild.

Other Miscellaneous Changes

Deferred Updates

Two types of update are now done using the “DeferredUpdate” module – which simply means that some data describing the update to be done later is appended to a text log file, and something else will come along later and actually do the update. The updates in question are TRM lookup count and artist alias usage count.

Clearly this means that the main request now goes significantly faster. There are two caveats to this however.

  • the process to subsequently apply the updates in the log file isn’t yet written
  • the other day we managed to lose most of the log file (we’re not sure why this happened exactly) – so that’s about two weeks worth of TRM lookup counts / artist alias usage counts lost. Well, it’s only usage data.

Other changes

$artist->LoadByName has had a simple but effective speed-up added. The old behaviour was just to do a case-insensitive search straight away, but that doesn’t use an index; so now we try a case-sensitive search on some obvious variations (e.g. all lower case, all upper case, Title Case etc) – which does use the index – before resorting to a full table scan.

RDFDump.pl used to be /really/ verbose. It’s now only that verbose if standard input is a terminal.

Moves have been made towards using separate HTTP vhosts for web pages / RDF requests.

Removed pointless and slow code when inserting “add album” moderations.

Added a little mod_perl magic to write the moderator name into the Apache access logs, and the RDF query name too (for mq_2_1.pl).

Fixed support for running an HTTP proxy in front of the mod_perl server.

MISCELLANY

All references to lyrics/synctext support has been removed.

Dave Evans