This general update is way overdue — a lot of things have been happening behind the scenes and its time to let everyone know where things in the MusicBrainz world are headed. I’ll start off with TRM, since that is hot discussion topic on the musicbrainz-users mailing list right now.
The TRM (TRM’s are acoustic fingerprints that MusicBrainz uses to identify music tracks) server is constantly overloaded and can only handle a database size of about 2.2Gb before it crashes. To prevent crashes, we prune the database where we throw out the least used TRMs, which implicitly discards work that our users have done. Not good. In order to make the TRM server perform at some reasonable level of performance, the entire database needs to be kept in RAM. Thus our server has 5GB of RAM and it still can’t keep up. The fact that this problem hasn’t reared its ugly head to the public, is a testament to Dave Evans’ skill in keeping the TRM server ticking.
Furthermore, TRMs have shown themselves not to be as unique as we would’ve liked. For example, take a look at the TRM’s with at least 5 tracks report: 4400 pages (!) of TRMs that I would consider to be sub-optimal. One example TRM (non silence on page 2) has 104 tracks associated with one single TRM. Given this, TRM is not some sort of magical solution that with great authority tells the tagger what metadata to apply to a track. Instead, its best to think of TRM as a system that lets you guess which few dozen tracks a file could be matched to — there is a lot of logic in the tagger that makes up for the shortcomings of TRM.
Thus, TRM has two major problems: its not accurate enough and it doesn’t scale well to the size that MusicBrainz has grown to. The system still functions but I expect it to start breaking down and becoming of less use over time. We have the following options:
- Find a replacement for TRM: Relatable doesn’t seem to be in business anymore, or at least they are in deep hibernation. No other companies that I have approached were interested in sharing their technology with MusicBrainz. (For the record, I’ve tried with 3 companies, including a couple of on-site visits in Europe).
- Create our own TRM solution: This is an very large endeavour — at least a year if not two, of hard work. I’d rather work to improve MusicBrainz itself, rather than hacking on acoustic fingerprint software.
- Throw more resources at TRM: We’re still lacking the funds for more resources, and the same argument in #2 still applies.
- Do something else: Find some technology that can replace TRM.
Given my babbling about Lucene, I think its a foregone conclusion that #4 is the way to go. Sometime this fall, I will release a Picard tagger with a lucene text indexing engine to replace the current MusicBrainz Tagger. The benefits of this new tagger will be:
- It will distribute the load on the server, since currently a large chunk of the server load goes to supporting tagger users. And a large chunk of tagger users never really contribute data to MusicBrainz or make cash donations to support the project. So, moving that traffic off the main server will allow people who want to edit/vote on the data focus on their work.
Given that most files in the wild nowadays have some metadata, a text index will work well. Lucene is great at taking crappy data input and coming up with something useful. If TRM gets us into the ballpark and then additional heuristics do the final leg work, Lucene will give us a much better guess to start with than TRM ever did. Thus, overall tagging quality will improve greatly.
- A lucene tagger will work much faster than the TRM based tagger ever was. 2-5 seconds per track was not unusual given TRM — with Lucene we’ll see 2-5 tracks per second, if not much faster.
- Since we will no longer have to decode files to identify them, it will be easier for us to support new formats. Its less work overall.
This approach also has the following downsides:
- It will no longer support identifying completely anonymous files. Files that have no id3 tags and are named test1.mp3, test2.mp3 will simply not stand a chance at identification. I realize that there is great romance associated with this concept, but in reality most people have files that have some metadata in them, and thus will stand a good chance of being identified.
- You will need to download a 250MB Lucene index to tag your collection. This is a pretty big hurdle, but if BitTorrent can routinely help people download 650Mb movies off the net, it should help us download distribute our search indexes. After the first release of a Lucene enabled Picard, we will investigate P2P searching methods that will allow people who have no index to use some other people’s indexes (if they allow that).
So, the roadmap for this looks like this:
- Release picard 0.5.0 in the next few weeks and start putting it on the main page as an alternative to the MB tagger.
- Release picard 0.6.0 with full Lucene support and offer that as the main tagging solution for MB.
- When the TRM usage drops because of adoption of Picard 0.6.0, we will start phasing out TRM.
There you have it — thats the current happenings on TRM and how we hope to solve the problems that it presents us with.