General update: What's up with TRM??

This general update is way overdue — a lot of things have been happening behind the scenes and its time to let everyone know where things in the MusicBrainz world are headed. I’ll start off with TRM, since that is hot discussion topic on the musicbrainz-users mailing list right now. The TRM (TRM’s are acoustic … Continue reading “General update: What's up with TRM??”

This general update is way overdue — a lot of things have been happening behind the scenes and its time to let everyone know where things in the MusicBrainz world are headed. I’ll start off with TRM, since that is hot discussion topic on the musicbrainz-users mailing list right now.

The TRM (TRM’s are acoustic fingerprints that MusicBrainz uses to identify music tracks) server is constantly overloaded and can only handle a database size of about 2.2Gb before it crashes. To prevent crashes, we prune the database where we throw out the least used TRMs, which implicitly discards work that our users have done. Not good. In order to make the TRM server perform at some reasonable level of performance, the entire database needs to be kept in RAM. Thus our server has 5GB of RAM and it still can’t keep up. The fact that this problem hasn’t reared its ugly head to the public, is a testament to Dave Evans’ skill in keeping the TRM server ticking.

Furthermore, TRMs have shown themselves not to be as unique as we would’ve liked. For example, take a look at the TRM’s with at least 5 tracks report: 4400 pages (!) of TRMs that I would consider to be sub-optimal. One example TRM (non silence on page 2) has 104 tracks associated with one single TRM. Given this, TRM is not some sort of magical solution that with great authority tells the tagger what metadata to apply to a track. Instead, its best to think of TRM as a system that lets you guess which few dozen tracks a file could be matched to — there is a lot of logic in the tagger that makes up for the shortcomings of TRM.

Thus, TRM has two major problems: its not accurate enough and it doesn’t scale well to the size that MusicBrainz has grown to. The system still functions but I expect it to start breaking down and becoming of less use over time. We have the following options:

  1. Find a replacement for TRM: Relatable doesn’t seem to be in business anymore, or at least they are in deep hibernation. No other companies that I have approached were interested in sharing their technology with MusicBrainz. (For the record, I’ve tried with 3 companies, including a couple of on-site visits in Europe).
  2. Create our own TRM solution: This is an very large endeavour — at least a year if not two, of hard work. I’d rather work to improve MusicBrainz itself, rather than hacking on acoustic fingerprint software.
  3. Throw more resources at TRM: We’re still lacking the funds for more resources, and the same argument in #2 still applies.
  4. Do something else: Find some technology that can replace TRM.

Given my babbling about Lucene, I think its a foregone conclusion that #4 is the way to go. Sometime this fall, I will release a Picard tagger with a lucene text indexing engine to replace the current MusicBrainz Tagger. The benefits of this new tagger will be:

  1. It will distribute the load on the server, since currently a large chunk of the server load goes to supporting tagger users. And a large chunk of tagger users never really contribute data to MusicBrainz or make cash donations to support the project. So, moving that traffic off the main server will allow people who want to edit/vote on the data focus on their work.

    Given that most files in the wild nowadays have some metadata, a text index will work well. Lucene is great at taking crappy data input and coming up with something useful. If TRM gets us into the ballpark and then additional heuristics do the final leg work, Lucene will give us a much better guess to start with than TRM ever did. Thus, overall tagging quality will improve greatly.
  2. A lucene tagger will work much faster than the TRM based tagger ever was. 2-5 seconds per track was not unusual given TRM — with Lucene we’ll see 2-5 tracks per second, if not much faster.
  3. Since we will no longer have to decode files to identify them, it will be easier for us to support new formats. Its less work overall.

This approach also has the following downsides:

  1. It will no longer support identifying completely anonymous files. Files that have no id3 tags and are named test1.mp3, test2.mp3 will simply not stand a chance at identification. I realize that there is great romance associated with this concept, but in reality most people have files that have some metadata in them, and thus will stand a good chance of being identified.
  2. You will need to download a 250MB Lucene index to tag your collection. This is a pretty big hurdle, but if BitTorrent can routinely help people download 650Mb movies off the net, it should help us download distribute our search indexes. After the first release of a Lucene enabled Picard, we will investigate P2P searching methods that will allow people who have no index to use some other people’s indexes (if they allow that).

So, the roadmap for this looks like this:

  1. Release picard 0.5.0 in the next few weeks and start putting it on the main page as an alternative to the MB tagger.
  2. Release picard 0.6.0 with full Lucene support and offer that as the main tagging solution for MB.
  3. When the TRM usage drops because of adoption of Picard 0.6.0, we will start phasing out TRM.

There you have it — thats the current happenings on TRM and how we hope to solve the problems that it presents us with.

27 thoughts on “General update: What's up with TRM??”

  1. Acoustic fingerprinting isn’t dead. I use it to feed a markov chainer to generate playlists.

    I subscribe to Jonathan Foote’s basic model for feature extraction:

    1) extract features
    2) feed them to a decision tree
    3) count the number of samples that land on each leaf and make a histogram
    4) store the histogram as a vector in a high-dimension metric tree

    By doing a Nearest Neighbors query against the metric tree, the euclidean distance is quality of match.

    There are problems with this approach:

    1) I currently extract features (mfccs) with the htk speech toolkit. The toolkit is not open source.

    2) The decision tree creation is sensitive to corpus selection. I haven’t done much research into this.

    3) Most high dimension tree-like data structures (R-Trees, TV-Trees, X-Trees, .. ) suffer from the curse of dimensionality — a polynomial complexity on dimension. Quick testing of uniformly distributed 100k database with 201 dimensions shows that it takes about 10s a query on my old Athlon 750. This means that I either need a data structure (like the pyramid technique) that does not suffer from the curse of dimensionality, or investigate using less than 200 unique features.

    4) I have poor quality of match metrics.

    I think acoustic fingerprinting still has value, so I’m going to continue to research it..

    Cheers,

    –jack

  2. Might i make a small, stupid suggestion?
    What about having trm as a sor of “last resort”?
    Normal lookup is done using “text search” in case on get’s less than e.g. 10% match -> tracks go to unidentified. now the user has the possiblity to do a “manual” trm lookup on these tracks. if trm really is only usefull in 1% of all cases, your trm load should go down by a huge amount.

    Other than that: i do agree that completly removing “accoustic identifying” will be a big loss for mb.

  3. Lucene sounds like a great interim measure – i’ve used it with success for indexing images for reuters feeds. Very fast.

    The only struggle was keeping the index in sync when moving data between servers (production ,staging, disaster recovery, load balance etc.) Last i checked there were no timestamps in Lucene logs, which caused no end of blasphemy when the thing died.

    What database system are you using with this 2Gig limit? We’ve got 15Gig+ postgresql databases chugging along quite nicely. Or is this is a filesystem limit you’re hitting? FreeBSD might be a solution there.

  4. Really, the effectiveness of TRM fingerprinting is the only reason I use musicbrainz at all…

    Without the ability to tell the difference between two similarly labeled songs by different performers (eg covers of popular songs labeled as the original performer) the effectiveness of the software is nil.

    Tagger is useful precisely for it’s ability to correct misnamed files, which a text engine wouldn’t help with at all, I think. Unless I’m missing something…

  5. Brenda:

    Our lucene indexes will always be generated from the MB data, so if something gets corrupt, just rebuild it. I’m not too worried about this. Postgres is the one in charge to make sure that everything stays sane. Lucene is just another way of finding MB data.

    As for the TRM server… Its totally custom software that needs to reside in memory. So database underneath — and while FreeBSD might be able to help, we can’t contact the company who wrote it, so getting them to move it has proven impossible.

    😦

  6. Although I have read up on quite a bit of the MusicBrainz philosophies and technologies,
    perhaps I should’ve done some more research before commenting, but…

    So, about TRM software that runs on the server: it sounds like MusicBrainz can’t change the fact that it must run straight out of memory. Is this correct? Is this a proprietary piece of software?

    This may be a silly question, but why has musicbrainz not attempted to use old-fashioned checksums (MD5 or such) to uniquely identify files?

    I have been quite interested in MusicBrainz for some time now, and, as ‘Jag’ has also expressed, I would not see as near a significant amount of value in MusicBrainz after making such a move.

    Also, the whole Lucene ‘index’ thing… That sounds pretty raunchy. I can’t see a whole lot of users taking the time to download such a thing. I know dedicated people will, but… Why can’t the tagging application send tag/filename information to the database, which in turns uses some Lucene stuff to find possible results? Or is that what we’re already talking about (then what’s the index needed for)?

    One thing that I’ve been longing for, for some time now, is a way to tag/mark music files as “corrupt” or to have “skips”, etc. I’m assuming that TRM, as it is, is not great for the itself. Perhaps a check-sum would help this situation?

    Another thing: does anyone know if this Bitzi thing (http://bitzi.com/about/) is at all valuable?? Have you guys considered such a technology?

    I am hoping to use MusicBrainz or some similar technology in a project of my own, so I’m trying to gather up some more information; I’m still not sure of what all MusicBrainz can actually do.

    Thanks — sorry if this was a bit off-topic.

  7. The TRM server is closed source. There is nothing we can change about it.

    md5 hashes are totally useless for tagging music. One little bit off and you get a different md5 — you could not identify anything useful with it.

  8. Is having more than one track resulting in the same TRM really a show stopper?

    Most of my TRMS show up with one matching track – there’s some huge value i’m getting from TRMs, and the occasional track that returns 10 or so is okay with me.

  9. I guess I should join the mailing list. However, I too utilize this to insure the integrity (TRM Fingerprinting) of a song. There are a lot of songs that have been correctly identified and it’s helped immensely with organizing a wreck of an mp3 collection that got messed up by a tagging rename program. (even i-tunes wasn’t very good at that.) I have sworn by your software and have tried to contribute by giving you more TRM’s to the harder to (finger)print albums which I’ve purchased.

    I don’t know much about the science of it all, but I think with the variance in the downloads and say incomplete songs and such it does tend to make for even more fingerprints than there may need to be? Is there a way to edit out the potentially messed up songs. (like those that end before the alloted time.) maybe that would let you prune down the TRM information enough to help with the load on the server?

    Beth aka Nyght

  10. Perhaps we should move to (one of) the mailing lists… I’m about to join one/some of them.

    Also, I’m assuming there’ve been dozens of such conversations about these things within the musicbrainz community already… I’m snooping around on the lists/etc, looking for stuff, but in the meantime…

    So, the TRM is like a check-sum that can “give a little” for slight differences in audio encoding formats; plus, you don’t just throw the entire audio file into the check-summer, you only throw *actual* audio data. That’s nice, but it sounds like it doesn’t work out so great, if only because of the fact that it’s a proprietary piece of software that can’t be fixed — and currently it’s sort-of “broken” because *all* TRM ID’s must be stored in the server’s RAM (did I understand that right?).

    Now, imagine you just do an old-fashioned md5 hash, only using the *audio* portions of the file, stripping out tags and format-specific data. Of course everytime someone makes a new encoding, you’re going to get a new hash, but hear me out…

    When the tagger goes through a collection, it can send the server a bunch of information: each track’s filename (and perhaps directory information could be useful, as well), tag information, and a hash of the audio. The first thing the server does is checks for a matching hash. If there’s no match, it can use the filename/path and tag information, with Lucene, to find a “best match”.

    So, why bother with the hashing part?

    People are going to be ripping their own CD’s *ALL THE TIME*, and everytime we’re going to get a new hash. True, but usually when you rip your own music, two important details can be considered:
    a) you properly label everything, and
    b) the tracks aren’t going to come through knarled (half cut-off, skips, pops, etc — at least mine usually don’t).
    Note that, people that do not properly label their albums when THEY RIP THEM, are not going to give a crap about labeling things and using MusicBrainz, anyhow.

    Soooo, most of the tracks, that are *unique* to your collection, will not need much identifying; ~90% of the time, the tracks will probably be recognizable by information placed in them by the ripper. This leaves tracks that you’ve gathered from outside sources. These tracks that you’ve obtained from somewhere else are “out there”, so other people have them, perhaps 100’s of people. So long as someone else has musicbrainz’ed the track, we should be good. Remember, we will only hash the *audio* portion of the file (the WAV output of a decoder), so as users change their ID3 tags (etc) around, it won’t affect the hash.

    (Just thought of a possible problem: Different decoders can output different audio data, right? Well, it shouldn’t be a big problem, so long as the tagger(s) all use the same set of algorithms.)

    Aside from ridding of the need to use a proprietary piece of software on the server, and solving the current TRM problem, this would also help to, slowly, rid of all the crappy MP3’s (etc) that are floating about out there. Eventually, outside applications (players, etc) could integrate, and tell you information such as “this track skips” or “this track is cut-off halfway through”.

    Could you please poke holes in this, and tell me why this is not reasonable?

  11. Acoustic fingerprinting is literally the only reason I use MusicBrainz. For relational purposes it’s not much more useful than Google. If fingerprinting is eventually going to be abandoned I’m going to start looking elsewhere for a less mature but more Free solution.

    I really should have checked myself when migrating from Moodlogic. That’s what I get for choosing software that’s only “Free enough”.

    – Chris

  12. Ah yes. The moral high road. Very good.

    Just to remind you, the only portion of MusicBrainz that is NOT free is the acoustic fingerprinting. The rest is painfully open.

    So, if you must invoke the moral high road, I guess you better stop using MusicBrainz right away.

  13. On Chris’s suggestion.

    Like you yourself point out, different decoders output different audio data. This creates the requirement to use the same decoder for the data, or atleast specify the algorithms exactly. This also means that an external decoder probably cannot be used since a simple version upgrade might break compatibility. And this means that the tagger application has to specifically support all the audio formats, instead of just invoking some external applications to give the data.

    But, if there is specific support for audio formats, why not just calculate the hash from the encoded data itself. That is, just take in the actual mp3 data from an mp3 file, disregarding any variable data. This is faster, as the file doesn’t need to be decoded and more stable, as the decoding algorithms do not matter.

    As for how the variable data is disregarded, there’s a few different methods. A canonical representation for a file could be defined – one without tags and such – and tagging applications can just dynamically create this canonical representation as it needs the checksum. Or the checksummed representation could be totally abstract, for example containing only the raw data from inside any valid mpeg 1 frames for mp3.

    In any case, you are not the only one to think of something like this. Hope something comes out of it.

  14. In response to Nuutti’s reply…

    Thanks, first of all.

    So, do you know why MusicBrainz has not considered this approach (or why they’ve dismissed it)? Do you know of others that are working on, or have at least considered such an approach?

    I would be willing to contribute to on such an effort. I’ve been longing for a clean way to organize music for a while; I thought MusicBrainz was my answer, but never got around to messing with it (mainly because I’m not using Windows/Mac, so cannot use the tagger). My eventual plan was to mess around with the MB libraries, and try to integrate them into a portion of my application, but at this point, it doesn’t sound like this would be all that valuable.

    Also, it doesn’t seem, to me, that it would be that horrendous making plug-ins that analyzed the various audio formats out there, coming up with a checksum/hash that’s not dependant on tag info, etc. In most cases, we could probably just strip off the tags, in which case we could probably find code that already does this (I’m using a PHP library, getid3, which could probably do this — just port it to…).

    If anyone’s interested in such an implementation, please let me know. And if I’m nuts, let me know that too — again, why have we not take this approach?

  15. I just want to echo the comments of a few users above, namely:

    1. The acoustic fingerprinting (okay, and support by last.fm) are the only real reasons I use musicbrainz. As jag says, the ability to differentiate similarly tagged songs by different artists and poorly tagged ones, is key.

    2. I also have no interest in keeping a 250+Mb index file on my hard drive. I was perfectly happy (and meticulous) manually tagging and using allmusic as a reference. You will lose me instantly if you go this way. In fact, you should lose just about everyone: for all intents and purposes every user will have their own ‘private database’, and the point of sharing will be lost. So my strong suggestion is that if you go this direction, DO NOT roll out until you’ve developed the distribution solution. But better not to go this way at all.

    3. Finally on the TRMs. I have no technology knowledge, so no suggestions beyond the layman’s thought I have had several times recently when seeing one TRM come up with many different suggestions. I’m not sure what you mean by “least-used” — least accessed? least submitted? — but wouldn’t the best approach be to delete TRMs from those 4400 pages? On some level, those are the “least useful” TRMs, because they require more manual intervention. And perhaps some of the multiple TRMs are errors (eg., you accidentally tag track 4 as track 5 and track 5 as 4 and then quickly hit save and then submit out of habit. it happens).

  16. So, if you must invoke the moral high road, I guess you better stop using MusicBrainz right away.

    Unfortunately the alternative doesn’t really exist at the moment.

    I’d like to apologise for the way I put my last comment. Obviously you’ve done the best you could to keep the project open, but when (what I consider to be) the core component is proprietary you’re just as stuck as everyone else. It’s a bit disappointing to hear that looking for an alternative fingerprinting solution has been discounted though.

    – Chris

  17. Okay. So, there’s 2 problems:

    1. TRMs aren’t as unique as hoped.
    2. The server is closed source and doesn’t scale to the size musicbrainz has grown to.

    if you’re looking for more volunteers to write/build/pull-together an opensource server, then here i am.

  18. I posted this on a seemingly dead thread about TRM stats, but I guess here is where it belongs anyways.

    Summary: basically screw the TRM server, don’t hash files, but hash TRM’s and integrate into your own much higher performance server. Now the old post:

    ==========================================================================

    Disclaimer: I have no idea about musicbrainz’s internal structure. And I also don’t know how to insert line breaks in this message – Oh well.

    ==========================================================================

    Why is the total TRM count a problem? I image 2.0 gigs is either a limit imposed by the system memory amount, or the VM size (if it’s a 2/2 32bit kernel/user split). But why would you need to store the entire TRM in the first place in such a high speed, high cost medium?

    ==========================================================================

    You could for example make due with the CRC-64 of the TRM for all intents and purposes: According to my methodologically imprecise back of the hand calculations; the chance of a single collision would be at most approx 1 in 10 million. The data isn’t that good; this is definitely an option which would hardly impact data quality. (Methodology: 2^64/(2000000^2/2), which is a slight overestimation but as n! isn’t exactly calculable here, and I’ve forgotten the continous equiv. atm). If you’re still worried about collisions, just up the number of bits in your hash… with 128 bits the chance of a collision are so minuscule you really should start worrying about world peace instead :-).

    ==========================================================================

    You could also choose to go with a disambiguation scheme; work with the CRC-64 normally but use the TRM if that hash is marked as colliding: that check path will be so infrequent that it’s not an issue if it’s gotta check the HDD. You can even work with a CRC32 check only by default without too much problem – though then you’ll need to use a disabiguation scheme.

    ==========================================================================

    Frankly, it’s gotta be possible to get this to run efficiently on a much smaller setup, with even more meta-information, rather than less :-).

  19. I would like to comment that I have tried to use the tagger at least twice, but I have never been able to submit TRMs to the server (always too busy?). As I have some pretty obscure music, much of which does not have TRMs at all yet, I thought I could make a contribution to the database, but I have never been able to do so and therefore have never really used the tagger. I agree with what has been said before though – the TRMs seem to be a good way to find music that otherwise would be pretty difficult. If you could find a way around the server/memory problems, I think it should be kept as a backup for when metadata doesn’t make a match at the very least. Maybe that would help the server load.

  20. 1. Lucene is more effective � go for it
    2. TRM is the best feature of MusicBrainz!!! � keep it in case lucene fails
    3. TRM is sill unique for whole albums (even if every TRM had 1000 tracks associated with it, set of TRMs for Track1.mp3, Track2.mp3…and Track10.mp3 will identify the correct album)

  21. Excuse my stupidity, but I don’t understand why it’s necessary to download a 250MB index to my PC. Why can’t I send all relevant info. (basically, any details I already have such as existing tag, file name, folder name, track length etc) from my machine to your server? Can’t you then do the look-up for me? MusicBrainz is superb but will die the moment you make a 250MB download obligatory.

  22. While it may be faster, the main reason I used MB was to tag files I didn’t know the exact names of. I can’t count the number of times I’ve used it to find the proper tag information of HORRIBLY mistagged or untagged song files. Also having to keep a 250MB index on my computer is kind of bothersome. I’m pressed for space on my laptop as it is.

    While lucene might be more efficient, I think the downs may outweigh the ups on this…

  23. Nooo! TRM is the main reason I use MusicBrainz! *I did* have lots of AudioTrack10.mp3 files and I just love that MusicBrainz for that it helped me to decipher these.

    I’ve also had fair amount of improperly tagged files. If I had wrong title, I don’t care I could have wrong title and wrong album and wrong track number, etc…

    Please keep TRM!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.