In the last couple of days I’ve stuffed Lucene into Picard and it has given me some quite amazing results. I’ve opened a collection of untagged files and watched it open the right albums and populate it with tags automatically. Mind you, none of the tags were previously tagged with MB ids. Plain amazing!
I have this hip-hop compilation that my friend put together and its utter crap — duplicates, many files without tags, crappy spelling and mostly from greatest hits albums. Ick. The original tagger identified less than 15% of the tracks. The new tagger identifies 50% – 60% of the tracks — that’s a really good rate for this crappy collection.
The only downside is that I am playing with a complete Lucene index of the MusicBrainz database. It takes over 650Mb of disk space and takes over an hour to calculate on my 2Ghz Linux box. A compressed version is about 250Mb. I’m trying to think of how I can make these databases available to people. I have a number of thoughts:
1. Build indexes daily and put them up via BitTorrent, and let people help with the distribution of the search indexes.
2. RJ of AudioScrobbler suggested making the DB accessible via P2P — great idea! Any idea what kind of P2P engine one could use for this? I’d like to check this out — and in our case, the P2P use is legit so the MB server could play the role of the host cache for the P2P system. Feedback solicited!
3. Have the tagger cache metadata (good idea anyway) and build the index incrementally. This is the most transparent for the user and may just be the default. If the user would like to do #1 to download a full index and maintain it incrementally, that could work well as a complement for this idea.
4. Partition the search space (by artist or genre) and have the client download incremental chunks of the index as they are needed. This could be cumbersome and give bad search performance.
Any other suggestions on how to save bandwidth to maintain these local search indexes would be appreciated.