Lucene enabled Picard

In the last couple of days I’ve stuffed Lucene into Picard and it has given me some quite amazing results. I’ve opened a collection of untagged files and watched it open the right albums and populate it with tags automatically. Mind you, none of the tags were previously tagged with MB ids. Plain amazing!

I have this hip-hop compilation that my friend put together and its utter crap — duplicates, many files without tags, crappy spelling and mostly from greatest hits albums. Ick. The original tagger identified less than 15% of the tracks. The new tagger identifies 50% – 60% of the tracks — that’s a really good rate for this crappy collection.

The only downside is that I am playing with a complete Lucene index of the MusicBrainz database. It takes over 650Mb of disk space and takes over an hour to calculate on my 2Ghz Linux box. A compressed version is about 250Mb. I’m trying to think of how I can make these databases available to people. I have a number of thoughts:

1. Build indexes daily and put them up via BitTorrent, and let people help with the distribution of the search indexes.

2. RJ of AudioScrobbler suggested making the DB accessible via P2P — great idea! Any idea what kind of P2P engine one could use for this? I’d like to check this out — and in our case, the P2P use is legit so the MB server could play the role of the host cache for the P2P system. Feedback solicited!

3. Have the tagger cache metadata (good idea anyway) and build the index incrementally. This is the most transparent for the user and may just be the default. If the user would like to do #1 to download a full index and maintain it incrementally, that could work well as a complement for this idea.

4. Partition the search space (by artist or genre) and have the client download incremental chunks of the index as they are needed. This could be cumbersome and give bad search performance.

Any other suggestions on how to save bandwidth to maintain these local search indexes would be appreciated.

9 thoughts on “Lucene enabled Picard

  1. donredman

    This is awesome! Spread the MB data all over the world. This is giving power to the users. Jay!

    As an alternative:
    Couldn’t you make a protocol between the tagger and lucene and leave lucene on the server? I suppose the data volume would be less if you just send the requests and results and not the whole index.

    Or is there a reason you need the full index local?

    If the only problem is server load (that might be huge: to answer thousands of search requests) Then you could make options:

    (a) download the full index via P2P,
    (b) let the server do the searches for you for a small fee that covers the server cost.

  2. Mayhem & Chaos

    The only reason why I am using the index locally is to relieve the searching stress from the central server. I’m not keen on having a boatload of servers doing tons of searches.

    However your suggestion of ‘express’ servers that users have to pay to access makes a ton of sense too.

    There are so many pros and cons in this debate — I’m not close to making up my mind yet. Maybe I should post this question to mb-devel and see what happens. Are you on mb-devel?

  3. azertus

    Great idea!
    Maybe also an option for downloading the index for all your subscribed artists? This might spur on people to subscribe to their fav. artists…

  4. azertus

    I’d just like to add that if you want to release a version with this ability (thus more automatic loading of and sorting to MB-albums) there should be an option for “reloading” the MB-album info… (to reflect p.e. an added release date, or a changed capitalisation)

  5. rjmunro

    I’d love you to sneak a bittorrent client into the Helix DNA platform. That would be perfect for watching pre-recorded video. RealPlayer is probably popular and user recognised enough that people could host a torrent for any purpose, and know that most people will be able to download it.

  6. Mayhem & Chaos

    azertus: wrt the updated data needing to be pulled from the server — yes, you’re right. That needs to be done anyway, especially in light of people tagging against mirror servers.

    rjmunro: I don’t control over any of the Helix sources other than Picard, so that’s not going to be likely. 😦

  7. JackandJohn

    I think a massive torrent and incremental update is a great idea..

    I don’t have money available to donate right now (Unless you take canadian cash and are in my city πŸ˜‰ ), and I feel bad about doing my 15,000 collection, esp when doing batches of 50-100 taxes the server.

    I would love to be able to download and seed (Make sure it’s usable right from the torrent download) the main database, run all my songs against it, then either update or run the remainder against the remote servers.

    of course, I would need some sort of persistant tracking of the files that didn’t get tagged properly.. I wouldn’t want to re-scan 15k songs to find the ones I need to work on πŸ™‚

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s