Tag Archives: search

Inviting testers for MusicBrainz live search

Hello everyone!

So as you might know, I recently joined the MetaBrainz team and my first project was the completion of our long-standing Solr search project to provide live search indexing for the MusicBrainz database.

I am happy to announce that we are finally rolling out an alpha release for you to test out. You can try it at https://test.musicbrainz.org/search or use the webservice end-point at https://test.musicbrainz.org/ws/2/

What this means –

  1. You can now instantly search for entities that have been updated. There should be a maximum 15 second delay between the database update and the entity changes being reflected on the search.
  2. This implies that once we have ironed out the Solr search we can finally retire the direct database search on the main site and use Solr with its advanced search syntax. For details on the new syntax features you can refer to the Lucene query parser documentation. For details on field types you can refer to our Search Syntax guide.
  3. As I said, the Solr search is still in its alpha stage, thus it can be unstable and have bugs. As such do not depend on it for your critical applications.
  4. Speaking of bugs, here’s where we need your help the most! We want testers to use Solr as extensively as possible and file any bugs you encounter at our Solr Issue tracker. You may encounter bugs like –
    • Missing fields in the API output for the webservice.
    • Certain types of queries not working in Solr search that happen to work on the main website.
    • Missing data/edits/updates not being indexed.
  5. Since we haven’t ported our search analyzers in their entirety, Solr might have worse search results than our main search.

I would like to re-iterate – Solr is still in alpha and not everything is perfect. We need your help to make it so.

 

Updated search jar/war files

Given the utter slackers we are, we haven’t yet finished updating the search server to output the new MBIDs that were added to some entities in our last release. We’ll try and get that done soonish.

However, we did update the search code to fix this error in the search indexer:

ERROR: type “earth” does not exist

I’ve put both of these jar/war files on our FTP site:

If you would like to try and build these from source, you’ll need commit 4f677727 from mmd-schema and the latest master commit from search-server. For instructions on how to build this, please follow these instructions.

UPDATE: The build from the current master for search-server appears to not be able to load indexes upon startup. Please use the old war (we still use this in production) until we can release a fix.

Help! Is there a Lucene doctor in the house?

UPDATE: Thanks to user selckin in the #lucene IRC channel for quickly solving this for us! Hopefully we can put this fix into production later today!

As our regular readers may know, we’ve been having lots of troubles with our lucene based search servers. Over the past few days we’ve spent a fair amount of time, tuning, debugging and otherwise trying to troubleshoot our setup. We’ve fixed and identified a number of problems, but most importantly we feel that we’ve identified the core issue: Our servers are simply overloaded.

Under normal conditions we find our servers loaded to about 25% – 35% CPU — things look good and we don’t think we have a capacity problem with our servers. Then a slow query comes in that starts to slow things down. Much like a traffic jam that evolves out of thin air, one slow query can make a giant mess for everyone.

We’ve started timing our queries and most of the time, they can be measured in milliseconds. However, when things get bad, they may take up to 7-8 seconds. Our upstream web servers time out on the search request after about 5 seconds in order to prevent traffic from getting backed-up. What we need to do next is to limit the duration that a lucene query can run and terminate it after the timeout.

I’ve started looking at this and quickly realized that this is much more of a job than adding a simple timeout parameter to the search call. We’re currently using this search function from IndexSearcher:

  public TopDocs search(Query query,  int n);

Ideally I would like to add a way to timeout queries after 3 seconds. So far, I’ve discovered that we could use

  public void search(Query query, Collector results)

with a TimeLimitedCollector. The old call returns TopDocs and our code assumes that we have a TopDocs object from which to cull our search results. Having stared at the docs for lucene for a while, I haven’t found an way to convert the data in TimeLimitedCollector and convert it to TopDocs. It doesn’t make sense to me. 😦

How does one do this? Sadly, we have no Java programmers on our team, so we’re quite a bit out of our league here. Is there an easier way to do this? Would someone be willing to write this code for us and submit a PR? We’d find some really good chocolate and send it to you if you do!

More info on our project:

We are using Lucene 4.10.4 on a custom codebase that pre-dates SOLR — we have a new SOLR project to replace this one, but it isn’t quite done yet. (Again, not having Java programmers is a bit of a problem for us).

Any tips, explanations or pull requests would be deeply appreciated! Chocolate reward offered!

Thank you!

New search server build deployed

Today we’ve deployed a new build of the search server code. As you may know, we’ve been having loads of issues with our search servers recently.

In an effort to figure out what causes the bizarre behaviour we’ve observed, we compiled a new version of the codebase with a more recent version of Lucene. In theory this makes no feature changes to the codebase, but you never know if that is actually the case.

We hope that this build will be more stable, but we’ll need to observe over the next few days to see if that will be the case. If you spot any problems, please report them in our SEARCH bug tracker.

Thanks!