Hi, I’m Kartikeya Sharma, a postgrad student at National Institute of Technology, Hamirpur. I’ve worked on the project MessyBrainz as a student developer for GSoC 2018. Robert Kaye mentored me during this GSoC programme. The goal of my project is associating MBIDs to MSIDs and clustering together the MSIDs which represents the same MBID. The MBIDs represent MusicBrainz Identifier. It is an Universally Unique Identifier that is permanently assigned to each entity in the MusicBrainz database, MSID represents MessyBrainz Identifier which is associated with each unique recording, artist_credit and release in MessyBrainz database. In simple words MSIDs represents unclean metadata whereas MBIDs represent clean metadata.
This blog post summarizes the work that I did in my project, which was divided into three parts.
Processing the data already in MessyBrainz database
The first part involves creating clusters using the MBIDs already present in the MessyBrainz database. This involves creating clusters for recordings, artists, and releases. To implement this part I created the following three PRs #37, #41, and #44.
After that, I began to work on the second part of this which involves creating clusters using the artist MBIDs and release MBIDs and names fetched from MusicBrainz database. I needed to access MusicBrainz database, for that, I first had to work on BrainzUtils to have methods to access MusicBrainz database to fetch artist MBIDs using recording MBIDs and release name and release MBIDs using recording MBIDs. The part to fetch artist MBIDs was done during the community bonding period in PR #14 at BrainzUtils and to fetch releases I created PR #18 at BrainzUtils during GSoC coding period. After that, I created a PR to create clusters using the fetched artist MBIDs #47 and another one to create clusters using releases fetched #49.
I did write around 60 tests which proved to be vital in making sure that the code does what it’s supposed to do.
Processing the data as it is inserted into the MessyBrainz database
Creating clusters for the data inside the database requires a lot of resources. So, it was better to create clusters as recordings are inserted into the database but, even this type of clustering is not efficient. So, to cluster these recordings first these recordings are sent to rabbitMQ server and from that, these are sent to a clustering script which runs in a different container and runs continuously and clusters the incoming recordings. That way it does not slow down the process of submitting recordings to the database. For this I created PR #50.
Create endpoints to access MSIDs and MBIDs
I created two API endpoints in PR #51.One endpoint is to fetch MBIDs and MSID using an MSID. Another endpoint is to fetch MSIDs using an MBID. This way end users can access MBIDs and MSIDs which may be used for calculating different stats.
Apart from that with the help of my mentor, I did setup a VM to test the above code on the MsB datadump. This task had some challenges: first I had to create indexes for various fields to speed up the process of clustering. Without indexing, it would have taken approximately 37 days but after creating indexes on various fields It just took 3 hours. I found out that PostgreSQL does allow to create indexes on functions too which came into use while creating artist_credit clusters for which I created a custom function. Indexes were created in PR #53. When I ran the clustering code on a VM on which the whole MessyBrainz datadump was present I found out that we have fields in
recording_json table which are supposed to store MBIDs but were pointing to empty strings. This was not supposed to happen initially as ListenBrainz is the only source of data for MessyBrainz currently. Submissions to MessyBrainz are restricted from users directly and ListenBrainz does validate listens for that. So, those recordings must have been inserted before that validation was present. To solve the problem I created PR #52.
The summer was a great learning experience for me. I started slowly as things were messy at the start. As at the start everything wasn’t crystal clear to me, I wasn’t sure on how exactly to write scripts that manipulate database and did write the scripts in the most trivial way possible. Here I was doing a query for every single MBID to first check if it’s present in the
recording_cluster tables and if not then cluster the recording. Which is conceptually correct but not efficient by any means. And this could be done by executing a single query on the
recording_json table to fetch only those recording MBIDs that are not present in
recording_redirect table as those are unclustered. That way we don’t have to process the recording MBIDs that have been already processed making the process of clustering efficient.
With time I got an understanding of how clusters are created and how to handle anomalies. Such as James Morrison. In the end, the definition of anomaly can be put as an MSID represents an anomaly if it points to different MBIDs in
entity_redirect table (entity can be artist_credit, recording, and release).
Work to be done ahead
The project is still in its initial stages and requires a lot of work to be done before moving it into production. We still need to write integration tests for ClusterWriter and API endpoints. After that, we can work on the Additional Ideas that I proposed in my proposal. We need to figure out some way to associate MBIDs to MSIDs for the artists, recordings, and releases where no MBIDs are present. This does not seem like a trivial task with so many anomalies to take care of.
Last three months have been a great experience for me. I would like to thank Robert Kaye, Param Singh, and Alastair Porter who helped me to solve a lot of problems that I encountered during the entire period. Working on their suggestions and reviews I was able to write good quality code which was efficient as well. The work culture at MetaBrainz inspired me a lot. At MetaBrainz we have weekly IRC meetings where we get to know what others are doing at the organization and also get a place to tell what we did in our past week. I would like to thank MetaBrainz and Google for giving me this chance to get involved in open source on such a cool project. The association of MSIDs to MBIDs can be used by ListenBrainz as stats are calculated on MSIDs which can then be mapped onto MBIDs which represents clean metadata. I would like to work on the project further because of the learning opportunities that are present in the project.