Summer of Code log analysis project: May we share our data with our GSoC student?

UPDATE: This clearly going to be a major hassle, so we’ll spend the extra time coding a program that will sanitize the data before it goes into splunk.

Last week Google’s Summer of Code program started and my student Dániel Bali is ready to get busy combing through our massive logs and see what sorts of information he can mine from our logs.

We only have one minor problem — our logs contain the IP addresses of our users and some requests contain the user names of the person making the request. Removing this private information from the logs before Dániel sees them is quite a pain to do well.

I would like to propose that we:

  1. Consider Dániel part of our core team for the summer and allow him to see IP addresses and all the requests in full.
  2. Have Dániel sign a short statement stating that he will not divulge any private information.
  3. Will fail him in his GSoC project if he does divulge any private information.

If this is not acceptable to you, please speak up soon. I would like to make this happen early next week so Dániel can continue his GSoc work.

UPDATE: The final output of Dániel’s work will not contain any private information. If we end up using any private data as input, we will sanitize it and remove private information before we publish the output.

11 thoughts on “Summer of Code log analysis project: May we share our data with our GSoC student?

  1. ianmcorvidae

    +1 from me.

    I assume it would go without saying, but this presumably also covers keeping private those reports that contain private information, after said reports exist.

  2. Hawke

    -1

    I think this is violating the letter of the privacy policy, specifically “Any personal information you choose to provide will not be revealed to anyone else.”

    That said, I don’t personally mind the use of my own data like this and would agree to a more liberal privacy policy anyway. But it seems to me that by asking here you are asking for an exception to the privacy policy which should be at the same level of communication as an update to the privacy policy.

  3. Brandon LeBlanc

    I’m indifferent, but let my indifference move towards support (mainly because I don’t believe IP addresses are or should quality as personally identifiable information).

    @Hawke: The first item in Rob’s proposal moves against your argument. As part of the core team, his access would be exempt by the privacy policy.

    Also, if the community moves in favor of not allowing him access to IP address, may I suggest uniquely hashing (SHAxxx or MD5 or some form of UUID) each address before turning them over. This can be quickly and easily done with some clever scripting and allows there to still be a unique identifier in the logs for each IP address and still provide anonymity.

    I also suggest that the unique hashing of the address be done anyways. As I understand, you plan on using Splunk as part of the processing of the logs. This would be a direct violation of the privacy policy that Hawke mentioned above.

  4. Brandon LeBlanc

    I take back what I said about Splunk, I misunderstood how that process works. Splunk has no contact with the data whatsoever.

  5. Hawke

    It seems like cheating to just declare one person to be “part of musicbrainz” like this though. IMO either GSoC students working with MB are part of MB as a group or none of them are and you’re asking for an exception to the privacy policy.

    Declaring “this guy is part of MB so we can give him access to the private data” defeats the purpose of the privacy policy. Using that logic you could give it to *anyone*.

    It would sit better with me if you made a blanket declaration that any GSoC student working with MB was considered part of MB, but would be required to sign an NDA as you mention above, if they needed access to private data in their work.

  6. BrianFreud

    I kind of agree with Hawke. I’m not against my own info being used, but others might. While I don’t see ip alone as being really “personal information”, it is “private information”, as it is known only by MB, and not made public.

    I don’t know exactly what types of data he might have access to, but I’d assume it could include POSTs sent during editing… which would be relatively easy to cross the datestamps with the (public info) data dumps, which do include edit data – and now you have ip+editor nick, and it’s no longer anonymous.

    For Dániel’s analysis to be fully public and confirmable, such that his results can be checked by others, presumably the dataset also will eventually need to be made public? (Else the results are not checkable.) If so, then doesn’t the data need to be sanitized anyhow? In which case, the pain is there at some point; it seems to me a lot better to have it sanitized by someone who is a full time core team member, rather than someone who is, temporarily, only a “core team” member due to a technicality in how he is temporarily defined. , no offense to Dániel intended.

    (It also seems odd that a GSoC student – essentially an “apprentice” in any other field – could be considered a “core team” member, even if only for a short term).

  7. Invisibleman78

    If the “IP addresses and some requests contain the user names of the person making the request” are not important for the log analysis project, why you don’t blank/delete/randomize this parts? I’m sure, the members of the core team all have the skills to use ‘sed’ or ‘grep’ for such a task?
    Or am I being simplistic?

  8. ZaphodBeeblebrox

    I’m sorry to say I’ll echo everyone else on this. I personally can’t give a bigger damn about if anyone sees *my* ip/edits/whatever.
    But it’s true that it is getting into murky waters with the privacy policy.
    And MusicBrainz is better than that.

  9. Mayhem

    OK, never mind then. We’ll spend the extra time doing the sanitizing, because I dont want to have this giant hassle of changing the privacy policy just for this one change that no one actually objects to.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s