Issues and potential processes for releasing crowdsourced data

Guides

/ February 4, 2015

Sara Terp, Director of Data Projects at Ushahidi, at opensource.org:

‘[Ushahidi is] thinking about what it means to balance the potential social good of wider dataset release with the potential risks that come with making any data public…

Any of the datasets managed by Ushahidi users contain information that is personal, often gathered under extreme circumstances, and potentially dangerous to its subjects, collectors, or managers. Sharing data from these platforms isn’t just about clicking on a share button. If you make a dataset public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that dataset….

Make things better As a crisismapper, I often go through the ethical process. I generally do a manual investigation first, or supervise someone who already has access to the deployment dataset doing this, with them weeding out all the obvious personally identifiable information (PII) and worrisome data points, then ask someone local to the deployment area to do a manual investigation for problems that aren’t obvious to people outside the area (for example, in Homs, the location of a bakery was dangerous information to release, because of targeted bombing).

Some of the things I look for on a first pass include:

Identification of reports and subjects: Phone numbers, email addresses, names, personal addresses.
Military information: actions, activities, equipment.
Uncorroborated crime reports: violence, corruption etc that aren’t also supported by local media reports.
Inflammatory statements (these might re-ignite local tensions).
Veracity: Are these reports true – or at least, are they supported by external information?

Things that make this difficult include untranslated sections of text (you’ll need a native speaker or good auto translate software), codes (e.g. what does “41” mean as a message?) and the amount of time it takes to check through every report by hand. This can be hard work, but if you don’t do these things, you’re not doing due diligence on your data, and that can be desperately important.

Please open up as much social-good data as possible, but do it responsibly too. We’ve seen too many instances of datasets that should have been kept private making it into the public domain—as well as instances of datasets that should have become public, and datasets that have been carefully pruned being criticized for release.’

Are any of the processes or considerations Sara describes useful for your work? What other issues should people dealing with crowdsourced information be thinking about?