Summary of our discussion on k-anonymity and other de-identification frameworks

Analysis

/ November 13, 2015

On November 6, we hosted a third RDF online discussion in our de-identifying data discussion series. This discussion was an introduction to k-anonymity and other de-identification frameworks and techniques with Max Shron, founder of Polynumeral.

Earlier discussions in the de-identification series introduced us to context, concept and background of de-identification and how it relates to social change work. Whereas this session was an opportunity to explore the more technical side of de-identifying data. We all came away with a better understanding of what the different frameworks are, why they matter and what the potential caveats are with each.

Max focuses his presentation on k-anonymity: a property possessed by an anonymous data set in which each collection of quasi-identifiers (i.e. zip code, birthdate, sex), in combination, occur at least k times. For example, a dataset is considered 2-anonymous when, for any combination of a dataset’s potentially identifying attributes found in any row of the table, there are always at least 2 rows with those exact attributes. Max explains how to strengthen the anonymity of a dataset using the k-anonymity techniques, while maintaining much of the accuracy of the dataset.

However, there are situations in which k-anonymity does not protect the individual from re-identification – there are a whole range of threats such as homogeneity, skewness, similarity, or background knowledge attacks. To address these threads, Max briefly explained l-diversity (each collection of quasi-identifiers, in combination, has at least l different sensitive values) and t-closeness (the distribution of sensitive data compared across whole table is no more than threshold of t).

We’re so grateful to Max Shron for sharing his time and expertise on this topic with the Responsible Data Forum! A recording of the presentation is embedded below:

More information about this presenter:

Max runs Polynumeral, a data science consultancy based in lower Manhattan. While at Polynumeral, he has worked with BRAC, the award-winning Bangladeshi NGO, and the World Bank, in addition to numerous media, education, and technology companies. Prior to founding Polynumeral, he was the data scientist at OkCupid, where he worked on the popular OkTrends blog. He is the author of Thinking with Data from O’Reilly media. His work has been featured in the New York Times, Chicago Tribune, WNYC, Huffington Post, and more.

More information on our discussion series on de-identifying data:

Our first discussion was an introduction for advocacy organisations with Mark Elliot of the UK Anonymisation Network. We have a summary and a recording of the entire conversation on our website.
Our second discussion with Sara-Jayne Terp of Thoughtworks focused on risk analysis and mitigation strategies. We have a recording of the presentation and a summary of the discussion on our website.
Our third discussion featured Max Shron, of Polynumeral, who presented an introduction to k-anonymity and other de-identification frameworks.
Upcoming: We’re thrilled to be facilitating a working session with Amy O’Donnell of Oxfam on developing a data-deposit decision-making framework on November 19 at 10am EST. More info here.