%0 Journal Article
%@ 1438-8871
%I Gunther Eysenbach
%V 14
%N 1
%P e33
%T De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset
%A El Emam,Khaled
%A Arbuckle,Luk
%A Koru,Gunes
%A Eze,Benjamin
%A Gaudette,Lisa
%A Neri,Emilio
%A Rose,Sean
%A Howard,Jeremy
%A Gluck,Jonathan
%+ Electronic Health Information Laboratory, CHEO Research Institute, Inc., 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada, 1 738 4181, kelemam@uottawa.ca
%K Open data
%K de-identification
%K privacy
%D 2012
%7 27.02.2012
%9 Original Paper
%J J Med Internet Res
%G English
%X Background: There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. Objective: To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Methods: We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. Results: An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. Conclusions: It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data. 
%M 22370452
%R 10.2196/jmir.2001
%U http://www.jmir.org/2012/1/e33/
%U https://doi.org/10.2196/jmir.2001
%U http://www.ncbi.nlm.nih.gov/pubmed/22370452