%0 Journal Article %@ 1438-8871 %I Gunther Eysenbach %V 14 %N 1 %P e33 %T De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset %A El Emam,Khaled %A Arbuckle,Luk %A Koru,Gunes %A Eze,Benjamin %A Gaudette,Lisa %A Neri,Emilio %A Rose,Sean %A Howard,Jeremy %A Gluck,Jonathan %+ Electronic Health Information Laboratory, CHEO Research Institute, Inc., 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada, 1 738 4181, kelemam@uottawa.ca %K Open data %K de-identification %K privacy %D 2012 %7 27.02.2012 %9 Original Paper %J J Med Internet Res %G English %X Background: There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. Objective: To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Methods: We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. Results: An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. Conclusions: It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data. %M 22370452 %R 10.2196/jmir.2001 %U http://www.jmir.org/2012/1/e33/ %U https://doi.org/10.2196/jmir.2001 %U http://www.ncbi.nlm.nih.gov/pubmed/22370452