Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Background There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.


A.1 Matching with Generalization
Generalizations on the quasi-identifiers can be expressed in terms of the example hierarchies as illustrated for the three quasi-identifiers in Figure 1. The more generalizations that are applied to the quasi-identifiers in the synthetic and real samples, the greater the chance of a match between the synthetic sample and the real sample. For example, if we try to match the synthetic sample with the real sample on the exact BMI then we may not match any patients, but if we generalize that to the b1 categories in panel (b) in Figure 1, then the match rate does increase. At the same time, as the data are generalized, the risk of identification of the real data will likely decrease. Therefore, as one moves vertically through the lattice, the value in equation (11) in the main body of the paper will not necessarily change monotonically in either direction. We can represent all possible generalizations on the quasi-identifiers as a generalization lattice (see Figure 2). In a generalization lattice the least generalized version of the quasi-identifiers is at the bottom (i.e., the highest granularity which is the original data), and the top of the lattice is the most generalized version of the quasi-identifiers. Each node represents a further generalization of a single quasi-identifier compared to the nodes below it in the lattice. The lattice represents all possible generalizations that an adversary can attempt on the synthetic data to identify it.
As we move up the lattice the likelihood of a match between the real sample and synthetic sample will by definition stay the same or increase. Also, as we move up the lattice, the risk of identification (the probability of matching a real sample record with a real person in the population or vice versa) will by definition stay the same or decrease.
The objective of navigating the lattice is to find the node that has the highest risk of meaningful identity disclosure, and that node represents the risk for the synthetic dataset. This means we assume that the adversary will be able to identify the configuration of generalization that would maximize identity disclosure risk, and focus on that. This is somewhat conservative assumption.

A.2 Matching with Variable Subsets
If the number of quasi-identifiers is denoted by k , an adversary may also try to match on fewer than k variables. By definition, the fewer the number of variables the more likely it is to match synthetic records with real records. At the same time, the identification risk will decrease the fewer quasiidentifiers are considered.
The different combinations of quasi-identifiers can be represented as a subsets lattice as illustrated in Figure 3. At the bottom of the lattice are all of the quasi-identifiers, and as we go further up the lattice the number of variables decreases. We need to evaluate the risk in equation (11) in the main body of the paper or every node in the subsets lattice as well.

A.3 Reducing Complexity
If we evaluate every node in each lattice then we have to compute risk for the following number of nodes: Q is the number of generalization levels in quasi-identifier i . Therefore, for example, the three quasi-identifiers with the hierarchies in Figure 1 would give us 240 nodes to compute the meaningful identity disclosure risk for to find the node with the highest risk value. However, the datasets that we are computing on become smaller as we move down the lattice, as explained below.
Initially, the s R value can be computed for all records, and the real sample records with 0 s R = eliminated from further consideration.
We can start our search from the top node in the generalization lattice, and as we move down, remove records from the real sample that were not matched. This means that the dataset size that is being processed to perform the calculation in equation (2) will gradually decrease as we move from top to bottom, speeding up the computations.
To take advantage of that pattern to reduce the amount of computation, every node need only consider the patients who have matched in nodes higher up in the lattice hierarchy along the defined generalization paths, as is illustrated in Figure 4. In this case, the intersection of real sample patients that matched (i.e., 1 s I = ) in nodes <t1,b0,s4>, <t1,b1,s3> and <t2,b0,s3> will be used to perform the computations in <t1,b0,s3> (see Figure 4). In practice, the lattice navigation would start with the variable subsets lattice, and for every node there perform the computations on the generalization lattice for that subset of quasi-identifiers. Also, computations on the subsets lattice should start from the top moving down.
If the highest risk node meets the following inequality, then the synthetic dataset is considered to have acceptably low meaningful identify disclosure risk: where τ is a threshold reflecting acceptable risk. We will discuss further below how to set a threshold.

Appendix B: Defining Acceptable Risk
For the purposes of our risk assessment model, the threshold τ is an important parameter because it is used to determine if the dataset is high risk or not. Below we provide some precedent-based guidance for choosing a value.
The general idea of there being a threshold for what is deemed acceptable identification risk is reflected qualitatively in various statutes. For example, under the HIPAA Privacy Rule, the risk needs to be "very small" that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient, to identify an individual [1]. Recent guidance from the Ontario Information and Privacy Commissioner's Office indicates that the risk of identification of individuals in data should be determined to be "very low" or "very small" prior to the data being released [2].
In Canada and the European Union, the reasonableness standard is widely used to judge whether or not information is identifiable [3]- [6]. For instance, in the Federal Court of Canada case of Gordon v. Canada (Health), 1 Gibson, J. adopted the "serious possibility" test proposed by the Federal Privacy Commissioner to determine if information constitutes personal information as defined in the Privacy Act 2 (i.e. whether it is information about an identifiable individual).

Recital 26 of the European General Data Protection Regulation states, "
To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly." [5] The EU Article 29 Working Party Opinion on Anonymization Techniques indicates that anonymized data must meet three specific criteria (must not allow for singling out, linkability or inference) or a re-identification risk analysis must be performed to demonstrate that the risk is "acceptably small" [7].
A key question then is how does one define subjective terms such as "reasonable", "reasonably likely", "serious possibility", "very low", "very small", or "acceptably small" risk ? And do so in a manner that can be practically translated into a probability value to set an identifiability threshold so that precise measurements can be applied ? Subjective translation of these terms to quantitative values proves to be difficult, with a large variation in how they are interpreted, according to recent surveys [8].
However, there are precedents for setting thresholds in the area of statistical disclosure control. The field of statistical disclosure control has historically concerned itself with the potential disclosure from small "cell sizes" when data is aggregated. That is, a cell in a table in which data is aggregated is considered sensitive if the number of individuals represented in the cell is below a threshold. This threshold rule is often called a minimum cell size. We therefore review existing precedents for setting "minimum cell sizes" since that is the dominant approach that has been applied thus far for deciding acceptable identifiability risk.
The threshold probability of identification is one divided by the minimum cell size. For example, if the minimum cell size is 5, then the threshold probability of re-identification is 1/5 or 0.2.
When governments and other organizations release statistical data, cell size thresholds are often used to ensure that the risk of identification remains at an acceptably low level. Historically, data custodians have used the "cell size of five" rule as a threshold for deciding whether to de-identify data [9]- [23]. rule was originally applied to count data in tables. Count data, however, can be easily converted to individual-level data-therefore these two representations are in effect the same thing. A minimum "cell size of five" rule would translate into a maximum probability of identifying a single record of 0.2. Some custodians use a cell size of 3 [24]- [28], which is equivalent to a probability of identifying a single record of 0.33. For the release of data a cell size of 11 has been used in the US [29]- [33], and a cell size of 20 for public Canadian and US patient data [34], [35], [36]. Cell sizes from 5 to 30 have been used across the US to protect student's personally identifying information, with the cell size of 10 being used most commonly (by 39 states) [37]. Other cell sizes such as 4 [38], 6 [39]- [43], 10 [36], [44]- [46], and 16 [47] have been used in different scenarios within varying countries.
The European Medicines Agency (EMA) has recently established a policy on the publication of clinical data for medicinal products [48] which requires applicants/sponsors to openly share clinical trial data. The guidelines accompanying the policy recommend a maximum risk threshold of 0.09, which is equivalent to a minimal cell size of 11. Health Canada implemented the same threshold for the sharing of clinical trial data [49].
In commentary about the de-identification standard in the HIPAA Privacy Rule, the US Department of Health and Human Services notes in the Federal Register that the identity disclosure risk is largely from unique records [18], [50]. Unique records are those that make up a cell count of 1. It is clear that they considered records that are not unique to have an acceptably low risk of being identified. This translates to a minimal cell size of 2. Although cell sizes less than three are not recommended in the disclosure control literature [51].
One would assume that there is a higher minimal cell size for more sensitive information. However, in such cases the thresholds used have not been so consistent. For example, a minimal count of 3 has been recommended for HIV/AIDS data [27], [52], a cell size of 5 for abortion data [13], and a cell size of 6 for at-risk children [53]. Recognizing that the type of data can differ, NHS Scotland has set a minimum cell size of 3 for non-sensitive data and a minimum cell size of 5 or 10 for sensitive information [54].
Therefore, two thresholds that can be used and that are based on strong precedents. The first is 0.2 since it is the most commonly used in practice. Another one that is more specific to health data is 0.09 since that has been explicitly proposed by two health regulators. The latter value is more conservative and provides a buffer for increasing adversary knowledge over time, and therefore it is the value that we recommend.