Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. Methods We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. Results The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. Conclusions This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.


Adult
The census income data from 1994 Census database 44842 13 1.193 BankNote Data of images that were taken for the evaluation of tan authentication procedure for banknotes. automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore, the dataset can be used either for 10-class or 3-class experiments.

Multivariate Hellinger Distance
The Bhattacharyya distance is the degree of dissimilarity between two probability distributions [3]. It is widely used in the areas of, for example, image segmentation, and feature extraction [4], [5].
Consider the probability distributions p and q over the same domain X , the Bhattacharyya distance is defined as: where, the Bhattacharyya coefficient can be computed for discrete distributions as follows: And for continuous probability distributions: It is not limited to computing the similarity between two univariate distributions. For a multivariate normal distribution, the Bhattacharyya Distance is as follows: The Hellinger distance can be derived from Bhattacharyya distance and has the advantage that it is bounded between zero and one, and hence is more interpretable [6]. We therefore use the Hellinger distance computed on the multivariate normal distribution.

Wasserstein Distance
In the context of optimal transport planning, the Wasserstein distance evaluates the effort it needs to transport the distribution of mass ( ) on a space to the distribution ( ) on the same space. Given the transport plan to move from point x to point y is ( , ) and the cost function is ( , ), the optimal cost can be defined as: where, Γ( , ) denotes the collection of all measures of transports.
The optimal cost is the same as the definition of the 1 distance if the cost function is equivalent to the distance between point x and y. The ℎ Wasserstein distance between two probability distributions and in ( ) is defined as: where, ( , ) is a metric space, ( ) denotes the collection of all probability measures on and Γ( , ) denotes the collection of all measures on × with marginals and of x and y respectively.
In the current paper the W1 metric is used.

Distinguishability
After stacking the real and synthetic datasets, using a binary classifier has been proposed as an approach for comparing two multivariate distributions [7], [8]. The estimated probability across all observations would then be used to compute a score. In our context the two datasets would be the real and synthetic datasets.
Adopting a perspective from the propensity score matching literature [9], a propensity mean square error metric has been proposed to evaluate the similarity of real and synthetic datasets [10], [11], which we will refer to as propensityMSE. To calculate the propensityMSE, a classifier is trained on a stacked dataset consisting of real observations labelled 1 and synthetic observations labelled 0. The propensityMSE score is computed as the mean squared difference of the estimated probability from the average prediction where it is not possible to distinguish between the two datasets. If the datasets are of the same size and indistinguishable, which is the assumption we make here, the average estimate will be 0.5: where N is the size of the stacked dataset, and i p is the propensity score for observation i . The classifier is used to compute the i p value where the training set is also used to compute the propensity score for each observation in the stacked dataset.
If the multivariate distributions of the two datasets are the same, then the probability will hover around 0.5, indicating that the classifier is not able to distinguish between them and the propensityMSE approaches zero. If the two datasets are completely different, then the classifier will be able to distinguish between them. In such a case the propensity score will be either zero or one, with propensityMSE approaching 0.25.
To make the metric more easily interpretable we can scale it to be between zero and one as follows (for the case when the two datasets are the same size): Another related approach that has been used to evaluate the utility of synthetic data is to take a prediction perspective rather than a propensity perspective. This has been applied with "human discriminators" by asking a domain expert to manually classify sample records as real or synthetic [12]- [14]. This means that a sample of real records and a sample of synthetic records are drawn, and the two sets are shuffled together. Then the shuffled records are presented to clinicians who are expert in the domain and asked to subjectively discriminate between them by indicating which record is real versus synthetic. In this task the classification of every record is either correct or incorrect. A correct classification is not good in this case because it indicates that the real and synthetic data are more distinguishable. A real record that is classified as synthetic and a synthetic record that is classified as real are both considered good outcomes because the clinician was not able to tell the difference. High distinguishability only occurs when the human discriminator can correctly classify real and synthetic records.
The use of human discriminators is not scalable and therefore we can use machine learning algorithms trained on a training dataset and that make predictions on a hold-out test dataset. This approach mimics the subjective evaluations described above. We will refer to this metric as predictionMSE. Also note that this is different than the calculation of propensityMSE where the training dataset is also used to compute the probabilities.
The predictionMSE calculation needs to be adjusted so as not to penalize incorrect classification. For example, if a real record has a predicted probability less than 0.5 then this would be penalized under propensityMSE, but under the prediction approach this is an indicator that the discriminator is unable to distinguish between real and synthetic records. We therefore define the predictionMSE: where the full adjusted metric would be: This formulation does not penalize synthetic data that looks more like real data than it does like synthetic data. The concept of using prediction error for the stacked dataset has been considered before, but AUROC was used rather than the squared error [15].  3 We took the log of the c U value to be consistent with the original article [16]. 4 Because the variables need to be converted to a binary representation, some of the less frequent categories are not generated in the synthetic datasets, which causes the calculation to be invalid. 5 For nominal variables, the integer encoding can result in inflated distances.

Appendix S4: Prediction Accuracy
The following synthetic data values are the average across 20 synthetic datasets. They were used to compute the AUROC and AUPRC differences. Note that for each SDG method the real data LR model was re-estimated and therefore the values are slightly different due to crossvalidation partitions being different.

Appendix S5: Results Plots
The following are the plots showing the prediction performance rank for the utility metrics not shown in the main body of the paper. For all the plots the three SDG methods were ordered based on their relative utility metric values into the "H", "M", and "L" groups.