Differential Privacy: an Extensive Evaluation of Open-Source Tools for eHealth Applications
Examensarbete för masterexamen
Computer systems and networks (MPCSN), MSc
Medical data is fundamentally important for improving and generating new insights in healthcare. However, personal health data is especially sensitive and statistics about it can leak and compromise individual integrity. In order to protect the data of individuals, or smaller groups of individuals, formal techniques can be applied with software to modify the data and decrease the risk of unauthorized disclosure. One such formal method that has been receiving increasing attention in the last couple of years is differential privacy (DP). DP adds carefully calculated noise to a statistical query, during the training of machine learning models, or during synthetic data generation to guarantee the privacy of the individual participants in the dataset. However, there is a gap between fundamental and applied research on the topic of DP. Particularly, the lack of knowledge exists over what software tools can be leveraged in privacy application development, and to what level of performance developers can expect from potential tools. In this thesis, we review and evaluate a set of DP tools to gain insights into how they perform in practice. For reasonable comparison between the DP tools’ performances, we in the evaluation categorize the tools into three domains, namely statistical queries, machine learning, and synthetic data release. Specifically, for statistical queries we look at Google DP and Smartnoise, for machine learning Tensorflow Privacy, Diffprivlib and Opacus, and for synthetic data release we look at Smartnoise and Gretel. These considered tools are open-source real-world deployments that are created in collaboration with and vetted by a community of engineers, scientists and experts. In our evaluation, we measure how the tools affect data analysis, using the metrics of accuracy and overhead, by comparing results of differential private analysis with analysis without privacy protection. The evaluation employs two datasets in the generating of analysis results: Parkinson Telemonitoring and 2018 Massachusetts Health Reform Survey. For the implementation of the evaluation, we develop a framework where the system overhead and loss of accuracy can be quantified for any tool and any dataset, which can be reused for further tests. The full source code for the framework is openly released as: https://github.com/anthager/ dp-evaluation. This evaluation allows us to look into how these tools perform on real data and how they compare. With the evaluation results, we provide guidance for how these tools can be applied to ultimately help improve privacy for individuals. Our work provides abundant testing results on the considered tools under varieties v of settings, allowing the view of how the tools trade off privacy and utility under different conditions. We gain in the evaluation some general observations and trends via comparison on the tool performance, e.g., it shows that conducting statistical queries and ML tasks on DP synthetic data has significantly larger impact on data accuracy than using statistical query and ML tools on non-privacy protected data, where DP is integrated in the tools themselves. The evaluation also reveals that it is not trivial to configure the tools, which is highly dependent on both the data, the tool being used, and the use case. The results from our evaluation can serve as guidelines for the optimally configuration of the tools, e.g., how the value of can be configured for different data sizes, and what data utility can be expected in those circumstances. We detail a summary of our results in Section 5.1.
Differential Privacy , Privacy Tools , Statistical Queries , Machine Learning , Synthetic Data , eHealth