Dr. Balázs Pejó

Assistant Professor

pejo (at) crysys.hu

web: www.crysys.hu/~pejo/
office: I.E. 430
tel: +36 1 463 2080

Current courses | Student projects | Publications

Short Bio

Balázs Pejó was born in 1989 in Budapest, Hungary. He received a B.Sc. degree in Mathematics from the Budapest University of Technology and Economics (BME, Hungary) in 2012 and two M.Sc. degree in Computer Science in the Security and Privacy program of EIT Digital from the University of Trento (UNITN, Italy) and Eötvös Loránd University (ELTE, Hungary) in 2014. He earned the Ph.D. degree in Informatics from the University of Luxembourg (UNILU, Luxembourg) in 2019. Currently, he is a member of the Laboratory of Cryptography and Systems Security (CrySyS Lab).

Current Courses

IT Security (VIHIAC01)

This BSc course gives an overview of the different areas of IT security with the aim of increasing the security awareness of computer science students and shaping their attitude towards designing and using secure computing systems. The course prepares BSc students for security challenges that they may encounter during their professional career, and at the same time, it provides a basis for those students who want to continue their studies at MSc level (taking, for instance, our IT Security major specialization). We put special emphasis on software security and the practical aspects of developing secure programs.

IT Security (in English) (VIHIAC01)

This is the English version of IT Security (VIHIAC01) course.

Privacy-Preserving Technologies (VIHIAV35)

This course provides a detailed overview of data privacy. It focuses on different privacy problems of web tracking, data sharing, and machine learning, as well as their mitigation techniques. The aim is to give the essential (technical) background knowledge needed to identify and protect personal data. These skills are becoming a must of every data/software engineer and data protection officer dealing with personal and sensitive data, and are also required by the European General Data Protection Regulation.

Student Project Proposals

Privacy & Anonymization

The word privacy is derived from the Latin word "privatus" which means set apart from what is public, personal and belonging to oneself, and not to the state. There are multiple angles of privacy and multiple techniques to improve them to varying extent. Students can work on the following topics:

Required skills: none
Preferred skills: basic programming skills (e.g., python)

Machine Learning & Security & Privacy

Machine Learning (Artificial Intelligence) has become undisputedly popular in recent years. The number of security critical applications of machine learning has been steadily increasing over the years (self-driving cars, user authentication, decision support, profiling, risk assessment, etc.). However, there are still many open security problems of machine learning. Students can work on the following topics:

Required skills: none
Preferred skills: basic programming skills (e.g., python), machine learning (not required)

Federated Learning - Security & Privacy & Contribution Scores

Federated learning enables multiple actors to build a common, robust machine learning model without sharing data, thus allowing to address critical issues such as data privacy, data security, data access rights and access to heterogeneous data. Its applications are spread over a number of industries including defense, telecommunications, IoT, and pharmaceutics. Students can work on the following topics:

Required skills: none
Preferred skills: basic programming skills (e.g., python), machine learning (not required)

Economics of (cyber)security and (data)privacy

As evidenced in the last 10-15 years, cybersecurity is not a purely technical discipline. Decision-makers, whether sitting at security providers (IT companies), security demanders (everyone using IT) or the security industry, are mostly driven by economic incentives. Understanding these incentives are vital for designing systems that are secure in real-life scenarios. Parallel to this, data privacy has also shown the same characteristics: proper economic incentives and controls are needed to design systems where sharing data is beneficial to both data subject and data controller. An extreme example to a flawed attempt at such a design is the Cambridge Analytica case.
The prospective student will identify a cybersecurity or data privacy economics problem, and use elements of game theory and other domain-specific techniques and software tools to transform the problem into a model and propose a solution. Potential topics include:

Required skills: model thinking, good command of English
Preferred skills: basic knowledge of game theory, basic programming skills (e.g., python, matlab, NetLogo)

Publications

2023

Industry-Scale Orchestrated Federated Learning for Drug Discovery

M. Oldenhof and G. Ács and B. Pejo and A. Schuffenhauer and N. Holway and N. Sturm and A. Dieckmann and O. Fortmeier and E. Boniface and C. Mayer and A. Gohier and P. Schmidtke and R. Niwayama and D. Kopecky and L. Mervin and P. C. Rathi and L. Friedrich and A. Formanek and P. Antal and J. Rahaman and A. Zalewski and W. Heyndrickx and E. Oluoch and M. Stößel and M. Van?o and D. Endico and F. Gelus and T. de Boisfossé and A. Darbier and A. Nicollet and M. Blottière and M. Telenczuk and V. T. Nguyen and T. Martinez and C. Boillet and K. Moutet and A. Picosson and A. Gasser and I. Djafar and A. Simon and Ádám Arany and J. Simm and Y. Moreau and O. Engkvist and H. Ceulemans and C. Marini and M. Galtier

Proceedings of the AAAI Conference on Artificial Intelligence, 2023.

Bibtex | Abstract | PDF | Link

@inproceedings {
   author = {Martijn Oldenhof and Gergely Ács and Balazs Pejo and A. Schuffenhauer and N. Holway and N. Sturm and A. Dieckmann and O. Fortmeier and E. Boniface and C. Mayer and A. Gohier and P. Schmidtke and R. Niwayama and D. Kopecky and L. Mervin and P. C. Rathi and L. Friedrich and A. Formanek and P. Antal and J. Rahaman and A. Zalewski and W. Heyndrickx and E. Oluoch and M. Stößel and M. Van?o and D. Endico and F. Gelus and T. de Boisfossé and A. Darbier and A. Nicollet and M. Blottière and M. Telenczuk and V. T. Nguyen and T. Martinez and C. Boillet and K. Moutet and A. Picosson and A. Gasser and I. Djafar and A. Simon and Ádám Arany and J. Simm and Y. Moreau and O. Engkvist and H. Ceulemans and C. Marini and M. Galtier},
   title = {Industry-Scale Orchestrated Federated Learning for Drug Discovery},
   booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
   year = {2023},
   howpublished = "\url{https://ojs.aaai.org/index.php/AAAI/article/view/26847}"
}

Keywords

Federated Learning, Drug Discovery, Privacy Preserving, Industry-scale

Abstract

To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n°831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information

W. Heyndrickx and L. Mervin and T. Morawietz and N. Sturm and L. Friedrich and A. Zalewski and A. Pentina and L. Humbeck and M. Oldenhof and R. Niwayama and P. Schmidtke and N. Fechner and J. Simm and A. Arany and N. Drizard and R. Jabal and A. Afanasyeva and R. Loeb and S. Verma and S. Harnqvist and M. Holmes and B. Pejo and M. Telenczuk and N. Holway and A. Dieckmann and N. Rieke and F. Zumsande and D.-A. Clevert and M. Krug and C. Luscombe and D. Green and P. Ertl and P. Antal and D. Marcus and N. D. Huu and H. Fuji and S. Pickett and G. Ács and E. Boniface and B. Beck and Y. Sun and A. Gohier and F. Rippmann and O. Engkvist and A. H. Göller and Y. Moreau and M. N. Galtier and A. Schuffenhauer and H. Ceulemans

Machine Learning in Bio-cheminformatics, 2023.

Bibtex | Abstract | PDF | Link

@article {
   author = {Wouter Heyndrickx and Lewis Mervin and Tobias Morawietz and Noé Sturm and Lukas Friedrich and Adam Zalewski and Anastasia Pentina and Lina Humbeck and Martijn Oldenhof and Ritsuya Niwayama and Peter Schmidtke and Nikolas Fechner and Jaak Simm and Adam Arany and Nicolas Drizard and Rama Jabal and Arina Afanasyeva and Regis Loeb and Shlok Verma and Simon Harnqvist and Matthew Holmes and Balazs Pejo and Maria Telenczuk and Nicholas Holway and Arne Dieckmann and Nicola Rieke and Friederike Zumsande and Djork-Arné Clevert and Michael Krug and Christopher Luscombe and Darren Green and Peter Ertl and Peter Antal and David Marcus and Nicolas Do Huu and Hideyoshi Fuji and Stephen Pickett and Gergely Ács and Eric Boniface and Bernd Beck and Yax Sun and Arnaud Gohier and Friedrich Rippmann and Ola Engkvist and Andreas H. Göller and Yves Moreau and Mathieu N. Galtier and Ansgar Schuffenhauer and Hugo Ceulemans},
   title = {MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information},
   journal = {Machine Learning in Bio-cheminformatics},
   year = {2023},
   howpublished = "\url{https://pubs.acs.org/doi/10.1021/acs.jcim.3c00799}"
}

Abstract

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource- intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.

Privacy-Preserving Federated Singular Value Decomposition

B. Liu and B. Pejo and Q. Tang

Advanced Technologies for Data Privacy and Security, 2023.

Bibtex | Abstract | PDF | Link

@article {
   author = {Bowen Liu and Balazs Pejo and Qiang Tang},
   title = {Privacy-Preserving Federated Singular Value Decomposition},
   journal = {Advanced Technologies for Data Privacy and Security},
   year = {2023},
   howpublished = "\url{https://www.mdpi.com/2076-3417/13/13/7373}"
}

Keywords

singular value decomposition; federated learning; secure aggregation; differential privacy

Abstract

Singular value decomposition (SVD) is a fundamental technique widely used in various applications, such as recommendation systems and principal component analyses. In recent years, the need for privacy-preserving computations has been increasing constantly, which concerns SVD as well. Federated SVD has emerged as a promising approach that enables collaborative SVD computation without sharing raw data. However, existing federated approaches still need improvements regarding privacy guarantees and utility preservation. This paper moves a step further towards these directions: we propose two enhanced federated SVD schemes focusing on utility and privacy, respectively. Using a recommendation system use-case with real-world data, we demonstrate that our schemes outperform the state-of-the-art federated SVD solution. Our utility-enhanced scheme (utilizing secure aggregation) improves the final utility and the convergence speed by more than 2.5 times compared with the existing state-of-the-art approach. In contrast, our privacy-enhancing scheme (utilizing differential privacy) provides more robust privacy protection while improving the same aspect by more than 25%.

Quality Inference in Federated Learning with Secure Aggregation

B. Pejo and G. Biczók

IEEE Transactions on Big Data, 2023.

Bibtex | Abstract | PDF | Link

@article {
   author = {Balazs Pejo and Gergely Biczók},
   title = {Quality Inference in Federated Learning with Secure Aggregation},
   journal = {IEEE Transactions on Big Data},
   year = {2023},
   howpublished = "\url{https://ieeexplore.ieee.org/document/10138056}"
}

Keywords

Quality Inference , Federated Learning , Secure Aggregation , Misbehavior Detection , Contribution Score

Abstract

Federated learning algorithms are developed both for efficiency reasons and to ensure the privacy and confidentiality of personal and business data, respectively. Despite no data being shared explicitly, recent studies showed that the mechanism could still leak sensitive information. Hence, secure aggregation is utilized in many real-world scenarios to prevent attribution to specific participants. In this paper, we focus on the quality (i.e., the ratio of correct labels) of individual training datasets and show that such quality information could be inferred and attributed to specific participants even when secure aggregation is applied. Specifically, through a series of image recognition experiments, we infer the relative quality ordering of participants. Moreover, we apply the inferred quality information to stabilize training performance, measure the individual contribution of participants, and detect misbehavior.

SQLi Detection with ML: A Data-Source Perspective

B. Pejo and N. Kapui

Proceedings of the 20th International Conference on Security and Cryptography, 2023.

Bibtex | Abstract | PDF

@inproceedings {
   author = {Balazs Pejo and Nikolett Kapui},
   title = {SQLi Detection with ML: A Data-Source Perspective},
   booktitle = {Proceedings of the 20th International Conference on Security and Cryptography},
   year = {2023}
}

Abstract

Almost 50 years after the invention of SQL, injection attacks are still top-tier vulnerabilities of today’s ICT systems. Consequently, SQLi detection is still an active area of research, where the most recent works incorporate machine learning techniques into the proposed solutions. In this work, we highlight the shortcomings of the previous ML-based results focusing on four aspects: the evaluation methods, the optimization of the model parameters, the distribution of utilized datasets, and the feature selection. Since no single work explored all of these aspects satisfactorily, we fill this gap and provide an in-depth and comprehensive empirical analysis. Moreover, we cross-validate the trained models by using data from other distributions. This aspect of ML models (trained for SQLi detection) was never studied. Yet, the sensitivity of the model’s performance to this is crucial for any real-life deployment. Finally, we validate our findings on a real-world industrial SQLi dataset.

2022

Collaborative Drug Discovery: Inference-level Privacy Perspective

B. Pejo and M. Remeli and Á. Arany and M. Galtier and G. Ács

Transactions on Data Privacy (TDP), vol. 15, 2022.

Bibtex | Abstract | PDF | Link

@article {
   author = {Balazs Pejo and Mina Remeli and Ádám Arany and Mathieu Galtier and Gergely Ács},
   title = {Collaborative Drug Discovery: Inference-level Privacy Perspective},
   journal = {Transactions on Data Privacy (TDP)},
   volume = {15},
   year = {2022},
   howpublished = "\url{http://www.tdp.cat/issues21/abs.a449a21.php}"
}

Abstract

Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborative modeling in the preclinical phase of drug discovery to accelerate the selection of promising drug candidates. After a short taxonomy of state-of-the-art inference attacks we adopt and customize several to the underlying scenario. Finally we describe and experiments with a handful of relevant privacy protection techniques to mitigate such attacks.

Games in the Time of COVID-19: Promoting Mechanism Design for Pandemic Response

B. Pejo and G. Biczók

ACM Transactions on Spatial Algorithms and Systems (TSAS), 2022.

Bibtex | Link

@article {
   author = {Balazs Pejo and Gergely Biczók},
   title = {Games in the Time of COVID-19: Promoting Mechanism Design for Pandemic Response},
   journal = {ACM Transactions on Spatial Algorithms and Systems (TSAS)},
   year = {2022},
   howpublished = "\url{https://dl.acm.org/doi/abs/10.1145/3503155}"
}

Abstract

Guide to Differential Privacy Modifications

B. Pejo and D. Desfontaines

Springer International Publishing (SpringerBriefs), 2022.

Bibtex | Link

@book {
   author = {Balazs Pejo and Damien Desfontaines},
   title = {Guide to Differential Privacy Modifications},
   publisher = {Springer International Publishing (SpringerBriefs)},
   year = {2022},
   howpublished = "\url{https://link.springer.com/book/10.1007/978-3-030-96398-9}"
}

Abstract

Incentives for Individual Compliance with Pandemic Response Measures

B. Pejo and G. Biczók

Enabling Technologies for Social Distancing: Fundamentals, concepts and solutions, (IET), 2022.

Bibtex | PDF | Link

@inproceedings {
   author = {Balazs Pejo and Gergely Biczók},
   title = {Incentives for Individual Compliance with Pandemic Response Measures},
   booktitle = {Enabling Technologies for Social Distancing: Fundamentals, concepts and solutions, (IET)},
   year = {2022},
   howpublished = "\url{https://digital-library.theiet.org/content/books/te/pbte104e}"
}

Abstract

Revenue Attribution on iOS 14 using Conversion Values in F2P Games

F. Ayala-Gómez and I. Horppu and E. Gülbenkoglu and V. Siivola and B. Pejo

AdKDD Workshop at 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (AdKDD) , 2022.

Bibtex | Abstract | PDF | Link

@inproceedings {
   author = {Frederick Ayala-Gómez and Ismo Horppu and Erlin Gülbenkoglu and Vesa Siivola and Balazs Pejo},
   title = {Revenue Attribution on iOS 14 using Conversion Values in F2P Games},
   booktitle = {AdKDD Workshop at 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (AdKDD) },
   year = {2022},
   howpublished = "\url{https://www.adkdd.org/Papers/Show-me-the-Money%3A-Measuring-Marketing-Performance-in-F2P-Games-using-Apple's-App-Tracking-Transparency-Framework/2022}"
}

Keywords

conversion value, revenue attribution, mobile advertising, privacy

Abstract

Mobile app developers use paid advertising campaigns to acquire new users. Based on the campaigns' performance, marketing managers decide where and how much to spend. Apple's new privacy mechanisms profoundly impact how performance marketing is measured. Starting iOS 14.5, all apps must get system permission for tracking explicitly via the new App Tracking Transparency Framework. Instead of relying on individual identifiers, Apple proposed a new performance mechanism called conversion value, an integer set by the apps for each user. The conversion value follows a set of rules and a schema that defines the integers based on the user's in-app behavior. The developers can get the number of installs per conversion value for each campaign. For conversion values to be helpful, we need a method that translates them to revenue. This paper investigates the task of attributing revenue to advertising campaigns using their reported conversion values. Our contributions are to formalize the problem, find the theoretically optimal revenue attribution function for any conversion value schema and show empirical results on past data of a free-to-play mobile game using different conversion value schemas.

Why Fuzzy Message Detection Leads to Fuzzy Privacy Guarantees

I. Seres and B. Pejo and P. Burcsi

22nd Financial Cryptography and Data Security Conference (FC), 2022.

Bibtex | Abstract | Link

@conference {
   author = {Andras Instvan Seres and Balazs Pejo and Peter Burcsi},
   title = {Why Fuzzy Message Detection Leads to Fuzzy Privacy Guarantees},
   booktitle = {22nd Financial Cryptography and Data Security Conference (FC)},
   year = {2022},
   howpublished = "\url{https://fc22.ifca.ai/preproceedings/9.pdf}"
}

Keywords

Fuzzy Message Detection, unlinkability, anonymity, differential privacy, game theory

Abstract

Fuzzy Message Detection (FMD) is a recent cryptographic primitive invented by Beck et al. (CCS’21) where an untrusted server performs coarse message filtering for its clients in a recipient-anonymous way. In FMD — besides the true positive messages — the clients download from the server their cover messages determined by their false- positive detection rates. What is more, within FMD, the server cannot distinguish between genuine and cover traffic. In this paper, we formally analyze the privacy guarantees of FMD from three different angles. First, we analyze three privacy provisions offered by FMD: recipient unlinkability, relationship anonymity, and temporal detection ambiguity. Second, we perform a differential privacy analysis and coin a relaxed definition to capture the privacy guarantees FMD yields. Finally, we simulate FMD on real-world communication data. Our theoretical and empirical results assist FMD users in adequately selecting their false-positive detection rates for various applications with given privacy requirements.

2021

Measuring Contributions in Privacy-Preserving Federated Learning

G. Ács and G. Biczók and B. Pejo

ERCIM NEWS, vol. 126, 2021, pp. 35-36.

Bibtex | Abstract | Link

@article {
   author = {Gergely Ács and Gergely Biczók and Balazs Pejo},
   title = {Measuring Contributions in Privacy-Preserving Federated Learning},
   journal = {ERCIM NEWS},
   volume = {126},
   year = {2021},
   pages = {35-36},
   howpublished = "\url{https://ercim-news.ercim.eu/en126/special/measuring-contributions-in-privacy-preserving-federated-learning}"
}

Abstract

How vital is each participant’s contribution to a collaboratively trained machine learning model? This is a challenging question to answer, especially if the learning is carried out in a privacy-preserving manner with the aim of concealing individual actions.

Property Inference Attacks on Convolutional Neural Networks: Influence and Implications of Target Model's Complexity

M. Parisot and B. Pejo and D. Spagnuelo

18th International Conference on Security and Cryptography (SECRYPT), 2021.

Bibtex | Link

@conference {
   author = {Mathias Parisot and Balazs Pejo and Dayana Spagnuelo},
   title = {Property Inference Attacks on Convolutional Neural Networks: Influence and Implications of Target Model's Complexity},
   booktitle = {18th International Conference on Security and Cryptography (SECRYPT)},
   year = {2021},
   howpublished = "\url{https://www.scitepress.org/Link.aspx?doi=10.5220/0010555607150721}"
}

Abstract

2020

Corona Games: Masks, Social Distancing and Mechanism Design

B. Pejo and G. Biczók

Proc. of ACM SIGSPATIAL Workshop on COVID, ACM, 2020.

Bibtex | Abstract | PDF

@inproceedings {
   author = {Balazs Pejo and Gergely Biczók},
   title = {Corona Games: Masks, Social Distancing and Mechanism Design},
   booktitle = {Proc. of ACM SIGSPATIAL Workshop on COVID},
   publisher = {ACM},
   year = {2020}
}

Abstract

Pandemic response is a complex affair. Most governments employ a set of quasi-standard measures to fight COVID-19 including wearing masks, social distancing, virus testing and contact tracing. We argue that some non-trivial factors behind the varying effectiveness of these measures are selfish decision-making and the differing national implementations of the response mechanism. In this paper, through simple games, we show the effect of individual incentives on the decisions made with respect to wearing masks and social distancing, and how these may result in a sub-optimal outcome. We also demonstrate the responsibility of national authorities in designing these games properly regarding the chosen policies and their influence on the preferred outcome. We promote a mechanism design approach: it is in the best interest of every government to carefully balance social good and response costs when implementing their respective pandemic response mechanism.

Sok: differential privacies

D. Desfontaines and B. Pejo

Proceedings on privacy enhancing technologies, 2020, pp. 288-313.

Bibtex | Abstract | Link

@inproceedings {
   author = {Damien Desfontaines and Balazs Pejo},
   title = {Sok: differential privacies},
   booktitle = {Proceedings on privacy enhancing technologies},
   year = {2020},
   pages = {288-313},
   howpublished = "\url{https://arxiv.org/abs/1906.01337}"
}

Abstract

Shortly after it was first introduced in 2006, differential privacy became the flagship data privacy definition. Since then, numerous variants and extensions were proposed to adapt it to different scenarios and attacker models. In this work, we propose a systematic taxonomy of these variants and extensions. We list all data privacy definitions based on differential privacy, and partition them into seven categories, depending on which aspect of the original definition is modified. These categories act like dimensions: variants from the same category cannot be combined, but variants from different categories can be combined to form new definitions. We also establish a partial ordering of relative strength between these notions by summarizing existing results. Furthermore, we list which of these definitions satisfy some desirable properties, like compo- sition, post-processing, and convexity by either providing a novel proof or collecting existing ones.

2019

Together or Alone: The Price of Privacy in Collaborative Learning

B. Pejo and Q. Tang and G. Biczók

Proceedings on Privacy Enhancing Technologies (PETS 2019), De Gruyter, 2019.

Bibtex | Abstract

@inproceedings {
   author = {Balazs Pejo and Q. Tang and Gergely Biczók},
   title = {Together or Alone: The Price of Privacy in Collaborative Learning},
   booktitle = {Proceedings on Privacy Enhancing Technologies (PETS 2019)},
   publisher = {De Gruyter},
   year = {2019}
}

Abstract

Machine learning algorithms have reached mainstream status and are widely deployed in many applications. The accuracy of such algorithms depends significantly on the size of the underlying training dataset; in reality a small or medium sized organization often does not have the necessary data to train a reasonably accurate model. For such organizations, a realistic solution is to train their machine learning models based on their joint dataset (which is a union of the individual ones). Unfortunately, privacy concerns prevent them from straightforwardly doing so. While a number of privacy-preserving solutions exist for collaborating organizations to securely aggregate the parameters in the process of training the models, we are not aware of any work that provides a rational framework for the participants to precisely balance the privacy loss and accuracy gain in their collaboration. In this paper, by focusing on a two-player setting, we model the collaborative training process as a two-player game where each player aims to achieve higher accuracy while preserving the privacy of its own dataset. We introduce the notion of Price of Privacy, a novel approach for measuring the impact of privacy protection on the accuracy in the proposed framework. Furthermore, we develop a game-theoretical model for different player types, and then either find or prove the existence of a Nash Equilibrium with regard to the strength of privacy protection for each player. Using recommendation systems as our main use case, we demonstrate how two players can make practical use of the proposed theoretical framework, including setting up the parameters and approximating the non-trivial Nash Equilibrium.

2018

POSTER: The Price of Privacy in Collaborative Learning

B. Pejo and Q. Tang and G. Biczók

CCS 2018 Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, ACM, 2018.

Bibtex | Abstract

@inproceedings {
   author = {Balazs Pejo and Q. Tang and Gergely Biczók},
   title = {POSTER: The Price of Privacy in Collaborative Learning},
   booktitle = {CCS 2018 Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security},
   publisher = {ACM},
   year = {2018}
}

Abstract

Machine learning algorithms have reached mainstream status and are widely deployed in many applications. The accuracy of such algorithms depends significantly on the size of the underlying training dataset; in reality a small or medium sized organization often does not have enough data to train a reasonably accurate model. For such organizations, a realistic solution is to train machine learning models based on a joint dataset (which is a union of the individual ones). Unfortunately, privacy concerns prevent them from straightforwardly doing so. While a number of privacy-preserving solutions exist for collaborating organizations to securely aggregate the parameters in the process of training the models, we are not aware of any work that provides a rational framework for the participants to precisely balance the privacy loss and accuracy gain in their collaboration. In this paper, we model the collaborative training process as a two-player game where each player aims to achieve higher accuracy while preserving the privacy of its own dataset. We introduce the notion of Price of Privacy, a novel approach for measuring the impact of privacy protection on the accuracy in the proposed framework. Furthermore, we develop a game-theoretical model for different player types, and then either find or prove the existence of a Nash Equilibrium with regard to the strength of privacy protection for each player.