Weighted Ensemble Distillation in Federated Learning with Non-IID Data Leveraging auxiliary data to efficiently aggregate models in heterogeneous settings Master’s thesis in Complex Adaptive Systems OSCAR ERIKSSON DEPARTMENT OF MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2022 www.chalmers.se www.chalmers.se Master’s thesis 2022 Weighted Ensemble Distillation in Federated Learning with Non-IID Data Leveraging auxiliary data to efficiently aggregate models in heterogeneous settings OSCAR ERIKSSON Department of Mathematical Sciences Division of Applied Mathematics and Statistics Chalmers University of Technology Gothenburg, Sweden 2022 Weighted Ensemble Distillation in Federated Learning with Non-IID Data Leveraging auxiliary data to efficiently aggregate models in heterogeneous settings OSCAR ERIKSSON © OSCAR ERIKSSON, 2022. Supervisors at company: Mattias Åkesson, Data Scientist, Scaleout Systems Fredrik Wrede, Data Scientist, Scaleout Systems Supervisor and examiner: Tobias Gebäck, Associate Professor, Department of Mathematical Sciences Master’s Thesis 2022 Department of Mathematical Sciences Division of Applied Mathematics and Statistics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2022 iv Weighted Ensemble Distillation in Federated Learning with Non-IID Data Leveraging auxiliary data to efficiently aggregate models in heterogeneous settings OSCAR ERIKSSON Department of Mathematical Sciences Chalmers University of Technology Abstract Federated distillation (FD) is a novel algorithmic idea for federated learning (FL) that allows clients to use heterogeneous model architectures. This is achieved by distilling aggregated local model predictions on an unlabeled auxiliary dataset into a global model. While standard FL algorithms are often based on averaging local parameter updates over multiple communication rounds, FD can be performed with only one communication round, giving favorable communication properties when local models are large and the auxiliary dataset is small. However, both FD and standard FL algorithms experience a significant performance loss when training data is not independently and identically distributed (non-IID) over the clients. This thesis investigates weighting schemes to improve the performance with FD in non-IID scenarios. In particular, the sample-wise weighting scheme FedED- w2 is proposed, where client predictions on auxiliary data are weighted based on the similarity with local data. Data similarity is measured with the reconstruction loss on auxiliary samples when passed through an autoencoder (AE) model that is trained on local data. Image classification experiments with convolutional neural networks performed in this study show that FedED-w2 exceeds the test accuracy of FL baseline algorithms with up to 15 % on the MNIST and EMNIST datasets for varying degrees of non-IID data over 10 clients. The performance of FedED-w2 is lower than FL baselines on the CIFAR-10 dataset, where the experiments display up to 5 % lower test accuracy. Keywords: federated learning, federated distillation, knowledge distillation, weighted ensembles, artificial intelligence, privacy, image classification. v Acknowledgements I would like to thank everyone at Scaleout Systems for giving me the opportunity to explore federated learning and its challenges. A special thanks to Mattias Åkesson and Fredrik Wrede for your feedback and help to initiate the project. I would also like to thank AI Sweden for providing computational hardware and a welcoming community for us master thesis students. Finally, I want to thank Tobias Gebäck, my supervisor and examiner at Chalmers University of Technology, for the feedback and guidance. Oscar Eriksson, Enköping, May 2022 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order. AE Autoencoder AI Artificial Intelligence ANN Artificial Neural Network CE Cross Entropy CNN Convolutional Neural Network FD Federated Distillation FL Federated Learning GDPR General Data Protection Regulation IID Independently and Identically Distributed IoT Internet of Things ML Machine Learning MSE Mean Squared Error ReLU Rectified Linear Unit ix Nomenclature Below is the nomenclature of indices, sets, functions, parameters and variables that have been used throughout this thesis. Indices k Index for client i,j Indices for data samples c Index for class l Index for image pixel t Index for communication round Sets D Set of traning data Dk Set of traning data for client k D0 Set of auxiliary data Ik Set of indices for data samples at client k K Set of clients B Set of batches from local data Functions fk Function representing the local model at client k fS Function representing the student model Lk Local objective function for client k L Global objective function LS Student loss function LKL Kullback-Leibler divergence loss function xi LMSE Mean squared error loss function LREC Reconstruction loss function σ Activation function Parameters N Number of data samples N Number of data samples at client k N c k Number of data samples at client k in class c K Number of clients C Number of classes E Number of epochs B Batch size η Learning rate α Dirichlet concentration parameter µ Regularization parameter T Number of communication rounds γ Fraction of participating clients W Input width of feature map O Output width of feature map F Kernel size P Padding size S Stride size τ Temperature parameter Variables θ0 Initial global model weights θt+1 Global weights at round t+ 1 θkt+1 Local weights for client k at round t+ 1 θS Student model weights x0 Batch of auxiliary data zk Model output from client k wk Ensemble weight for client k wjk Ensemble weight for client k at auxiliary sample with index j xii x(i) Data sample i y(i) True label for data sample i zT Aggregated predictions from clients (teacher predictions) zS Student predictions zk Predictions from client k ȳ One-hot encoded true label b Bias term in neural network model q,v Temperature weighted class probabilities b Communication cost h Latent variable xiii xiv Contents List of Acronyms ix Nomenclature xi List of Figures xvii List of Tables xxi 1 Introduction 1 1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose and research questions . . . . . . . . . . . . . . . . . . . . . 3 1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Theory and related work 5 2.1 Supervised image classification . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Fully connected neural network . . . . . . . . . . . . . . . . . 6 2.2.2 2D convolutional layer . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Federated learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Challenges in federated learning . . . . . . . . . . . . . . . . . 10 2.3.2 Federated averaging . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Federated ensemble distillation . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Distillation vs. averaging . . . . . . . . . . . . . . . . . . . . . 14 2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Method 17 3.1 Weighting schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Generation of non-IID data . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Results 25 4.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 xv Contents 4.2 EMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Communication costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Discussion 31 5.1 Performance analysis of weighting schemes . . . . . . . . . . . . . . . 31 5.2 Using federated distillation in practice . . . . . . . . . . . . . . . . . 32 6 Conclusion 33 Bibliography 35 A Model architectures I A.1 CNN architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Autoencoder architectures . . . . . . . . . . . . . . . . . . . . . . . . II B Training curves V B.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V B.2 EMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X B.3 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIV xvi List of Figures 1.1 A schematic illustration of FD compared to the standard FedAvg algorithm. FedAvg exchange model updates based on local training on private data, whereas FD transmits local model predictions on unlabeled auxiliary data. . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Schematic view of a neuron with n input variables. . . . . . . . . . . 7 2.2 Schematic view of an ANN with 3 hidden layers. . . . . . . . . . . . . 7 2.3 Output values from a 2D-convolution of input I and kernel K with W = 5, F = 3, P = 0 and S = 1. . . . . . . . . . . . . . . . . . . . . . 8 2.4 Output values from a max pooling operation with W = 4, F = 2 and S = 2 and P = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 The AE network architecture. . . . . . . . . . . . . . . . . . . . . . . 18 3.2 An example showing how data similarity can be measured with the reconstruction loss from an AE, using the MNIST dataset. The aux- iliary data is here taken as a subset of the complete dataset. . . . . . 19 3.3 Sample images from the MNIST dataset, corresponding to handwrit- ten digits 0-9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Sample images from the EMNIST letters dataset, corresponding to handwritten letters a-z and A-Z. . . . . . . . . . . . . . . . . . . . . . 21 3.5 Sample images from the CIFAR-10 dataset, corresponding to objects in the categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Illustration of the Dirichlet splitting procedure used in this project. The dot size is relative to the total number of samples over the clients. The data splits are more non-IID for lower values of α. . . . . . . . . 22 4.1 Comparison of test accuracy on MNIST for the implemented weight- ing schemes with varying data size |D0|. The student model is CNN3 and loss function is MSE. . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Comparison of test accuracy on MNIST for different student models with α = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Comparison of test accuracy on EMNIST for the implemented weight- ing schemes with varying data size |D0|. The student model is CNN3 and loss function is MSE. . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4 Comparison of test accuracy on EMNIST for different student models with α = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 xvii List of Figures 4.5 Comparison of test accuracy for the implemented weighting schemes with varying data size |D0|. The student model is Resnet-18 and loss function is MSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.1 MNIST: FedAvg local training accuracies with the CNN1 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V B.2 MNIST: FedProx local training accuracies with the CNN1 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . VI B.3 MNIST: FedAvg global test accuracy with the CNN1 model for 5 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI B.4 MNIST: FedProx global test accuracy with the CNN1 model for 5 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI B.5 MNIST: FedED local training accuracies with the CNN1 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII B.6 MNIST: FedED-w0 student training accuracy with the CNN3 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . VII B.7 MNIST: FedED-w1 student training accuracy with the CNN3 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . VIII B.8 MNIST: FedED-w2 student training accuracy with the CNN3 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . VIII B.9 MNIST: FedED-w0 student training accuracy with the CNN3 model and CE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . . IX B.11 EMNIST: FedProx local training accuracies with the CNN1 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . X B.10 EMNIST: FedAvg local training accuracies with the CNN1 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . X B.12 EMNIST: FedAvg global test accuracy with the CNN1 model for 5 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI B.13 EMNIST: FedProx global test accuracy with the CNN1 model for 5 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI B.14 EMNIST: FedED local training accuracies with the CNN1 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI B.15 EMNIST: FedED-w0 student training accuracy with the CNN3 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . XII B.16 EMNIST: FedED-w1 student training accuracy with the CNN3 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . XII B.17 EMNIST: FedED-w2 student training accuracy with the CNN3 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . XIII B.18 EMNIST: FedED-w0 student training accuracy with the CNN3 model and CE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . . . . . . . XIII B.20 CIFAR-10: FedProx local training accuracies with the Resnet-18 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . XIV B.19 CIFAR-10: FedAvg local training accuracies with the Resnet-18 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . XIV B.21 CIFAR-10: FedAvg global test accuracy with the Resnet-18 model for 5 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . XV xviii List of Figures B.22 CIFAR-10: FedProx global test accuracy with the Resnet-18 model for 5 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . XV B.23 CIFAR-10: FedED local training accuracies with the Resnet-18 model for 3 different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . XV B.24 CIFAR-10: FedED-w0 student training accuracy with the Resnet- 18 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . XVI B.25 CIFAR-10: FedED-w1 student training accuracy with the Resnet- 18 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . XVI B.26 CIFAR-10: FedED-w2 student training accuracy with the Resnet- 18 model and MSE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . XVII B.27 CIFAR-10: FedED-w0 student training accuracy with the Resnet- 18 model and CE loss for 1 seed. . . . . . . . . . . . . . . . . . . . . XVII xix List of Figures xx List of Tables 3.1 Sizes of used models, assuming a float is 24 bytes. . . . . . . . . . . . 22 4.1 Mean test accuracy and standard deviation on MNIST for different α with |D0| = 30000 and the CNN3 student model. . . . . . . . . . . 26 4.2 Mean test accuracy and standard deviation on EMNIST for different α with |D0| = 60000 and the CNN3 student model. . . . . . . . . . . 27 4.3 Test accuracy for different α with |D0| = 25000 and the Resnet-18 student model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Communication costs for the experiments conducted, measured in MB transferred between clients and central server and assuming a float is 24 bytes. The auxiliary data size |D0| is here set to the largest size used for each dataset. . . . . . . . . . . . . . . . . . . . . . . . . 30 A.1 Architecture used for CNN1, CNN2 and CNN3 in the MNIST experi- ments. CNN1: NCh = 2, 1042 trainable parameters. CNN2: NCh = 8, 4138 trainable parameters. CNN3: NCh = 16, 8266 trainable param- eters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Architecture used for CNN1 and CNN2 in the EMNIST experiments. CNN1: NCh = 2, 2626 trainable parameters. CNN2: NCh = 16, 20826 trainable parameters. . . . . . . . . . . . . . . . . . . . . . . . I A.3 Architecture used for CNN3 in the EMNIST experiments. 23834 trainable parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . II A.4 Autoencoder architecture used for MNIST and EMNIST. . . . . . . . II A.5 Autoencoder architecture used for CIFAR-10. . . . . . . . . . . . . . III xxi List of Tables xxii 1 Introduction In the last decade, significant progress has been made in the field of machine learning and AI. Increased computing capabilities, improved algorithms and datasets have made it possible to perform tasks such as image classification, object detection and speech recognition with models that learn from data. Models used for such tasks often thrive on big datasets, which might need to be built by collecting data at different locations and user devices. In the standard case of machine learning, the centralized case, this distributed data is sent to a central server where the machine learning model is trained. However, with increasing privacy concerns among the public, it has become undesirable and in some cases even illicit to move users’ private data to a central location. A recent study show that 9 out of 10 Americans are concerned about their online privacy, which clearly demonstrates the hesitation toward sharing private information [1]. As integrity becomes a more important question regarding private data, new regulations have been introduced to protect the users’ privacy. An example of such a privacy act is GDPR [2], which regulates how companies and organizations are allowed to manage user data. A technique proposed to overcome this privacy challenge in machine learning is federated learning (FL) [3]. Since the data cannot be stored centrally, FL moves the model training to the client where the private data is present. A global model is then formed by combining information from locally trained models, without accessing the private data. Aggregating the local models into a global model is a crucial step for the end performance, and one of the many open challenges in FL [4]. The standard algorithm for this is FedAvg [3], which iteratively averages over the local model parameters into a global model. This is a simple approach to aggregate the local models, but it has some drawbacks. Previous work has shown that FedAvg converges to the centrally trained model when sampled client data is independently and identically distributed (IID) over the clients [3, 5]. Unfortunately, the IID assumption does not always hold in practice, and when client data is non-IID, the performance of FedAvg decreases drastically [6]. Furthermore, FedAvg requires all clients to train on identical model architectures, making it less flexible for system heterogeneity. The main constraint in FL is that client data needs to be kept private. However, in many real-world applications, there are publicly available auxiliary data with distributions similar to client data. For instance, many FL solutions for computer vision and natural language processing could obtain auxiliary data from the public 1 1. Introduction Figure 1.1: A schematic illustration of FD compared to the standard FedAvg algorithm. FedAvg exchange model updates based on local training on private data, whereas FD transmits local model predictions on unlabeled auxiliary data. databases ImageNet [7] or WikiText [8]. These databases contain millions of data samples but might lack the necessary label information for the specific FL problem. In recent years, a novel algorithmic method called federated ensemble distillation, or federated distillation (FD), has been proposed for FL problems where such auxiliary data is available. In contrast to the parameter averaging algorithm FedAvg and its recently modified versions [9–12], FD allows for clients to train heterogeneous model architectures since clients only share predictions on the auxiliary data, which are used to train a separate model on the central server (Figure 1.1). This flexibility is beneficial for cases when clients are running heterogeneous systems, making it possible to choose a model that suits the clients specific system or computational capabilities. FD also has the potential to reduce the overall communication in FL, since communication scale with the size of the auxiliary dataset rather than the model size. Lastly, since it has been shown that local data can be partially reconstructed based on parameter or gradient updates [13], FD could be a more privacy-preserving alternative. 1.1 Problem formulation FD-based algorithms have potential advantages in regards to system heterogeneity, communication and privacy properties compared to algorithms based on parameter averaging. However, as with FedAvg, an FD-based algorithm is not immune to performance drops in the non-IID scenario. One way to approach this problem is 2 1. Introduction to use a weighting scheme so the impact of each local model is weighted differently towards the global model. The problem is the following: Consider a set of clients K, where each client k ∈ K holds a private dataset Dk, a shared auxiliary dataset D0 and a machine learning model fk(θk) with parameters θk. Each private dataset consist of feature-label pairs (x(i), y(i)) and the local parameters have been adjusted to minimize the local objective function Lk on this data. Let zk = fk(x0;θk) be the output from client k on a batch of samples x0 ∈ D0. To perform distillation, the aggregated predictions, zT = ∑ k∈K wkzk, (1.1) referred to as teacher predictions, are used to train a student model fS on the data pairs (x0, zT) with the goal of capturing the collective behavior of the local models. The weights wk allow to control the impact of each client and is typically chosen to 1/|K| when data is assumed IID. In the non-IID scenario, wk would need to be chosen differently due to the varying performance of local models. This leads to the main goal of this project, which is to investigate different weighting schemes wk to improve performance in non-IID scenarios with FD. 1.2 Purpose and research questions This study intends to implement a test framework to compare FD against standard FL algorithms based on parameter averaging in non-IID settings. Both FD and parameter averaging algorithms suffer significant performance loss when client data is non-IID, but it is not established whether one method outperforms the other in this scenario. The first research question for this project is therefore: How does the performance of FD compare to standard FL algorithms in non-IID settings? Using weighting schemes is a way to directly control the impact of each client, which motivates the investigation of suitable weighting schemes to improve FD in non-IID scenarios. The second research question is hence: What is a suitable weighting scheme to improve FD in non-IID settings? Implementing FD also introduces the problem of choosing the student model and having enough auxiliary data available. This leads to the final research question: How do the student model and quantity of auxiliary data affect the per- formance of FD? 1.3 Delimitations The primary focus of this study is to investigate the performance of FD compared to standard FL algorithms on different distributions of client data. It is therefore the performance relative to these chosen baselines that are of interest and not the 3 1. Introduction absolute performance. For this reason, local models have relatively low complexity for most experiments and hyperparameters have only been tuned to some extent. In the real case FL scenario, local models are trained in parallel at client devices. Since the purpose of this project is to evaluate the algorithmic performances, imple- mentations have been simplified to execute model training sequentially on a single device where the data is pseudo-distributed. This means that system heterogeneity based on different bandwidths and computational power among the clients is not considered. The algorithms covered in this study are only designed and tested for supervised image classification. It is therefore not certain that the result presented in this study would translate to other machine learning tasks. 1.4 Contributions This project contributes with extended knowledge on the limits of FD and its relative performance to standard FL algorithms in non-IID settings. Insights on how to improve FD in the presence of data heterogeneity is given by introducing a sample- wise weighting scheme that achieves good performance compared to FL baselines. 4 2 Theory and related work This chapter will explain the theory and current research related to the problem addressed in this thesis. The first section will cover the essentials of image clas- sification with supervised machine learning and what kind of models will be used in this project. This is then followed by an overview of FL where the main appli- cation areas and challenges are discussed. The algorithmic paradigm FD is then introduced, which sets the foundation for the algorithm that is investigated in this project. Lastly, a literature review is given on related FD studies. 2.1 Supervised image classification Machine learning (ML) is a branch of artificial intelligence (AI) that covers the use of data and algorithms to enable computers to perform certain tasks without being explicitly programmed. This project will consider the task of image classification, using supervised learning. In this case, data is available in the form (x(1), y(1)), (x(2), y(2)), . . . , (x(N), y(N)), (2.1) where y(i) is the label (or class) for image x(i). The goal is to learn a function f(x;θ) ≈ y with parameters θ that predicts a label y based on input image x as accurately as possible. This is typically done by tuning the parameters θ for a chosen function f(x;θ), using the optimization problem min θ L(θ) with L(θ) = 1 N N∑ i=1 l Ä f(x(i);θ), y(i) ä , (2.2) where l ( f(x(i);θ), y(i)) is the loss function evaluated on (x(i), y(i)). The loss function should be defined such that correct predictions by f(x(i);θ) results in a low value, and a high value for incorrect predictions. The predictions from f(x;θ) are usually given as an output vector z, where each element zc corresponds to the probability of sample x belonging to class c ∈ {1, . . . , C}. Similarly, the true labels y are represented with one-hot encoding ȳ, where all elements are zero except element y, which is one. This label representation is used in the cross-entropy (CE) loss 5 2. Theory and related work l Ä f(x(i);θ), y(i) ä = C∑ c=1 −ȳ(i) c log z(i) c , (2.3) which is the default choice of loss function for classification problems in supervised ML, and will be used throughout this thesis. 2.1.1 Gradient descent With a loss function defined, the next question is how to minimize the loss function in (2.2). In the ML context, the method chosen to minimize the loss function is called an optimizer. A common technique to find the minimum or maximum of a function is to find where the derivative of the function is zero. In the gradient descent optimizer, a minimum is searched for by taking steps in the negative direction of the derivative ∇L with respect to θ. This is done by calculating ∇L for a number of samples, and then updating the parameters according to θ := θ − η∇L where ∇L = Å ∂L ∂θ1 , . . . , ∇L ∂θ|θ| ã (2.4) Here, η is the chosen step size, known as the learning rate in ML. With large datasets it can be computationally infeasible to perform gradient descent on the complete dataset in one iteration. This can be solved by doing gradient descent on one batch of data at a time. This is referred to as batch gradient descent, or mini-batch gradient descent when the batch size is fairly small (roughly between 10 and 103). This is typically done by shuffling the data samples before splitting it into batches and proceeding with gradient descent. Many optimizers have been developed to further accelerate the training with gradient descent. A popular optimizer is Adam [14], which is one of many improved optimizers that introduce an adaptive learning rate. 2.2 Convolutional neural networks What remains to be defined in the optimization objective (2.2) is the model f(x;θ). A popular choice of model for image classification is a convolutional neural network (CNN), which will be explained in this section. 2.2.1 Fully connected neural network CNN’s are special types of artificial neural networks (ANN’s) that have been devel- oped in particular for data with spatial dimensions. The main building block in an ANN is the neuron, which applies a non-linear activation function σ(·) on the dot product between input x and parameters θ, plus a bias term b, as f(x;θ) = σ(x · θ + b). (2.5) 6 2. Theory and related work The neuron computation is illustrated in Figure 2.1. x1 x2 xn ŷ θ1θ1 θ2θ2 θnθn ... = σ (x · θ + b) Figure 2.1: Schematic view of a neuron with n input variables. A fully connected ANN can be formed by stacking neurons in multiple layers, where each neuron has its specific weights connected to each neuron from the previous layer. With more neurons and layers, the complexity of the network increases, which gives the model a higher ability to learn non-linear patterns. However, this is only possible if non-linear activation functions are used in the network. A popular choice of activation function for the intermediate layers is the rectified linear unit (ReLU), σ(x) = max(0, x). For the final layer, a softmax function σ(x)c = exc/ ∑C j=1 e xj for c = 1, . . . , C is the default choice for classification problems. Intermediate layers in an ANN are usually called hidden layers, as indicated in Figure 2.2 where an example of a fully connected ANN is presented. x1 x2 xn ... a (1) 1 a (1) 2 a (1) 3 a(1) m ... a (2) 1 a (2) 2 a (2) 3 a(2) m ... a (3) 1 a (3) 2 a (3) 3 a(3) m ... z1 z2 zC ... Input layer Hidden layers Output layer Figure 2.2: Schematic view of an ANN with 3 hidden layers. 2.2.2 2D convolutional layer When using a fully connected ANN with images as input data, each pixel in an image would have a weight connected to each neuron in the next layer. This can result in lots of parameters and lead to inefficient training of the network. This is one motivation for the introduction of CNN’s where the number of parameters are reduced by using a convolutional kernel operation. This operation can be done for various dimensions but this project only considers the 2-dimensional case. The kernel holds a number of weights, determined by the kernel sizeK, and iteratively do 7 2. Theory and related work the neuron computation (2.5) on aK×K window of the input data. Since the kernel weights are shared over all input data, the number of parameters are significantly reduced. The complexity of a 2D convolutional layer can then be increased by having multiple kernels with separate weights. The output from a convolutional layer is usually called a feature map. The spatial width O of the feature map is determined by the width W of input data, kernel size F , padding P and stride S, according to O = (W − F + 2P )/S + 1. This formula assumes that all sizes spans equally in both dimensions. An example of a convolutional kernel computation is illustrated in Figure 2.3. 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 I ∗ 1 0 1 0 1 0 1 0 1 K = 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0 I ∗K ×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1 Figure 2.3: Output values from a 2D-convolution of input I and kernel K with W = 5, F = 3, P = 0 and S = 1. 2.2.3 Pooling layer In CNN’s, convolutional layers are usually followed by a pooling layer. A pooling layer is used to reduce the spatial dimension of the feature map. This means that the computational cost of training the network is reduced since fewer parameters are required in subsequent layers. A pooling layer also allows the following convolutional layer to extract features at a different resolution than previous layers. A commonly used pooling layer is max pooling, where the output is maximum values for patches of the preceding feature map, see Figure 2.4. Pooling layers do not contain any parameters and are therefore not trainable layers. 1 5 0 6 4 7 5 8 3 1 2 2 1 3 0 4 6 8 3 4 Max pooling Figure 2.4: Output values from a max pooling operation with W = 4, F = 2 and S = 2 and P = 0. 8 2. Theory and related work 2.3 Federated learning Due to the growing computational power on our devices, such as mobile phones, wearable technology and autonomous vehicles, it has become more attractive to move computations previously done on a central server to the edge device. In the case of training ML models, the rising public concern for privacy makes it undesirable to move client data to the central server, which also motivates the interest in moving the model training to the devices. These two motives have together led to a growing interest in a decentralized and collaborative approach to machine learning, called federated learning. The term federated learning was coined in 2016 by McMahan et al. [3] and given a broader definition by Kairouz et al. [4] as: Federated learning is a machine learning setting where multiple en- tities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, fo- cused updates intended for immediate aggregation are used to achieve the learning objective. FL was initially introduced with a focus on applications for mobile phones and edge devices, referred to as the cross-device case. Since then, interest has also been shown for applications with a smaller number of clients involved, for example, a collaboration between multiple companies or organizations, referred to as the cross- silo case. Both of these cases have many possible applications, some of which are discussed below. Mobile phones. Machine learning models can be trained for next-word prediction, face detection and voice recognition by utilizing data from several mobile phones [15– 17]. Due to privacy concerns, users may not want to share their data. FL has the potential to transform the development of these personalized and assistive features on mobile phones to be more privacy-preserving, which has already been applied in consumer digital products. One example of this is Google’s keyboard for mobile phones (Gboard mobile keyboard) [15, 16], which uses FL to build models for next word suggestions. The same technique is also integrated with Android Messages to give better and more personalized smart features, such as Smart Reply and other assistive suggestions [18]. Apple has also introduced cross-device FL in iOS 13 when training models for personalized suggestions in the QuickType keyboard and voice classification in the application “Hey Siri” [17]. Organizations. A client, or device, can also be an organization or institution in the FL context. This case has gained great interest in healthcare, since hospitals have big collections of patient data that could aid ML models for predictive healthcare. Hospitals have strict privacy rules that require patient data to remain inside the hospital. FL has the potential to solve this problem for predictive healthcare, as it would allow hospitals to collaboratively train ML models, without exposing sensitive patient data. This is an ongoing field of research in FL, and applications have been proposed for medical data segmentation [19, 20] to predicting clinical outcomes in patients with COVID-19 [21]. 9 2. Theory and related work Internet of things. Autonomous vehicles, smart homes and wearable devices are all examples of modern IoT systems. These systems can have multiple sensors that allow them to adjust their behavior based on incoming data. This adjustment could be an updated model of traffic or construction for a fleet of autonomous vehicles, which is needed for the vehicles to continue operating safely. Due to data privacy and limited connectivity of devices, it might be a problem to safely aggregate models in these scenarios. FL could be used in this scenario to train models that adjusts to system changes and keeps user data private [22]. Another possible advantage of using FL for autonomous vehicles is that the heavy communication of transmitting the big data from the vehicles to the central server can be avoided. 2.3.1 Challenges in federated learning FL faces many practical challenges, where the main ones are how to secure the privacy of client data, how to address heterogeneities among clients and how to handle expensive communication. These challenges are discussed shortly below. See Kairouz et al. [4] and Li et al. [23] for a more in-depth discussion. Privacy. FL has partially emerged by the motivation to solve the privacy problem in training ML models, where the solution is to only communicate information from locally trained models. In many FL algorithms, the information shared is model parameters or gradients. However, these algorithms alone cannot guarantee complete protection of client data, since its been shown that input data can be partially reconstructed based on the observed model updates [13]. To this extent, tools like multiparty computation [24] and differential privacy [25] have emerged to enhance the privacy of federated learning. Unfortunately, these methods often come at the cost of reduced model performance. Data heterogeneity. Each client connected to a federated network has a dataset that is generated by a specific user or organization. In both of these cases, clients may observe or generate different data points based on , e.g., their location, time zone or user preferences. These differences could be present between local label or feature distributions, as well as differences in the quantity of data [4, Sec. 3.1]. All of these scenarios are referenced as cases with non-IID data in FL. It is also possible for the set of active clients to change over time, which introduce another dimension of non-IID. The original goal in FL to train a single global model on the union of client data becomes harder in the non-IID scenario. This is a widely researched topic in FL, and proposed solutions ranges from techniques to balance the local data [26], weight client contributions [27] or to modify existing algorithms, e.g. using probabilistic ML or ensemble methods [28]. However, in some cases training a single global model might not be the right goal. Clients could benefit by having a personalized model for their specific data, e.g. for next word prediction in mobile phones. In this case, non-IID data could be more of a feature than a problem. Systems heterogeneity. The devices participating in a federated network could use different hardware and have different network connectivity or power available, leading to varying capabilities in regards to storage, computation and communica- tion. Devices can also be unreliable and drop out from the FL process due to bad 10 2. Theory and related work connection or low energy. This problem is most relevant for the cross-device scenario, and algorithms developed for this case must therefore be robust to a low amount of participation, heterogeneous hardware and drop-out clients. In the cross-silo case, system heterogeneity might be less of a problem, since a participating device at a company or organization will probably not face the same limited resources. Communication. Federated networks can potentially consist of millions of de- vices, e.g. a network of smartphones, and communication in the network might be much slower than the local computations. It is, therefore, necessary to develop communication-efficient algorithms to enable training a model to fit the data gen- erated by the devices. This means that the messages or model updates from the local training process, and the total number of communication rounds, should be minimized. 2.3.2 Federated averaging The problem in FL is to learn a global model based on data that is partitioned over multiple clients. This global model is learned under the constraint that client data cannot be transmitted to a central server. Instead, local model parameters are communicated. What has become the standard algorithm for this problem is federated averaging (FedAvg) [3], where McMahan et al. rewrites the ML objective (2.2) as min θ L(θ) = K∑ k=1 Nk N Lk(θ) with Lk(θ) = 1 Nk ∑ i∈Ik li(θ), (2.6) where K is the number of clients, Ik is the set of indices of data examples on client k and Nk = |Ik|. Here, li(θ) is an abbreviation for l(f(x(i);θ), y(i)). In FedAvg, a selected fraction γ of K clients starts with the same initial model and performs E epochs of gradient descent (2.4) on its private data. The central server then takes a weighted average over the resulting local model parameters, which is then sent to the clients once again for another training round. The details of FedAvg are outlined in Algorithm 1. If the local data distributions Dk were formed by distributing data examples uni- formly at random over the clients, it would hold that EDk [Lk(θ)] = L(θ), meaning that minimizing any local objective Lk would be the same as minimizing the global objective L. This IID assumption is typically a criterion for algorithms in dis- tributed optimization, where the partitioned data is initially contained in a central dataset. In contrast, FL scenarios gives no control over the distribution of client data, which could make the IID assumption invalid and complicate the minimization of the global objective (2.6). Previous studies [6, 29] have shown that local weights tend to diverge in the non- IID setting with FedAvg, which makes the global objective hard to reach when aggregating the local models. One idea to counteract the weight divergence for each client is to add a regularization term towards the latest global weights. This is done in FedProx [11], where the local minimization objective is modified to 11 2. Theory and related work Algorithm 1: The FedAvg [3] algorithm. γ is the fraction of participating clients, K is the total number of clients in the federated network, η is the learning rate and Dk is the local dataset at client k with Nk = |Dk| and N = ∑ k∈Kt Nk. Server executes: initialize θ0 for each round t = 1, . . . , T do Kt ← random subset (γ fraction) from K local models for each client k ∈ Kt do θkt+1 ← ClientUpdate(k,θt) end θt+1 ← ∑K k=1 Nk N θkt+1 end ClientUpdate(k,θ): B ← (split Dk into batches of size B) for each local epoch i = 1, . . . , E do for each batch (x,y) ∈ B do ŷ = fk(x;θk) θ ← θ − η∇Lk(ŷ,y) end end return θ to server min θ Hk(θ,θt) = Lk(θ) + µ 2‖θ − θt‖ 2, (2.7) which only changes the weight update to θ ← θ−η∇(Lk(θ)+ µ 2‖θ−θ t‖2) compared to FedAvg (Algorithm 1). Here, µ is hyperparameter that determines the impact of the regularization term. FedProx has been shown to improve the convergence behavior of FL in non-IID settings. 2.4 Federated ensemble distillation In recent years, a novel algorithmic paradigm for FL problems has been proposed for dealing with many of the challenges described in Section 2.3.1. This method is called federated ensemble distillation, or federated distillation (FD), which ap- plies the knowledge distillation (KD) procedure [30, 31] in a federated setting. KD is a technique that uses an auxiliary dataset to transfer the knowledge of one or multiple ML classifiers into one single classifier. In the centralized setting, this can be done before deployment to reduce model size and complexity. This process is often explained in terms of a teacher-student analogy, where the goal is to transfer knowledge from the teacher model to a student model. 12 2. Theory and related work 2.4.1 The algorithm There exist many variations on FD algorithms in the literature. In this project, the simplest approach is considered where only one communication round of local predictions is needed, referred to as FedED, which is then used to perform KD at the central server. This procedure starts by letting each client train a local model fk(θk) on its local dataDk. The clients then generate predictions zk on the unlabeled auxiliary dataset D0. These predictions are then aggregated at the central server to form the teacher predictions zT, which are used as targets to train a student model fS(θS) with a chosen loss function LS for T distillation steps. The details of FedED are outlined in Algorithm 2. Algorithm 2: The FedED algorithm. Local Training: Train each local model fk(θk) with Dk. for each distillation step t = 1, . . . , T do Kt ← random subset (γ fraction) from K local models x0 ← a batch of public data from D0 with size B for each local k ∈ Kt do zk ← fk(x0;θk) end zT ← aggregate {zk|k ∈ Kt} zS ← fS(x0;θS) Update: θS ← θS − η∇LS(zS, zT) end When choosing the loss function LS, it is common to use a measure based on the softmax output, i.e. class probabilities, between teacher and student model, since valuable information can be retrieved from the ratios of these probabilities. In the standard distillation procedure, Hinton et al. [31] softens the class probabilities as qcT = ez c T/τ∑ c e zc T/τ , (2.8) with a temperature parameter τ , before applying the Kullback-Leibler divergence LKL(qT, qS) = ∑ c qcT log q c T qcS , (2.9) as loss function for training the student model. Here, qS is the temperature weighted class probabilities from the student model. When using a high temperature τ , minimizing (2.9) simplifies to minimizing the mean squared error (MSE) [31] LMSE(zS, zT) = 1 C ∑ c (zcS − zcT)2, (2.10) which can also be used in combination with LKL [32] by summing the losses. 13 2. Theory and related work The simplest approach to aggregate the teacher predictions is by averaging: zT = 1 |Kt| ∑ k∈Kt zk, (2.11) where Kt is the set of participating clients for distillation round t. However, this approach deteriorates when local models have varying performance, for example due to data heterogeneity. Ideas on how to improve this aggregation in non-IID scenarios are presented in Section 3.1. 2.4.2 Distillation vs. averaging Using FD has three major benefits compared to parameter sharing algorithms. These benefits are summarized below. Model agnostic. In contrast to FedAVG, the FD approach is independent of the model architecture used at each client and therefore allows clients to train mod- els of different type and size. This gives an additional flexibility which could be particularly beneficial when clients are running on heterogeneous systems. Low communication cost. The distillation process does not require the raw model parameters to be communicated with a central server. Instead, the only information leaving the clients is the softmax output on the auxiliary dataset. This changes the communication cost, which now scales with the size of the auxiliary dataset |D0| and the number of classes C as b ∈ O(|D0|C), (2.12) instead of the shared model parameters b ∈ O(T |θ|) (2.13) as in FedAVG. This can give significantly reduced communication when the aux- iliary dataset is smaller than the local models, especially since FD can be executed with only one communication round, whereas parameter sharing algorithms often need multiple rounds. Robustness and privacy. By sharing model parameters, FL algorithms based on parameter sharing leaves the models completely exposed to adversarial or malicious clients that could influence the training. This is a severe privacy vulnerability as the model parameters contain all information a model has about the data its been trained on [33, 34]. FD is potentially a better choice in this aspect, since it is protected against attacks directly on the global model parameters. Furthermore, it would not be possible to reconstruct local data from model predictions, as is possible to some extent when sharing intermediate model updates [13]. 14 2. Theory and related work 2.5 Related work The most common differences among currently published FD algorithms are the assumptions on auxiliary data, the communication procedure and what information is being sent from the clients to the central server. This project will consider a one- shot FD approach where only one round of communication is used between clients and the central server. It is assumed that an unlabeled auxiliary dataset is available, on which clients share its predictions after local training is finished. This approach is also considered by Guha et al. [35] and Gong et al. [36]. Guha et al. compare one-shot FD against different ensemble methods and show that performance of FD approach the ensemble performance as the size of the auxiliary data is increased. However, there is no comparison against FedAvg and no experiments are presented for non-IID settings. Different approaches have been proposed for dealing with non-IID data using FD algorithms. This project will specifically consider the approach of using a weighting scheme to control each clients’ impact on the aggregated predictions. While weighted ensembles have been widely studied for centralized ML [37–39], the weights are often based on each models’ performance on a shared validation set, which consequently need to be labeled. Among the published FD algorithms that have been found during this literature review, only two studies considers the use of a weighting scheme to handle non-IID data. In the first one, FedAD by Gong et al. [36], a class-wise weighting scheme is proposed, such that the predictions from client k in class c is weighted by N c k/ ∑ kN c k . Their method is shown to outperform FedAvg in non-IID settings, but it is uncertain if this is due to the weighting scheme or the additional sharing of attention maps. This weighting scheme also cause additional privacy concerns, since the label distribution for all clients would need to be stored at each client or the central server. The second study that considers a weighting scheme in FD is by Sattler et al. [40], who propose the algorithm FedAux where weighting is performed sample-wise by using certainty scores. The certainty scores are based on the loss from a logistic regression classifier that is trained to separate the local data from a subset of the auxiliary data. Their method is shown to outperform FedAvg in the considered experiments and to maintain stable performance when the level of non-IID is in- creased. In contrast to Algorithm 2 considered in this study, the communication procedure in FedAux is customized to run for multiple rounds. However, FedAux was also shown to outperform the considered baselines by a large margin when only using one round of communication. As mentioned before, many variations on FD algorithms have been published. FedMD, proposed by Li and Wang [41], assumes a labeled auxiliary dataset and start each round by letting the clients train on both auxiliary and private data and then trans- mits predictions on the auxiliary data to the central server for aggregation. In the subsequent rounds, the aggregated predictions are used as targets on the auxiliary data during training. Lin et al. [42] propose FedDF, where ensemble distillation is performed on top of FedAvg to refine the global model. This is shown to result in fewer communications rounds compared to federated averaging methods, but it 15 2. Theory and related work is still based on the same communication protocol as FedAvg. A probabilistic ap- proach is considered in FedBE by Chen and Chao [28], where Bayesian tools are used to select a subset of all local models as teacher models. Distillation is then per- formed with unlabeled auxiliary data to transfer the Bayesian teachers’ knowledge into a global student model, which then acts as an initial model for the next round of local training. This approach is also more related to FedAvg than FD since model parameters are communicated between clients and central server, making it impossible to have heterogeneous models. 16 3 Method This chapter will first introduce and motivate the weighting schemes integrated with FD in this project. The following sections will then cover the datasets used and how non-IID distributions of data have been generated. Lastly, some details are given on the experimental setup and hyper parameters used for training the ML models. 3.1 Weighting schemes In this study, three different weighting schemes have been considered for improving the one-shot FD algorithm (Algorithm 2) in non-IID settings. The first one considers the case where clients are weighted based on their local data sizes zT = ∑ k wkzk, wk = Nk N , (3.1) which is the same weighting that is used in the original FedAvg algorithm. Since clients share their softmax output on a batch of auxiliary data, FD also allows to weight client contributions class-wise, as zcT = ∑ k wckz c k, wck = N c k∑ kN c k , (3.2) where N c k is the number of samples client k has in class c. This weighting scheme has earlier been tested by Gong et al. [36], but its properties have not been fully investigated in different non-IID scenarios. This weighting scheme has therefore also been included in the experiments of this study. Another possibility is to weight client contributions sample-wise, as previously in- vestigated by Sattler et al. [40], where a client prediction on a sample is weighted based on the similarity of that sample with the local dataset. This means that each client k would have a weight wjk for each sample xj ∈ D0. With this approach, local data sizes or class distributions do not have to be shared directly, as is the case with weighting schemes (3.1) and (3.2). To measure the data similarity, the naive approach would be to model the local data distributions Pk(x) and use some metric (i.e distance to mean of Pk(x)) to determine how much a sample xj ∈ D0 deviates from Pk(x). However, when using high 17 3. Method dimensional data such as images, defining a distribution that models the complex correlations between all pixels is a challenging task. From the studies in this thesis, it is proposed to simplify this problem by introducing a latent variable h, and instead define a conditional distribution Pk(x|h) for the local data Dk. The latent variable is a low-level representation of the high-dimensional input data and can be used to provide an abstract view of the data distribution. This thesis has specifically investigated the possibility of modeling the conditional distribution Pk(x|h) using ML with an ANN architecture called autoencoder (AE). This network has a bottleneck structure and is trained to minimize the reconstruc- tion loss LREC when samples are compressed and then expanded, see Figure 3.1. A common choice of reconstruction loss is the mean squared error LREC(x,x′) = 1 n n∑ l=1 (xl − x′l)2, (3.3) where x′l is the l’th pixel of the autoencoder output x′. x1 x2 x3 xn ... ... ... x′1 x′2 x′3 x′n ... Input layer Latent variable h Output layer Figure 3.1: The AE network architecture. When the AE has learned a conditional distribution Pk(x|h) by minimizing the MSE (3.3) on data Dk, the loss for samples x0 ∈ D0 will be low if x0 is similar to samples in Dk. For the AE, this means that these samples map to the same location in latent space. This is illustrated with an example in Figure 3.2. To apply this in a federated setting, each client k would train an AE model on local data Dk and calculate the losses lk(xj) := LREC(xj,x′j) for all auxiliary samples xj ∈ D0. Since low reconstruction loss corresponds to a similarity between sample xj and the local data Dk, the weight wjk should be high when lk(xj) is low. This can be achieved with the relation wjk = 1/lk(xj)β, where β ∈ R+. This concludes the third weighting scheme implemented in this project, defined as zjT = ∑ k wjkz j k, wjk = 1 lk(xj)6 , (3.4) 18 3. Method where β = 6 was chosen by means of experimentation. This was done by tuning among the integers 1-7. (a) Example distribution of client data D. (b) The AE maps training data D into clusters in latent space. (c) Similar samples from D0 are mapped to the same clusters. (d) The local reconstruction loss is low for the most frequent digits. (e) Similar samples from D0 will also have a low reconstruction loss. Figure 3.2: An example showing how data similarity can be measured with the reconstruction loss from an AE, using the MNIST dataset. The auxiliary data is here taken as a subset of the complete dataset. 19 3. Method 3.2 Datasets This project considers the task of image classification. For this purpose, three com- mon datasets for computer vision problems have been used in the experiments. These datasets are summarized below. MNIST. The MNIST [43] dataset is a collection of gray scale images of handwritten digits that is commonly used for training and testing models for computer vision tasks. Since its introduction in 2010, it has become a standard benchmark for image classification models. Each image has one color channel and a size of 28x28 pixels. The dataset is divided into a training set and test set of 60000 and 10000 examples, respectively, both of which are balanced over the 10 classes. Figure 3.3: Sample images from the MNIST dataset, corresponding to handwritten digits 0-9. EMNIST. EMNIST is an extension of the MNIST dataset, which contains hand- written upper and lower case letters in addition to handwritten digits. The images have the same format as MNIST and there are 6 different splits available for the dataset. This project uses the letters split of the dataset, which consists of 125600 training examples and 20000 test examples. The training and test set are both balanced over the 26 available classes. 20 3. Method Figure 3.4: Sample images from the EMNIST letters dataset, corresponding to handwritten letters a-z and A-Z. CIFAR-10. A more advanced dataset that is also very common as benchmark for image classification is the CIFAR-10 [44] dataset. This dataset consists of 32x32 pixel images of objects in the categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The data is divided into a training set of 50000 examples and a test set of 10000 examples, both of which are balanced over the classes. Figure 3.5: Sample images from the CIFAR-10 dataset, corresponding to objects in the categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck 3.3 Generation of non-IID data Local training data has been generated with the commonly used Dirichlet sampling procedure proposed by Hsu et al. [29]. This strategy involves a concentration parameter α that allows to easily adjust the level of heterogeneity among clients. The data is generated by sampling a matrix 21 3. Method P = [p1, . . . ,pC ] ∈ RK×C , p1, . . . ,pC ∼ Dir(α) (3.5) from the symmetric K-categorical Dirichlet distribution, where K is the number of clients and C the number of classes. It then holds that ‖pi‖1 = 1 for i = 1, . . . , C. Each client i is then assigned PijN j non-overlapping samples, where N j is the total number of training samples in class j. The data splitting procedure is illustrated in Figure 3.6 with K = 10 and C = 10. Figure 3.6: Illustration of the Dirichlet splitting procedure used in this project. The dot size is relative to the total number of samples over the clients. The data splits are more non-IID for lower values of α. 3.4 Experimental setup This section explains the details of the setup used in the performed experiments. Code implementation can be found at Github1. Datasets and models. The algorithm FedED (Algorithm 2) have been evaluated on MNIST, EMNIST and CIFAR-10 with weighting schemes (3.1), (3.2) and (3.4), referred to as FedED-w0, FedED-w1 and FedED-w2. For each dataset, one half is distributed as local data, and the other half is used as auxiliary data. For MNIST and EMNIST, three different CNNs have been tested in the experiments. These models are referred to as CNN1, CNN2 and CNN3. The local model is the CNN1 model for all MNIST and EMNIST experiments and the student model has been tested for all three architectures. More details on these models are presented in Appendix A. For CIFAR-10, a Resnet-18 [45] architecture was used for both local models and student model. Autoencoder architectures used for FedED-w2 are presented in A.2. Table 3.1: Sizes of used models, assuming a float is 24 bytes. Name CNN1 CNN2 CNN3 Resnet18 Dataset MNIST EMNIST MNIST EMNIST MNIST EMNIST CIFAR-10 Size (MB) 0.03 0.06 0.10 0.50 0.20 0.57 28.36 1https://github.com/oscareriksson/FedML-master-thesis 22 3. Method Federated environment. Each experiment were conducted with K = 10 clients and client data generated with α values 10, 1, 0.1 and 0.01, according to the Dirichlet sampling procedure explained in Section 3.3. The fraction γ of participating clients each round was set to 1. Distillation settings. The number of local epochs was set to E = 20 for all distillation experiments. The weighting schemes have been tested for five different sizes |D0| of auxiliary data for each dataset. The student loss function has been tested as both MSE (2.10) and CE (2.3). Optimization. Local models were trained with SGD (stochastic gradient descent) optimizer with learning rate 1e−3 and momentum 0.9. Student models were trained with the Adam optimizer with learning rate 1e−3 for FedED-w0 and FedED-w1. For FedED-w2, performance was improved with a lower learning rate, and was therefore set to 1e−5. The AE models in FedED-w2 were also trained with the Adam optimizer with learning rate 1e−3. Baselines. The performance is compared against to the state-of-the-art FL algo- rithms FedAvg and FedProx. The auxiliary dataset is assumed unlabeled in this project and is therefore not included as training data for FedAvg and FedProx. For these algorithms, the local epochs was set to E = 1 and communication rounds to T = 100. These settings was observed to be enough for converged performance and therefore seen as sufficient for a fare comparison. The proximal parameter µ was chosen to 0.1 for all datasets and distributions after tuning among the val- ues [0.01, 0.1, 0.5, 1]. Included in the comparison is also the ensemble performance, referred to as Ensemble-w0, Ensemble-w1 and Ensemble-w2, which is the aggregated predictions from all local models. Hardware. Experiments have been executed on a virtual machine provided by AI Sweden, using Nvidia RTX5000 with 4 cores, 32 GB ram and 1 TB disk space. 23 3. Method 24 4 Results The results are divided into three parts, based on the different datasets explored. All algorithms are evaluated and compared based on the accuracy on the test dataset for different values of the concentration parameter α. Each weighting scheme is tested for different auxiliary data sizes |D0|. MNIST and EMNIST results are presented for different student models with varying complexity. For further details, see Section 3.4. The loss function used when training the student model is MSE, unless stated otherwise. See model details in Appendix A. The results have been generated and averaged over 10 different seeds, such that each algorithm has been tested for the same distributions of local data for a given α. For more detailed results on training progresses and global test performances, the reader is referred to Appendix B, where training curves are presented for all considered datasets and algorithms. 4.1 MNIST Table 4.1 shows how the implemented FD weighting schemes compares to the chosen baselines for different values of α. The auxiliary data size is here set to |D0| = 30000 samples and the student model is CNN3, which is the largest model considered in the experiments, but still with a lower model size than all clients combined. For a low degree of data heterogeneity (α = 10 and α = 1), FD displays similar performance as the baselines. However, algorithm FedED-w2 seems to perform slightly worse for these α values. For increased data heterogeneity (α = 0.1 and α = 0.01), the performance drop is greatest for FedED-w0 and FedED-w1, while FedED-w2 maintains high accuracy and performs better than baselines. Algorithm FedED- w2 with CE loss is the only case where test accuracy seems to increase with higher degree of non-IID data. Figure 4.1 shows how the auxiliary data size |D0| affects the performance of the considered weighting schemes. In general, FedED-w2 needs more auxiliary samples to reach high accuracy, while the performance of FedED-w0 and FedED-w1 do not change that much when |D0| is increased. As presented in Table 4.1, FedED-w2 achieves significantly higher accuracy than the other weighting schemes for α = 0.1 and α = 0.01. 25 4. Results Table 4.1: Mean test accuracy and standard deviation on MNIST for different α with |D0| = 30000 and the CNN3 student model. Algorithm α = 10 α = 1 α = 0.1 α = 0.01 FedAvg 94.14±0.75 93.96±0.92 91.83±1.16 88.37±1.39 FedProx 94.11±0.74 93.99±0.91 91.93±1.14 88.30±1.47 FedED-w0 94.08±0.65 93.78±0.62 76.43±6.53 42.23±6.65 FedED-w1 94.27±0.73 94.19±0.62 80.94±9.60 37.12±3.58 FedED-w2 93.05±0.58 93.02±0.46 90.98±1.88 90.36±1.94 FedED-w2 (CE) 92.14±1.20 92.73±0.86 93.28±1.10 93.87±0.92 Ensemble-w0 90.70±1.47 91.11±1.14 75.52±6.79 39.58±5.89 Ensemble-w1 90.78±1.45 91.66±1.09 88.46±2.86 38.06±13.27 Ensemble-w2 90.87±1.43 92.02±0.96 93.08±1.16 92.73±1.48 Figure 4.1: Comparison of test accuracy on MNIST for the implemented weighting schemes with varying data size |D0|. The student model is CNN3 and loss function is MSE. 26 4. Results Figure 4.2 compares the test accuracy of the weighting schemes with different student models and α = 10. The choice of student model mostly affects the performance of FedED-w2. As mentioned in Section 3.4, the local model for each client is CNN1. In this case, CNN1 can also be used as student model and achieve good performance. However, the performance increases with model complexity, as can be seen from the results with CNN2 and CNN3. Figure 4.2: Comparison of test accuracy on MNIST for different student models with α = 10. 4.2 EMNIST As with MNIST, Table 4.2 shows that when using enough auxiliary data and student model complexity, FD is comparable with the baselines for lower levels of non- IID data. Overall, FedED-w2 with CE loss is the best performing algorithm on EMNIST, whose performance increases with lower α. Table 4.2: Mean test accuracy and standard deviation on EMNIST for different α with |D0| = 60000 and the CNN3 student model. Algorithm α = 10 α = 1 α = 0.1 α = 0.01 FedAvg 78.85±1.14 78.29±1.11 72.36±1.51 65.95±1.92 FedProx 78.79±1.15 78.30±1.13 72.39±1.45 66.00±1.90 FedED-w0 77.11±0.64 76.11±0.65 62.64±3.78 37.91±4.23 FedED-w1 77.29±0.48 77.81±0.26 63.26±8.92 31.22±6.78 FedED-w2 75.10±0.74 74.98±0.49 71.80±1.33 70.37±1.82 FedED-w2 (CE) 78.47±1.25 78.92±1.06 78.58±1.32 80.12±1.20 Ensemble-w0 78.09±1.05 78.23±1.07 64.79±3.52 34.93±4.36 Ensemble-w1 78.18±1.08 79.08±0.98 70.59±4.53 38.34±14.52 Ensemble-w2 78.26±1.10 79.74±0.89 80.48±1.12 81.18±1.06 Similar to MNIST, Figure 4.3 shows that FedED-w2 needs more auxiliary samples 27 4. Results to reach high accuracy also on EMNIST, compared to FedED-w0 and FedED-w1. For α = 0.1 and α = 0.01, test accuracy is significantly higher with FedED-w2. Figure 4.3: Comparison of test accuracy on EMNIST for the implemented weight- ing schemes with varying data size |D0|. The student model is CNN3 and loss function is MSE. In Figure 4.4, the results show that the complexity of the student model affect the performance considerably for all weighting schemes. In this case, the local model architecture CNN1 does not achieve comparable performance to baselines when used as student model. The larger models, CNN2 and CNN3, are needed to achieve comparable performance. 28 4. Results Figure 4.4: Comparison of test accuracy on EMNIST for different student models with α = 10. 4.3 CIFAR-10 CIFAR-10 is a harder image classification task than previously considered MNIST and EMNIST, which can be seen from the overall lower accuracy in Table 4.3. For this dataset, the FD algorithms do not achieve higher test accuracy than baselines for any α with the considered student model and auxiliary data. However, the ensemble model Ensemble-w2 performance is close to baseline for α = 0.1 and α = 0.01. Table 4.3: Test accuracy for different α with |D0| = 25000 and the Resnet-18 student model. Algorithm α = 10 α = 1 α = 0.1 α = 0.01 FedAvg 51.36±0.46 50.17±0.51 40.42±2.33 29.99±2.62 FedProx 51.58±0.59 49.83±0.75 40.27±2.96 29.91±2.69 FedED-w0 48.36±1.29 47.77±0.63 37.41±2.51 24.16±5.56 FedED-w1 48.27±0.99 46.88±1.98 33.81±4.69 15.87±3.93 FedED-w2 47.37±0.73 46.31±0.87 35.38±1.45 24.17±7.47 FedED-w2 (CE) 45.05±0.82 44.03±0.73 35.32±1.53 24.82±6.68 Ensemble-w0 50.65±0.59 49.98±0.58 38.04±3.34 24.52±5.84 Ensemble-w1 50.61±0.58 49.96±0.49 38.14±5.07 18.55±5.75 Ensemble-w2 50.53±0.51 49.79±0.68 40.14±1.73 28.43±7.89 29 4. Results Figure 4.5: Comparison of test accuracy for the implemented weighting schemes with varying data size |D0|. The student model is Resnet-18 and loss function is MSE. 4.4 Communication costs The communication costs presented in Table 4.4 shows that FedED transmits less data than FedAvg and FedProx in all of the considered experiments. The differ- ence in communication is especially large on the CIFAR-10 dataset, which has the smallest size of auxiliary data and the largest model size among the experiments. Table 4.4: Communication costs for the experiments conducted, measured in MB transferred between clients and central server and assuming a float is 24 bytes. The auxiliary data size |D0| is here set to the largest size used for each dataset. Cost formula MNIST EMNIST CIFAR-10 FedED |D0|KC 7.2 37.4 6.0 FedAvg/FedProx 2|θ|KT 25.0 63.0 28359.4 30 5 Discussion This chapter provides an analysis of the results obtained in this study. A short discussion is also given on the practical use of FD. 5.1 Performance analysis of weighting schemes The results presented in the previous section indicate that the one-shot FD algo- rithm can outperform standard FL algorithms on MNIST and EMNIST by using the sample-wise weighting scheme FedED-w2. Weighting schemes FedED-w0 and FedED-w1 achieve baseline performance when data is close to IID (α = 10). However, the performance is much lower than baseline for lower values of α. A pos- sible reason for this is the loss of information in class probabilities when clients do not have enough data in all classes. In this case, local models will only generate class probabilities in classes they have been trained on, independent of the input sample from the auxiliary dataset. This would make the aggregated class probabilities less useful for training a student model since they will more likely represent the global class distribution, rather than provide a valid prediction on an auxiliary sample. As an example, assume a client only have data in classes 0 and 1 out of 10 total classes. The local model then only learns to distinguish between class 0 and 1, and hence only generates class probabilities in these classes when predicting on auxiliary data. In this case, FedED-w0 and FedED-w1 can only reduce these class probabilities if the local data size is small relative to other clients. If the local data size is large relative to other clients, these class probabilities will be amplified and the aggregated class probabilities will be heavily weighted towards classes 0 and 1, which makes them less informative for training the student model. This thesis introduces the novel weighting scheme FedED-w2, which is shown to obtain high performance with non-IID distributions of MNIST and EMNIST. The best performance is achieved when using CE as loss function when training the student model, where the results show an increase in test accuracy when data heterogeneity is increased. This is a reasonable outcome since it is easier for the local AE to achieve low reconstruction loss if training samples have few and similar features. This is the case when the level of non-IID is high since this means that clients only have data in a few classes, resulting in fewer features that the AE needs to reconstruct with low loss. 31 5. Discussion The performance of FedED-w2 is improved when using CE loss compared to MSE. A possible reason for this improvement is that when the reconstruction loss is low for a certain sample, the local model have likely seen many similar samples. This in turn means that the model probably has a confident prediction for this certain sample, which receives a large weight when forming the aggregated predictions zT due to the low reconstruction loss. In this case, the MSE loss will diminish faster compared to CE loss when student predictions zS are improved, and therefore halts the training of the student model at an earlier point. This can be understood by observing the MSE loss (zcS − zcT)2 converges to 0 faster than the CE loss − log zcS when zcS approaches 1, assuming that the aggregated class probabilities produce a confident prediction zcT ≈ 1. On CIFAR-10, the overall test accuracy is lower with FD compared to the considered baselines. However, Table 4.3 shows that the ensemble performance is close to baselines in some cases, e.g. for Ensemble-w2 with α = 0.1. Since the results on MNIST and EMNIST show that the student model can achieve the ensemble test accuracy for many cases, given enough complexity and auxiliary data, some performance gain could be obtained on CIFAR-10 by improving the distillation procedure. This could be done by changing the student model architecture, use more auxiliary data, or try different distillation techniques, such as softening the class probabilities as explained in section 2.4.1. 5.2 Using federated distillation in practice FD introduces some additional challenges compared to standard FL algorithms that need to be addressed. First, an unlabeled auxiliary dataset need to be available. This is no problem for general use cases where public datasets from the same domain could already be available, but it might be a problem for more specific tasks. An advantage with the FD algorithm considered in this study is that the auxiliary data can be unlabeled. This algorithm is thus suitable for cases where auxiliary data is unlimited but labeling is expensive. One example is image segmentation or object detection in video material. Video material can in some cases be an unlimited resource, but the labeling might be time consuming or require an expert. Second, the student model architecture and distillation procedure need to be se- lected. The results from this study indicate that the choice of student model can have a large impact on the final performance, as well as the choice of loss function for training the student model. However, this problem can be addressed separately from local training when only using one communication round in FD, meaning that tuning the student model training would not impose additional communication costs. 32 6 Conclusion The experiments in this study indicate that the considered FD algorithm can achieve similar performance to standard parameter averaging algorithms when data is close to IID on less complex image classification tasks. When local data is non-IID, the experiments show that FD experience a greater performance loss than parameter av- eraging algorithms, which is not improved with weighting schemes based on the local data size or label distribution. This study has shown that the sample-wise weighting scheme FedED-w2 can withstand this performance loss and outperforms the con- sidered baselines for two of the used datasets in the experiments. The FD approach introduces many challenging problems, where this study considers the problem of aggregating local model predictions on auxiliary data when local training data is non-IID. An important aspect of this method is to ensure the quantity and quality of auxiliary data. This study has shown that the size of auxiliary data is important to achieve high performance, especially with FedED-w2, which also demands a certain similarity between the local and auxiliary dataset. Therefore, future work includes investigating FD algorithms that are less dependent on auxiliary data. 33 6. Conclusion 34 Bibliography [1] Forbes. (2019). “Rethinking privacy for the ai era,” [Online]. Available: https: //www.forbes.com/sites/insights-intelai/2019/03/27/rethinking- privacy-for-the-ai-era/?sh=479bcdb67f0a (visited on 05/13/2022). [2] B. Wolford. (2022). “What is gdpr, the eu’s new data protection law?” [Online]. Available: https://gdpr.eu/what-is-gdpr/ (visited on 05/13/2022). [3] H. B. McMahan, E. Moore, D. Ramage, et al., “Communication-efficient learn- ing of deep networks from decentralized data,” 2017. arXiv: 1602 . 05629 [cs.LG]. [4] P. Kairouz, H. B. McMahan, B. Avent, et al., “Advances and open problems in federated learning,” 2021. arXiv: 1912.04977 [cs.LG]. [5] F. Zhou and G. Cong, “On the convergence properties of a k-step averag- ing stochastic gradient descent algorithm for nonconvex optimization,” Pro- ceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018. [6] Y. Zhao, M. Li, L. Lai, et al., “Federated learning with non-iid data,” 2018. arXiv: 1806.00582 [cs.LG]. [7] J. Deng, W. Dong, R. Socher, et al., “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. [8] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” CoRR, vol. abs/1609.07843, 2016. arXiv: 1609.07843. [9] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” 2019. arXiv: 1902.00146 [cs.LG]. [10] T. Li, M. Sanjabi, A. Beirami, and V. Smith, “Fair resource allocation in federated learning,” 2020. arXiv: 1905.10497 [cs.LG]. [11] T. Li, A. K. Sahu, M. Zaheer, et al., “Federated optimization in heterogeneous networks,” 2020. arXiv: 1812.06127 [cs.LG]. [12] S. P. Karimireddy, S. Kale, M. Mohri, et al., “Scaffold: Stochastic controlled averaging for federated learning,” 2021. arXiv: 1910.06378 [cs.LG]. 35 https://www.forbes.com/sites/insights-intelai/2019/03/27/rethinking-privacy-for-the-ai-era/?sh=479bcdb67f0a https://www.forbes.com/sites/insights-intelai/2019/03/27/rethinking-privacy-for-the-ai-era/?sh=479bcdb67f0a https://www.forbes.com/sites/insights-intelai/2019/03/27/rethinking-privacy-for-the-ai-era/?sh=479bcdb67f0a https://gdpr.eu/what-is-gdpr/ https://arxiv.org/abs/1602.05629 https://arxiv.org/abs/1602.05629 https://arxiv.org/abs/1912.04977 https://arxiv.org/abs/1806.00582 https://arxiv.org/abs/1609.07843 https://arxiv.org/abs/1902.00146 https://arxiv.org/abs/1905.10497 https://arxiv.org/abs/1812.06127 https://arxiv.org/abs/1910.06378 Bibliography [13] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients - how easy is it to break privacy in federated learning?” CoRR, vol. abs/2003.14053, 2020. arXiv: 2003.14053. [14] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014. [15] A. Hard, K. Rao, R. Mathews, et al., “Federated learning for mobile keyboard prediction,” 2018. [16] T. Yang, G. Andrew, H. Eichner, et al., “Applied federated learning: Improving google keyboard query suggestions,” 2018. [17] Apple. (2019). “Designing for privacy (video and slides),” [Online]. Available: https://developer.apple.com/videos/play/wwdc2019/708 (visited on 05/13/2022). [18] support.google. (2019). “Your chats stay private while messages improves sug- gestions,” [Online]. Available: https://support.google.com/messages/ answer/9327902 (visited on 05/13/2022). [19] I. PR. (2020). “Intel works with university of pennsylvania in using privacy- preserving ai to identify brain tumors,” [Online]. Available: https://newsroom. intel.com/news/intel-works-university-pennsylvania-using-privacy- preserving-ai-identify-brain-tumors/#gs.vkzoux (visited on 05/13/2022). [20] P. Courtiol, C. Maussion, M. Moarii, et al., “Deep learning-based classifi- cation of mesothelioma improves prediction of patient outcome,” NATURE MEDICINE, vol. 25, pp. 1519–+, 2019. [21] I. Daya, H. Roth, and A. e. a. Zhong, “Federated learning for predicting clinical outcomes in patients with covid-19,” NATURE MEDICINE, vol. 27, pp. 1735– 1743, 2021. [22] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Federated learning for ultra-reliable low-latency V2V communications,” CoRR, vol. abs/1805.09253, 2018. arXiv: 1805.09253. [23] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Chal- lenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020. [24] B. Jayaraman, L. Wang, D. Evans, and Q. Gu, “Distributed learning without distress: Privacy-preserving empirical risk minimization,” in Advances in Neu- ral Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, et al., Eds., vol. 31, Curran Associates, Inc., 2018. [25] Y. Li, T.-H. Chang, and C.-Y. Chi, “Secure federated averaging algorithm with differential privacy,” in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), 2020, pp. 1–6. [26] W. Hao, M. El-Khamy, J. Lee, et al., “Towards fair federated learning with zero-shot data augmentation,” CoRR, vol. abs/2104.13417, 2021. arXiv: 2104. 13417. 36 https://arxiv.org/abs/2003.14053 https://developer.apple.com/videos/play/wwdc2019/708 https://support.google.com/messages/answer/9327902 https://support.google.com/messages/answer/9327902 https://newsroom.intel.com/news/intel-works-university-pennsylvania-using-privacy-preserving-ai-identify-brain-tumors/#gs.vkzoux https://newsroom.intel.com/news/intel-works-university-pennsylvania-using-privacy-preserving-ai-identify-brain-tumors/#gs.vkzoux https://newsroom.intel.com/news/intel-works-university-pennsylvania-using-privacy-preserving-ai-identify-brain-tumors/#gs.vkzoux https://arxiv.org/abs/1805.09253 https://arxiv.org/abs/2104.13417 https://arxiv.org/abs/2104.13417 Bibliography [27] J. Reyes, L. Di-Jorio, C. Low-Kam, and M. Kersten-Oertel, “Precision-weighted federated learning,” CoRR, vol. abs/2107.09627, 2021. arXiv: 2107.09627. [28] H.-Y. Chen and W.-L. Chao, “Fedbe: Making bayesian model ensemble appli- cable to federated learning,” 2021. arXiv: 2009.01974 [cs.LG]. [29] T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” 2019. arXiv: 1909.06335 [cs.LG]. [30] C. Buciluundefined, R. Caruana, and A. Niculescu-Mizil, “Model compres- sion,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: Association for Computing Machinery, 2006, pp. 535–541. [31] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. arXiv: 1503.02531 [stat.ML]. [32] U. Asif, J. Tang, and S. Harrer, “Ensemble knowledge distillation for learning improved and efficient networks,” CoRR, vol. abs/1909.08097, 2019. arXiv: 1909.08097. [33] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting unintended feature leakage in collaborative learning,” 2018. [34] M. Nasr, R. Shokri, and A. Houmansadr, “Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against cen- tralized and federated learning,” in 2019 IEEE Symposium on Security and Privacy (SP), IEEE, 2019. [35] N. Guha, A. Talwalkar, and V. Smith, “One-shot federated learning,” 2019. [36] X. Gong, A. Sharma, S. Karanam, et al., “Ensemble attention distillation for privacy-preserving federated learning,” in Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2021, pp. 15 076–15 086. [37] D. Jimenez, “Dynamically weighted ensemble neural networks for classifica- tion,” in 1998 IEEE International Joint Conference on Neural Networks Pro- ceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), vol. 1, 1998, pp. 753–756. [38] P. Sollich and A. Krogh, “Learning with ensembles: How over-fitting can be useful,” in Proceedings of the 8th International Conference on Neural Infor- mation Processing Systems, Cambridge, MA, USA: MIT Press, 1995, pp. 190– 196. [39] M. P. Perrone and L. N. Cooper, “When networks disagree: Ensemble methods for hybrid neural networks,” Chapman and Hall, 1993, pp. 126–142. [40] F. Sattler, T. Korjakow, R. Rischke, and W. Samek, “Fedaux: Leveraging unlabeled auxiliary data in federated learning,” 2021. [41] D. Li and J. Wang, “Fedmd: Heterogenous federated learning via model dis- tillation,” 2019. arXiv: 1910.03581 [cs.LG]. 37 https://arxiv.org/abs/2107.09627 https://arxiv.org/abs/2009.01974 https://arxiv.org/abs/1909.06335 https://arxiv.org/abs/1909.06335 https://arxiv.org/abs/1503.02531 https://arxiv.org/abs/1909.08097 https://arxiv.org/abs/1910.03581 Bibliography [42] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” 2021. arXiv: 2006.07242 [cs.LG]. [43] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [44] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009. [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog- nition,” CoRR, vol. abs/1512.03385, 2015. arXiv: 1512.03385. 38 https://arxiv.org/abs/2006.07242 https://arxiv.org/abs/1512.03385 A Model architectures A.1 CNN architectures Table A.1: Architecture used for CNN1, CNN2 and CNN3 in the MNIST ex- periments. CNN1: NCh = 2, 1042 trainable parameters. CNN2: NCh = 8, 4138 trainable parameters. CNN3: NCh = 16, 8266 trainable parameters. Layer #Channels F S P Activation Output size Input layer - - - - - 1x28x28 Conv NCh 5 1 2 ReLU NChx28x28 Max Pooling - 4 2 - - NChx7x7 Fully Connected - - - - Linear 10 Table A.2: Architecture used for CNN1 and CNN2 in the EMNIST experiments. CNN1: NCh = 2, 2626 trainable parameters. CNN2: NCh = 16, 20826 trainable parameters. Layer #Channels F S P Activation Output size Input layer - - - - - 1x28x28 Conv NCh 5 1 2 ReLU NChx28x28 Max Pooling - 4 2 - - NChx7x7 Fully Connected - - - - Linear 26 I A. Model architectures Table A.3: Architecture used for CNN3 in the EMNIST experiments. 23834 train- able parameters. Layer #Channels F S P Activation Output size Input layer - - - - - 1x28x28 Conv 8 5 1 2 ReLU 8x28x28 Max Pooling - 2 2 - - 8x14x14 Conv 16 5 1 2 ReLU 16x14x14 Max Pooling - 2 2 - - 16x7x7 Fully Connected - - - - Linear 26 A.2 Autoencoder architectures Table A.4: Autoencoder architecture used for MNIST and EMNIST. Layer #Channels F S P Activation Output size Input layer - - - - - 1x28x28 Conv 8 3 2 1 ReLU 8x14x14 Batch norm - - - - - - Conv 16 3 2 1 ReLU 16x7x7 Batch norm - - - - - - Conv 32 3 2 0 ReLU 32x3x3 Flatten - - - - - 288 Fully Connected - - - - ReLU 128 Fully Connected - - - - Linear 4 Fully Connected - - - - ReLU 128 Fully Connected - - - - ReLU 288 Unflatten - - - - - 32x3x3 ConvTranspose 16 3 2 0 ReLU 16x7x7 Batch norm - - - - - - ConvTranspose 8 3 2 1 ReLU 8x14x14 Batch norm - - - - - - ConvTranspose 1 3 2 1 ReLU 1x28x28 II A. Model architectures Table A.5: Autoencoder architecture used for CIFAR-10. Layer #Channels F S P Activation Output size Input layer - - - - - 3x32x32 Conv 32 3 2 1 ReLU 32x16x16 Batch norm - - - - - - Conv 64 3 2 1 ReLU 64x8x8 Batch norm - - - - - - Conv 128 3 2 0 ReLU 128x3x3 Flatten - - - - - 1152 Fully Connected - - - - ReLU 128 Fully Connected - - - - Linear 64 Fully Connected - - - - ReLU 128 Fully Connected - - - - ReLU 1152 Unflatten - - - - - 128x3x3 ConvTranspose 64 3 2 0 ReLU 64x8x8 Batch norm - - - - - - ConvTranspose 32 3 2 0 ReLU 32x16x16 Batch norm - - - - - - ConvTranspose 3 3 2 0 ReLU 3x32x32 III A. Model architectures IV B Training curves B.1 MNIST Figure B.1: MNIST: FedAvg local training accuracies with the CNN1 model for 3 different seeds. V B. Training curves Figure B.2: MNIST: FedProx local training accuracies with the CNN1 model for 3 different seeds. Figure B.3: MNIST: FedAvg global test accuracy with the CNN1 model for 5 different seeds. Figure B.4: MNIST: FedProx global test accuracy with the CNN1 model for 5 different seeds. VI B. Training curves Figure B.5: MNIST: FedED local training accuracies with the CNN1 model for 3 different seeds. Figure B.6: MNIST: FedED-w0 student training accuracy with the CNN3 model and MSE loss for 1 seed. VII B. Training curves Figure B.7: MNIST: FedED-w1 student training accuracy with the CNN3 model and MSE loss for 1 seed. Figure B.8: MNIST: FedED-w2 student training accuracy with the CNN3 model and MSE loss for 1 seed. VIII B. Training curves Figure B.9: MNIST: FedED-w0 student training accuracy with the CNN3 model and CE loss for 1 seed. IX B. Training curves Figure B.11: EMNIST: FedProx local training accuracies with the CNN1 model for 3 different seeds. B.2 EMNIST Figure B.10: EMNIST: FedAvg local training accuracies with the CNN1 model for 3 different seeds. X B. Training curves Figure B.12: EMNIST: FedAvg global test accuracy with the CNN1 model for 5 different seeds. Figure B.13: EMNIST: FedProx global test accuracy with the CNN1 model for 5 different seeds. Figure B.14: EMNIST: FedED local training accuracies with the CNN1 model for 3 different seeds. XI B. Training curves Figure B.15: EMNIST: FedED-w0 student training accuracy with the CNN3 model and MSE loss for 1 seed. Figure B.16: EMNIST: FedED-w1 student training accuracy with the CNN3 model and MSE loss for 1 seed. XII B. Training curves Figure B.17: EMNIST: FedED-w2 student training accuracy with the CNN3 model and MSE loss for 1 seed. Figure B.18: EMNIST: FedED-w0 student training accuracy with the CNN3 model and CE loss for 1 seed. XIII B. Training curves Figure B.20: CIFAR-10: FedProx local training accuracies with the Resnet-18 model for 3 different seeds. B.3 CIFAR-10 Figure B.19: CIFAR-10: FedAvg local training accuracies with the Resnet-18 model for 3 different seeds. XIV B. Training curves Figure B.21: CIFAR-10: FedAvg global test accuracy with the Resnet-18 model for 5 different seeds. Figure B.22: CIFAR-10: FedProx global test accuracy with the Resnet-18 model for 5 different seeds. Figure B.23: CIFAR-10: FedED local training accuracies with the Resnet-18 model for 3 different seeds. XV B. Training curves Figure B.24: CIFAR-10: FedED-w0 student training accuracy with the Resnet- 18 model and MSE loss for 1 seed. Figure B.25: CIFAR-10: FedED-w1 student training accuracy with the Resnet- 18 model and MSE loss for 1 seed. XVI B. Training curves Figure B.26: CIFAR-10: FedED-w2 student training accuracy with the Resnet- 18 model and MSE loss for 1 seed. Figure B.27: CIFAR-10: FedED-w0 student training accuracy with the Resnet- 18 model and CE loss for 1 seed. XVII DEPARTMENT MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms Nomenclature List of Figures List of Tables Introduction Problem formulation Purpose and research questions Delimitations Contributions Theory and related work Supervised image classification Gradient descent Convolutional neural networks Fully connected neural network 2D convolutional layer Pooling layer Federated learning Challenges in federated learning Federated averaging Federated ensemble distillation The algorithm Distillation vs. averaging Related work Method Weighting schemes Datasets Generation of non-IID data Experimental setup Results MNIST EMNIST CIFAR-10 Communication costs Discussion Performance analysis of weighting schemes Using federated distillation in practice Conclusion Bibliography Model architectures CNN architectures Autoencoder architectures Training curves MNIST EMNIST CIFAR-10