Distributed Semi-Supervised Learning and Audio Recognition of Road Surfaces Master’s thesis in Computer science and engineering Adam Davidsson Simon Larsson Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Master’s thesis 2022 Distributed Semi-Supervised Learning and Audio Recognition of Road Surfaces Adam Davidsson Simon Larsson Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 Distributed Semi-Supervised Learning and Audio Recognition of Road Surfaces Adam Davidsson Simon Larsson © Adam Davidsson, 2022. © Simon Larsson, 2022. Supervisor: Yehia Abd Alrahman, Computer Science and Engineering Advisor: Tomas Carlfalk, WirelessCar Examiner: Krasimir Angelov, Computer Science and Engineering Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Gothenburg, Sweden 2022 iv Distributed Semi-Supervised Learning and Audio Recognition of Road Surfaces Adam Davidsson Simon Larsson Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract The automotive industry is a promising environment for machine learning. However, current machine learning techniques do not meet all the requirements of many possi- ble applications. Requirement such as privacy preservation, limited communication and semi-supervision. To satisfy these requirements, this thesis proposes a simple distributed semi-supervised algorithm (distributed FixMatch). Furthermore, we ap- ply this algorithm to a real-world problem, detecting road surface types from audio. In applying the semi-supervised algorithm to this problem, we also propose a simple augmentation technique for audio features. The proposed algorithm was tested on two real datasets, where the algorithm was compared to a supervised training algo- rithm. The results suggest that the algorithm successfully leveraged unlabeled data. Furthermore, a theoretical analysis and a simulation show that the communication cost of the proposed algorithm was lower than federated or centralized alternatives. Keywords: Machine learning, Federated learning, Semi-Supervised learning, Dis- tributed learning, Audio Recognition, Road Surface detection, Neural Network. v Acknowledgements We want to thank our supervisor, Yehia Abd Alrahman for his contributions to this thesis. We also want to thank Jens Andersson, Tomas Carlfalk, Anna Gunlycke, and Jeffrey Spång at Wirelesscar for their help during the project. Finally, we want to thank Linn Alholt at New Minds for connecting us with WirelessCar. Adam Davidsson, Gothenburg, June 2022 Simon Larsson, Gothenburg, June 2022 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Federated learning and distributed learning . . . . . . . . . . 3 1.3.2 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Audio classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Audio recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Signal processing & Feature extraction . . . . . . . . . . . . . 7 2.1.2 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . 9 2.2 Federated learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 FixMatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Distributed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Model of Communication . . . . . . . . . . . . . . . . . . . . 20 3 Methods 23 3.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Road surface noise dataset . . . . . . . . . . . . . . . . . . . . 24 3.2.2 UrbanSound8K . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Auditory Spectral Features . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Feature set evaluation . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 eGeMAPSv02 . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Model Architecture & Training . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Distributed FixMatch . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . 33 3.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ix Contents 4 Results 35 4.1 Communication overhead . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Concluding remarks 43 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bibliography 47 A Appendix 1 I x List of Figures 2.1 A short audio recording of a wheel driving on asphalt, represented in the time domain (a) and in the frequency domain (b). . . . . . . . . . 8 2.2 A fully connected neural network. . . . . . . . . . . . . . . . . . . . 9 2.3 Two types of federated learning. 2.3b illustrates a bank and a shop using a coordinator to train together without revealing private infor- mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 A semi-supervised example. The stars represent labeled data. The black line represents the decision boundary for only training on the labeled data. The grey line represents the decision boundary for train- ing on the labeled data and the unlabeled data . . . . . . . . . . . . . 15 2.5 An example of consistency loss. The arrows represent data augmen- tations. Two random augmentations are applied to each data point. After the augmentations, the two new points should likely remain in the same class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 A diagram illustrating the difference between a centralized, decen- tralized, and distributed network topology. . . . . . . . . . . . . . . . 20 3.1 The placement of the microphone when collecting road noise for the road surface noise dataset. . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 State transitions of distributed FixMatch . . . . . . . . . . . . . . . . 30 3.3 The distribution of feature mfcc1_sma3_amean for all road surface data points visualized for each class. . . . . . . . . . . . . . . . . . . 32 4.1 The accuracy difference of distributed FixMatch vs a central super- vised model for the Road Surface dataset using a fixed number of labeled training examples for each class. . . . . . . . . . . . . . . . . 36 4.2 The accuracy difference of distributed FixMatch vs a central super- vised model for the UrbanSound8K dataset using a fixed number of labeled training examples for each class. . . . . . . . . . . . . . . . . 36 4.3 The distribution of feature mfcc1_sma3_amean for all UrbanSound8K data points visualized for each class. . . . . . . . . . . . . . . . . . . 37 4.4 Accuracy development of the nodes during training on the Urban- Sound8K dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Accuracy of the different road surfaces. Trained using distributed Fix- Match. Road surface noise accuracy was taken when a client reached at least 87% accuracy. An average of 5 runs was taken . . . . . . . . 39 xi List of Figures 4.6 The communication overhead of the most strained node for Dis- tributed FixMatch vs Federated training . . . . . . . . . . . . . . . . 42 xii List of Tables 3.1 Parameters of the cars used for recording . . . . . . . . . . . . . . . . 23 3.2 Accuracy of different feature sets when used to train a neural network to classify the Road surface noise dataset . . . . . . . . . . . . . . . . 27 3.3 Accuracy of a distributed FixMatch when trained on the Urban- Sound8K dataset for different combinations of hyperparameters. . . . 33 4.1 Number of clients affect on accuracy . . . . . . . . . . . . . . . . . . 38 4.2 Communication complexity. Where n is the number of nodes partic- ipating in the training, d is the client dataset size, r is the number of server rounds during training and m is the model size. . . . . . . . . . 40 4.3 Number of communication messages sent for distributed FixMatch vs federated training until achieving a threshold accuracy . . . . . . . . 41 A.1 Accuracies depicted in Figure 4.2 and Figure 4.1 . . . . . . . . . . . . I xiii List of Tables xiv 1 Introduction Recent years have witnessed an increasing interest in machine learning. This trend has been especially strong in the automotive industry, with use cases such as au- tonomous vehicles and automatic braking systems. This trend can be shown by the fact that around 95% of all new vehicles in the next ten years may have some form of AI [18]. This is a significant increase compared to 2015, where the number was as low as 5-10%[18]. One of the potential use cases for machine learning in the automotive industry is road surface detection. The idea is to detect which type of road surface the vehicle is driving on, e.g., asphalt, gravel, or icy asphalt. In addition to detecting road surface, the roughness of different roads could be detected. Road surfaces and road roughness can be used to determine the friction forces between the road and the wheels, which is important when developing safety measures. Another use case could be the prediction of energy consumption in electric vehicles, which is also dependent on the road material [45]. Machine learning is a resource-intensive process. The effectiveness of a model largely depends on the amount and the quality of data used in the training process [20]. Massive amounts of data are constantly collected by vehicles. Modern vehicles have around 60-100 sensors and 30-50 computers [24]. However, sending massive amounts of data to a centralized location is expensive and scales badly[33]. One way to solve this problem is to distribute the learning process among different vehicles. A vehicle in a federated system could send generalized knowledge, instead of raw data, to a central server. The server can then share the knowledge with other vehicles. How- ever, data would still be sent to a centralized location which could cause problems with scaling. The alternative would be a fully distributed system. Where every par- ticipant takes turns in acting as client and server. Two other concerns strengthen the case for a distributed system. Firstly, the collection of private data from vehi- cles. A 2019 survey found that 79% of Americans were concerned with how data was collected by companies [8]. This concern could be alleviated by only collecting the data important for learning instead of collecting all raw data. Secondly, renting or buying computers to train the model in a centralized way is expensive. One way to avoid this would be to use the already existing computers in vehicles. 1 1. Introduction 1.1 Problem Our research problem is focused on studying if federated machine learning and semi- supervised machine learning techniques can be adapted to the requirements of the automotive industry. We will focus on one automotive industry machine learning problem, classifying the road surface a car is driving on, through audio recognition. Classifying road noise in the automotive setting comes with a few problems. Firstly, each vehicle has the possibility to gather large amounts of unlabeled data. Unlabeled data lacks any information describing it. In contrast, labeled data has been tagged with some information that explains it. For instance, an audio recording tagged with the information that it is a recording of a vehicle driving on asphalt. Labeling the data gathered by the cars is resource-intensive. Therefore the amount of labeled data is restricted. Labeled data is much more useful for training classification mod- els. With labeled data, we can train a model to connect the features X to the Label Y. However, with unlabeled data we have no such labels to connect the features with. This makes classification harder. Secondly, the amount of information that can be communicated between the vehicles and a potential server is limited. The communication between the vehicles should be minimized since it is expensive. Fur- thermore, the amount of data a server can receive is limited because of bandwidth. This causes scaling problems when the number of vehicles and updates increases. The third problem is how to preserve privacy. Generalized knowledge should be extracted from the raw data without sending it to a central location, i.e., the data should be kept on the vehicle. To address these problems, a distributed algorithm, based on previous federated machine learning and semi-supervised machine learning techniques, is proposed. Federated learning techniques allow for learning without sending raw data to a central location. Semi-supervised learning techniques allow for both unlabeled and labeled data to be used in training. These techniques are combined in a distributed algorithm. The distribution lowers the communication overhead. 1.2 Our Contribution The high-level goal of this project is to determine whether federated learning can effectively be applied to problems in the automotive industry. To be useful to the automotive industry it needs to meet the following design requirements[33][8][24]: • The communication costs between the participants (cars) should be minimized. • The training needs to work on unlabeled data. • The previous two goals need to be accomplished without loss of privacy or accuracy To fulfill these goals this project will combine state-of-the-art federated learning, semi-supervised learning, and audio classification techniques. To detect the road surface, a neural network will be trained on features extracted from road noise. To 2 1. Introduction minimize communication costs, the training will be performed with our proposed distributed algorithm. The algorithm is based on federated learning techniques to avoid uploading personal data and semi-supervised learning to leverage unlabeled and labeled data. To our knowledge, we are the first to do this. So our contribution will be: • Proposing a novel distributed algorithm that combines FixMatch (a state-of- the-art semi-supervised algorithm) and FedAvg (a state-of-the-art federated learning algorithm) • Proposing a simple augmentation technique for audio features to be used in semi-supervised machine learning • Classifying road surfaces based on road noise with a neural network 1.3 Related works To solve the problems described in the previous section, techniques from federated learning, semi-supervised learning, and audio classification will be used. This section will introduce some previous work that has been done in these fields. 1.3.1 Federated learning and distributed learning One proposed method for solving the problem of distributing the learning process is federated learning (FL) [28]. The method usually consists of loosely connected devices (called clients) and a central server. The server coordinates the sharing of knowledge between the clients. FedAvg [28] is one such federated learning method. In FedAvg, the server sends its model to the clients. The clients then train the model on their local data and send the resulting model back. The server then repeats the process with an average of all the models it received from the clients. FedAVG has been empirically shown to work. It could achieve 85% accuracy on the Cifar-10 dataset after 2000 communication rounds. Another proposed algorithm for performing federated learning is FedMatch [22]. It operates much in the same way as FedAvg with clients training their own models and a central server aggregating the knowledge learned by the models. However, unlike traditional FL strategies, FedMatch can be used in scenarios where only a small amount of labeled data is available at the central server. The remaining data is unlabeled and distributed among many clients. The main ideas of FedMatch are pa- rameter decomposition and inter-client consistency loss. FedMatch decomposes the parameters of the model to allow disjoint learning on the labeled and the unlabeled data. One set of parameters is trained on the labeled data and one set is trained on the unlabeled data. Separating the parameters prevents for loss of knowledge learned on the labeled data which improves accuracy [22]. Inter-client consistency loss regularizes the models learned at multiple clients to output the same prediction [22]. This allows for improved model consistency across the multiple models and improved accuracy. 3 1. Introduction Another approach was taken by [46]. They introduced a supervised algorithm (DADA) without a central server. Instead, the collaboration is done through peer- to-peer updates. Each peer exchanges updates with only a few neighbors. This exchange can be described by a collaboration graph. This graph is also optimized such that peers with similar tasks share knowledge with each other. The algorithm is based on Frank-Wolfe i.e. conditional gradient descent, and peer sampling service. Tests on synthetic data showed that the Dada successfully achieved high accuracy while keeping the collaboration graph sparse. Several distributed semi-supervised algorithms have been developed [16][47][36]. However, these algorithms tend to be too complicated to implement [16][47]. Fur- thermore, they have been implemented with image recognition or general classifi- cation tasks in mind [16][47]. Additionally, one of the algorithms was proposed to solve the binary classification problem [36]. 1.3.2 Semi-supervised learning MixMatch was proposed to solve the challenge of using both labeled and unlabeled data to train an image classifier [7]. The method is based on three techniques: entropy minimization, which is used to make confident predictions on unlabeled data; consistency regularization, which means that the same output should be produced by the model when the input data is slightly altered; and generic regularization, which aims to generalize the model and avoid overfitting. The three techniques of MixMatch are built around image augmentation. The unlabeled data is augmented k times. The average class distribution of the k augmentations is then used as a label in training. Tests on well-known datasets showed that MixMatch achieved a significantly improved performance over other semi-supervised methods. Another proposed method is Unsupervised Data Augmentation (UDA) [44]. UDA is very similar to MixMatch. It also uses image augmentation to perform semi- supervised learning. What differentiates UDA from MixMatch is the way data aug- mentation is performed. MixMatch only uses weak augmentation, both on labeled and unlabeled data. UDA, however, uses strong augmentation for the unlabeled data. The stronger data augmentations are used to get more diverse input data. UDA also introduces a classification threshold. This threshold is used to filter out unlabeled examples that the model can not classify with high confidence. UDA achieves great results for well-known image classification tasks. 1.3.3 Audio classifiers Several different papers have been published in the field of audio classifiers. A paper published in 2013 introduces a method for classifying road surface wetness from audio [3]. The article extracts auditory spectral features by applying the short-time Fourier transform. The features are converted to Log-Mel spectrograms. These transformed features are then used to train a recurrent neural network (RNN). The network can predict road wetness with an Unweighted Average Recall of 93.2% 4 1. Introduction Another article published in 2021 uses a similar method of auditory feature extrac- tion [39]. The paper introduces a method for training convolutional neural networks in the supervised federated setting. They evaluate their model on three different data sets: Speech Commands [43], Ambient Context, [32] and VoxForge [42]. They achieve an accuracy of 96.5%, 73.0%, and 79.6% respectively in the centralized set- ting. And 96.9%, 71.9%, and 79.1% in the federated setting. A different classifier was tried by [4]. Their goal was to classify the road surface roughness using audio. They also used a short-time Fourier transform to extract auditory features and then converted them to Log-Mel spectrograms. However, instead of using an RNN for the classification, they used a Convolutional Neural Network (CNN). The trained model was able to classify the audio into two classes, rough and smooth, with an F1-score of 86%. This approach was further worked on [19]. Instead of using a CNN, a Siamese Neural Network (SNN) was used. The SNN was used since it has good generalization properties on unbalanced data. The new approach improved the performance of the classification to an F1-score of 95.6%. An audio classifier was proposed by [31]. They used a semi-supervised algorithm called FixMatch [38] to perform the training. FixMatch was originally proposed to solve image recognition tasks. However, [31] evaluated the performance of FixMatch on audio recognition. To get FixMatch to work with audio signals, they converted the audio signals to spectrograms. The spectrograms were then used as input im- ages to train a CNN using FixMatch. Their results showed that FixMatch doesn’t necessarily work out-of-the-box on audio data. 1.4 Overview In the theory chapter, we present the background knowledge needed to follow the report. We begin by explaining the audio recognition techniques. We then describe the federated learning knowledge needed to understand our algorithm. We also explain how semi-supervised learning works and present an algorithm that our im- plementation builds upon. Lastly, general distributed algorithms are explained to understand how an algorithm can be decentralized. The methodology chapter describes the approach which is taken to fulfill the goal presented in the introduction chapter. First, we describe how road noise data will be gathered. Then, we present the feature extraction method and why it was chosen. The following section describes the method by which those features will be classified. The following two sections go into how we optimize and simulate the training. Lastly. we describe how the goals will be evaluated. In the results chapter, we present the result of the evaluations. We demonstrate the trained classifier accuracy and how it compares to other approaches. We also eval- uate the communication costs of our approach compared to alternative approaches. The Discussion chapter mentions challenges we faced during the thesis project. We 5 1. Introduction also discuss the results and analyze what was achieved. Lastly, we mention future work that can be done to make improvements. 6 2 Theory This section will give a detailed background to the techniques used to solve the problems described in the previous chapter. The first section will give a basic intro- duction to audio recognition and neural networks since it is crucial to understanding the evaluation method. The second section will give an introduction to federated learning. Additionally, we will explain different strategies to perform federated learn- ing. We will also give a description of an algorithm, FedAvg, that will be used as part of our method. These three topics are important for understanding the federated component of distributed FixMatch and the context it was developed in. The third section will give an overview of different semi-supervised techniques. Furthermore, it will give a detailed description of the FixMatch algorithm. This section is also included to give context and basic background for the semi-supervised component of distributed FixMatch. The final section will give a short explanation of distributed methods to understand how a centralized algorithm can be distributed. 2.1 Audio recognition Audio recognition is a technology used to identify and classify audio data. It can be used to classify speech, music, emotions, etc. There are many different approaches to classifying audio. However, the methods can usually be broken down into three steps: signal processing, feature extraction, and classification. Signal processing converts the input audio from the time domain into the frequency domain. Feature extraction extracts relevant feature vectors from the processed data. Lastly, a classifier uses the extracted features to make classifications. 2.1.1 Signal processing & Feature extraction Signal processing and feature extraction is the process of extracting numerical fea- tures from audio signals. Audio signals come in two forms: analog and digital. Analog sound is a replica of the sound wave, while digital sound is built up from samples of the original sound wave. Audio signals can be represented in two domains. In the time domain and in the frequency domain. The time domain shows the amplitude of the audio signal. With analog sound, it shows a continuous amplitude curve while digital sound shows the 7 2. Theory (a) Time domain (b) Frequency domain Figure 2.1: A short audio recording of a wheel driving on asphalt, represented in the time domain (a) and in the frequency domain (b). amplitude at each sample point. In the frequency domain, the amplitude is shown along a frequency axis instead of the time axis. To transform the signal from the time domain to the frequency domain, a Fourier Transform is applied. Applying a Fourier Transform splits a signal into its constituent frequencies. These frequencies can then be shown in the frequency to amplitude representation. In Fig. 2.1a and Fig. 2.1b we show the differences between the two representations. In Fig. 2.1a, an audio recording of a wheel driving on asphalt is represented in the time domain. In Fig. 2.1b the same audio clip is represented but in the frequency domain. From the time domain and the frequency domain, low-level descriptors (LLD) are extracted. Low-level descriptors are sound parameters and are closely related to the sound signal. Examples of LLD are pitch, loudness, and spectral slope. Low-level descriptors are represented as floating-point values in a 1-dimensional array. The LLD are fed to functions to obtain a single value from each LLD. Two examples of such functions are arithmetic mean and standard deviation. The function values of all LLD are then combined to form a resulting feature extraction set. To extract LLD and apply functionals, a toolkit called openSMILE [15] will be used. Feature extraction is used to reduce the dimensionality of the input data, i.e., the number of features used by the model to make a prediction. Reducing the number of features has several benefits. First, it reduces the number of computations required in training and it can reduce the size of the data. Fewer computations mean faster training speed and therefore possibly better results. A second benefit is the removal of multicollinearity, i.e., several independent variables that are correlated, which can be a problem in regression models. However, multicollinearity is less of a problem for neural networks[41]. A third benefit of feature extraction is that it can help with the curse of dimensionality. The curse of dimensionality refers to various problems occurring when analyzing data with many dimensions, that do not occur in lower dimensions. One such problem is that more features require more training examples [9]. 8 2. Theory 2.1.2 Artificial neural networks Artificial neural networks (ANNs) consist of groups of connected nodes and weights. ANNs can act as functions and given a certain input they will produce a certain output. Furthermore, ANNs can be trained to produce a certain output, given a certain input, through the backpropagation algorithm. Layers Figure 2.2: A fully connected neural network. The nodes in ANNs are connected to each other through weights. The nodes are divided into different layers. There are three types of layers. The input layer, the hidden layers, and the output layer. The number of layers, weights, and nodes depends on the application. In a fully connected neural network, each node is connected with weights to all the nodes in the layer below it, as shown in Fig. 2.2 The basic idea of the ANN is that each node should activate on some input. When one node activates, it causes a node in a later layer to activate. This causes a hierarchical activation of nodes. For instance in the case of image classification. The nodes in the input layer may activate when a certain pixel has a specific value. A group of nodes in the input layer could represent an edge. When they activate they could cause a node in the next hidden layer to activate. The activated node could be part of a group that represents a dog-ear or a cat-eye. This group could cause a node in the output layer to activate. Which could Indicate if the image depicts a dog or a cat. 9 2. Theory Gradient descent ANNs are trained using gradient descent. The goal of gradient descent is to decrease a loss function. When the slope of the loss function is 0, the minimum loss has been found. Thus the gradient descent algorithm tries to step towards the minima of the loss function. To achieve this, firstly the loss function is applied to the model prediction and the real label. The loss function calculates some distance between the predicted and real values. A commonly used loss function is cross entropy loss. Cross entropy is defined as: LCE = − n∑ i=1 yi log ŷi, for n classes, (2.1) where yi is the real value and ŷi is the predicted value fed through an activation function. The activation function used with cross-entropy loss is usually softmax. Softmax transforms the output values to an array of probabilities. The probabilities indicate how likely the input is to belong to each class. Secondly, to decrease the loss function, a gradient is calculated on it using backwards propagation. Finally, the parameters of the model is then updated with the negative gradient to minimize the loss function: wt+1 = wt − λ δL δwt (2.2) where wt are the parameters at time t, L is some loss function, λ is the step size, which says how big steps the parameters should take towards the calculated gradi- ent. A commonly used gradient descent algorithm is Stochastic Gradient Descent (SGD) [34]. SGD is very similar to gradient descent. The difference is that SGD stochastically chooses one or a subset of training samples to perform gradient de- scent on. Whereas gradient descent uses all the training samples for each update. This makes SGD considerably faster than gradient descent. However, SGD gener- ally does not move towards the true gradient of the loss function. Thus SGD is computationally faster than gradient descent but requires more rounds to converge. Backpropagation Backpropagation is the algorithm that determines how the weights and biases should be updated to minimize the loss function. Backpropagation calculates how the weights and biases should be updated to decrease the loss function the most for a single training example. Backpropagation is performed by computing the partial derivatives of the loss function. The partial derivatives are computed with respect to the weights and biases, ∂L/∂wl jk and ∂L/∂bl j, for all layers. To compute the partial derivatives, backpropagation first computes the error, δl j, for the jth node in layer l. The error is computed by: δl j = ∂L ∂zl j (2.3) 10 2. Theory where zl j is the weighted input of node j in layer l. To get the partial derivatives from the error, backpropagation will relate δl j to ∂L/∂wl jk and ∂L/∂bl j. To relate δl j to ∂L/∂wl jk and ∂L/∂bl j, four equations are used [30]. Combining those four functions makes it possible to calculate the gradient of the loss function. The first equation, used for calculating the error in the output layer, is: δlo j = ∂L ∂alo j σ′(zlo j ) (2.4) Here, the term ∂L/∂aL j shows how the loss is changing with respect to the output activation of neuron j. The term, σ′(zlo j ) shows at which rate the activation function σ changes with the weighted input zlo j . The second equation uses the error in the next layer, δl+1 to calculate the error in the current layer, δl: δl = ((wl+1)T δl+1)⊙ σ′(z′) (2.5) Thus by combining Equation 2.4 and Equation 2.5, the error for any layer δl can be computed by using the error at the output layer and propagating backward through the layers. The third equation shows how the change of loss is affected in relation to the bias of any node in any layer: ∂L ∂bl j = δl j (2.6) The partial derivative with respect to the bias, ∂L/∂bl j, is exactly equal to the error δl j. Since combining Equation 2.4 and Equation 2.5 results in δl j ∈ δl, one of the two partial derivatives has been linked to the error. The forth equation shows how the change of loss is affected in relation to an indi- vidual weight in any layer: ∂L ∂wl jk = al−1 k δl j (2.7) The term al−1 k is the activation output of the node in the previous layer. The term δl j is the node error in the current layer. Both terms are known how to compute, thus the second partial derivative has been linked to the error. Combining the four equations results in the final backpropagation method. Gradient descent can then be used together with the backpropagation method to update the weights and biases of the network. Overfitting One problem which can occur when training a model is overfitting. When a model is overfitted it has been trained to model the training data too closely. It models noise in the data which doesn’t appear in other data. Overfitting makes the model worse in classifying new data. There are several ways to reduce the risk of overfitting. 11 2. Theory One method is to decrease the number of features the model is trained on. In this method, you remove unnecessary input that acts as noise. Another method is regularization. Regularization penalizes large model parameters and therefore creates a less complex model. A less complex model has a reduced opportunity to overfit. Another method to reduce the risk of overfitting is dropout. Dropout ignores some nodes in the hidden layers. Ignoring a node means that the node, and all its connected edges, are not used in that training round. More specifically, for each node, there is a probability p of that node being ignored. Dropout helps reduce overfitting by reducing the probability of co-dependency between nodes. 2.2 Federated learning Federated learning is a technique to leverage private data without sending it to a central server. The term was first used by a paper from google [28]. Their goal was to "present a practical method for the federated learning of deep networks based on iterative model averaging". Federated learning is defined to solve problems with the following four properties[28]: • The data belonging to one participant may not be representative of the data belonging to the other participants • Some participants will have more training time than others • The number of participants will be larger than the average amount of data per participant • The participants are limited in their communication Federated learning, in its basic form, consists of a central server and loosely con- nected participants (called clients). The server coordinates the sharing of knowledge between the clients. The sharing can be done through many different types of strate- gies. However, it is done without sharing the actual data contained on the clients. Instead, the clients send their model parameters to the server which combines them. This combination is then returned to the clients. The clients then continue train- ing on the combined model and again return the results to the server. This is the approach of the original FedAvg strategy [28]. Since federated learning first was introduced, many different approaches have been proposed. A survey on federated learning [29] classified those approaches in 5 ways: network topology, data partition, data availability, aggregation and optimization algorithm, and open-source frameworks. Network topology refers to how the training and aggregation of models are per- formed. In the original centralized approach, as previously mentioned, the server simply manages and aggregates the training from the asynchronous updates it re- ceives from the clients. Another approach is user clustering, where users are grouped based on how similar their data distribution is. This clustering is used to reach con- vergence faster. The third approach is distributed (fully decentralized) federated 12 2. Theory (a) A distributed learning topology (b) Vertical learning Figure 2.3: Two types of federated learning. 2.3b illustrates a bank and a shop using a coordinator to train together without revealing private information learning. Here, each participant shares the information in a peer-to-peer fashion, i.e., there is no central server. The distributed topology is illustrated in Fig. 2.3a. Data partition refers to the way the data is partitioned between different partic- ipants. The division can be made broadly into horizontal, vertical, and transfer learning. Horizontal learning is defined as when each participant’s dataset shares the same features, but the instances are different. This is the classical federate learning, e.g., when many different phones are used to train text prediction. Ver- tical federated learning is defined by the use of different feature spaces to jointly train a model. For instance, when a bank and an internet shop try to train a model together. The bank has credit data while the internet shop has order data. This data could be used to jointly train a purchase prediction model. Vertical learning is illustrated in Fig. 2.3b. The last approach is federated transfer learning. Here tra- ditional ML transfer learning techniques (i.e., knowledge from solving one problem is used to solve another) are applied in the federated domain. Data availability refers to the availability of the participants to train. There are two categories, cross-silo FL and cross-device FL. In cross-silo FL the number of participants is usually in the range of 2-100, the devices are indexed and usually always available. In contrast, the cross-device approach usually has a very large number of participants. The connections are also less reliable and the participation is more random. Aggregation and optimization strategy controls the training flow. In centralized federated learning, the strategy chooses what clients it will use for training. It then combines the returning parameters, from the clients, at the end of the round. A distributed federated learning strategy manages the interaction of the participants. For instance, assign the participants temporary roles as servers and clients. The first proposed centralized strategy was FedAvg, described in Algorithm 1, in which the server acts as an orchestrator by sending out its parameters to a random subset of clients. These clients train on the parameters and then return them to the server. The server makes an average and repeats the process. Some other variants are: 13 2. Theory • SMC-AVG: A strategy based on Secure Multiparty computing. It allows dis- trustful participants to collaboratively calculate sums of their private values, without revealing their own private values to the other participants. This al- lows the clients to share knowledge without revealing their gradients. Which makes it even harder to derive private information. • FedMa: was proposed for creating CNN and LSTM (Long Short-Term Mem- ory) updates in an FL environment. • FedMatch: a semi-supervised federated algorithm. Federated learning can be implemented using many different frameworks. Some examples are TensorFlow federated, PySyft, Flower, and FATE. However, most are catered to the centralized topology. ALGORITHM 1: FedAvg 1 RunServer(): 2 initialize w0 3 for each round t = 1,2,... do 4 m← max(C ·K, 1) 5 St ← (random sets of m clients) 6 for each client k ∈ St in parallel do 7 wk t+1 ← ClientUpdate(k, wt) 8 end 9 wt+1 ← ∑K k=1 nk n wk t+1 10 end 11 RunClient(k, w): 12 B ← (split Pk into batches of size mathcalB) 13 for each local epoch e from 1 to E do 14 for batch b ∈ B do 15 w ← w − η∇ℓ(w; b) 16 end 17 end 18 return 2.3 Semi-supervised learning Semi-supervised learning (SSL) tries to solve machine learning problems with a small amount of labeled data and a larger amount of unlabeled data. The idea is to leverage the knowledge from the labeled data together with some underlying structure of the unlabeled data. For instance, if we create a text classifier to classify what sort of food a certain recipe describes, e.g., Italian, French, or Chinese. We may have a lot of pizza recipes labeled as Italian. These recipes most often include tomato sauce as an ingredient. Therefore, our classifier will find that tomato sauce is a strong indicator of Italian food. This is where semi-supervised learning can be leveraged. In the unlabeled data, there may be a lot of recipes that use both the words olive oil and tomato sauce. These two words might also often be connected 14 2. Theory with pasta. This could guide the classifier toward labeling pasta recipes as Italian. Despite not getting any such labeled examples. A two dimensional classification problem on synthetic data is shown in Fig. 2.4 Figure 2.4: A semi-supervised example. The stars represent labeled data. The black line represents the decision boundary for only training on the labeled data. The grey line represents the decision boundary for training on the labeled data and the unlabeled data A necessary condition for semi-supervised learning to be useful is that the underlying marginal data distribution, p(x), over the input space contains some information about the posterior distribution, p(y|x) [40], i.e, the input data must contain some information about the desired output. This condition has been shown to be true in most cases, by the successful application of semi-supervised learning to many learning problems. However, the interaction between p(x) and p(y|x) is not always the same. These interactions have been formalized using three assumptions: points that are close to each other should share a label, clusters often share a label, and the data lies roughly on a manifold (an object that looks flat to an ant in a lower dimension, e.g., a sphere in 3 dimensions) with a lower dimension than the input space. According to [40] semi-supervised algorithms can be either inductive or transductive. The goal of inductive methods is to create a classifier that can make predictions for any object in the input space. In contrast, transductive methods do not create a classifier for the entire input space. Rather, its predictive power is limited to the exact data it has experienced during training. The inductive methods can be further divided into three categories. Firstly, wrapper 15 2. Theory methods let the model be trained in a supervised fashion. Then the classifier is used to construct pseudo-labeled data, which is then used to train the model. The second method is unsupervised clustering, which can be done in three ways. Either useful features are extracted, the data is pre-clustered, or the initial parameters for the supervised training are chosen with an unsupervised method. The last method is to directly include unlabeled data into the objective function or optimization procedure of the learning method. All transductive methods are either explicitly graph-based or can be understood in such a manner. This is caused by the fact that there exists no model of the input space. Instead, information is propagated via connections of the data points. Transductive graph-based methods can be generalized into three steps. First, a set of objects is used to construct a graph where similar points are connected. The second step adds weights, to the edges, to capture how similar the nodes are. In the third step, some chosen method is used to assign labels to the unlabeled data points. One wrapper method is consistency regularization. Consistency regularization was first proposed by Bachman et al. [5]. Consistency regularization is based on the assumption that a model should output similar predictions when fed slightly aug- mented inputs. Consistency regularization combines supervised training on labeled data and training on unlabeled data using the loss function: µB∑ b=1 ||Pm(y|α(ub))− pm(y|α(ub)||22 (2.8) where α and pm are some stochastic functions and therefore give the two terms different values. This loss function encourages smoothness of the model around the labeled data. An example of how augmentation functions affects synthetic data in a two dimensional classification problem is shown in Fig. 2.5. [12] made some empirical observations motivating why consistency regularization would improve model accuracy. The first observation was that networks trained on noisy data demonstrate a generally lower consistency than those trained on clean data. The second observation was that consistency reduces more significantly around noisy- labeled data points than correctly-labeled ones. The noise in these labels eventually becomes modeled in training, i.e. the model becomes overfitted and loses accuracy. These two observations suggest that maximizing consistency of prediction could improve robustness to noise. [12] also observed a correlation between validation accuracy and training consistency. Another wrapper method is pseudo-labeling. Psuedo-labeling uses the model itself to obtain artificial labels for unlabeled data. These pseudo labels can then be used to train the model. However, the pseudo labels are only retained if the confidence of the model is higher than some predefined threshold. The loss function for pseudo- labeling can be defined in the following way: 1 µB µB∑ b=1 1(max(qb) ≥ τ)H(q̂b, qb) (2.9) 16 2. Theory Figure 2.5: An example of consistency loss. The arrows represent data augmen- tations. Two random augmentations are applied to each data point. After the augmentations, the two new points should likely remain in the same class. where qb = pm(y|ub), q̂b = argmax(qb), τ is the threshold and argmax is applied to produces a "one-hot" probability distribution. Psuedo-labeling can be theoretically motivated by the clustering assumption, i.e., data points of the same class are located near each other. When models are only trained on a few data points, it causes some labeled points to become classified with low confidence of several classes instead of high confidence of one class. Psuedo-labeling uses the clustering assumption to encourage unlabeled data to get one label with high confidence. 2.3.1 FixMatch FixMatch is a state-of-the-art SSL algorithm [38]. It is based on two SSL approaches Consistency regularization and pseudo-labeling. FixMatch is described in Algorithm 2 The loss function in Fixmatch consists of two terms, a supervised loss ls and an unsupervised loss lu. The supervised loss is a standard cross-entropy loss on weakly augmented input: ℓs = 1 B B∑ b=1 H(pb, pm(y|α(xb))) (2.10) The unsupervised part of Fixmatch computes a pseudo-label for each unlabeled datapoint. The pseudo-label is then used in a standard cross-entropy loss function. The pseudo-label is produced by first making a prediction on weakly augmented data qb = pm(y|α(ub)). In the case of image classification, this could be a rotation. Then the pseudo-label becomes q̂b = argmax(qb). The pseudo-label will then be 17 2. Theory used together with a strongly augmented version of ub in cross-entropy loss: ℓu = 1 µB µB∑ b=1 1(max(qb) ≥ τ)H(q̂b, pm(y|A(ub))) (2.11) where τ is a confidence threshold that denotes when to use a pseudo-label. The full loss will be ℓs + λuℓu. As strong augmentation functions, A, [38] used RandAugment[10] and CTAugment (Control theory augment)[6]. RandAugment was developed as an augmentation function for image data. RandAugment randomly applies some N transformations sequentially to an image. The transformations used are: • Rotate, rotates the image some random degrees • Shear-x, shears the image along the horizontal axis with • Shear-y, shears the image along the vertical axis • Sharpness, randomly adjusts the sharpness of the image • Contrast, adjusts contrast • Translate y, translates the image horizontally • Translate x, translates the image vertically • Solarize, inverts all pixels above a threshold • Posterize, reduces the number of bits for each color channel • Color, randomly changes the color • Brightness, adjusts the brightness • AutoContrast, maximizes the image contrast by setting the darkest pixels to black and the lightest pixels to white Each transformation in RaundAugment is linearly scaled with the hyperparameter M. Only using one hyperparameter to scale all transformations was motivated ex- perimentally [10]. Since auto augment only has two hyperparameters (M and N) it can be optimized by grid search. Unlike RandAugment, CTAugment doesn’t need hyperparameter tuning [6]. This makes it easier to use with a small number of labeled examples. CTAugment divides each parameter for each transformation into bins of distortion magnitudes. Let m be a vector of bin weights for some parameter. The bin weights are initialized to 1 and are used to choose which magnitude bin to apply an image. At each training step, two transformations are sampled for each image in a uniformly random manner. To augment the images, two transformations are sampled and each of their parame- ters gets a modified set of bin weights, m̂, produced. Where m̂i = mi if mi ≥ 0.8 and m̂ = 0 otherwise. Magnitudes are then sampled from Categorical(Normalize(m̂)). The weights of the two transformations are updated by sampling each magnitude bin in mi uniformly randomly. The resulting transformations are applied to a labeled 18 2. Theory example x with label p to obtain an augmented version, x̂. The correctness of the models prediction is then measured with: 1− 1 2L ∑ |pmodel(y|x̂; θ)− p| (2.12) The weight for each sampled magnitude bin is then updated as: mi = pmi + (1− p) (2.13) where p = 0.99 is a fixed exponential decay hyperparameter. ALGORITHM 2: FixMatch algorithm 1 Input Labeled batch X = {(xb, pb) : b ∈ (1, ..., B)}, unlabeled batch U = {ub : b ∈ (1, ..., µB)}, confidence threshold τ , unlabeled data ratio µ, unlabeled loss weight λu 2 ℓs = 1 B ∑B b=1 H(pb, α(xb)) 3 for b = 1 to µB do 4 qb = pm(y|α(ub); θ) 5 end 6 ℓu = 1 µB ∑µB b=1 1(max(qb) ≥ τ)H(argmax(qb), pm(y|A(ub))) 7 return ℓs + λuℓu 2.4 Distributed algorithms Distributing algorithms is a technique to enable computation on many independent nodes without the need for a central coordinator. The nodes can range from cores in a multi-core processor to geographically distributed computation nodes, connected in a network. Each node should run a different part of the algorithm, while having limited knowledge about the other nodes. A network of nodes can be built up by different system architectures. Three different types of systems are: centralized, decentralized, and distributed. As shown by fig 2.6, the difference is how the nodes are connected. In a centralized system, there exists one server to which all other nodes are connected and communication is always done between a single node and the server. This system has one big limitation, it is prone to failure. If the server goes down, the whole system crashes. A decentralized system works similarly. However, in this system, there exists more than one server. The nodes can communicate with any of the servers, and the servers communicate with each other. By having many servers, the network is better protected against failure. In a distributed setting, all nodes are connected to each other. Each node in the distributed network has the same authority. There exists no server to handle coordination. Thus, specific distributed algorithms have to be used to handle coordination in a distributed system. 19 2. Theory Figure 2.6: A diagram illustrating the difference between a centralized, decentral- ized, and distributed network topology. To perform the coordination between nodes in a distributed system, there exist many algorithms. One common approach among distributed algorithms is to choose one node as coordinator [11]. The node to be selected as coordinator generally does not matter. However, the selection of the coordinator has to be performed in agreement with all nodes. Every node in the network should be aware of who is selected as the current coordinator. One such technique of choosing a coordinator is leader election. The challenge leader election tries to address is how to make a uniform decision of coordinator among all the nodes. Another way to model a distributed algorithm is a transition system. A transition system is used to define the behavior of discrete systems, such as finite-state ma- chines. A finite-state machine usually consists of a graph with nodes and edges, where each node represents a state and each edge represents an action. There are various types of transition systems that can model different types of algorithms. One such system is the Channelled Transition System (CTS) which was created to model the use of communication channels [2]. While CTS make the abilities of a single agent clear, it may not represent the interaction of agents clearly. ReCiPe, was developed to give better semantics for such interactions [2, 1]. 2.4.1 Model of Communication We present a modified version of the ReCiPe communication formalism [2, 1]. We start by specifying agents (or programs) and their local behaviours and we show how to compose these local behaviours to generate a global (or a system) one. We assume that agents rely on a set of data variables d, and a set of multicast channels ch. Definition 1. An agent is Ai = ⟨Vi, gr i , T s i , T r i , θi⟩, where: 20 2. Theory – Vi is a finite set of typed local variables, each ranging over a finite domain. We use V ′ to denote the primed copy of V (to store the next assignment to variables in V ) and Idi to denote the assertion ∧ v∈Vi v = v′. – gr i (Vi, ch) is a receive guard describing the connection of an agent to channel ch. – T s i (Vi, V ′ i , N, d, ch) is an assertion describing the send transition relation, where N specifying the number of required receiving agents. That is, given the current assignment to V , the agent can send a message d to N agents on some channel in ch, and consequently the next assignment to V is stored in V ′. – T r i (Vi, V ′ i , d, ch) is an assertion describing the receive transition relation. That is, given the current assignment V , an agent can receive message d on some channel in ch and evolve to V ′. – θi is an assertion on Vi describing the initial states, i.e., a state is initial if it satisfies θi. Agents exchange messages. A message is defined by the channel it is sent on and the data it carries. A set of agents agreeing on the data variables d, and channels ch define a system. We informally define the transition relation of the system ρ as follows: The transition relation ρ relates a system state s to its successors s′ given a message m = (ch, d, k) by agent k. Namely, there exists an agent k that sends a message with data d (an assignment to d) on channel ch to n other agents if and only if there exists n other agents are connected and participate in the interaction. That is, the agents are connected to channel ch (i.e., each agent j in state sj has gr j (sj, ch) satisfied) get the message and perform a receive transition. As a result of interaction, the state variables of the sender and these receivers might be updated. This means that the interaction is blocking both for the sender and the receivers. That is, a sender is not able to send without having the required number of receivers available. 21 2. Theory 22 3 Methods This chapter will explain the methods used in this project. It will begin to explain how data was gathered for the road surface noise dataset. In the following section, both datasets used in the evaluation will be described. Section three will explain how and what features will be extracted from the audio data. The following section will describe the model architecture of the classifier and the proposed distributed training algorithm, distributed FixMatch. Furthermore, we describe the proposed augmentation function for audio features. The last two sections will describe the simulation environment and the evaluation methods. 3.1 Data Gathering The data was recorded through several short road trips. The road trips were per- formed on the road surfaces; asphalt, gravel, oil gravel, cobblestone, and snowy roads. To record the sound of the wheel against each road surface, a microphone was used. The microphone was connected to the vehicle. The placement of the microphone was tested at different positions, both inside and outside the car. The placement chosen was close to the wheel on the outside of the car. To minimize wind noise, the microphone was placed inside the wheel well. The back right wheel well was chosen to also minimize engine noise, as seen in Fig. 3.1. Every car produces different sounds when driving. This is due to a number of different parameters that varies based on the car. For example; the engine type, the airflow, the tire type, the tire wear, motor defects, etc. Thus when training a neural network on sounds from just one car, the network model could be adapted to the sounds from that specific car. The model will then perform poorly when applied to different cars. To minimize this effect and make the model more general, the data used in training should be as generic as possible. To make the data more Model Year Type Form Motor Tires Nissan Qashqai 2011 Diesel Cross over 1.6-dCi Stud-free winter tires Volvo v70 2013 Diesel Station wagon D4 Stud-free winter tires Polestar 2 2022 Electric Hatch back single motor Stud-free winter tires Table 3.1: Parameters of the cars used for recording 23 3. Methods generic, the car parameters should change between the recordings. To achieve data generality we collected recordings with several different car models. The car models used for recordings are presented in Table 3.1. There are other factors that can vary apart from just the car. The weather, road quality, and car speed are such things. To minimize the effect of these parameters, recordings were collected on; wet and dry roads, on roads of different quality, and at different car speeds. The car speeds varied from 10 km/h to 80 km/h. Figure 3.1: The placement of the microphone when collecting road noise for the road surface noise dataset. 3.2 Datasets Two datasets were used when testing distributed FixMatch, the proposed distributed semi-supervised algorithm. Firstly, the road surface noise dataset, which was recorded during the project. Secondly, UrbanSound8K, which is an open-source dataset used to evaluate audio classifiers. 3.2.1 Road surface noise dataset The first dataset contains audio recordings of road noise that were recorded during the project. The dataset contains 5 different classes which represent audio from 5 road surfaces: • 4688 clips of gravel road noise • 4249 clips of asphalt road noise 24 3. Methods • 1130 clips of cobble road noise • 2010 clips of oil gravel road noise • 667 clips of snow road noise Each clip is one second long and there are 12744 in total, which have been cut from about 212 minutes of audio recordings. When listening to the clips we thought it was often clear what type of road surface the sound has been recorded on. However, some of the clips were recorded when driving faster than 50 km/h, which has caused some audio clipping. This could possibly have been caused by wind. These clips were removed and are not included in the dataset. 3.2.2 UrbanSound8K The second dataset, UrbanSound8K [35], is a dataset of urban sounds. Urban- Sound8K was chosen as an additional dataset for two reasons. It has been used in previous research, which makes comparisons in performance easier. It is a harder classification problem than the road surface noise dataset, which could show that distributed FixMatch works on harder problems. The aim of UrbanSound8K was to create a dataset consisting of the most frequently occurring urban noise complaints. The authors had three conditions for the dataset; The sounds should occur in an urban environment, all recordings should be real, i.e. no artificially made sounds, and the dataset should be large enough for training algorithms that can be used on real sensor networks. To achieve this, they utilized the Freesound [17] database. There they searched and downloaded recordings associated with each class. These recordings were then manually checked and cut to only include the relevant sounds. The final dataset consists of 10 classes distributed as follows: • 1000 clips of street noise • 1000 clips of child noise • 1000 clips of dog noise • 1000 clips of AC noise • 1000 clips of drill noise • 1000 clips of jackhammer noise • 1000 clips of engine noise • 929 clips of siren noise • 427 clips of car horn noise • 374 clips of gunshot noise In total there were 8732 clips with a maximum length of 4s. The total length of the recordings is 8.5h. However, there were two clips that were shorter than the minimum clip size(60ms) for the feature set used in openSMILE. These clips were removed before training. 25 3. Methods 3.3 Feature Extraction We extract features from the audio recordings to feed as input to the neural network. The features are extracted by applying filters to the audio recordings. The filters are applied using the audio feature extraction toolkit, openSMILE [15]. We choose openSMILE for its ease of use and to ensure reproducibility. The filters extract different Low-level descriptors. The low-level descriptors include audio features such as: • Frame Energy • Frame Intensity / Loudness • Mel-spectograms • Auditory Spectra The choice of OpenSMILE was based on previous state-of-the-art audio classification papers[3][4]. In [3] the authors tried to detect if the road surface the car was driving on was wet or dry. To detect road surface wetness, they extracted a set of features, Auditory Spectral Features (ASF) [27], described in the next paragraph. Since road surface wetness detection and road surface detection are similar problems we will try their set of features. However, openSMILE also has its own predefined feature sets that were made for speech recognition. The feature set from [3] will be compared to the predefined sets in openSMILE and the best one will be used. 3.3.1 Auditory Spectral Features The first step in making the feature set from [3] is to compute a short-time Fourier transform (STFT). Unlike [3] we use a 100ms frame size and 100ms step size. This is done to reduce the dimensionality of the features. The STFT outputs auditory spectral features which can then be used by 26 triangular filters to derive the Mel spectrograms M100(n, m) + 1). Then the Mel spectrogram will be transformed to log-scale 3.1. This is done to match the human perception of loudness since humans can hear the type of road surface they are driving on. M100 log (n, m) = log(M100(n, m) + 1.0) (3.1) The positive first-order differences can then be calculated from each log Mel spec- trogram: D100(n, m) = M100 log (n, m)−M100 log (n− 1, m) (3.2) The positive first-order differences are used to emphasize sudden changes in the au- dio, specifically a note onset. The log Mel spectrogram and the first-order difference will give 52 features. Finally, we will include the frame energy and its derivative to get the full feature vector. 26 3. Methods Feature set ComParE_2016 GeMAPSv01b eGeMAPSv02 ASF Accuracy 81.62% 81.11% 84.51% 80.60% Table 3.2: Accuracy of different feature sets when used to train a neural network to classify the Road surface noise dataset 3.3.2 Feature set evaluation As mentioned earlier, there exist several premade feature sets in openSMILE. The ones we chose to evaluate are; ComPareE_2016 [37], GeMAPSv01b [13] and eGeMAPSv02 [13]. The difference between these feature sets is the Low-Level Descriptors ex- tracted, as well as the functionals applied to the LLD. The largest of these premade feature sets is ComParE_2016. It extracts 65 LLD and applies various functionals to these LLD to obtain 6373 features. The set was created as a tool to solve four challenges within speech recognition; detect non- linguistic events such as a sigh or laughter, recognize conflict in group discussion, recognize the emotion of a speaker, and determine the pathology of a speaker. The smallest feature set we evaluated was GeMAPSv01b [13]. To obtain this feature set, various functionals were applied to 18 extracted LLD, resulting in a feature set of 62 features. This set was also created to solve speech recognition tasks. How- ever, the aim of the GeMAPSv01b set was to create a minimalistic set compared to the ComPareE_2016, which could perform comparably in emotion detection. GeMAPSv01b performed similarly, and even better at certain emotions [13], com- pared to the ComPareE_2016. The final feature set we evaluated was the eGeMAPSv02 [13]. This feature set is an addition to GeMAPSv01b. It uses the same 62 features as a base. In addition to these features, they additionally extract 7 LLD and apply functionals to these LLD to obtain the final feature set of 88 features. The different feature sets were evaluated empirically to determine which set would work best with our classification problem. The road surface noise dataset was used to evaluate the feature sets. One by one, each feature set was used to extract features from the dataset. These features were then used as inputs to a central supervised model. The model was trained with the same hyperparameters for all the feature sets. For each feature set, the model was trained for 800 training epochs. The training was performed for ten runs for each set and the highest achieved accuracy among these 10 runs was collected. The highest accuracy for every feature set is shown in Table 3.2. Here we can see that all premade openSMILE feature sets achieved higher accuracy than the feature set from [3]. Among the premade sets, eGeMAPSv02 performed best. Thus we used eGeMAPSv02 as our feature extraction set. 27 3. Methods 3.3.3 eGeMAPSv02 eGeMAPSv02 [14] consists of 88 features in total. The first step in the extraction of those features is to extract the 25 LLDs. All of these LLDs are smoothed over 3 frames (for pitch, jitter, and shimmer, the smoothing is only performed within regions where fundamental frequency F0 > 0). The 25 LLD can be sorted into three groups: Frequency related parameters, Energy/Amplitude related parameters, and spectral parameters: Frequency related parameters: • Pitch, logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz (semi- tone 0). • Jitter, deviations in individual consecutive F0 period lengths. • Formant 1, 2, and 3 frequency, Formant 1, 2, and 3 frequency, centre frequency of first, second, and third formant • Formant 2–3 bandwidth, added for completeness of Formant 1–3 parame- ters. Energy/Amplitude related parameters • Shimmer, difference of the peak amplitudes of consecutive F0 periods. • Loudness, estimate of perceived signal intensity from an auditory spectrum.. • Harmonics-to-Noise Ratio (HNR), relation of energy in harmonic com- ponents to energy in noiselike components. Spectral parameters: • Alpha Ratio, ratio of the summed energy from 50–1000 Hz and 1–5 kHz • Hammarberg Index, ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in the 2–5 kHz region. • Spectral Slope 0–500 Hz and 500–1500 Hz, linear regression slope of the logarithmic power spectrum within the two given bands • Formant 1, 2, and 3 relative energy, as well as the ratio of the energy of the spectral harmonic peak at the first, second, third formant’s center frequency to the energy of the spectral peak at F0. • Harmonic difference H1–H2, ratio of energy of the first F0 harmonic (H1) to the energy of the second F0 harmonic (H2). • Harmonic difference H1–A3, ratio of energy of the first F0 harmonic (H1) to the energy of the highest harmonic in the third formant range (A3). • Formant 1, bandwidth of first formant. • MFCC 1–4 Mel-Frequency Cepstral Coefficients 1–4. • Spectral flux difference of the spectra of two consecutive frames. 28 3. Methods The second step is to use the LLD to compute the 88 features. Firstly, all 25 LLD have the arithmetic mean and the coefficient of variation functionals applied to them. This results in 50 features. Furthermore, 6 temporal features are added: • The rate of loudness peaks, the number of loudness peaks per second. • The mean length and the standard deviation of regions where F0 > 0. • The mean length and the standard deviation of regions where F0 = 0; approximating pauses). • The number of continuous regions, where F0 > 0, per second (pseudo syllable rate). which results in 56 features so far. Another 26 features are applied when F0 is non-zero. 16 of those features are added by applying the following functionals to pitch and loudness: • The 20-th,50-th and 80-th percentile • The range of the 20-th to 80-th percentile • The mean and standard deviation of the slope of rising/falling signal parts The remaining 10 are computed by taking the arithmetic mean and coefficient of variation of the spectral flux and MFCC 1–4. This results in 82 features in total. 5 additional features are added by applying the following functionals when F0 = 0: • The arithmetic mean of the Alpha Ratio • The Hammarberg Index • the spectral slopes from 0–500 Hz and 500–1500 Hz • The arithmetic mean of the spectral flux Giving us 87 features. Lastly, the equivalent sound level is added resulting in 88 features. 3.4 Model Architecture & Training The features are fed to a fully connected neural network (FNN). The FNN was chosen instead of a CNN since the feature space was small. The FNN consists of on input layer of 88 nodes, three hidden layers of 200 nodes each, and an output layer of 5 or 10 nodes depending on the classification task. The model was implemented in PyTorch. PyTorch was chosen because of its ease of use, its flexibility given by its use of dynamic computational graphs, and since it’s often used in research. The training on each node is performed by stochastic gradient descent (SGD) with the loss function described in Algorithm 2. Both labeled and unlabeled data were used. Furthermore, cosine learning rate decay was used. Lastly, a dropout of 50% was used on the first layer during training. 29 3. Methods Figure 3.2: State transitions of distributed FixMatch 3.4.1 Distributed FixMatch Distributed FixMatch combines ideas from FixMatch and FedAvg and distributes them. A key difference between distributed FixMatch and FedAvg, is that FedAvg has a server that performs the coordination of the supervised training on clients. Whereas in distributed FixMatch, every node takes turns performing the server role. The semi-supervised training is inspired by FixMatch but the augmentation functions have been modified to work with audio features extracted by openSMILE. Each agent Ai in distributed FixMatch has its own set of local variables, that is parameters wi, its own set of unlabeled data Ui and an identical copy of labeled data L. The training in distributed FixMatch is performed in rounds R. Each client begins in an initial state 0, where they attempt to select some N of their neighbors to begin a training round. If a agent Ai finds N neighbors in the initial state then it makes a state transition and sends its parameters to the other models. At the same time the N neighbors make a state transition and begin training with the average of agent Ai parameters and their own parameters. The agent Ai acts as a server in this instance. When the neighbors have finished training they return the new parameters to Ai and return to the initial state. Ai now combines the parameters it received from each of its neighbors and averages them. Ai returns to the initial state. The distributed FixMatch algorithm is presented in Algorithm 3. A visualization of the state transitions is shown in Fig. 3.2. The state transitions are performed according to the modified version of ReCiPe introduced in the Model of Communication section in the theory chapter. 30 3. Methods ALGORITHM 3: Distributed FixMatch SelectN is a function, which queries N neighbour clients if they are in the idle state and returns them if true. R is the number of server rounds, w is the model parameters, Ui is unlabeled data, and L is labeled data. 1 Initialize w; State← initial 2 Function Step(): 3 if State = initial then 4 nodes ← SelectN() 5 if |nodes| = N and r ≤ R then 6 for node ∈ nodes do 7 Transition node to the client state and send w to node 8 end 9 State← server 10 r ← r + 1 11 end 12 end 13 if State = server then 14 ŵ ← Get received parameters from clients 15 if |ŵ| = N then 16 wt ← 1 N ∑N n=1 ŵn 17 State← initial 18 end 19 end 20 if State = client then 21 ws ← Get received parameters from server 22 w ← w+ws 2 23 for each local epoch e from 1 to E do 24 for batch b ∈ (Ui,L) do 25 w ← w − η∇ℓ(w; b) 26 end 27 end 28 send w to server 29 State← initial 30 end 31 31 3. Methods The training is performed by a slightly modified version of FixMatch. In [38] Fix- Match is implemented to train an image detection model. The loss function contains one strong augmentation and one weak augmentation function. The strong augmen- tation function uses RandAugment[10] which randomly chooses between a set of im- age transformations. These transformations can be flipping the image, changing its colors, or rotating it. The weak augmentation function only uses a subset of these transformations, such as flipping the image. The features we use to classify the audio are 88 scalar features extracted from the eGeMAPSv02 feature set by openSMILE. Therefore, the augmentation functions used in [38] are incompatible with classifying audio features. The scalar features need to be augmented by a different method. One such feature is illustrated in Fig. 3.3. In the image we can clearly see that some classes tend to cluster around a specific value on the feature. The idea is that mod- ifying these features is analogues to image transformations. Changing the loudness or frequency of the audio is similar to rotating or changing the color of an image. Audio clips where many features are similar but slightly different should belong to the same class. We will use this clustering assumption to create our augmentation functions. If we add a small random value from a Gaussian distribution to each feature, then the augmented features should likely remain in the same class: α(u) = u + x (3.3) where x = {x0, x1, ..., x87}, xi ∼ N (µ, σ2) and u ∈ U is some unlabeled data. The augmentation function α can easily be extended into a strong augmentation function and a weak augmentation function by using different σ values. Figure 3.3: The distribution of feature mfcc1_sma3_amean for all road surface data points visualized for each class. 32 3. Methods 3.5 Hyperparameter optimization Distributed FixMatch uses several hyperparameters. To achieve the best result we decided to search for the optimal values. This optimization was done using grid search since it is simple to implement. The search was made for three hyperpa- rameters: soft augment σ, strong augment σ, µ, and the threshold τ . For each of these hyperparameters, some interval was chosen. Then a combination of hyper- parameters for each point in this interval was made (144 combinations in total). Each combination was then run for 400 iterations. The five best combinations of hyperparameters in terms of accuracy are presented in Table 3.3. acc augw augs threshold mu 66.72 0.2 0.6 0.95 3 66.72 0.4 0.5 0.95 3 66.55 0.3 0.6 0.95 3 66.38 0.2 0.5 0.95 3 66.2 0.1 0.6 0.99 3 Table 3.3: Accuracy of a distributed FixMatch when trained on the UrbanSound8K dataset for different combinations of hyperparameters. 3.6 Simulation To examine how the algorithm will perform in practice, a simulation environment was implemented. The simulation consisted of a simulation handler and nodes. Each node was implemented as a class with its own model and data sets. Furthermore, each node has a step function that attempts a state transition. The step function is shown in Algorithm 3. The choice of which client gets to run its step function is made in a uniformly random manner by the simulation handler. The simulation is implemented in this way so that it can simulate the parallelism of the nodes in real-world applications. In a parallel system, it is unknown in what order the nodes try to make a state transition. The random transitions give of our simulation the same property. To distribute the data between the nodes, the datasets were split up into labeled and unlabeled data. An equal share of data from each class was chosen randomly to be used as labeled data. The labeled data were split up into a validation set and a training set. The split was made such that all classes were equally represented in the training set. A copy of the validation and training set was distributed to all the nodes. The remaining data were used as unlabeled data. This data was spread out to the nodes in an IID distribution. I.e., each node was given the same amounts of unlabeled data per class and all data on each node were distinct. [21]. 33 3. Methods 3.7 Evaluation To evaluate the performance of the distributed FixMatch algorithm we will per- form several tests. Firstly, distributed FixMatch will be compared to a supervised centrally trained model. The supervised model is trained using stochastic gradient descent with cross-entropy loss. The supervised model and the semi-supervised dis- tributed FixMatch will both have access to the same set of labeled data. However, each participating node in distributed FixMatch will also have access to its own set of unlabeled data. One test will be performed with the UrbanSound8K dataset and one test will be performed with the road surface sound dataset. The training will then be performed five times for 800 rounds to maximize the probability that the model achieved its highest accuracy. From these 5 runs, the highest accuracy was collected. The two methods will then be compared by their accuracy in the classifications. These tests are performed to evaluate the performance of distributed FixMatch compared to only using labeled data with a central model. The tests will also be performed with several different amounts of labeled data. This is done to evaluate how the amount of labeled data affects the performance of distributed FixMatch with respect to prediction accuracy. The results of these tests are shown in Fig. 4.1 and in Fig. 4.2. The results show that distributed FixMatch can, in some settings, utilize the unlabeled data to improve the accuracy. Another test that will be done is to evaluate how the accuracy of distributed Fix- Match is affected when increasing the number of clients. This is done to see how well the algorithm scales. These tests will be performed with 5, 10, and 25 clients. The number of clients is however somewhat limited since our dataset is too small to make tests with a realistic amount of clients. The evaluation results of these tests are presented in Table 4.1. We will also conduct tests to evaluate the communication efficiency of the distributed FixMatch algorithm. These tests will be performed by counting how many commu- nication messages are sent by each node in the network. The number of messages sent will be compared when using the distributed FixMatch algorithm vs when us- ing the federated learning algorithm. To measure the messages, a counter will be increased whenever a message is sent between two nodes. Each node will have an in- dividual counter so the distribution of messages can be seen. The tests will compare the total messages sent in the network before reaching a certain accuracy, as well as the maximum number of messages sent by one node. This accuracy threshold was set to 59% accuracy. The tests will be performed with different number of clients to evaluate the scalability of the distributed FixMatch algorithm. These tests are presented in Table 4.3. They show that the communication overhead of distributed FixMatch is more evenly distributed than in federated learning. They also show that distributed FixMatch scales better than federated learning 34 4 Results To evaluate our distributed FixMatch algorithm several tests were performed on the road surface dataset and UrbanSound8K. In the first set of tests, the performance of the semi-supervised component of distributed FixMatch is tested. A model trained by the distributed FixMatch is compared to a model trained using supervised learn- ing. Both models train on the same labeled data but distributed FixMatch also trains on unlabeled data. In these tests, we can see that distributed FixMatch suc- cessfully leverages the unlabeled data. For instance, with 50 labels per class on UrbanSound8K, distributed FixMatch achieved an accuracy of 68.51% compared to the supervised training, which achieved 63.12%. The second set of tests evaluates how the size of the network impacts the performance of distributed FixMatch. Each test uses the same data and model. However, the number of clients used varies. These tests show that the network size has little impact on performance, as seen in Table 4.1. Lastly, we demonstrate the individual accuracies of the clients during training. For the road surface noise dataset, the model trained using distributed FixMatch outperforms the supervised model as seen in Fig. 4.1. This is most prevalent in the scenarios with fewer labels. The accuracy with one label for every class is 85.04% for distributed FixMatch and 68.35% for the supervised. This may seem like a very high accuracy for so few labels. However, this is because the features of the different classes in the road surface noise dataset are quite distinct, e.g. the feature seen in Fig. 3.3. There we can see a clear distinction among most classes. With only one label per class, the supervised model has some problems in classifying the data. A possible explanation of these results is that the supervised model has difficulties learning the real distribution with just one example. This will be easier for distributed FixMatch since its unlabeled data will tend to cluster around the labeled data points. Therefore, it can derive which points likely belong to which distribution, giving it a better idea of the distributions of the features. Nevertheless, the distributed FixMatch model’s and the supervised model’s accuracies converge with an increased number of labels per class. At 50 labels per class they are the same. This is not surprising since 50 labels per class give the supervised model a good idea of the real distributions of the features. Hence, the advantage distributed FixMatch got from its unlabeled data is strongly diminished. 35 4. Results 1 2 5 10 20 5040% 60% 80% 100% labels per class ac cu ra cy Road Surface Supervised Distributed FixMatch Figure 4.1: The accuracy difference of distributed FixMatch vs a central super- vised model for the Road Surface dataset using a fixed number of labeled training examples for each class. 1 2 5 10 20 5020% 40% 60% 80% labels per class ac cu ra cy UrbanSound8K Supervised Distributed FixMatch Figure 4.2: The accuracy difference of distributed FixMatch vs a central super- vised model for the UrbanSound8K dataset using a fixed number of labeled training examples for each class. For the UrbanSound8K dataset, we can see that the trend (Fig. 4.2) is different from the previous test. Here, the distributed FixMatch model performs slightly worse than the supervised model on the tests with few labels. This could possibly be because UrbanSound8K is harder to classify than the road surface noise dataset. The feature distributions in UrbanSound8K are less clustered, e.g. shown in Fig. 4.3, this could cause distributed FixMatch to miss-label some of the unlabeled data. 36 4. Results For instance, the model could receive labeled data with an outlier feature. This could cause distributed FixMatch to wrongly classify data from other categories in the same category. This will cause the model to be trained on incorrect labels. In contrast, the supervised model could be more conservative in assuming that clusters near labeled data belong to the same class. When the amount of data increases the risk of only receiving outliers decreases and the chance of getting labeled data points near the real class clusters increases. Therefore, as the number of labels increases the performance difference between the supervised and distributed FixMatch is reversed. With greater than 5 labels distributed FixMatch outperforms the supervised model. Finally, with 50 labels per class distributed FixMatch outperforms the supervised model by 5.39 percentage points. Figure 4.3: The distribution of feature mfcc1_sma3_amean for all UrbanSound8K data points visualized for each class. In Table 4.1 the test results when training with different number of clients are shown. For both datasets the difference in accuracies is small. The accuracies for Urban- Sound8K differ by 2.26 percentage points. The accuracies for the road surface noise dataset differ by 2.24 percentage points. The accuracy for UrbanSound8K slowly decreases with the number of clients. Suggesting a scalability problem. However, the road surface noise dataset first increases and then decreases with the number of clients. This suggests that the variation could be caused by noise. Therefore, the algorithm seems to scale, but more testing could be needed to confirm this. Fig. 4.4 shows the average accuracy of each client when training on the Urband- Sound8K dataset. It also shows the individual accuracy of each client. The accuracy is recorded at each round of the simulation, i.e., each time a client tries to make a state transition. As the rounds progress, the average accuracy slowly increases. In contrast, the increase in accuracy of the individual clients is much more unstable, 37 4. Results Dataset acc 5 clients acc 10 clients acc 25 clients UrbanSound8K 30 labels 66.72% 65.85% 64.46% road surface noise dataset 5 labels 88.72% 90.56% 88.32% Table 4.1: Number of clients affect on accuracy even if all of them eventually stabilize. This can be explained partly by high accu- racy increases during training and partly by the parameter exchanges between the clients. A trained client who exchanges parameters with an untrained client will lose some accuracy due to the nature of the aggregation. This explanation is validated by the convergence of all individual accuracies with the average accuracy during the last rounds of training. 0 50 100 150 200 250 Simulation round 10 20 30 40 50 A cc ur ac y Figure 4.4: Accuracy development of the nodes during training on the Urban- Sound8K dataset. In Fig. 4.5 we can see how well a model trained using distributed FixMatch could classify the different road surfaces. We can see that the model can classify snow and cobblestone with nearly 100% accuracy. This outcome is somewhat predictable. This is because the noise coming from snow and cobblestone is easier to distinguish compared to the other road surfaces. Oil-gravel and gravel were predicted with fairly 38 4. Results gr av el as ph al t oi lg ra ve l co bb le st on e sn ow 80 85 90 95 100 ac cu ra cy Figure 4.5: Accuracy of the different road surfaces. Trained using distributed FixMatch. Road surface noise accuracy was taken when a client reached at least 87% accuracy. An average of 5 runs was taken high accuracy. The model had the most difficulty classifying noise coming from an asphalt road. Here the model only classified it correctly about 78% of the time. The lower accuracy could have been caused by the more diverse dataset. For example, unlike the other classes, the asphalt recordings were collected by all the different vehicles used. The asphalt recordings were also recorded on many different roads with different road quality, making the data more scattered. Another possible cause is that asphalt does not have an equally distinct sound compared to the other road surfaces. 4.1 Communication overhead In this section, we will first make a theoretical comparison of communication over- head between centralized training (FixMatch using a central dataset), federated training (FedAvg combined with FixMatch using a distributed dataset), and dis- tributed FixMatch (using a distributed dataset). Then we will empirically demon- strate the communication overhead of the different methods. In the theoretical comparison, we will first demonstrate the complexity of the total communication overhead, i.e., all communication that takes place on the network. Secondly, we will demonstrate the complexity of the maximum local cost, i.e. the highest cost of communication for any single node in the network. The results of both comparisons are shown in Table 4.2. To train on all the data with a centralized approach, it would be required to upload the audio data from each of the participants. This would result in a communication 39 4. Results Total cost Maximum node cost Centralised O(n · d) O(n · d) Federated O(n · r ·m) O(n · r ·m) distributed FixMatch O(n · r ·m · degree(network)) O(degree(network) · r ·m) Table 4.2: Communication complexity. Where n is the number of nodes partici- pating in the training, d is the client dataset size, r is the number of server rounds during training and m is the model size. overhead of O(n·b) where n is the number of participants and b is the average dataset size. On the contrary, the total communication overhead of federated training and distributed FixMatch would be O(n · r ·m) where n is the number of participants, r is the number of communication rounds and m is the model size. In the federated algorithm, the server sends to and receives its model m from some fraction of the n clients every round, until round r, i.e a communication of O(n · r · m). In the distributed FixMatch algorithm, every node can act as a server for r rounds. Each round it acts as a server it sends and receives its model m from degree(network) clients. Thus the communication of each node is O(r ·m ·degree(network)) and the communication of the entire network is O(r ·m · degree(network) · n). Comparing O(n · b) and O(r ·m ·n), b tends to be larger than m · r. For instance, one second of a recorded wav file takes 172 KB. The average daily drive time is 51 minutes [23]. Therefore, the average amount of data would be 51 · 60 · 172 KB = 526MB daily. This can be compared to the model size used in this project of 0.4 MB. The average amount of rounds needed for convergence on the UrbanSound8K(6.6 GB) dataset is 24.24, giving a total communication overhead per client of 24.24 · 0.4MB ≈ 9.7MB. Another way to evaluate the communication overhead is to look at the maximum communication overhead of each participant. This is where distributed FixMatch has a large advantage over the other methods. In the centralized approach, the server communicates with n clients. From each client it receives some b amount of data. This gives it a communication overhead of O(n · b). The client only com- municates with the server, by uploading the data b, which gives a communication overhead of O(b). In the federated approach, the maximum local communication overhead also occurs on the server. The server receives communications with a frac- tion of n clients every round. Therefore receiving O(m · n) data every round, where m is the model size. This gives a communication overhead of O(n · r · m). The clients only communicates with the server, by receiving and sending the model m during r rounds, which gives a communication overhead of O(n · r). In distributed FixMatch, the maximum communication overhead a node will receive is when it acts as a server. The temporary server will send and receive m from each of the tem- porary clients during r rounds. The number of temporary clients is limited to the number of neighbors a client has, i.e. degree(network). This gives a communication overhead of O(degree(network) · r ·m). In the worst case, a node acting as client can communicate with all of it’s neighbours all of their rounds r. In this case, it sends and receives the model m to each of the neighbours r times. This give a worst communication overhead of O(degree(network) ·r ·m), which is less than or equal to 40 4. Results the communication overhead when acting as a server. To summarize, the maximum local communication overhead increases linearly, with the number of nodes in the network, for the federated and centralized approach. But for distributed FixMatch, the communication overhead only increases linearly for its neighbors. To measure communication overhead, we performed a test that evaluated the num- ber of communication messages sent before convergence. We tested on the Urban- Sound8K dataset using federated training (FedAvg combined with FixMatch using a distributed dataset) and distributed FixMatch. To measure the number of commu- nications needed for both algorithms, we set a cutoff accuracy that determined when the algorithm had converged. For federated training, the accuracy was examined on the server. For distributed FixMatch, the accuracy was examined on every node since every node can act as a server. The results of the tests are presented in Table 4.3. In communication tests there are several things to note. For distributed FixMatch, the average node communicates a relatively similar amount as the most strained node in the network. when using 10 nodes the average node sent 242.4/10 ≈ 24.2 messages, while the most strained node sent 32.6 messages. This is also true when increasing the number of nodes in the network. On the contrary, for federated train- ing, the most strained node is not close to the average communication overhead. The average overhead with 10 nodes is 158.0/10 = 15.8, compared to the server which has a communication overhead of 80. Additionally, the communication overhead of the most strained node for distributed FixMatch is fairly constant independently of the network size, as seen in Fig. 4.6. The maximum number of communications is connected to the degree of the network, which does not change when the total number of nodes increases. The total communication cost of distributed FixMatch is almost the same for 25 and 50 nodes. This is not inline with the theoretical pre- diction. This might be because in the theoretical analysis the algorithm stops after a certain number of rounds, while in these tests it stops by an achieved accuracy. The federated algorithm has a linear increase in both total communication costs and the maximum local cost with respect to the number of nodes. Algorithm Total communications Most strained Distributed FixMatch 10 nodes 242.4 32.6 Distributed FixMatch 25 nodes 396.4 25.6 Distributed FixMatch 50 nodes 387.8 15.0 Federated training 10 nodes 158.0 80.0 Federated training 25 nodes 607.0 306.0 Federated training 50 nodes 1670.0 840.0 Table 4.3: Number of communication messages sent for distributed FixMatch vs federated training until achieving a threshold accuracy 41 4. Results 10 25 500 200 400 600 800 1,000 nodes m es sa ge s se nt Most strained node Federated training Distributed FixMatch Figure 4.6: The communication overhead of the most strained node for Distributed FixMatch vs Federated training 42 5 Concluding remarks In this chapter, we will first discuss the results from the previous chapter, especially relative to the goals and challenges stated in the introduction chapter. After that, we will state the conclusion we think can be drawn from the results and the project in its entirety. Lastly, we will speculate about improvements that could be made to the method and what further research could be of interest. 5.1 Discussion As shown in the results chapter, FixMatch outperformed the supervised approach in the audio classification tasks. Furthermore, distributed FixMatch performs similarly to other state-of-the-art semi-supervised approaches. For instance, the performance of distributed FixMatch was 68.5% for 50 (500 total) labels per class on the Urban- Sound8K dataset. While, semi-supervised learning on the same dataset, where a maximum of 75.17% ± 1.52 was achieved [26] when using 600 total labels. These two results indicate that distributed FixMatch was successful in leveraging unla- beled distributed data on many different clients. Furthermore, the accuracies of distributed FixMatch were achieved without sending any raw data, thus preserv- ing the privacy of participants. The comparable accuracy also suggests that the openSMILE feature set, eGeMAPSv02, is possible to use for more problems than its original use case of speech recognition. These results also imply that the simple augmentation of features with a Gaussian distribution could be a useful technique for pseudo labeling. However, the accuracy advantage over the centralized method varied with different datasets and amounts of labeled data. Therefore, the pseudo- labeling method may be more or less useful in different situations. The theoretical and empirical analysis of the communication costs of distributed Fix- Match shows that it scales better than its centralized alternatives. The theoretical analysis states that the maximum node cost should increase with O(degree(network)· r ·m), where r is the number of rounds and m is the model size. This theoretical cost is in line with testing, Fig. 4.6. The maximum local communication overhead does not increase with the network size. Furthermore, we can see in Table 4.1 that the size of the network has little impact on the accuracy of the models. In comparison, the federated algorithm (FedAvg combined with FixMatch) has a theoretical maxi- mum local communication overhead of O(n · r ·m), where n is the number of nodes, 43 5. Concluding remarks r is the number of rounds and m is the model size. This theoretical analysis is also confirmed by tests (Fig. 4.6). The centralized algorithm’s (FixMatch) maximum node communication overhead also increases linearly, with the number of clients O(n · d). However, it depends on the data size instead of the model size. Therefore, which approach