Disturbance Detection and Classification in Large Microwave Networks Time series classification using deep convolutional neural net- works Master’s thesis in Computer Science TOBIAS OLAUSSON VICTOR SANDELL Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2017 Master’s thesis 2017:105 Disturbance Detection and Classification in Large Microwave Networks Time series classification using deep convolutional neural networks Tobias Olausson Victor Sandell Department of Electrical Engineering Division of Communications and Antenna Systems Chalmers University of Technology Gothenburg, Sweden 2017 Disturbance Detection and Classification in Large Microwave Networks Time series classification using deep convolutional neural networks Tobias Olausson & Victor Sandell © Tobias Olausson, 2017. © Victor Sandell, 2017. Supervisor: Jonas Hansryd, Ericsson Supervisor: Alireza Sheikh, Department of Electrical Engineering, Chalmers Uni- versity of Technology Examiner: Alexandre Graell i Amat, Department of Electrical Engineering, Chalmers University of Technology Master’s Thesis 2017:105 Department of Electrical Engineering Division of Communications and Antenna Systems Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: A map showing Ericsson’s microwave network in Gothenburg. Typeset in LATEX Printed by Chalmers Reproservice Gothenburg, Sweden 2017 iv Disturbance Detection and Classification in Large Microwave Networks Time series classification using deep convolutional neural networks Tobias Olausson Victor Sandell Department of Electrical Engineering Chalmers University of Technology Abstract This thesis explores how disturbances in a microwave network can be detected and classified using neural networks. The data was segmented into chunks consisting of one day measurements in one link. Each segment was then classified as normal behavior, weather disturbances, or disturbances caused by the construction cranes. Additionally, a general class for other disturbances was also used. Two convolutional neural network structures were evaluated. One structure has a single link input, while the other uses the data of the nearby links. The networks were able to achieve an accuracy of 100% and 98%, respectively. This confirms that convolutional neural networks can be used to classify disturbances in a microwave network. Keywords: Machine Learning, Neural Network, Classification, Time Series v Acknowledgements First and foremost, we would like to thank our supervisor at Ericsson, Jonas Han- sryd, for his guidance throughout the project. We would like to thank our supervisor at Chalmers, Alireza Sheikh, for his insight surrounding the thesis. We would also like to acknowledge Lei Bao and Sima Shahsavari for their support and expertise. A thanks goes out to Olof Mogren, from the Computing Science division at Chalmers, for his support regarding machine learning. We would like to thank the owners of the microwave network, Hi3G Access AB, for providing the data for this thesis. Last but not least, we would like to express our gratitude toward Ericsson and also everyone else involved with the project. Tobias Olausson & Victor Sandell, Gothenburg, June 2017 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Microwave Transmission . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Quadrature Amplitude Modulation . . . . . . . . . . . . . . . 5 2.1.2 Attenuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5.2 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.3 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Methods 13 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Expected Behaviour . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Common Disturbances . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.4 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . 15 ix Contents 3.2.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Neural Network Design . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Results 19 4.1 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.2 Network Comparison . . . . . . . . . . . . . . . . . . . . . . . 19 5 Conclusion 25 6 Future Work 27 6.1 Worldwide Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.1.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.1.2 Location Dependence . . . . . . . . . . . . . . . . . . . . . . . 28 6.2 Network Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.1 Semi-supervised . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.2 Additional Data . . . . . . . . . . . . . . . . . . . . . . . . . . 29 x List of Figures 2.1 The binary points of 16-QAM. . . . . . . . . . . . . . . . . . . . . . . 6 2.2 An example of a feed forward neural network. The circles represent neurons and the lines between the neurons are the weighted connec- tions. This network consists of three input neurons, one hidden layer with four neurons and the output layer with three neurons. . . . . . . 7 2.3 A segment, defined by the filter size, is extracted and given as an input to a neuron in the feature map. Subsampling using pooling is then performed and followed by inputting the results of the feature maps to a fully connected layer. . . . . . . . . . . . . . . . . . . . . . 8 2.4 An example of maximum pooling. . . . . . . . . . . . . . . . . . . . . 9 3.1 The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. . . . . . . . . . . . . . . . . . . . . . 13 3.2 An example of a weather disturbance. The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. 14 3.3 An example of a crane disturbances. The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. 15 3.4 The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. . . . . . . . . . . . . . . . . . . . . . 16 4.1 Comparison of zero-mean normalization (green line), with the zero- mean unit-variance normalization (blue line) for the input data of the single link structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Comparison of the zero-mean normalization (green line), with the zero-mean unit-variance normalization (turquoise line) for the nearby link structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 The orange lines show the best obtained accuracy and loss when using the single link structure. The turquoise lines show the result of using the same structure, but with the addition of nearby links. . . . . . . . 21 4.4 The neural network structures with the highest obtained accuracy and lowest loss. The turquoise line represents the structure using the nearby links, and the orange line represents the structure using a single line link as an input. . . . . . . . . . . . . . . . . . . . . . . . . 22 xi List of Figures 4.5 This figure shows the structure of a convolutional neural network using time series data of the main link and its nearby links as an input data. The neural network consists of three convolutional layers, followed by dropout, and then supplied to a fully connected layer which returns a classification. . . . . . . . . . . . . . . . . . . . . . . 23 xii List of Tables 4.1 The results obtained from the best convolutional neural network struc- tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 The parameters for the best performing neural network using a single link as an input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 The parameters for the best performing neural network by addition of the nearby links as an input. . . . . . . . . . . . . . . . . . . . . . 22 xiii List of Tables xiv Acronyms 1-NN 1-Nearest Neighbour Adam Adaptive moment estimation ANN Artifical Neural Network API Application Programming Interface BIDMC Beth Israel Deaconess Medical Center CIFAR-10 Canadian Institute for Advanced Research 10 CNN Convolutinal Neural Network DNN Deep Neural Network GHz Gigahertz MC-DCNN Multi-Channels Deep Convolution Neural Networks MLP Multilayer Perceptron MNIST Modified National Institute of Standards and Technology PAMAP2 Physical Activity Monitoring Data QAM Quadrature Amplitude Modulation ReLU Rectified Linear Unit SGD Stochastic Gradient Descent xv 1 Introduction In recent years there has been a rapid increase in the amount of collected and stored data. The analysis and manipulation of big data is one of the important aspects of the ever growing computer industry. Companies collect large amounts of data from various services potentially, for further analyzing. Machine learning algorithms provide solutions which are efficient at analyzing and finding hidden data patterns. Machine learning techniques such as neural networks is used for a large variety of tasks related to data analysis such as image and speech recognition. Developing and applying machine learning algorithms to automate and improve data analytics is beneficial for companies, as it unlocks the potential use of the unused data. The continuing increase in computing power combined with the progress in the field of machine learning extends these possibilities further. There are currently (2016), approximately 4 million operating microwave links all over the world. These microwave links connect radio sites, and is used in addition to fiber and copper links [1]. The use of fiber is expected to increase, while copper is being phased out. Microwave links will still connect 65% of radio sites in 2021. Due to the size of the microwave networks located all over the world, large amounts of data is generated and machine learning solutions can potentially be beneficial to the network operators. 1.1 Background Ericsson collects data from a network of microwave links located in Gothenburg. The data includes the transmitted and received power for each link. The attenua- tion, deviation from the expected received power, is currently used to approximate precipitation in Gothenburg. When the received power is lower than the expected power, it is expected that the power attenuation was due to the rain or snow. While the changes in the received power are often caused by the weather, it is not always the case. Some of these disturbances are instead caused by construction cranes, trees, and other unknown disturbances. At this time, Ericsson does not have a way of knowing what is the cause of fluctuations and drops in the received power without manual analysis of the link’s data. In order to find any potential problems in the network, large amounts of data must be analyzed manually which is time-consuming. Time series analysis extracts patterns in time series data, in order to obtain useful features. The data gathered by Ericsson falls into the field of time series analysis since the data for the received power is directly linked to a point in time, forming 1 1. Introduction time series data. This thesis explores the application of machine learning algorithms on time series data in order to classify and separate anomalies from normal behaviour within a large microwave network. 1.2 Purpose The primary aim of this thesis is to explore the respective benefits of different machine learning algorithms for anomaly detection and classification for a single- variable time series. Furthermore, the findings will be used to simplify and improve efficiency of fault detection and handling in a microwave network. Our thesis will seek to answer the following questions: • How can problems be detected in a microwave network using the collected data about the received power over an extended period of time? • Which kind of input data and neural network structure produces the best classification results? 1.3 Limitations The data used for this thesis is only gathered for Gothenburg. As such, the program will only be required to work for this limited area. While some thought may be given to the future scalability, it will not be a high priority. Predicting faults may be even more useful than detecting them in real-time. How- ever, due to the limitations in time and data quality, this would be out of the scope of this thesis. The thesis will examine ways to differentiate between normal disturbances caused by precipitation and more significant disturbances. This thesis will be limited to identifying a few prominent categories among these significant disturbances. 1.4 Related Work There are a wide range of research topics surrounding the applications of machine learning for time series analysis. The main machine learning approach that will be explored in this thesis is the convolutional neural networks (CNNs), an extension of neural networks. Classification tasks usually require knowledge about the data in order for feature extraction by human experts. CNNs have the advantage of extracting and learning features, thus not requiring human experts [2]. This feature extraction capability is useful for all kinds of data. Since the data used for this thesis likely contains patterns and behaviour only found in microwave data, the automatic feature extraction of CNNs is applicable. CNNs have successfully been applied to time series data within a range of fields, including speech recognition and audio classification. The results obtained by [3] 2 1. Introduction show that the CNN, compared to a corresponding deep neural network (DNN), performed much better for the same speech recognition task. Furthermore, CNNs have been proven to be successful in image classification tasks. To make use of these results when classifying time series, the data can be transformed and represented as an image in order to treat the problem as an image classification task [4]. Another widely used method for time series classification is 1-nearest neighbour (1- NN) with dynamic time warping as a similarity measurement. This approach has been found to produce similar results to a multi-layer perceptron (MLP) with fully connected layers. However, in comparison to the convolutional networks, they have been shown to achieve lower classification accuracy when tested on 44 different time series data sets [5]. Multi-Channels Deep Convolution Neural Networks (MC-DCNN) [6], divides a multi- variate time series into univariate time series and performs feature learning on each time series separately using CNNs. The feature learning for each time series is then concatenated into a MLP, in order to carry out classification. MC-DCNN outper- form both 1-NN and MLP used on health care data. The health care data sets used are Physical Activity Monitoring Data (PAMAP2) and the Beth Israel Deaconess Medical Center (BIDMC) data set [7], containing multi-variate time series data. 1.5 Outline In section 2, the reader is introduced to the general concepts of microwave transmis- sion. This chapter presents the theory behind artificial neural networks and more specifically convolutional neural networks. Section 2 also presents the concepts used for preprocessing, and finally how training and validation is performed in an arti- ficial neural network. Section 3 introduces the methodology. This section begins with a description of the analyzed data, including classification and preprocessing. Section 3.2 presents the machine learning approach, the utilized machine learning library and how the network was set up. Section 4 presents the results of this thesis. The different network structures are compared with regards to accuracy, confidence, and time consumption. Section 5 includes a discussion and conclusion of the results. Finally, in section 6, the potential future work is discussed. 3 1. Introduction 4 2 Theory In this section the theory behind this thesis is presented, focusing on artificial neural networks and in particular convolutional neural networks. Additionally, the theory behind data preprocessing used for this thesis is introduced. 2.1 Microwave Transmission Microwave transmission is used to transfer information between two points, and is used in communication systems such as cellular and telecommunications. Mi- crowave frequencies range from 3 Gigahertz (GHz) to 300 GHz, which is equivalent to wavelengths between 10 and 0.1 centimeters [8]. Antennas are used to direct and transmit the microwave signal. A microwave network is constructed by plac- ing antennas, forming point-to-point communication links. Careful placement of the antennas allow for several links to use the same frequency. This is due to the fact that the microwave beams are highly narrow and precisely oriented, thus not interfering with other beams. Microwave transmission is done using high frequency which in turn leads to have a high bandwidth. However, due to the high frequency transmission, the signals have difficulties passing through mountains and other ter- rains. Therefore, it is essential to place the antennas strategically in order to avoid obstacles. 2.1.1 Quadrature Amplitude Modulation Quadrature amplitude modulation (QAM) is a method of encoding information in a microwave signal. Using QAM the microwave signal is composed of two underlying signals, cosin and sinusoidal functions, given in (2.1). The symbol ω denotes the angular velocity, t is the time, φ is the phase delay, Q, I, and A are the amplitudes. The resulting signal is a sinusoidal function with varying phase and amplitude based on the underlying values of Q and I. Either using the phase and amplitude of the final signal or by decoding the values of Q and I, one can infer the actual value of the signal as can be seen in Fig. 2.1. Q ∗ cos(ωt) + I ∗ sin(ωt) = A ∗ sin(ωt+ φ) (2.1) However, much higher modulations are possible. Since microwave signals are analog, they can encode an arbitrary number of values. This is, however, only true in a theoretical environment. In reality, there will always be noise in a signal, and as the modulation increases, the difference between signals become smaller. As such, 5 2. Theory a higher order of QAM demands a higher signal-to-noise ratio. Nowadays, some microwave transmissions are based on the 2048-QAM. The quality of microwave transmission varies significantly over time, especially when environmental factors such as rain is taken into account. Therefore, one can vary the modulation order to combat these factors and maintain the low error rate. Figure 2.1: The binary points of 16-QAM. 2.1.2 Attenuation Attenuation is the loss in power of a signal transmitted through a medium. Com- pared to lower radio frequencies, signals in the microwave range suffer more from attenuation caused by the rain. This becomes especially apparent at frequencies above 10 GHz [9]. Terrain, buildings and trees are obstacles that can cause attenu- ation, thus the path between the transmitting and receiving antennas must remain unobstructed. 2.2 Artificial Neural Networks Artificial neural networks (ANNs) is a prominent computational model in machine learning. ANNs are inspired by the network of neurons in a biological brain. An ANN consists of an input layer, an output layer and several hidden layers. The neurons are connected between the layers by weights that determine the output of each layer to the next. ANNs are trained instead of explicitly programmed, using training examples to achieve a desired output corresponding to the input by updating the weights between the neurons in the network. The method used to update the weights and biases between neurons is called backpropagation, which is explained 6 2. Theory further in section 2.5.1. An ANN with a sufficient amount of neurons and layers can be a exceptional tool for feature detection and classification. An example of an ANN can be seen in 2.2. Figure 2.2: An example of a feed forward neural network. The circles represent neurons and the lines between the neurons are the weighted connections. This network consists of three input neurons, one hidden layer with four neurons and the output layer with three neurons. There are different types of ANNs, e.g. feed forward neural networks and recurrent neural networks. Another type of ANN is the convolutional neural network, which will be described further in the next section. 2.2.1 Supervised Learning Training a neural network requires selection of the input and output format. One of these training schemes is called supervised learning. Training examples with a corresponding label are required in order to train a neural network using a super- vised approach. The goal is to achieve an inferred function from the training set, which produces accurate labels for unfamiliar data. This is the same as training the network to generalize. Supervised learning can in turn be grouped into further categories, classification and regression. In the classification approach, each training example is assigned a class, and the neural network is trained to output an integer corresponding to a class. In a regression network, each training example is instead labeled with a real value, and the output is also a real value. 2.3 Convolutional Neural Networks Conventional artificial neural networks have several downsides. Each of the neurons of a layer in a fully connected neural network, is connected to all of the neurons in the next layer, making the amount of neurons required for the large data infea- sible. CNNs handle large input by reducing the spatial size of the input. CNNs have performed exceptionally well in complex classification tasks, particularly in image classification [10, 11]. This section will describe the most commonly used components of a CNN, convolutional layers and pooling layers. 7 2. Theory 2.3.1 Convolutional Layer The primary component in a CNN is the convolutional layer. This layer consists of a set of filters. The filters in each layer is moved across the input in order to produce an activation map. This process is known as the convolutional step in the network. CNNs utilize sparse connectivity, meaning that only a portion of the input is given as an input to a neuron, as seen in Fig. 2.3. The spatial connectivity between the filter and the neuron is known as the receptive field. The stride of the filters decides how much a filter is moved in each dimension across the data. A stride of 1 results in a filter moving one data point in each step. The stride determines the dimensions of the output, since a smaller filter size leads to less overlapping of the receptive fields. Sometimes it is beneficial to obtain the same input and output dimensions, which can be achieved using padding. Padding is done by adding data, commonly zeroes, to the edges of the input. In order to adjust the amount of parameters, a filter in a convolutional layer uses the same weights and biases across the entire input space. The weight sharing makes the filters in the convolutional layer behave as feature maps. These feature maps achieve spatial translation. Since features can be located everywhere on the input, weight sharing is done. If spatial translation was to be removed, the training set would have to be significantly larger and generalization to new data would be worse. Figure 2.3: A segment, defined by the filter size, is extracted and given as an input to a neuron in the feature map. Subsampling using pooling is then performed and followed by inputting the results of the feature maps to a fully connected layer. After the convolution has been performed, an activation function is applied. Recti- fied linear unit (ReLU) (2.2) is one of the most commonly used activation functions. Compared to other activation functions, such as the hyperbolic tangent or the sig- moid function, ReLU achieves shorter training time [10]. f(x) = max(0, x) (2.2) 8 2. Theory 2.3.2 Pooling Layer The next major component in a convolutional network is the pooling layer. Pooling is performed in order to reduce the overall complexity, and is usually applied after a convolutional layer. In order to perform pooling, a filter is moved across the data, downsampling over each filter by discarding data. One of the most common pooling methods is max pooling. In each region, the maximum value is extracted, as shown in Fig. 2.4. Although pooling reduces the complexity of the data, [12] suggests that simply applying pooling after each convolutional layer reduces performance due to the spatial reduction. In [13], it is proposed to remove the pooling layer in favor of either increasing the stride of the convolutional layers, or replacing the pooling layer with a convolutional layer where the stride is larger than one. Figure 2.4: An example of maximum pooling. 2.4 Preprocessing Data used for machine learning purposes can be noisy and not always entirely re- liable. In order to achieve valuable results, some degree of data preprocessing is usually needed. The data used for this thesis is time series data. 2.4.1 Time Series A time series is a set of data points, each indexed at a point in time. The data points in a time series are collected at a fixed interval, where the frequency in which the data points are collected determines the resolution of the time series. Time series data can be represented as a function of time, f(t), where t is a point in time and f(t) is the corresponding value. 2.4.2 Normalization Normalization of the input data for the ANNs has been shown to make the network perform better generalization and achieve faster network training convergence [14]. Normalizing the input from one layer to another within the network during training enables the use of higher learning rates, and has the potential to eliminate the need for dropout [15]. There are different methods used for normalization, one being zero-mean and unit-variance, shown in (2.3). The normalized data is calculated by subtracting the mean and dividing by the standard deviation. 9 2. Theory x′ = x− x̄ σ (2.3) 2.5 Training In terms of neural networks, training is the process of iteratively updating the weights in the network to minimize the cost function. Common cost functions include 0-1 loss, mean squared error and cross-entropy, but the cost function can be defined in many different ways. During training of a neural network, the task is to fit the network to a training set. One of the problems which can arise while training a neural network is overfitting. Overfitting commonly occurs when the parameter size exceeds the number of training examples, or due to excessive training. This can result in the neural network learning features which are not found outside of the training data. Overfitting leads to noise in the data becoming more prominent while training, and the underlying pattern which the neural network should learn is instead obscured by the noise. Various methods exist to avoid overfitting, one example of such a method is dropout, which is described further in section 2.5.4. 2.5.1 Backpropagation A loss function is used to determine how well a model is performing. In the case of an ANN, the loss function is a value of how accurate the predictions are. In backpropagation the derivatives of the network weights are calculated with respect to the loss function. Based on the derivatives, the weights are then updated according to the chosen optimization method. Stochastic Optimization Gradient descent is a method used in conjunction with backpropagation to minimize an objective function, usually also referred to as the loss function, cost function or error function. Minimization is done stepping in the steepest direction using the derivative of the objective function, updated by iteration over each value in the data set. The new gradient is calculated according to (2.4). The goal is to find the value of θ which minimizes the objective function J(θ). η represents the step size, determining how much each iteration should affect the value of θ. θ := θ − η d dθ J(θ). (2.4) Stochastic gradient descent (SGD) is an extension of gradient descent which suc- cessfully handles large data sets [16]. In order to handle large data sets, the training data is randomly arranged, and each training sample is used to update the value of θ. (2.5) shows the update step when using SGD. Training item x and corresponding label y is a randomly chosen pair from the training set. Computing the gradient based on a high number of data points can lead to the incorrect steps, resulting in a higher value of J(θ). However, setting a low value for the step size achieves convergence by computing the gradient a large number of times. The fluctuations 10 2. Theory of the SGD algorithm enables it to potentially jump and find a better local minima. SGD in combination with backpropagation is the standard algorithm used for the ANNs. θ := θ − η d dθ J(θ;x(i), y(i)). (2.5) Adam Adaptive moment estimation (Adam), is an adaptation of stochastic gradient descent [17]. The previous updates contain information about the direction of the minimum. Adam utilizes the previous updates, adding momentum when calculating the next update. The direction of a minimum builds momentum, increasing the rate in which a local minima is reached. Adding a momentum component prevents heavy fluctuations due to constantly updating the direction. Another feature of Adam is the adaptive parameter updates. Parameters benefit differently for a set learning rate. Therefore, using separate learning rates for each individual parameter, the learning process becomes faster. Compared to the other optimization methods, Adam performs well in practice due to being computationally efficient and having low memory requirements. 2.5.2 Learning Rate The learning rate of a network is a measure of how aggressively it adapts according to the value of the loss function. A higher learning rate means that the network is quicker to adapt, but less capable of fine-tuning. If a network has too low learning rate, it will take longer to converge and increases the risk of converging to a local optima. Introducing a decay rate, causing the learning rate to be gradually lowered, can potentially allow the network to converge quickly and be tuned finely in the later stages of training. 2.5.3 Cross Entropy The squared loss function can be used in cases where a large deviation from the target should be punished more heavily than several smaller ones. This is due to the fact that a very large deviation will cause the learning rate to slow down. The cross-entropy cost function targets this weakness by making the learning rate be controlled by the error in the output [18]. 2.5.4 Dropout Dropout is the process of randomly discarding a portion of the hidden and input units. This is done to punish the network for over-reliance on a few units. In each training step, a neuron and its connections are dropped with a probability of 1 − p and retained with probability of p. After training, the resulting neural network can be seen as a combination of several smaller neural networks with the shared 11 2. Theory weights. The resulting neural network has less risk of overfitting and better ability to generalize [19]. 2.6 Validation Even with steps taken to avoid overfitting, there are no guarantees that it will not happen. Therefore, the accuracy and loss metrics from training can be misleading. It is therefore common to divide the data into training and validation sets. The validation set is not fed to the network during the training phase, but tested on after to give more reliable performance metrics. Feeding the validation set to the network outputs performance metrics about accuracy and loss. 12 3 Methods This chapter presents the methodology used in this thesis. The classifications used and the structure for the neural network is described. 3.1 Data The data used in the thesis consists of the received power from around 700 links in the area surrounding Gothenburg. The data has been logged every 10 seconds, beginning in 2015-04-28. The data was split into 24 hour segments, consisting of 8640 samples. 3.1.1 Expected Behaviour Most of the time the received power is expected to remain level. Variations tend to lie within 2dB of the mean, however exceptions to this exist for links in certain environments such as over water. An example of a link with received power in the expected range can be seen in Fig. 3.1. Figure 3.1: The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. 13 3. Methods The received power is impacted by environmental effects such as rain and snow. While this can cause significant drops in performance, the effects can only be miti- gated, not avoided completely. Regardless of whether it is raining or not, the links are expected to behave similarly to those nearby. Fig. 3.2 shows an example of a link affected by rain. The rain occurred at the same time as the drop in the received power, and the nearby links showed the same pattern. Figure 3.2: An example of a weather disturbance. The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. 3.1.2 Common Disturbances In urban environments, such as Gothenburg, construction sites and in particular the cranes used in the construction can interfere with the transmissions. These distur- bances cause a very distinct pattern in the received power as the cranes normally only move during the standard working hours. In fact, during the working hours the received power can fluctuate heavily in very short time spans. Fig. 3.3 shows a link with a crane disturbance. 3.1.3 Normalization The mean received power can vary greatly, but the deviations from the mean appear similarly for the same situations. As such, normalizing the data to a zero-mean makes a lot of sense. It is common to also normalize to unit-variance, however it leads to the information loss about the severity of the power fluctuations. 14 3. Methods Figure 3.3: An example of a crane disturbances. The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. 3.1.4 Labelling To use a supervised approach, the data should be labeled. While automatic labelling to some extent is possible, high number of the underlying parameters make it com- plex and unreliable. Therefore, human evaluation was used, where an expert user was given rain and snow data and received power measurements for both the link in question and links near it. The data was then classified into one of 4 different categories, normal behaviour, weather disturbance, crane disturbance and unknown disturbance. Fig. 3.4 shows the received power for a main link and the 4 closest links. Rain does not seem to affect the main link similar to the nearby links, and the pattern is very distinct, thus this example was labelled as an unknown disturbance. 3.2 Machine Learning Approach This section describes the machine learning library that was used to construct the neural network. The design of the neural network is presented, including the various parameters used. 3.2.1 TensorFlow The neural networks was designed using TensorFlow. TensorFlow is an open-source interface which allows for implementation and experimentation of machine learning algorithms [20]. The usage of both Central Processing Units (CPU) and Graphics Processing Units (GPU) is supported by TensorFlow, which is a necessity when handling large data sets. In addition to the TensorFlow interface, the high level Ap- 15 3. Methods Figure 3.4: The blue line in the top graph depicts the data for the main link over one day, and the red lines are the nearby links. The bottom graph shows the rain for the same day. plication Programming Interface (API) called TFLearn was used, which builds upon the basic functionality of TensorFlow and allows for more agile experimentation. 3.2.2 Neural Network Design Selecting a model which perform well, both in classification performance and time consumption, was crucial. A convolutional neural network (CNN) approach was chosen, since CNNs perform well in classification tasks with large input. The input size was structured to provide sufficient data, in order to improve feature extraction. The initial approach was a neural network structure which used a single link as the input. An advantage of this structure is the time consumption compared to a structure using a larger input size. When constructing the neural network, several data characteristics were taken into consideration. Since a link is expected to behave similarly to its nearby links, having the nearby links as the input data could potentially lead to better feature extraction. However, due to the input size of this neural network structure, the potential increase in classification performance comes at the cost of time consumption. The depth and the number of hidden layers in the neural network, affect the perfor- mance. Several hidden layers provide additional convolution, which in turn leads to improve feature extraction and classification performance. However, a neural net- work can be unnecessarily large, leading to increased time consumption. Between one and five convolutional layers were tested. There does not exist one correct architecture for convolutional neural networks in general. However, there are existing widely tested architectures for some specific data sets. While these architectures might be applicable to similar data sets, the data used for this thesis likely contains behaviour only found in microwave data. Thus, 16 3. Methods an architecture that achieves good results for microwave data had to be created. Several combinations of parameters were tested by trial and error. A high number of tests were performed with combinations of the parameters listed in the next section. Apart from the trial and error approach, existing CNN architectures were taken into consideration. 3.2.3 Parameter Settings In order to achieve a satisfactory result, a range of parameters for the neural network had to be considered. Several neural network structures with different combinations of learning rate, filters and pooling layers were tested to find out how to achieve the highest accuracy. Learning Rate The learning rate was set to 0.005 with an exponential decay of 0.99 for the first moment estimates and 0.999 for the second moment estimates. Filters A higher amount of filters increases training time, but does not necessarily improve the accuracy of the neural network. An arbitrary number of filters was chosen for each layer, ranging between 20 and 50 filters. Different filter sizes were used for each layer. Too small filter sizes can lead to an excessive amount of parameters within the network, and possibly also overfitting. Filter sizes between 5 and 100 were chosen as a baseline. Pooling Pooling can be useful due to reducing the complexity of the input data between layers. However, this is not always the case since pooling has a negative effect on classification performance due to the spatial reduction. Different neural network structures were tested, with and without pooling, in order to find out whether pooling improved the accuracy in conjunction with other parameter values. 17 3. Methods 18 4 Results This chapter presents the results from the convolutional neural network structures from chapter 3. A comparison of neural network structures with regards to accuracy and loss, is presented. 4.1 Classification Performance The results were gathered by running the neural networks for a set amount of epochs. The single link structure typically converged rather fast, and was thereby only run for 100-250 epochs. Training the structure using nearby links was more time consuming due to a larger input size and slower convergence. We found that adjusting the learning rate for different network structures did not seem to have a significant effect on the classification. The best obtained results can be seen in Table 4.1. Results CNN structure Validation Accuracy (%) Validation Loss Single link 100 0.02 Nearby link with single link parameters 98.2 0.1 Nearby link 98.2 0.07 Table 4.1: The results obtained from the best convolutional neural network struc- tures. 4.1.1 Normalization In Fig. 4.1, the comparison between the results obtained by normalizing the data to the zero-mean and unit-variance data and normalizing to only zero-mean data is shown. As can be seen, the results based on variance normalization yielded both a higher loss and lower accuracy. Similar results were obtained when comparing the same normalization methods in the neural network structure by addition of the nearby links as an input, in Fig. 4.2. 4.1.2 Network Comparison Two neural network structures were considered in this thesis. The first one is the convolutional networks with a single time series for a link as an input, while the 19 4. Results (a) Validation accuracy (b) Validation loss Figure 4.1: Comparison of zero-mean normalization (green line), with the zero- mean unit-variance normalization (blue line) for the input data of the single link structure. (a) Validation accuracy (b) Validation loss Figure 4.2: Comparison of the zero-mean normalization (green line), with the zero- mean unit-variance normalization (turquoise line) for the nearby link structure. second one used a main link’s time series data combined with the four nearest links’ time series data, requiring a two-dimensional convolutional neural network. The structure of the neural network using nearby links can be seen in Fig. 4.5. The networks consisted of three sequential convolutional layers. All convolutional layers used a ReLU activation function. After these layers, dropout was performed and ultimately there was a fully connected layer with four outputs, corresponding to the four different classes. Cross entropy was the loss function used for each of the neural network structures. We remark that Adam, see section 2.5.1, was selected as the optimizer. The best result produced by the neural network with a single link as an input, was obtained with the parameters listed in Table 4.2. Pooling layers were not utilized in conjunction with the listed parameters, since it resulted in higher loss and lower accuracy. We also found that training without pooling did not have a significant impact on the time consumption. The results obtained by adding nearby links as an input, compared to feeding the 20 4. Results Single Link Structure Number of filters Filter size Stride Convolutional layer 1 20 5 5 Convolutional layer 2 20 25 5 Convolutional layer 3 20 25 5 Table 4.2: The parameters for the best performing neural network using a single link as an input. main link can be seen in Fig. 4.3. The structure for the network without nearby links was the best one obtained through our experimentation. Using the same network structure, the network using nearby nodes takes significantly longer time to converge and results in a slightly lower accuracy and loss. (a) Validation accuracy (b) Validation loss Figure 4.3: The orange lines show the best obtained accuracy and loss when using the single link structure. The turquoise lines show the result of using the same structure, but with the addition of nearby links. However, the network using nearby nodes was able to perform significantly better using a different structure and setup parameters. The comparison of the two ap- proaches using the best tested structure can be seen in Fig. 4.4. This shows a much more similar performance to the neural network using a single time series as an input. Even with the improved structure, the network using nearby links takes longer time to converge. Furthermore, the increased input size causes training to be significantly slower. The best classification accuracy and loss for the network structure using the nearby links was achieved when applying larger filter sizes, compared to the structure using a single link. Pooling was also used in the neural network structure with the nearby links, as it showed improvements in both time consumption and classification per- formance. Furthermore, small filter strides for each convolutional layer produced the best result. The parameters used can be seen in Table 4.3. 21 4. Results (a) Validation accuracy (b) Validation loss Figure 4.4: The neural network structures with the highest obtained accuracy and lowest loss. The turquoise line represents the structure using the nearby links, and the orange line represents the structure using a single line link as an input. Nearby Links Structure Number of filters Filter size Stride Convolutional layer 1 20 100 5 Convolutional layer 2 20 100 5 Convolutional layer 3 20 100 5 Table 4.3: The parameters for the best performing neural network by addition of the nearby links as an input. 22 4. Results Figure 4.5: This figure shows the structure of a convolutional neural network using time series data of the main link and its nearby links as an input data. The neural network consists of three convolutional layers, followed by dropout, and then supplied to a fully connected layer which returns a classification. 23 4. Results 24 5 Conclusion Immense amounts of data is gathered by companies such as Ericsson. This data might be useful for other purposes, potentially providing a monetary benefit. How- ever, due to the sheer size of the available data, efficient and precise methods have to be devised. Machine learning provides the capability to manage, and find patterns in large amounts of data. In this thesis, we have evaluated large amounts of data which Ericsson has gathered from a microwave network in Gothenburg. A model, using a convolutional neu- ral network, which detects and classifies disturbances in a microwave network was developed. The neural network structure, achieving the highest classification performance was trained on 24-hour segments of the received power, each segment containing the data for one link. This neural network achieved a very high classification performance. A second structure was developed, which was trained based on the data of a main link and the data of the nearby links, for the same 24-hour segment. Although this neural network did not perform as we had hoped for our classifications, it could potentially perform better generalization for other disturbances. The obtained results are promising, and it is apparent that machine learning can be used to detect and classify disturbances in a microwave network. 25 5. Conclusion 26 6 Future Work In this chapter, possible future work is presented. Additional implementation and improvements for this project are discussed. 6.1 Worldwide Deployment Since the current results are promising, there is a possibility of deploying the program across the rest of world. With this, there will come many new possibilities and challenges. 6.1.1 Scalability One effect of including more areas will be a vast increase in the amount of the stored data. Therefore, scalability is an important issue. Training Time For every classified data sample, there will be one iteration required for each epoch in the training. As such, the training time for the network scales linearly with the amount of training data. This is only true excluding hardware limitations, however, and for example some changes may need to be made with respect to what data is held in random-access memory (RAM). The amount of labeled data required, however, will not be expected to increase linearly with the number of microwave links. Urban areas similar to Gothenburg will require few or no additional labels. Even a rural environment in a similar climate is expected to at least share the behaviour during precipitation. Overall, the training time would be expected to scale significantly better than linear scaling with respect to the number of microwave links, assuming no other types of labels are needed. Classification Time While training only need to be done when a significant amount of new labels have been added, classification need to be done every day for continuous updates. There- fore, the efficiency of the updates is an important issue. Each segment consists of one days measurements for one link, and each of these segments will have to be classified independently of the rest. Therefore, the time required for classifying the segments for one day can, at best, scale linearly with the number of links. 27 6. Future Work 6.1.2 Location Dependence The data used was only from Gothenburg. Since this is a homogeneous urban area, it might not be possible to simply deploy the same solution across the world. Other regions may have new types of disturbances that are not present elsewhere. One might, for example, have to take sand storms into account in desert areas, in which another classification is required to be added to the problem. In the areas where sand storms are not present, however, this means a potential classification which can only be wrong, never correct, has been added. This could reduce the classification performance. Another possible scenario is that the same type of disturbance could have different characteristics in different environments. A solution might be decentralizing the classifications. This means that we split the world into similar regions and consider independent classifications for each of them. This would require more classified examples but could potentially improve the performance. Another approach is to include metadata to the neural network, i.e., include data about the type of the environment, the link is located in. 6.2 Network Modifications Performing an exhaustive search for all possible network structures and parameters is not possible, so there is no guarantee that the currently highest performing structure is optimal. Tweaking parameters may allow for slight performance improvement but changing the type of neural network could unlock entirely new possibilities. 6.2.1 Semi-supervised One can change the approach to the semi-supervised one. This allows the network to make use of the unlabelled data, in addition to what has already been labelled. Self-classifying One of the simplest semi-supervised approaches is to let the network be self classified. This starts in the usual way by training the network and classifying all data points. However, afterwards, the most confident classifications are converted into real labels. The process is then repeated many times, each time giving the network more labelled data to train on. It has been shown that self-classifying can produce better performance [21], but these results may not be applied to our algorithm and dataset. Ladder Networks The semi-supervised ladder network structure has shown a lot of promise. It has achieved state of the art performance for classification in both semi-supervised Mod- ified National Institute of Standards and Technology (MNIST) and Canadian Insti- 28 6. Future Work tute for Advanced Research 10 (CIFAR-10) [22], two datasets for object recognition in images. Any conventional neural network can be converted to a ladder network. The initial structure is considered the clean encoder path. Another path, a corrupted encoder, is then added in which also adds noise to the signal in every layer. A denoising function is then used to reconstruct the original output of the activation function. The difference between the reconstructed output and the clean one is the denoising cost for each layer. The supervised cost is based on the difference in output between the corrupted encoder and the actual target. The unsupervised cost is simply the sum of the denoising costs for each layer. For the supervised data the final cost is the sum of the unsupervised and supervised costs. The ladder network can also be trained on unlabeled data, using only the unsu- pervised cost. While the ability to incorporate unlabeled data is a key feature, the ladder network architecture achieves good performance even in a fully supervised scenario [22, 23]. 6.2.2 Additional Data In the neural network structures that were evaluated in this thesis, only data of the received power was used as an input. The dataset that was used contained both transmitted and received power, but it is possible to store additional variables with configuration of the data collection. In [6], an approach which uses multi-variate time series is proposed. Such a method could potentially be used, since additional variables create new time series. This variables from the microwave network might provide useful features for the neural network. Furthermore, metadata can possibly be given as an input to a separate fully connected layer in the neural network. Metadata such as transmission frequency or the hop length of the links, could provide additional information concerning the behaviour of a link. The number of classes used for the neural networks were limited to four, but it is possible to add additional categories. Overwater links are located across a pool of water such as a lake or the ocean. These links show a pattern of high fluctuations due to surface reflection and movement of the water. Multipath, which is due to receiving the transmitted signal from several paths, causing interference, could be possibly added. Additionally, during maintenance, a link’s received power drops significantly, showing a clear pattern since maintenance is usually scheduled for evenings and nights. It would also potentially be possible for maintenance workers to report a link which is undergoing maintenance, in order to be classified and used as the training data. These are a few examples of behaviours in the microwave network which can be categorized and classified. 29 6. Future Work 30 Bibliography [1] Ericsson, “Ericsson microwave outlook, trends and needs in the microwave industry,” [Online] Available: https://www.ericsson.com/assets/local/ microwave-outlook/documents/ericsson-microwave-outlook-report-2016.pdf, 2016. [2] M. Längkvist, Modeling time-series with deep networks. PhD thesis, Örebro university, 2014. [3] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transac- tions on audio, speech, and language processing, vol. 22, pp. 1533–1545, Jul. 2014. [4] Z. Wang and T. Oates, “Encoding time series as images for visual inspection and classification using tiled convolutional neural networks,” in Proc. Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Jan. 2015. [5] Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch with deep neural networks: A strong baseline,” CoRR, vol. abs/1611.06455, Dec. 2016. [6] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao, “Time series classifica- tion using multi-channels deep convolutional neural networks,” in International Conference on Web-Age Information Management, pp. 298–310, Springer, Jun. 2014. [7] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000. [8] D. M. Pozar, Microwave Engineering, vol. 4. Wiley, April 2012. [9] S. Das, A. Maitra, and A. K. Shukla, “Rain attenuation modeling in the 10- 100 ghz frequency using drop size distributions for different climatic zones in tropical india,” Progress In Electromagnetics Research B, vol. 25, pp. 211–224, Sep. 2010. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Pro- cessing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein- berger, eds.), pp. 1097–1105, Curran Associates, Inc., Dec. 2012. [11] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proc. Computer Vision and Pattern Recognition (CVPR), pp. 3642–3649, Feb. 2012. [12] B. Graham, “Fractional max-pooling,” CoRR, vol. abs/1412.6071, Dec. 2014. 31 Bibliography [13] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for simplicity: The all convolutional net,” CoRR, vol. abs/1412.6806, Mar. 2014. [14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [15] S. Wiesler, A. Richard, R. Schluter, and H. Ney, “Mean-normalized stochastic gradient for large-scale deep learning,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), pp. 180–184, May 2014. [16] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, Jun. 2010. [17] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [Online] Available: https://arxiv.org/abs/1412.6980, 2014. [18] M. A. Nielsen, Neural Networks and Deep Learning. Determination Press, 2015. [19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, Jan. 2014. [20] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Ten- sorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Soft- ware available from tensorflow.org. [21] N. Fazakis, S. Karlos, S. Kotsiantis, and K. Sgarbas, “Self-trained lmt for semisupervised learning,” Computational intelligence and neuroscience, vol. 2016, p. 10, Nov. 2016. [22] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semi- supervised learning with ladder network,” CoRR, vol. abs/1507.02672, Nov. 2015. [23] M. Pezeshki, L. Fan, P. Brakel, A. Courville, and Y. Bengio, “Deconstruct- ing the ladder network architecture,” in International Conference on Machine Learning, pp. 2368–2376, May 2016. 32 List of Figures List of Tables Introduction Background Purpose Limitations Related Work Outline Theory Microwave Transmission Quadrature Amplitude Modulation Attenuation Artificial Neural Networks Supervised Learning Convolutional Neural Networks Convolutional Layer Pooling Layer Preprocessing Time Series Normalization Training Backpropagation Learning Rate Cross Entropy Dropout Validation Methods Data Expected Behaviour Common Disturbances Normalization Labelling Machine Learning Approach TensorFlow Neural Network Design Parameter Settings Results Classification Performance Normalization Network Comparison Conclusion Future Work Worldwide Deployment Scalability Location Dependence Network Modifications Semi-supervised Additional Data