Chalmers University of Technology Human activity classification using simulated micro-Dopplers and time-frequency analysis in conjunction with machine learn- ing algorithms: a comparative study for automotive use Master’s thesis in Communication Engineering By: Fredrik Axelsson & Pavel Gueorguiev Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2017 Master’s thesis 2017:08 Human activity classification using simulated micro-Dopplers and time-frequency analysis in conjunction with machine learning algorithms: a comparative study for automotive use Fredrik Axelsson & Pavel Gueorguiev Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2017 Human activity classification using simulated micro-Dopplers and time-frequency analysis in conjunction with machine learning algorithms: a comparative study for automotive use Fredrik Axelsson & Pavel Gueorguiev © NAME FAMILYNAME, 2016. Supervisor: Amer Nezirovic, Volvo Cars Company Examiner: Lars Hammarstrand, Signals and Systems Master’s Thesis 2016:08 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Smoothed Pseudo Wigner-Ville Distribution highlighting micro-Dopplers generated by a walking human. Gothenburg, Sweden 2016 iv Human activity classification using simulated micro-Dopplers and time-frequency analysis in conjunction with machine learning algorithms: a comparative study for automotive use Fredrik Axelsson & Pavel Gueorguiev Department of Signals and Systems Chalmers University of Technology Abstract As vehicle automation becomes more common and encompasses more vehicle func- tionality early detection of vulnerable road users (VRUs) becomes a greater concern. One part of addressing this is to use on-board radars to detect micro-Doppler (µ-D) signatures associated with VRUs and classify them to give early warning. With time-frequency analysis, µ-D signatures can be extracted and in conjunction with machine learning algorithms (MLAs) they can also be classified. In this thesis work, done at Volvo Cars Company (VCC), different combinations of algorithms for time- frequency analysis and machine learning are compared to determine what is suitable in an automotive context. A µ-D radar return was simulated using data from the Motion Capture database available at Carnegie Mellon University’s graphics lab. Three different human activ- ities were available for classification with subjects walking, running and boxing. The µ-D signatures were generated using four time-frequency analysis algorithms: Short- Time Fourier Transform (STFT), Continuous Wavelet Transform (CWT), Smoothed Pseudo Wigner-Ville Distribution (SPWVD) and Empirical Mode Decomposition (EMD). The signatures extracted using STFT, CWT and SPWVD were in image format and were classified using two machine learning algorithms: Artificial Neural Networks (ANN) and Convolutional Neural Networks (CNN). The algorithms were applied to both noisy and noiseless data. The accuracy of clas- sification on noisy data varied from 69.23% to 100% depending on what combination of algorithms was used. CWT in combination with an ANN resulted in classifica- tion performing perfectly, likely because the data set was too small for there to be any errors. STFT and PSWVD in conjunction with CNNs were found to have very similar performance to each other. EMD coupled with CNN proved itself to be a promising with a classification accuracy of 97.50% on noisy data. The PSWVD algorithm was found to be unsuitable for on-board vehicular use due to extensive computation times without any major performance gain. Other algorithms performed within more reasonable time frames but only the EMD was fast enough to work in a live traffic situation with an average of 0.05 seconds. With this speed a complete classification was possible, using a CNN, in less than 0.075 seconds. Keywords: micro-Doppler, STFT, EMD, Time-Frequency Analysis, CWT, SPWVD, ANN, CNN, Automotive, Vulnerable Road Users. v Acknowledgements I would like to thank my sister and parents for their help and support in this endeav- our. I am also grateful to our examiner Lars Hammarstrand for his advice during the tough times of this thesis. Fredrik Axelsson, Gothenburg, August 2017 Velim ad gratias ago meus frater, mater et pater. Amicis meis epularer. Lars advisor meas. Et regina in villam. Alea iacta est. Pavel Gueorguiev, Gothenburg, August 2017 vii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Theory 5 2.1 Doppler and µ-Doppler . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Radar Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Simulating a radar response . . . . . . . . . . . . . . . . . . . 10 2.3 Time-Frequency analysis: Doppler Extraction . . . . . . . . . . . . . 10 2.3.1 The Short-Time Fourier Transform . . . . . . . . . . . . . . . 11 2.3.2 Continuous Wavelet Transform . . . . . . . . . . . . . . . . . 12 2.3.3 Smoothed Pseudo Wigner-Ville Distribution . . . . . . . . . . 13 2.3.4 Empirical Mode Decomposition . . . . . . . . . . . . . . . . . 14 2.4 Machine Learning algorithms . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Deep learning and Artificial Neural Networks . . . . . . . . . 17 2.4.2 Initializing and Training . . . . . . . . . . . . . . . . . . . . . 17 2.4.3 Backpropagation and the Loss function . . . . . . . . . . . . . 20 2.4.4 Optimization, Dropout and Testing . . . . . . . . . . . . . . . 21 2.4.5 Testing NN performance . . . . . . . . . . . . . . . . . . . . . 21 2.4.6 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 22 2.4.6.1 Running the network . . . . . . . . . . . . . . . . . . 23 2.4.6.2 Max-pooling, FC Layer and Readout Layer . . . . . 25 3 Methods 27 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Choosing MOCAP data . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Preparing MOCAP data . . . . . . . . . . . . . . . . . . . . . 28 3.1.3 Generating a radar response . . . . . . . . . . . . . . . . . . . 29 3.2 Doppler Extraction configurations . . . . . . . . . . . . . . . . . . . 29 3.2.1 The STFT algorithm . . . . . . . . . . . . . . . . . . . . . . . 31 ix Contents 3.2.2 CWT using the Morlet mother wavelet . . . . . . . . . . . . . 33 3.2.3 SPWVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.4 EMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 ML Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 41 3.3.3 CNN Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Results 45 4.1 DE algorithm computation times . . . . . . . . . . . . . . . . . . . . 45 4.2 ANN performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 CNN performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Discussion 49 5.1 Limitations with MOCAP data and simulated radar returns . . . . . 49 5.2 Choice of window functions and mother wavelet . . . . . . . . . . . . 50 5.3 Choices regarding EMD . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Regarding noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5 Computation times - C vs MATLAB . . . . . . . . . . . . . . . . . . 52 5.6 Comments on classification . . . . . . . . . . . . . . . . . . . . . . . . 53 6 Conclusion 55 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Bibliography 57 x List of Figures 2.1 Body parts are marked with red with the torso and head in the center. Leg 1 is still while leg 2 moves and vice verse. The same relationship can be seen for the arms. Color represents an amplitude related to the size of the target cross section, defined in Section 2.2. The image is a spectrogram generated by the movements of subject 16 during take 8 in combination with the STFT algorithm. Details regarding this can be found in Section 3.1 and Section 3.2.1. . . . . . . . . . . 6 2.2 Spectrogram of subject running. The signature has several abrupt changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Spectrogram of subject boxing. The single bulge in the image is the punch being performed. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 A radar cube showing the relationship between the channels, slow- time and fast-time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Illustration of a linear FMCW radar waveform from f1 to f2 with the transmitted signal coloured in magenta and the received echo coloured in cyan. The time delay ∆t is highlighted by the blue arrows, frequency difference ∆f by the red arrows and the PRI T by the green arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 An example signal containing two frequency components that abruptly changes to two other frequency components at 25 000 samples while amplitude remains unchanged. . . . . . . . . . . . . . . . . . . . . . . 12 2.7 The spectrogram resulting from the input signal in figure 2.6. The abrupt frequency change at 25 000 samples can clearly be seen at sample 400 of the spectrogram. The number of samples is different in the spectrogram compared to the original signal as a result of the size of the chosen window function. . . . . . . . . . . . . . . . . . . . 12 2.8 The shape of a Morlet Wavelet where X-axis and Y-axis are generic indications of time and amplitude respectively. . . . . . . . . . . . . . 13 2.9 The signal described by equation 2.16 can be seen at the very top with the three extracted IMFs following. . . . . . . . . . . . . . . . . 15 2.10 Residue remaining after employing the EMD algorithm on the original signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.11 The layout of an ANN with the N input layers coloured in green, P hidden layers coloured in blue and K output layers coloured in red. X0 and H0 are bias neurons. The connecting black arrows are known as vertices and function as weights. . . . . . . . . . . . . . . . . . . . 18 xi List of Figures 2.12 The layout of an CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.13 This is a detailed description of the convolution operation. The filter, here called a kernel, is 2x2 operating on a 3x4 input resulting in a 2x3 activation map. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.14 A 5x5x3 filter is applied on a 32x32x3 image. As it is slid over the entire input image it generates one activation map (orange) of size 28x28x1. Another filter generates the second activation map (red). . . 25 2.15 The maximum value in each quadrant of the Input square is selected and separated into a new quadrant called the Pooled Output. . . . . 26 3.1 A flow chart of the thesis methodology with data pre-processing (red), DE algorithms (orange), ML algorithms (yellow) and final analysis (green) highlighted. STFT, CWT and PSWVD are inputs to an ANN and CNN One. EMD is input to the ANN and CNN Two. The meaning on CNN One and Two are explained towards the end of this chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 The absolute value of the simulated radar return from subject 16 take 8 from the CMD database. . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Figure 2.1 transformed into gray scale. . . . . . . . . . . . . . . . . . 31 3.4 The spectrogram generated by the movements of subject 16 during take 8 when white noise has been added. . . . . . . . . . . . . . . . . 32 3.5 Figure 3.5 in gray scale. The signal is less clear in relation to the noise in comparison to the coloured version of the image. . . . . . . . 32 3.6 A scalogram of the signal generated by the movements of subject 16 during take 8. High frequencies are smeared while low frequencies are shown with high resolution. . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 A scalogram of the signal generated by the movements of subject 16 during take 8. Low frequencies are smeared while high frequencies are shown with high resolution. . . . . . . . . . . . . . . . . . . . . . 34 3.8 A scalogram resulting from the merge of the two above to get res- olution in both high and low frequencies simultaneously. Negative frequencies are folded into the zero frequency resulting in a loss of information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.9 A spectrogram generated with the SPWVD algorithm. High frequen- cies have folded into the negative frequency domain destroying locality. 35 3.10 A spectrogram generated using the SPWVD method with shifted fre- quencies that represents the quality of image used. High frequencies are now where they should be but the spectrogram has less sharpness than in the previous figure. Cross-terms marked with red have also been introduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.11 The first five IMFs generated after EMD has been applied to the signal generated by subject 16 take 8. A complex signal generates complex IMFs so for visualization the real and imaginary parts have been separated. The real part on the left side of the image is coloured blue and the imaginary part on the right side is coloured in red. . . . 37 xii List of Figures 3.12 Reconstruction of the original signal from complex IMFs. A spectro- gram was generated with an STFT from a signal made by summing all complex IMFs together. . . . . . . . . . . . . . . . . . . . . . . . 38 3.13 Reconstruction of the original signal by using only real parts of the IMF functions. A spectrogram was generated by an STFT by sum- ming the real parts of all IMFs together. Noting that a spectrogram made from only the imaginary parts looks nearly identical. . . . . . . 38 3.14 An immaculate image of subject 16 take 8 generated by adding the imaginary parts of six IMFs. In the upper part of the image high frequencies can be seen as oscillations in shades of grey. . . . . . . . 39 3.15 A visualization of a subset of the trained weights. What can be seen are features captured by the network after training. There are units that are clearly looking for "boxing-like" features (the bottom row), "running-like" features (top row middle) and "walk-like" features (middle right). These units would get activated upon input images matching their respective class. The bias unit is not visualized. . . . . 41 5.1 The same spectrogram as above but using 171 length windows in- stead. The image shows how good resolution is possible with the method if computation time constraints are less of a concern. . . . . . 51 xiii List of Figures xiv List of Tables 3.1 Table listing the constants and system parameters used for simulating the radar return. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Specifications for the computer on which the ML algorithms were run. 39 3.3 Table of design parameters used for CNN One and CNN Two. . . . . 43 3.4 Table of comparison between the neurons and parameters for CNN One and CNN Two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Table of computation times for the STFT, CWT, SPWVD and EMD algorithms for both half second (Case A) and one second (Case B) observation times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Table of ANN performance for different Cases and DE algorithms. . . 46 4.3 Table of ANN performance for different Cases and DE algorithms when noise has been added. . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Table of CNN performance for different cases and DE algorithms with noiseless input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Table of CNN performance for different cases and DE algorithms when noise has been added. . . . . . . . . . . . . . . . . . . . . . . . 47 xv List of Tables xvi 1 Introduction Early detection and reliable classification of humans is key in avoiding collisions between motor vehicles and exposed humans. Experiments with radars within the automotive industry started as early as the late 50’s but have since matured signif- icantly. As radar technology evolved different frequency ranges and radar types be- came relevant for automotive use. Collision avoidance has been a powerful driver in the development of such technology and with the evolution of Monolithic Microwave Integrated Circuits and micro-controller processing power automotive radars have become much more commercially viable. Cheap modern sensors and electronics combined with modern communication technology has resulted in vehicle manufac- turers including several sensors in their products. Some of these sensors are radar units. As car accidents are a source of fatalities and serious injuries across the globe widespread use of such technology could save many lives each year. One example of successful application is when the Greyhound Lines bus company 1993 installed 24 GHz radars on 1600 buses resulting in a 21% reduction in accidents during their first year of use [1]. A radar detects movements of its targets in terms of Doppler shifts. Pedestrians, bicyclists and strollers are all examples of Vulnerable Road Users (VRUs) whose movements give rise to so-called micro-Doppler (µD) sig- natures. These (µD) signatures can be used to not only detect but classify VRUs. Quick and reliable classification would be a powerful tool for accident avoidance. Since the original observation of µD signatures by Victor C. Chen in his defining work [2], the use of µD signatures has been evaluated for a plethora of applications. In addition to the aforementioned automotive case applications include distinguishing civilians from combatants, reconnaissance and security. Separating different µD signatures into their different frequency components has sparked interest in the signal processing community in an attempt to find good algorithms for the problem. The process of generating these time-frequency maps of (µD) signatures is called Doppler Extraction (DE). Time-frequency analysis algorithms such as Short-Time Fourier Transform (STFT), Continuous Wavelet Transform (CWT) and Wigner-Ville Distribution (WVD) have been used for DE. Certain algorithms excel at different things such as the STFT hav- ing been shown to do well on parameter estimation [3] while for example CWT has been shown to do well on classification. WVD has performed very well for classifica- tion at the cost of introducing additional complexity. This complexity is caused by frequential cross-terms complicating the analysis. Added complexity may be accept- able depending on the application but in an automotive context computation power 1 1. Introduction is a limited resource and may require a simpler approach. DE has also been done using Independent Component Analysis (ICA) and Principal Component Analysis (PCA) [4] where simulated data with µD signatures was generated, extracted and used for classification. Lastly the Empirical Mode Decomposition (EMD) has been shown to be a potentially powerful method to extract the time-frequency information in a signal by separating it into its frequency components [5]. Classification also requires the use of Machine Learning (ML) algorithms where dif- ferent algorithms may have different performance. This is an open topic for research but examples of algorithms put forward are Support Vector Machines (SVMs), Dy- namic Time Warping (DTW)[7] and k-Nearest Neighbour (k-NN)[4], Artificial Neu- ral Networks (ANNs) and Convolutional Neural Networks (CNNs). An ANN was tested in combination with spectrograms generated by an STFT as DE algorithm where 12 human subjects performed seven different activities to be classified. Performance varied between 82.7-87.8%[8] depending on validation sce- nario. A SVM using six features have been shown to successfully classify different human activities in a controlled environment with an accuracy of over 90%[9] in a similar study using the same data set as mentioned above. Lastly, the author of the above mentioned papers did a study using a CNN with spectrograms as input. It was able to distinguish humans from animals and cars and successfully classified with 97.6% accuracy[10]. There are plenty of algorithms to choose from both in regards to DE and ML. There is no obvious answer to what combinations perform best in an automotive context. This is due to the trade-offs between complexity, observation duration and size of the data set needed for training and testing. Both choice of DE algorithm and ML algorithm impact results meaning there are numerous possible combinations. 1.1 Thesis Objectives The main goal of this thesis is to evaluate combinations of DE and ML algorithms in order to determine their performance regarding VRU classification. As some DE and ML algorithm combinations may perform better than others in the automotive context this could be useful information for future work on VRU classification. The objectives of this work can be divided into the following: • Simulate radar response given measured motion-capture data. • Apply a selection of DE algorithms, namely STFT, CWT, WVD and EMD, on the data set. • Use the output from the DE algorithms as input to the selected ML algorithms: ANN and CNN. • Evaluate and compare performance in terms of classification accuracy, com- putation cost and observation time. 2 1. Introduction 1.2 Related Work The data set used in this thesis originates from Carnegie Mellon University. It is called the CMU Graphics Lab Motion Capture Database [11] and is available freely on their website. Several subjects have been recorded performing different activities while wearing 41 markers saved as movements in space over time as the Cartesian coordinates of the markers. How this data is processed and used is discussed in-depth in Section 3. A crucial part of the thesis builds on the simulation of a radar response given the Cartesian coordinates over time. A model for radar response simulation was pro- posed in a paper[12] using the Boulic model[13] as input data. The same simulation model was also shown to work with a Kinect sensor generating the data set[14]. The same study also compares CMU Motion Capture (MOCAP) data with the data from the Kinect sensor. This radar response model is discussed in-depth in Section 3. In the present work, the Time-Frequency Toolbox (TFTB) [15], is used. It was de- veloped primarily by François Auger and Patrick Flandrin in association with Centre National de la Recherche Scientifique (CNRS) and is an extensive toolbox available for MATLAB and GNU Octave. It contains many time-frequency analysis tools and algorithms, among them some used in this thesis. Specifically the STFT, CWV and SPWVD are all available in the toolbox. It has been subject to modifications of the framework but the algorithms themselves have remained unchanged. 1.3 Scope and Limitations There is obviously a limitation to how many DE algorithms that can be used. There are numerous possible algorithms and only a few will be possible to evaluate. The four DE algorithms mentioned above will be the focus of the thesis based on how promising they have shown to be in previous studies. Overly complex algorithms can be excluded out of hand despite having shown promise as they would be practically unfeasible for use in a vehicle with very limited computation power. The thesis was originally intended to use measured data of multiple subjects walking, jogging, running and bicycling. Measurements were both planned and conducted but there was need for extensive data pre-processing for which software was unavailable. Because of this the thesis uses MOCAP data recorded by CMU. This limits the number of possible subjects and the activities they perform. There needs to be sufficient variation in subjects and numerous repetitions for the training, validation and testing of ML algorithms to work properly. Otherwise they may not achieve their real potential. Choice of activities are dictated by what data is available. This work has limited itself to "Walking", "Running" and "Boxing". The choice of ML algorithms is limited by their performance in previous studies and the fact that designing, training and testing can take a significant amount of time. Any method that requires a library to be present, such as k-NN or DTW, can be excluded as it is not realistic for every vehicle to carry a large and ever 3 1. Introduction growing database with it. SVMs require features to be extracted which means as more classes arise more features are required. ANNs and CNNs have all been shown to work in combination with µD signatures in previous studies. These algorithms also have plenty of tools available due to their popularity. Because of this the thesis will limit itself to examining these three algorithms. 1.4 Contributions The thesis works has resulted in the following contributions: • Radar responses were simulated given MOCAP data and used as input to the STFT, CWT, WVD and EMD algorithms. • An ANN and two CNNs were designed and implemented for comparison to classify human activities with the help of the above time-frequency analysis algorithms. • ANN and CNN performance in conjunction with STFT, CWT, WVD and EMD was evaluated. Different observation times were part of the evaluation. The combination of CNN and EMD was a novel method of human activity classification. EMD showed much promise when classified even with a small network. CWT together with the ANN also proved to be a strong classifier. 1.5 Thesis Outline The work that has been done during the duration of this thesis will be described in this report. The structure is as follows. First the Theory section, Chapter 2, starts with a relevant radar basics. This is followed by a presentation of the time-frequency analysis algorithms used along with the ML algorithms that have been part of the work. The Methods section, Chapter 3, follows where the process behind data acquisition and refinement is explained in great detail. The intricacies of how different time- frequency analysis algorithms were applied are also presented. Lastly the method- ology behind the design of the ANN and CNNs is described. Finally Chapters 4, 5 and 6 cover the Results, Discussion and Conclusion respectively. In the Results the performance of different combinations of time-frequency analysis algorithms and ML algorithms are presented for different kinds of input data. In the Discussion the data, models and algorithms chosen are discussed along with reflections on the re- sults. Finally the conclusions of the work that has been done and suggestions for future work make up the last part of the report. 4 2 Theory This section presents the theory behind µ-Doppler signatures, what kind of simula- tion model is used and the DE algorithms. Finally how ANNs and CNNs work is covered. 2.1 Doppler and µ-Doppler The frequency shift caused by the relative motion between the transmitter sending a wave and the object reflecting it is called a Doppler Shift and the phenomenon is called the Doppler Effect. Mathematically the frequency of the observed reflected waveform, f , can be described as following with f0 being the transmitted frequency, c the speed of light through the medium and the relative speed between the transmitter and reflector being ∆v, f = (1 + ∆v c )f0 (2.1) where the Doppler Shift ∆f , ∆f = ∆v c f0 (2.2) where observed frequency, transmitted frequency and Doppler Shift are all related through ∆f = f − f0. If the object reflecting the wave has several parts moving at different speeds this gives rise to what is called micro-Doppler or µD signatures. The different parts of the object generate different Doppler frequencies on top of the main Doppler created by the torso and the combination of these different frequencies is called a signature. An example of this would be a pedestrian where arms, legs, torso and head all move at different relative velocities in regards to the transmitter. Different human activities give rise to different µD signatures and those signatures will often be distinct. However, they are subject to change as the orientation and velocity of the observed objects change or as the angle of the main lobe of the array changes. This is most easily exemplified by a human walking on a flat surface. The arms, head, torso and legs all give rise to sinusoidal-like µDs with varying frequencies and amplitudes. The ensemble of these sinusoidals form the signature. This is illustrated in figure 2.1 where different body parts are marked at different Doppler frequencies. In the example the subject is walking. 5 2. Theory Figure 2.1: Body parts are marked with red with the torso and head in the center. Leg 1 is still while leg 2 moves and vice verse. The same relationship can be seen for the arms. Color represents an amplitude related to the size of the target cross section, defined in Section 2.2. The image is a spectrogram generated by the movements of subject 16 during take 8 in combination with the STFT algorithm. Details regarding this can be found in Section 3.1 and Section 3.2.1. Examples of spectrograms of subjects running and boxing can be seen in figure 2.2 and figure 2.3 respectively. 6 2. Theory Figure 2.2: Spectrogram of subject running. The signature has several abrupt changes. Figure 2.3: Spectrogram of subject boxing. The single bulge in the image is the punch being performed. 7 2. Theory 2.2 Radar Basics Radars are powerful tools to determine the distances and velocities of targets but if they are to perform well they must be designed with particular tasks in mind. The relationship between radar and target properties can be expressed with the RADAR equation, Pr = Pt GtGrλ 2 c (4π)3R4σ (2.3) where Pr is received power, Pt is transmitted power, Gt and Gr are transmitter and receiver antenna gains respectively and λ is the transmitted wavelength. σ is a property of the target called cross section which indicates how large the target is in the sense of reflectivity. Lastly R is the distance between the radar and target. The radars used on vehicles are usually Phased Array radars meaning they consist of an array of antenna elements where each element has a phase shifter. The manipula- tion of the phase of different elements allows for transmission in different directions up to 120◦ without any moving parts. The fact that nothing is moving internally within the radar units is essential if they are to be mounted on vehicles. In addition to this, carrier frequency greatly impacts radar size and how small targets a radar can resolve. Higher frequencies allow for smaller targets. A phased array radar using a high carrier frequency, for example 77 GHz, can be very physically small. This allows for multiple radar units, each no larger than a few square centimeters, to be mounted on vehicles. To understand the radar data structure the concepts of fast-time, slow-time and channels are essential. Fast-time is the sampling rate of the radar and is a large number compared to slow-time. Slow-time is the pulse number with one pulse being a single instance of the Pulse Repetition Interval (PRI). The channels are the antenna elements with each element being associated with a specific channel. The relationship between these three dimensions is typically illustrated with a radar cube as seen in figure 2.4 which gives an idea of how radar data is stored. Figure 2.4: A radar cube showing the relationship between the channels, slow-time and fast-time. 8 2. Theory The radar units used on vehicles are also usually Frequency-Modulated Continuous- Wave (FMCW) radars. This is also the case for the ones used on Volvo Cars Company (VCC) vehicles. FMCW radars have the advantage that they transmit constantly with a sliding frequency resulting in some waveform. This means the radar is not subject to the periods of waiting called Dead Time which other radar types are. An example of how this sort of radar works is shown in figure 2.5 where the magenta coloured transmitted signal starts at f1 and linearly increases to f2 creating a saw waveform. The time delay ∆t between transmitting and receiving an echo of the same frequency is marked by the blue arrows. The measured frequency difference ∆f is marked by red arrows. The green arrows highlights T , which is the PRI. Figure 2.5: Illustration of a linear FMCW radar waveform from f1 to f2 with the transmitted signal coloured in magenta and the received echo coloured in cyan. The time delay ∆t is highlighted by the blue arrows, frequency difference ∆f by the red arrows and the PRI T by the green arrows. The distance R between the radar and the reflecting target can be calculated using the expression, R = c0|∆t| 2 = c0|∆f | 2(df/dt) (2.4) where df/dt is the slope of the waveform. This slope, the frequency shift per unit of time, is also sometimes called runtime frequency. The range resolution is directly related to the bandwidth and can be expressed as follows, ∆R = c0 2(f2 − f1) . (2.5) If the target is moving, information relating to its velocity is contained in the Doppler Shift of the signal. This Doppler Shift is acquired by comparing transmitted and received frequency and compensating for runtime frequency. 9 2. Theory 2.2.1 Simulating a radar response Assuming a pulsed Doppler linear FMCW radar with a single channel, the total return can be expressed as the sum over the m individual points observed using the expression below [12][14], sh(n, t) = m∑ i=1 at,irect( t̂− td,i τ )ej[−2πfctd,i+πγ(t̂−td,i)2] (2.6) where t̂ is the time relative to the start of each PRI T . The time t is related to t̂, pulse number n and T through t = T (n−1)+ t̂. The amplitude is represented by at,i meaning there is a specific amplitude for each point i observed at any given time. The time delay is represented by td,i. τ is the pulse width, c the speed of light, γ the chirp slope and fc the center frequency of the transmitted signal. rect() refers to the rectangle function. The amplitude for some point i at some time t is seen below [12], at,i = Gλ √ Ptσiσn (4π)1.5R2 i √ Ls √ La √ Tsys (2.7) where G is antenna gain, Pt is transmit power, λ the transmitted center wavelength, σi the cross section associated with a given point, σn the noise standard deviation and Ri the distance of point i. √ Ls and √ La are system losses and atmospheric losses respectively and finally Tsys is the system temperature. For the purpose of this work the format of the radar response data needed to be over slow-time at the range bin of the peak power output. To achieve this the two above equations can then be combined into the final expression below [14], xp[n] = m∑ i=1 at,iτe −j 4πfc c0 Rd,i . (2.8) The simulated radar response, the complex signal xp[n], is then ready for further processing. 2.3 Time-Frequency analysis: Doppler Extraction To emphasize a µD signature in a signal it is digitally processed using time-frequency analysis. This procedure is in the context of µD signatures known as Doppler Ex- traction. There are numerous algorithms for this purpose and they all have their strengths and weaknesses. While all DE algorithms used are explained in more detail later in this section, it should be noted that they all have some fundamental things in common. They all aim to extract time-frequency information from complex or some- times real input signals. There is also a trade-off between time resolution, frequency resolution and complexity. In accordance with the Heisenberg-Gabor limitation it is not possible to simultaneously have high resolution in time and frequency [16]. It is however possible to work around this limitation with some of the methods presented in this section. 10 2. Theory There are different names to describe the resulting time-frequency representations, also known as a time-frequency maps, generated by DE algorithms. One common name is spectrogram. In order to avoid confusion the term will be used to describe the output images from the STFT and SPWVD methods. The reader should, how- ever, be aware that the term spectrogram, outside the context of this document, is sometimes used as a general name for time-frequency representations generated by any method presenting all the information in the same image. The results generated by Wavelet Transforms are called scalograms and will be referred to as such. In both spectrograms and scalograms the color axis indicates an amplitude related to the cross-sections of the target observed. 2.3.1 The Short-Time Fourier Transform The Short-Time Fourier Transform, STFT, is the simplest extraction algorithm considered and is essentially a windowed Fourier transform. It can be expressed as, Xs(ω, τ) = ∫ ∞ −∞ x(t)e(−jtωt)w(t− τ)dt (2.9) where w(t− τ) is a window function and τ is the center of the window. The signal x(t) is then slid through the window with some overlap from one windowing to the next. This procedure brings forward the spectral content related to a given time interval. Along with overlap, the size and shape of the window determines what the resolution will be in time and frequency. For example; a regular Fourier transform can be thought of as a STFT with an infinite window. It has the best possible spectral resolution but no temporal resolution. Frequency and time resolutions ∆ω and ∆t respectively are related through ∆ω∆t = C where C is a constant meaning there is a trade-off between resolutions. Consider a signal containing two frequency components that abruptly change at some point in time. Such a signal can be seen in figure 2.6 which is a time domain signal used as input to a STFT using a Kaiser window function. The result is the spectrogram seen in figure 2.7. An abrupt change in frequency over time is clearly visible in both figures. 11 2. Theory Figure 2.6: An example signal containing two frequency components that abruptly changes to two other frequency components at 25 000 samples while amplitude remains unchanged. Figure 2.7: The spectrogram resulting from the input signal in figure 2.6. The abrupt frequency change at 25 000 samples can clearly be seen at sample 400 of the spectrogram. The number of samples is different in the spectrogram compared to the original signal as a result of the size of the chosen window function. 2.3.2 Continuous Wavelet Transform Continuous Wavelet Transform is a popular algorithm for image processing and time-frequency representation. It adapts its shape to give good resolution in time and frequency as needed. Wavelet transforms use a function ψ called the mother Wavelet which is an oscillation of finite length with some shape chosen in relation to 12 2. Theory the shape of the input signal. The CWT F (a, b) for some wavelet ψ can be described as, F (a, b) = 1√ |a| ∫ r < ψ (t− b a ) f(t)dt (2.10) where f(t) is the input signal, a the scaling factor and b the translation. The scaling factor compresses or dilates the mother wavelet while the translation factor shifts it along the time axis. This ability to manipulate the wavelet through two parameters allows for powerful time-frequency analysis when the input signal has rapid and abrupt changes. The performance of the transform is strongly tied to the choice of mother wavelet as the shape should correspond to the shape of the signal analysed. This is both a strength and a weakness as it allows for much versatility when analysing a specific signal but may suffer if input signals differ to much in their behavior. In this work the choice of mother wavelet is the Morlet wavelet, a shape that is popular in the field [17]. As mentioned in Section 2.1 the expected shape is sinusoidal or partially sinusoidal. The Morlet wavelet is mathematically described as following [30], Ψ(t) = ej2πω0te −t2 σ (2.11) where ω0 is the central frequency and σ is the bandwidth parameter. A Morlet wavelet is easily generated in MATLAB and has a shape as seen in figure 2.8. Figure 2.8: The shape of a Morlet Wavelet where X-axis and Y-axis are generic indications of time and amplitude respectively. 2.3.3 Smoothed Pseudo Wigner-Ville Distribution Wigner-Ville Distribution (WVD) is a form of bilinear analysis that allows for high resolution in both time and frequency at the price of the signal components interfer- ing with one another. The interference is called cross-terms and must be filtered out 13 2. Theory for the time-frequency representation to be useful which introduces more complexity. In its most basic form it is defined as the Fourier transform of the auto-correlation of s(t). Mathematically it can be expressed as, S(t, f)WV = ∫ s(t+ τ 2)s∗(t− τ 2)e(−j2πfτ)dτ. (2.12) As mentioned above it introduces cross-terms which disturb the interpretation of the output signal. Cross-terms arise when a signal contains multiple components in time-frequency and they can be up to twice the size of the desired terms. To deal with this interference a low-pass filtered WVD can be used. This results in a small loss of resolution in time and frequency for a large reduction in cross-term interference. A WVD in combination with a linear LP filter is part of the Cohen’s class which can be mathematically expressed as below [18], S(t, f)WV LP = ∫ ∫ s(x+ τ 2)s∗(x− τ 2)φ(t− x, τ)e(−j2πfτ)dxdτ (2.13) where φ(t, τ) is a LP filter and the corresponding Fourier Transform is Φ(ψ, τ). The Fourier Transformed filter is known as a kernel function and the choice of kernel func- tion impacts performance. Examples of variations of WVD using different kernels is the Smoothed Pseudo Wigner-Ville Distribution or the Choi-Williams Distribution. In this work the Smoothed Pseudo Wigner-Ville Distribution (SPWVD) is used in order to minimize the cross-terms that arise from bilinear transforms. This obviously adds complexity to the method but also improves its quality. The mathematical expression for the SPWVD is [19], SPWVD(t, f) = ∫ ∞ −∞ ∫ ∞ −∞ g(u)h(τ)x(t− u+ τ 2)x∗(t− u− τ 2)e−j2πftdudτ (2.14) where g(u) and h(τ) are real-valued window functions and x(t) is the input signal. One window function is chosen in regards to performance in the time domain and the other in regards to performance in the frequency domain. 2.3.4 Empirical Mode Decomposition Empirical mode decomposition (EMD) is an algorithmic approach to analyze non- stationary and non-linear time varying signals. A non-stationary process refers to one that does not change when shifted in time. In other words the mean and vari- ance, if defined, remain the same. EMD is a integral part of the Hilbert-Huang transform and breaks down signals into several components allowing for good reso- lution for each particular component [20]. The purpose of the EMD is to break down the signal into Intrinsic Mode Functions (IMFs) [21]. The idea of the algorithm is simple. Given an observation x(t) a transform is applied to get a representation in the form: x(t) = K∑ k=1 ak(t)φk(t) (2.15) 14 2. Theory The algorithm performs best when a signal is composed of a fast oscillation on top of a slow oscillation. The algorithm locally identifies the fastest oscillation, removes it from the original signal, and then continues iterating. Pseudo-code for this algorithm is as follows: 1. Identify local maxima and minima of the signal 2. Form an upper and lower envelope by interpolation, usually cubic splines (a) subtract the average of the two envelopes from the signal. This is called an iteration (b) iterate until: number of extrema is equal to number of zeros ±1 3. Now the IMF is obtained, saved and subtracted from the signal 4. The remaining signal is called the residual. The process is continued, gener- ating IMFs until a certain number of iterations has been completed. Consider the signal x(t) below, x(t) = sin(πt) + sin(2πt) + sin(6πt) + sin(13πt) + sin(17πt) (2.16) The signal above can be seen at the top of figure 2.9 after which three IMFs follow. The residual is seen in figure 2.10 at the very bottom of the image. Figure 2.9: The signal described by equation 2.16 can be seen at the very top with the three extracted IMFs following. 15 2. Theory Figure 2.10: Residue remaining after employing the EMD algorithm on the original signal. The EMD can be expanded to take complex arguments by the use of the bivariate EMD algorithm, where the signal is no longer contained in a fast varying envelope but rather a rotating cylinder travelling through three-dimensional complex space. 2.4 Machine Learning algorithms Machine Learning (ML) algorithms are a class of algorithms aiming to give com- puters the ability to perform specific tasks without explicitly being programmed to do so. More concretely ML algorithms have been summarized by Tom M. Mitchell with the succinct quote [22]: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." Experience refers to the data-driven approach where the program is taught to "learn" through examples. Generally speaking as more examples are fed into a ML algorithm the performance improves. The task could be anything from classifying images into categories to identifying a speaker from a voice recording. Performance is measured by how well the ML algorithm does this categorization. ML algorithms can be divided into the categories supervised and unsupervised. Supervised learning refers to a training process in which the examples are coupled with the correct answer, called a label or tag, which the algorithm uses to alter its state and improve as more training examples are supplied. A simple example of this occurs if the task is to classify animals, the training examples will consist of images of an assortment of animals while the label will be the corresponding name, e.g. an image of a cat paired with the label "cat", an image of a dog paired with the label "dog" and so on. 16 2. Theory Unsupervised learning refers to training sets where the labels are not provided, such as a cluster of points on a two dimensional Cartesian plane. The task of the machine is then to find its own weights to separate the data set. There is a lot of variation when it comes to ML algorithms and the choice is depen- dent on the task at hand, computation time, memory constraints, and the nature and size of the available data set. In this work only supervised algorithms, specifically ANN and CNN, are used. 2.4.1 Deep learning and Artificial Neural Networks Neural networks for the purpose of this thesis comprise of ANN and CNN. Deep learning commonly refers to neural networks (NN) with many layers and the CNN can be thought of as an extension of ANN. ANN is a ML algorithm featuring a network of neurons, also called nodes, inter- connected by vertices. A diagram of a typical ANN topology with a single hidden layer is shown in figure 2.11 where an ANN with a size N input layer, a size P hidden layer and a size K output layer can be seen. The size of the input layer is determined by the size of the input data. Input to ANNs are usually in the form of images and that is also the case in this work. Image format is quantified by height and width in pixels along with a colour channel usually ranging from 1 (for grey scale) to 3 (for RGB images). Input layer size is calculated with N = a× b× c given an image with number of pixels a, b and c in height, width and depth. The output layer size is based on the number of classes the data is going to be split into. The hidden layer is designed according to the classification problem to be solved. One common design choice is depth, in other words the number if hidden layers. Another is the number of neurons in each layer. Neurons are simply a set of memory locations holding the computation values at each step of the program. Each neuron is connected to the next layer by vertices (also called axons) which are the weights. Both the input layer and the hidden layer can have a bias which helps with learning by making the network more adaptive. The weights and biases are determined by training the network, giving the network the ability to learn and make decisions. The network is inspired by the biological neural networks in animal brains from which it derives its name. 2.4.2 Initializing and Training When a network is first initialized the weights must be set to some values. Often a network is initialized with randomly generated numbers according to some distribu- tion. This avoids the network getting stuck in local minima, which might happen if it was initialized with zeros. This is commonly referred to as symmetry breaking. As NNs require training to function, suitable data must be selected and labelled according to what class it belongs to. The labelled data is then further divided into a standard split of training set, validation set and testing set. This is usually in 70%, 15% and 15% proportions for training, validation and testing respectively. The training set is then used as input to the network with the training process consisting of forward and backward propagation. During this process the training weights are 17 2. Theory Figure 2.11: The layout of an ANN with the N input layers coloured in green, P hidden layers coloured in blue and K output layers coloured in red. X0 and H0 are bias neurons. The connecting black arrows are known as vertices and function as weights. 18 2. Theory adjusted on each iteration according to some pre-selected optimization routine until some convergence criteria has been reached. The update procedure is done by the forward and backward propagation routines. In plain terms the forward propagation will take an input image and propagate it through all the layers of the network eventually producing an signal at the output layer. This signal is a class the network predicted using the current weights. Upon matching of this result to the true value provided by the labelled data an error term will be produced. Now using backpropagation this error term will be propagated through the network towards the input layer, at each stage producing a possible update to the weights in order to minimize this prediction error. Mathematically forward propagation is comprised of matrix multiplications with the weights then passed through an activation neuron. The Sigmoid function is a common choice as an activation function. The Sigmoid decision function is defined as [23], g(z) = 1 1 + e−z (2.17) and then a hidden layer computation can then be expressed as, a(l+1) = g(W Tx+ b) (2.18) where a refers to activation, l is the current layer, x is the input, W represents the adjustable weights and b is the bias. Letting the parametersW and b be represented by Θ, the cost function is calculated as below, J(θ) = 1 m m∑ i=1 K∑ k=1 [−y(i) k log((hθ(x(i))k)−(1−y(i) k )log(1−(hθ(x(i))k)]+ λ 2m L−1∑ l=1 sl∑ i=1 sl+1∑ j=1 (θ(l) ji )2 (2.19) where m is the total number of training examples, K is the number of output classes and hθ is the hypothesis of the network. In the second term λ is the weight decay, L is the number of layers, i is the marking of the destination node and j is the origin node. Note that the bias terms are not summed over as summations start at 1. The first term is the cross-entropy of the loss, summing over all the training examples and classes. The second is called a regularization used on the weights to prevent overfitting on the training data. In mathematical terms, backpropagation is the process of back-tracking and observ- ing how the network reacts to a certain input with its current weights and biases. As an example, let a network have three layers. An input layer, a single hidden layer and an output layer. Starting from the output nodes the result is compared to the labels as below, δ (3) k = (a(3) k − yk) (2.20) where a(3) k is the result of the 3rd layer and yk is a one-hot-encoded output vector. Next the result is backpropagated using the derivative of the Sigmoid function as follows, δ(2) = (θ(2))T δ(3) · g′(z(2)) (2.21) 19 2. Theory where g′ is the derivative of the Sigmoid function. Equation 2.21 calculates the error values at the hidden layer δ(2) by reversing the operations. The gradient is accumu- lated for θ(1) and θ(2) in the temporary variables ∆(l). To simplify understanding the equation is not in vectorized form. ∆(l) ij := ∆(l) ij + δ (l+1) i (a(l) j ) (2.22) In vectorized form, which is used in efficient programming implementations ∆(l) := ∆(l) + δ(l+1)(a(l)) (2.23) Yet again l is the layer, j is the node in the current layer and i is the error in the targeted layer. The ∆ terms accumulate a weighted average of the errors by iterating through all training examples. From this it is possible to normalize and calculate the partial derivatives for all nodes. ∂ ∂θlij J(θ) = 1 m ∆(l) ij (2.24) The cost function and its partial derivatives can now be passed on to the optimization algorithm. 2.4.3 Backpropagation and the Loss function The Loss function, also known as Cost function, is typically defined with the softmax function above. The purpose of the Loss function is as follow: when training the network will have the goal to minimize the loss by adjusting the weights and biases such that the softmax outputs are as close to 1 as possible for the true class and as close to 0 as possible for the other classes. An interpretation of the softmax from a probilistic can be seen below, P (yi|xi;W ) = efyi∑ j e fj (2.25) where yi is the true label, xi is the image, parameterized by W . The relation is such that softmax looks at fyi as logarithmic probability that when exponentiated gives unnormalized probabilities. These are then normalized by the division of the sum, ensuring a total summation of 1. This can be seen as a maximum likelihood estimation. For finding the best values of the parameters W and b, the backpropagation algo- rithm is necessary. Backpropagation is a way to compute gradients of expressions using multiple applications of the Chain rule. The gradients will be required in the optimization since the optimizer needs them to calculate how to change W and b such that the softmax classifier outputs a higher value for the true class. Numerical calculation of the gradient is not possible when the network consists of hundreds of parameters. Backpropagation is best explained with an example: f(x, y, z) = z(x+ y) (2.26) 20 2. Theory let s = x + y, such that f(s, z) = sz. To find find the change of f in relation to x one must find ∂f ∂x = ∂f ∂s ∂s ∂x with the Chain rule. As such it is trivial to find the relations of all three variables to the output f . ∂f ∂x = z ∂f ∂y = z ∂f ∂z = x+ y (2.27) Gradients can be calculated locally and chained back all the way to the input. The only prerequisite is knowing the gradient of the functions. This is sufficient information to know how the parameters must be altered in order to change f in the right direction. Good examples of this can be found at [28]. 2.4.4 Optimization, Dropout and Testing With the gradients the network can be guided in the right direction by the optimizer which will go down the slope with greatest descent. There are different available optimizers such as Momentum, Nesterov Momentum and Adam. They alternate in the way they update their learning rates. A recent technique used in training DNNs is called Dropout. It is where a certain percentage of the neurons are turned off during a step in the training. This forces neurons to develop redundant features in the network. As the network learns that it cannot depend on one specific neuron for information, since that neuron might not be there on the next iteration, the neurons become more self-sufficient. This has been shown to improve results and prevent overfitting [25]. After the training process is done and the network has converged on some minimized loss it is ready for the test set. The network then makes predictions on this data set and the accuracy is recorded. A possible set of images called the validation images can also be used, in which different hyper-parameters such as learning rate can be adjusted prior to using the test set. 2.4.5 Testing NN performance As the training process continues the network will have converged on some mini- mized loss. A possible set of tagged data called the validation set can be used, in which different hyper-parameters such as learning rate and regularization rate can be adjusted before test set is used. Finally a new set of data and its corresponding tags are fed into the network. This is called the test set. The network then makes predictions on these images and the accuracy is recorded. That way the network is not indirectly trained on the data, and the test set remains a true set of inputs that the network has never seen before. 21 2. Theory 2.4.6 Convolutional Neural Networks Convolutional Neural Networks (CNN or ConvNets) are a specific type of NN. CNNs are commonly referred to as deep artificial neural networks containing many lay- ers controlled in height, width and depth. The term "deep" refers to their num- ber of layers compared with its predecessors the ANNs. Although their structure is different from that of ANNs their goal is the same. The main differences be- tween the ANN and CNN are the amount of nodes and layers in the network and the global connectivity in the ANN vs the local connectivity in the CNN. They are currently the state-of-the-art image classification algorithm winning the Large Scale Visual Recognition Challenge (LSVRC) with the winners AlexNet (2012), VG- GNet(2013), GoogLeNet(2014), ResNet(2015) all being CNNs[24]. The GoogLeNet team achieved image classification error of 6.7% versus the 5.1% achieved by the human annotator [27]. CNNs are able to make strong and correct assumptions about the nature of images such as the locality of pixel dependencies and stationarity of statistics[25]. The general idea of CNNs exploit the spatial similarity between the neighbouring pixels (for image analysis) by generating small features that are matched to different parts of the image and are then combined together. The layered network is generally understood to provide a structure where each part of the network becomes capable of solving a simple image recognition task which is then voted in by each respective part to make a final decision of the image. A typical layout for a CNN can be seen in figure 2.12 Figure 2.12: The layout of an CNN The network’s layers have activation functions designed to introduce competitiveness in the network. An activation function commonly used is the Rectified Linear Unit (ReLU) which facilitates competition between input sums. Another example is the max-pool layer which provides an explicit competition between the input neurons [26]. These functions are explained in the following sections. Just as for ANNs the input to the CNNs is usually in the form of images. Typical image sizes vary from 28x28 pixels to 100x100 pixels or more. Larger images expo- nentially grow the network down the line and as such computation power sets the limit. An image the size of 100x100x3 results in a 30 000 neuron input layer. 22 2. Theory 2.4.6.1 Running the network Just like for the ANN, weights and biases are usually initialized randomly to avoid local minima. Unlike the ANN however, the CNN has a more intricate hidden layer. The first part of a CNN is the convolutional layer. True to its name the convolutional layer applies a convolution operator to a part of the input image. The convolution is done using a filter on the image, which is small in comparison. In a CNN context a convolution is mathematically described as, wTx+ b (2.28) where w are the weights of the filter, x is the subsection of the input image that the filter covers and b is the bias. w and b are hyper-parameters to be determined later by the optimization routine. The result of this convolution gives a single point in the next layer called the activation map. After a single convolution the filter is then slid a certain distance, called a stride, and the operation is repeated. The filters, also referred to as kernels, are typically 1x1 to 20x20 pixels in size, although they can be larger. The filter depth always matches the previous layer’s depth. As an example, consider a filter of size 11x11x3, where the 3 is inherited from the previous layer’s depth and the 11 is an adjustable hyper-parameter. Since the filter generates a single point on the activation map it has to be moved by a certain number of pixels before convolution is repeated. This goes on until the entire image has been covered. See figure 2.13 for an illustration of this process. 23 2. Theory Figure 2.13: This is a detailed description of the convolution operation. The filter, here called a kernel, is 2x2 operating on a 3x4 input resulting in a 2x3 activation map. The resulting activation map will have a size according to the expression, A = N − F + 2P S + 1 (2.29) where A is the activation map size, N is the input size, F is filter size, P is called a pad and S is the stride. The process of padding means zero-padding around the image to keep its size consistent. This is best clarified with an numerical example. If the stride is 1 and the filter size is 13, no padding means the output map will be of size 88x88. After a few successive convolutions the image will shrink by an unacceptable amount. Another example is applying 10 filters of shape 13x13x3, with stride 1 and pad 6. This will result in 100x100x10 output neurons, with (11x11 + 1)× 10 parameters (+1 is for the bias). For a visual illustration see figure 2.14. There are a number of filters applied at each convolutional layer. The filters generally develop simple shapes such as lines and circles. This usually occurs in the first layers while more complex shapes form in the following layers. To introduce non-linearity into the model an activation function is used. Previously the Sigmoid function was a common choice but since the introduction of the ReLU nearly all CNNs today use this activation unit. This is due to fast computation time and because it doesn’t have a vanishing gradient. Below is the equation describing the ReLU function, 24 2. Theory f(x) = max(0, x) (2.30) Figure 2.14: A 5x5x3 filter is applied on a 32x32x3 image. As it is slid over the entire input image it generates one activation map (orange) of size 28x28x1. Another filter generates the second activation map (red). 2.4.6.2 Max-pooling, FC Layer and Readout Layer The pooling layer is a layer that down-samples the input image. There are many variations of this layer, but a very popular variant is the max-pool layer. It operates on the input by taking the highest number out of the grid and move on by one stride. The general equation describing the filter size is, w2 = w1 − f s + 1 (2.31) h2 = h1 − f s + 1 (2.32) d2 = d1 (2.33) where w is width, h is height and d is depth of the output, f is filter size and s the stride. The subscripts indicate the layer the variables belong to. A 2x2 filter with a stride of two would reduce a 100x100x10 to 50x50x10, effectively taking the maximum of each neighbouring 2x2 block and reducing it to one number. In total such a reduction would be to a quarter of the original size. For a clear illustration of the max-pooling operation see figure 2.15. The max values in each quadrant and picked and turned into a new quadrant. 25 2. Theory Figure 2.15: The maximum value in each quadrant of the Input square is selected and separated into a new quadrant called the Pooled Output. The Fully Connected (FC) layer resembles an ANN network where there is no local connectivity but all the nodes in the current FC layer are connected to all nodes in the previous layer. This can be seen in figure 2.12 on the right end of the network. In effect this combines all of the previous layer’s decisions into several hundred neurons. The previous layer is flattened and a matrix multiplication operation between the weights of the FC layer and the input from the previous layer completes the operation. The Readout Layer is typically the last layer of the network. It is where the FC layer is funnelled into all the class outputs. After this a softmax function can assign probabilities to each class according to, σ(z)j = ezj∑K k=1 e zk (2.34) which acts as a normalization function so that all class scores have a range from [0,1] which can be interpreted as a probability. 26 3 Methods This chapter covers the procedure of going from the original MOCAP data files to a final result, describing the DE and ML algorithms used. The procedure is visualized in figure 3.1 where red represents modelling and pre-processing, orange represents DE algorithms, yellow highlights ML algorithms and green is the final analysis of the result. Figure 3.1: A flow chart of the thesis methodology with data pre-processing (red), DE algorithms (orange), ML algorithms (yellow) and final analysis (green) high- lighted. STFT, CWT and PSWVD are inputs to an ANN and CNN One. EMD is input to the ANN and CNN Two. The meaning on CNN One and Two are explained towards the end of this chapter. 3.1 Data Data and simulation tools were picked based on availability and how well they fit with the goals of this work. The decision was made to combine real-world MOCAP 27 3. Methods data of humans and simulate the radar responses using that data. The process of turning MOCAP data into useful radar returns is presented below. 3.1.1 Choosing MOCAP data The first step when using this data was to select relevant categories. Category choice was based on how similar it was to the real-world measurements that had been conducted as part of the original plan for the thesis. Several subjects walking and running at various speeds were available in the MOCAP database. There were also many repetitions of these activities and a lot of variation in gait. As many categories as possible were desirable and there was sufficient MOCAP data for three categories. "Boxing" was selected as the third category, in addition to "Walking" and "Running". It was selected despite being an uncommon pedestrian activity since there was some variation in subjects and a fair number of repetitions available. The "Walking" category had the largest number of subjects with 18 individuals. The category "Running" was limited to 5 individuals and "Boxing" had 4. 3.1.2 Preparing MOCAP data The MOCAP data was available in multiple formats. The format used to gener- ate Cartesian time-space coordinates was the .c3d format. There were also filmed versions (.mpg) and animated versions (.avi) of each recording. These were used to verify the usefulness of a given recording. As previously mentioned, the subjects carried 41 markers spread evenly across their bodies. On the CMD website some support functions were provided to translate the .c3d files into a structure containing the desired data. It should be noted that even though that all .c3d appear to be similar some are in the format VAX-D (also called DEC) and first need to be converted into PC format requiring an additional function. Whether or not this step was required varied on a subject by subject basis. Once converted the structure consisted of coordinates (x, y, z) over the time t for each of the 41 markers. Six of these markers were selected, spread across the head, torso, arms and legs. The motivation behind this was the fact that no realistic automotive radar would have the resolution to pick up µDs from markers more densely packed. The recordings used 120 frames per second which was significantly lower than what was needed to generate realistic µD signatures. Because of this the data from the six selected markers was interpolated by a factor of 80 turning 120 data points per second into 9600 points per second for each spatial direction. The interpolation factor 80 was kept as low as possible in order to keep down computation cost while still giving smooth curves. The spatial coordinates were also shifted in space so that they always moved towards or away from the point of observation rather than past it. In order for the coordinates to resemble what a radar unit would observe they were converted from the Cartesian plane (x, y, z) into the Spherical plane (R, φ, θ). The last step was to assign each marker with a cross section σ. The cross sections were assumed to be constant with the legs and torso having twice the size of the arms 28 3. Methods and head. They were also assumed to be spherical in shape allowing the φ and θ dimensions to be discarded. This choice is discussed further in Chapter 5. 3.1.3 Generating a radar response The radar response simulation started with equation 2.7. To calculate the amplitude the different constants were set to values seen in table 3.1. Choice for antenna gain, pulse width, carrier frequency and transmit power were based on the radar unit used by VCC. System losses, atmospheric losses and noise standard deviation were for simplicity chosen so they would have minimal impact. The same was true regarding system temperature. Antenna Gain G 6 dB Transmit Power Pt 10 mW Carrier Frequency fc 77 GHz Pulse Width τ 5 µs Noise Standard Deviation σn 0 dB System Losses Ls 0 dB Atmospheric Losses La 0 dB System Temperature Tsys 290 K Table 3.1: Table listing the constants and system parameters used for simulating the radar return. Once the amplitude was calculated it was used in equation 2.8. The complex output xp[n] was then ready to be used as input to the different DE algorithms. 3.2 Doppler Extraction configurations Once the radar responses were simulated they were used as input to the STFT, CWT, SPWVD and EMD algorithms. The ML-algorithms used different formats of information and thus additional steps were sometimes required. Gray scaling or squaring images are examples of such steps. To illustrate the process and show how results could look subject 16 take 8 from the CMD database was selected. The inputs to the DE algorithms consisted of complex signals but for illustration the absolute value of such a signal can be seen in figure 3.2. 29 3. Methods Figure 3.2: The absolute value of the simulated radar return from subject 16 take 8 from the CMD database. The algorithms were run both with and without noise for all cases. In the case where noise was added it was in the form of white noise added to the spherical R- coordinates meaning the positions of the sensors themselves got shifted. Attempts were initially made to add the noise to the simulated radar response instead as this is where noise would realistically appear. It was however found to be very unstable and unpredictable with SNR varying wildly, sometimes completely destroying the signal and sometimes being unnoticeable. The noise level chosen resulted in a SNR of about 30 dB. The reason for this instability could not be determined and so noise level was chosen empirically. This noise level was however chosen for coloured images and upon visual inspection it seemed like the gray scale images had significantly lower SNR than that. Properly quantifying noise levels from gray-scale images proved to be very difficult however and was ultimately unsuccessful. Noise is discussed further in Chapter 5. To evaluate the effect of observation time DE algorithms were run on the same data twice. First when the signal was divided into half second snapshots and again when it was divided into one second snapshots. This was important as using less observation time could mean a lot for avoiding accidents in a real life traffic situation where a reaction half a second earlier could have a large impact on the outcome. It was also important because less observation time resulted in fewer data points meaning algorithms would run faster. The complexity of these different algorithms was important to quantify due to the limited computation power available on a vehicle. All images were saved in both grey-scale (146x110) .png format for the ANN and grey-scale square (110x110) .png format for the CNN. Monochrome images were used to speed up ML algorithm run times. In addition to this DE algorithms were timed and the timings were saved for evaluation. 30 3. Methods 3.2.1 The STFT algorithm The particular STFT used in this thesis used 512 frequency bins, a Dolph–Chebyshev window function of length 71 and combined this with the complex input signal mentioned in the previous section. In figure 2.1 a spectrogram of the complex signal using STFT is seen. Note that the image is in color only to make it easier for the reader, the images used for the ML algorithms were gray-scaled versions. The peaks with highest frequencies represent the legs moving at a Doppler frequency of 4.2 kHz meaning over 16 m/s and the torso being centered around 1.8 kHz meaning around 7 m/s. These speeds are clearly to high and the torso should be centered somewhere around 500− 800 kHz with the rest of the movements shifting accordingly. The reason for the offset was unknown but suspected to be a result of some of the time-frequency analysis tools not being designed to handle negative frequencies. Fortunately it was not important for the ML algorithms using images as because the images had their axis removed to compress them as it was considered to be redundant data, same for all images. The shift was also consistent for different subjects and repetitions. Transformed into gray scale figure 2.1 is transformed into 3.3 which is the sort of image used as input to the ANNs. Figure 3.3: Figure 2.1 transformed into gray scale. A noisy version of the same image can be seen in figure 3.4 with the gray scaled version seen in figure 3.5 31 3. Methods Figure 3.4: The spectrogram generated by the movements of subject 16 during take 8 when white noise has been added. Figure 3.5: Figure 3.5 in gray scale. The signal is less clear in relation to the noise in comparison to the coloured version of the image. 32 3. Methods 3.2.2 CWT using the Morlet mother wavelet The CWT transform took the same complex input as the STFT did. As previously mentioned a Morlet wavelet was used and it was found to be very difficult to tune settings to get acceptable performance both for very high and very low frequencies. High resolution for low frequencies can be seen in figure 3.6 where the upper part of the scalogram is fuzzy while the lower parts are sharp. The opposite is true in figure 3.7 where the lower part of the scalogram is fuzzy while high frequencies are distinct. This fuzziness was clearly undesirable and because of this the CWT was run twice, once for high frequencies and once for low frequencies. The resulting images were then cut and merged in such a way that there was high resolution across the entire frequency interval. In all scalograms negative frequencies were also folded into the zero frequency. This folding meant there was a loss of information when comparing with the STFT algorithm as it correctly represented negative frequencies. The result of cutting two images together as described above can be seen in figure 3.8. This third scalogram can be compared to the STFT spectrogram in figure 2.1 and is noticeably sharper with frequencies being more well-defined over time. One obvious downside to using CWT was the smearing around the zero frequency limit. Another was that running two CWTs obviously took more time but it was required to get an image of comparable quality in regards to the STFT method. The final scalograms were turned into gray scale and used as input to the ANNs and CNNs. Figure 3.6: A scalogram of the signal generated by the movements of subject 16 during take 8. High frequencies are smeared while low frequencies are shown with high resolution. 33 3. Methods Figure 3.7: A scalogram of the signal generated by the movements of subject 16 during take 8. Low frequencies are smeared while high frequencies are shown with high resolution. Figure 3.8: A scalogram resulting from the merge of the two above to get resolution in both high and low frequencies simultaneously. Negative frequencies are folded into the zero frequency resulting in a loss of information. 3.2.3 SPWVD The SPWVD required two window functions. The time smoothing window used was a Dolph–Chebyshev of length 61. For frequency smoothing another Dolph–Chebyshev window of length 71 was used. Several different window functions and window lengths were tested to find a good trade-off between time resolution, frequency res- olution and computation time. Longer windows gave better results at the cost of much longer computation times. The result can be seen in figure 3.9 where the 34 3. Methods Figure 3.9: A spectrogram generated with the SPWVD algorithm. High frequen- cies have folded into the negative frequency domain destroying locality. highest frequencies ended up in the negative frequency domain when using the same complex input as to the STFT and CWT methods. The reason for this couldn’t be determined but it was obviously undesirable. The signal frequencies were therefore shifted so they ended up in the right place using the STFT and CWT algorithms as points of reference. This shift was achieved by division inside the exponential in equation 2.8. The resulting equation was then xp[n] = m∑ i=1 at,iτe −j 4πfc 1.85c0 Rd,i (3.1) with the particular number 1.85 chosen as it aligned the SPWVD spectrogram fre- quencies with the STFT spectrogram frequencies. The image generated after this change is seen in figure 3.10 where the high frequen- cies are no longer appearing in the negative frequency domain. This came at the cost of a worse time-frequency representation with loss of sharpness and some cross- terms marked in red appearing. Locality was important for CNNs and so related information in an image had to be found in nearby pixels. After this change that was the case. These images were saved in the same fashion as the outputs from the STFT and CWT algorithms. 3.2.4 EMD The output from the EMD algorithm was treated a bit differently. In figure 3.11 the IMFs extracted from the complex signal by subject 16 take 8 are shown with the real part of the IMFs to the left and the imaginary part to the right. The real 35 3. Methods Figure 3.10: A spectrogram generated using the SPWVD method with shifted frequencies that represents the quality of image used. High frequencies are now where they should be but the spectrogram has less sharpness than in the previous figure. Cross-terms marked with red have also been introduced. and imaginary parts are very similar with only a slight time shift and amplitude change between them. The IMFs in the upper part of the image contain mostly high frequency components, with lower frequency components showing themselves at higher order IMFs. The resulting images are inappropriate as input to ANNs or CNNs purely as images. This is because the images would have to have a very high resolution to retain their level of detail and images that large would take excessively long to train and test on. 36 3. Methods Figure 3.11: The first five IMFs generated after EMD has been applied to the signal generated by subject 16 take 8. A complex signal generates complex IMFs so for visualization the real and imaginary parts have been separated. The real part on the left side of the image is coloured blue and the imaginary part on the right side is coloured in red. To turn the IMFs into a potentially suitable input they are instead first split into their real and imaginary parts. The real parts are discarded and the imaginary parts are downsampled by a factor of two. After this up to six IMFs are concatenated into one long vector. The reason only the imaginary part IMFs are used is because they need to fit into 146x110 (ANN format) or 110x110 (CNN format). If an EMD of a signal converges early and the generated IMF are insufficient to fill the vector of length 16060 (ANN) or 12100 (CNN) then the vector was zero-padded to get the right length. This is motivated by the fact that no IMFs indicate that the signal was encapsulated in the first few IMFs with very little to no residue. The input from EMD to ANN was in the form of matrices and the input from EMD to CNN was in the form of images. There is a loss of information when part of the complex IMFs are discarded but to make sure the signal was not destroyed images like the one seen in figure 3.12 were generated using an STFT. In that particular image all the real parts of the IMFs from a signal were added together and used as input into a spectrogram. The imaginary parts were discarded. The same was done for the imaginary parts, discarding the real parts, giving the same kind of results. This can be compared to an image where the complex IMFs were added together and then used as input to a STFT, seen in figure 3.13. While there is clearly some loss of information when using only real or imaginary parts as opposed to complex, the µD signature is mostly intact. 37 3. Methods Figure 3.12: Reconstruction of the original signal from complex IMFs. A spec- trogram was generated with an STFT from a signal made by summing all complex IMFs together. Figure 3.13: Reconstruction of the original signal by using only real parts of the IMF functions. A spectrogram was generated by an STFT by summing the real parts of all IMFs together. Noting that a spectrogram made from only the imaginary parts looks nearly identical. To illustrate how different the output images from the EMD after downsampling, concatenation and reshaping, subject 16 take 8 yet again features in figure 3.14. 38 3. Methods The lower order IMFs contain the high frequency components and can be seen in the upper part of the image. Figure 3.14: An immaculate image of subject 16 take 8 generated by adding the imaginary parts of six IMFs. In the upper part of the image high frequencies can be seen as oscillations in shades of grey. 3.3 ML Classification The data generated by the DE algorithms served as input to the ML algorithms. The idea was to create networks classifying the three human activities "Walking", "Run- ning" and "Boxing". The particular algorithms were selected due to their supreme results in image recognition (CNN) and simplicity and ability to work with small data sets (ANN). Both ML algorithms were run on a PC with the specifications seen in table 3.2. CPU Intel Core i5-3570K @ 3.40GHz RAM 16384 MB DDR3 1333 MHz GPU NVIDIA GeForce GTX 680 OS Windows 10 Pro 64-bit Table 3.2: Specifications for the computer on which the ML algorithms were run. Inputs to the CNNs and ANNs were the same images but re-scaled to different sizes. The images were divided into training sets and testing sets and tagged in accordance to their classes. As such the output classes of the ANN and CNN were a 3 class output vector in one-hot-encoding format. One-hot-encoding format means a vector where all entries are 0 except the true class which is a 1. For example if walk was the second class its vector would be [0, 1, 0]. 39 3. Methods The images were divided into training sets (70 %), validation sets (15 %) and test sets (15 %). The training sets consisted of 188 or 378 images depending on obser- vation time while test set consisted of 41 or 81 images. Generally CNNs work on training sets consisting of millions of images. Without this luxury, images had to be augmented in order to raise the number available. In this case images in the "Run- ning" category were horizontally flipped. This was justified since the horizontal flip of the spectrograms and scalograms simply meant the target had reversed direction, if it was moving towards the radar unit before the flip it is going to be an image of a target moving away after the flip. All ML algorithms evaluated for one second and half second observation times for the three activities "Walking", "Running" and "Boxing". The limit to the number of available images was set by the "Running" category. With the help of image augmentation there were 90 images with good µD-signatures available while using one second observations and 180 images while using half second observations. Three Cases were investigated: • Case One - one second observations and 90 images/vectors per category • Case Two - half second observations and 90 images/vectors per category • Case Three - half second observations and 180 images/vectors per category Cases One and Two were chosen for comparison with each other and Case Three to exploit the larger number of images available. 3.3.1 Artificial Neural Networks Images were in format 110x146 pixels in gray-scale. The network had one hidden layer made up of 45 neurons. There were three output neurons as previously men- tioned. Using images of the above stated size resulted in an input layer with a size of 16060. In the ANN designed there were 722700 weights from the input layer to hidden layer and 135 from the hidden layer into the output layer. These weights were randomly initialized using a normal distribution. The optimizer chosen was a nonlinear conjugate gradient method with a Polak–Ribière flavour. Max number of iterations was chosen to be 200. The learning rate step λANN that had the best results was 0.001 as there the network did not show any oscillatory behaviour, which is indicative of too large steps. It is also important to note that convergence was adequately fast on this setting, since slow convergence is an indication of too small steps. Hidden layer size, number of iterations and λANN were all found by training and running the ANN on the validation set. The hyper- parameters that gave the best result were then chosen and used for the ANNs on the test sets. A visualization of trained weights can be seen in figure 3.15. These weights would be acting like templates for the incoming test images. 40 3. Methods Figure 3.15: A visualization of a subset of the trained weights. What can be seen are features captured by the network after training. There are units that are clearly looking for "boxing-like" features (the bottom row), "running-like" features (top row middle) and "walk-like" features (middle right). These units would get activated upon input images matching their respective class. The bias unit is not visualized. The network was implemented using MATLAB. The forward- and backpropagation code was written in vectorized format, avoiding loops due to the costly computation time in MATLAB. Despite this the network was not suitable for expansion. A larger depth would result in greatly increased training times. Using 188 images the network took on average 602.38 seconds to train while testing with 41 images on average took 0.13 seconds. When using 378 images for training the average training time was 1141.84 seconds and using 81 images for testing took on average 0.284 seconds. 3.3.2 Convolutional Neural Networks Images used for input to the CNNs were simply reshaped from 110x146 to 110x110. This was due to the convention of CNNs using square inputs. Like before the images were in gray-scale with the output classes being the same. The data set was split exactly the same way as for the ANN. In total two CNNs were designed. CNN One was designed for the STFT, CWT and SPWVD algorithms. The images from these three algorithms were similar. Images generated by the EMD looked very different and performed extremely poorly when tried on CNN One. Because of this CNN Two was designed specifically for the images from the EMD. The design parameters for both CNN One and CNN Two are presented later in this section. The main tool used in the design and execution of the CNN was Google’s Tensorflow (TF) in conjunction with Python v3.5.2. TF provides a Application Programming Interface (API) which included efficient convolutional operations as well as ReLUs, max-pool functions and other useful tools. It also allowed for GPU computation, which significantly sped up the training process. NVIDIA has developed support 41 3. Methods for their more modern cards and the one used, the NVIDIA GTX 680, was the first generation to benefit from this support. The Modified National Institute of Standards and Technology (MNIST) database was also used. It is a collection of hand written images and there is about 60 000 training images and 10 000 test images. The database is commonly used to compare algorithms, on a relatively simple problem, as state-of-the-art networks get around 99.8% accuracy. This served as a sanity check and aided in the calibration of the networks. CNN One was able to achieve a test accuracy of 95.42 % which was not exceptional but it did prove that the network was not incorrectly designed. CNN Two achieved an accuracy of 98.52 % in the same test. The design was a process of trial and error with the networks constructed, tested using MNIST, trained and validated on the data of interest and modified to achieve better performance. Once a good accuracy was achieved on the validation sets the CNNs were fed the test sets. 3.3.3 CNN Design The CNN design used had three convolutional layers with the first layer using two filters, the second layer using three filters and the third layer using three filters. Figure 2.12 from Section 2.4.2, which consists of two convolutional layers, is a useful tool to visualize this. Recall that input images started with a size of 110x110x1 pixels. Two 11x11x1 filters were then applied to a padded version of the input image so that two activation maps with sizes 110x110x1 were generated. Next the ReLU activation function was applied to introduce non-linearities in order to aid training. The image is pooled by a 2x2 max-pool operator with a stride of [2,2] (vertical and horizontal). The result are two downsampled images of size 55x55. Next three 5x5x2 filters convolve the padded image. ReLU and max-pooling (at same size and stride) are again applied, resulting in three 28x28x3 images. In a similar fashion the last layer outputs a 14x14x3 image. The fully connected layer flattens the image and multiplies it with a size of 8. In addition to this, the dropout layer randomly deactivates half of the neurons during the training of the network. Finally the output layer returns a 1x3 matrix as the output of the network. The design parameters used for CNN One and CNN Two can be seen in 3.4. CNN One is a simpler network compared to CNN Two. The number of neurons are calculated as follows, N = whd (3.2) where N is the total number of neurons in the layer, w, h, and d are the width, height and depth of the activation map respectively. The parameters of the first filter are calculated as below, Fp = (whc+ 1)d (3.3) where Fp is the number of parameters in the filter, w and h are the width and height, respectively, c is the filter depth, the 1 is because of the bias unit and d is 42 3. Methods Parameter CNN One CNN Two Filter One Size 11x11x1 11x11x1 Filter One Depth 2 5 Filter One Stride 1 1 Pool One 2x2 2x2 Pool Stride One [2,2] [2,2] Filter Two Size 5x5x2 5x5x2 Filter Two Depth 3 10 Filter Two Stride 1 1 Pool Two 2x2 2x2 Pool Stride Two [2,2] [2,2] Filter Three Size 3x3x3 3x3x3 Filter Three Depth 3 20 Filter Three Stride 1 1 Pool Three 2x2 2x2 Pool Stride Three [2,2] [2,2] Input Nodes 110x110x1 110x110x1 FC Layer Size 8 25 Output Nodes 3 3 Batch Size 30 30 No. Iterations 3500 3500 Training Step 1e-4 0.5e-4 Dropout rate 50% 50% Table 3.3: Table of design parameters used for CNN One and CNN Two. the number of filters. Using 188 images CNN One took on average 65.62 seconds to train while testing with 41 images on average took 1.09 seconds. When using 378 images for training the average training time was 122.82 seconds and using 81 images for testing took on average 1.96 seconds. Using 188 images CNN Two took on average 106.21 seconds to train while testing with 41 images on average took 1.32 seconds. When using 378 images for training the average training time was 300.52 seconds and using 81 images for testing took on average 2.46 seconds. 43 3. Methods Parameter CNN One CNN Two Layer 1 Neurons 6050 15125 Layer 2 Neurons 2352 7840 Layer 3 Neurons 588 3920 Layer FC Neurons 8 25 Total Neurons 8998 26910 Layer 1 Parameters 244 610 Layer 2 Parameters 153 510 Layer 3 Parameters 84 560 Layer FC Parameters 4712 98025 Total Parameters 5193 99705 Table 3.4: Table of comparison between the neurons and parameters for CNN One and CNN Two. 44 4 Results This section presents the performance of the ANN and CNN in combination with the STFT, CWT, SPWVD and EMD algorithms. The computation times for the DE algorithms. 4.1 DE algorithm computation times The average of the computation times across all different categories for the different algorithms are presented in table 4.1. The column marked Case A represent com- putation time given observations that are one second long and the column marked Case B represents computation time given observations that are half a second long. This means Case A uses twice as many data points per calculation as Case B. Not surprisingly computation times for Case A are longer than for Case B. For Case B the EMD has the best performance by far at 0.005 seconds. The SPWVD takes significantly longer to run, Case B takes 3.585 seconds to process. For Case A the best performance is again with the EMD at 0.110 seconds and the SPWVD is the slowest by far at 7.098 seconds. Algorithm Case A computation time [s] Case B computation time [s] STFT 0.441 0.280 CWT 1.398 0.785 SPWVD 7.098 3.585 EMD 0.110 0.050 Table 4.1: Table of computation times for the STFT, CWT, SPWVD and EMD algorithms for both half second (Case A) and one second (Case B) observation times. 4.2 ANN performance The results using noiseless images as input in conjunction with ANNs can be seen in table 4.2. With the exception of the EMD, there is no difference in performance for any DE-algorithm when comparing Cases 1 and 2. Performance does however improve for both the STFT and SPWVD algorithms while the CWT algorithm performed perfectly in all cases. 100% accuracy was unexpected and would certainly drop somewhat with a larger data set. It is also notable that there is no gain in accuracy when comparing Case 1 and Case 2. This indicates that half a second 45 4. Results of observation is sufficient for the µD-signatures to be classified correctly. EMD performance was on par with pure guessing for all Cases. Algorithm Accuracy Case 1 Accuracy Case 2 Accuracy Case 3 STFT 87.18 % 87.18 % 98.72 % CWT 100 % 100 % 100 % SPWVD 84.61 % 84.61 % 94.87 % EMD 33.33 % 33.33 % 33.33 % Table 4.2: Table of ANN performance for different Cases and DE algorithms. The results using noisy images as input can be seen in table 4.3 where the STFT algorithm performs much worse for all Cases compared to the noiseless scenario. The STFT accuracy peaks for case three at 80.77% while the SPWVD peaks at 89.74% meaning the SPWVD performs better than the STFT when noise is added. CWT performs the best with a 100% accuracy yet again. With noisy input performance does get worse from Case 2 to 1. This is expected as one second snapshots contain more information than half second snapshots. The EMD performance for Cases 1 and 2 at 33% was on par with pure guessing and for Case 3 it was even worse at 32.5%. Algorithm Accuracy Case 1 Accuracy Case 2 Accuracy Case 3 STFT 76.92 % 69.23 % 80.77 % CWT 100 % 100 % 100 % SPWVD 87.18% 84.61 % 89.74 % EMD 33.33 % 33.33 % 32.5% Table 4.3: Table of ANN performance for different Cases and DE algorithms when noise has been added. 4.3 CNN performance The results using noiseless images as input in conjunction with CNNs can be seen in table 4.4. For all DE algorithms one second observations still outperform or match the performance of half second observations when the same amount of images are used. This means there is a significant gain in performance with increased obser- vation time. CWT performs best for Case 1 but performs poorly for Case 3 where SPWVD and STFT have the same performance at 97.44 % accuracy. The benefits of more training data seems to outweigh the loss of information when reducing ob- servation time from one second in case one to half a second in case three. EMD performance was not best for any of the Cases but performed on par with CWT for Case 2. The results using noisy images as input in conjunction with CNNs can be seen in table 4.5. Again Case 1 outperforms Case 2 with the largest difference being found for the CWT with 97.44 % compared to 83.33 %. Performance decreases for case 46 4. Results Algorithm Accuracy Case 1 Accuracy Case 2 Accuracy Case 3 STFT 94.87 % 94.87 % 97.44 % CWT 97.79 % 87.18 % 84.61 % SPWVD 94.87 % 92.31 % 97.44 % EMD 92.31 % 87.18 % 88.75 % Table 4.4: Table of CNN performance for different cases and DE algorithms with noiseless input. three where the SPWVD barely slightly outperforms both the STFT and CWT algorithms. Yet again the results indicate that one second observations are superior to half second observations. With noisy input to the CNNs there was not a gain in case three with more images available. EMD has the best performance of all possible combinations, seen in Case 3, at 97.50% but also the worst performances for Cases 1 and 2. Algorithm Accuracy Case 1 Accuracy Case 2 Accuracy Case 3 STFT 92.31 % 87.18 % 87.18 % CWT 97.44 % 83.33 % 87.18 % SPWVD 92.31 % 82.05 % 89.74 % EMD 89.74 % 79.49 % 97.50 % Table 4.5: Table of CNN performance for different cases and DE algorithms when noise has been added. 47 4. Results 48 5 Discussion In this section data, simulation, choices regarding DE and ML algorithms as well as results are discussed. 5.1 Limitations with MOCAP data and simulated radar returns As mentioned in Chapter 3, the original plan was to use data from real-world mea- surements. These contained subjects walking, jogging, running, biking and multiple persons walking next to each other. When it became clear that the gathered data would be unusable the choice was made to turn to another solution. The best al- ternative was the MOCAP library but it still came with a number of limitations. There were only three classes with an acceptable balance between subjects and rep- etitions and there was obviously no way of controlling what the recordings contained in terms of duration, distance or angle. Without a doubt the greatest limitation was the number of repetitions with the "Running" category setting the limit to 90/180 high quality images with distinct µD signatures if a balance between the number of images from each category was to be achieved. In contrast to this there were over 300/600 images available from the "Boxing" category and around 250/500 im- ages from the "Walking" category. This could not be addressed given the fixed data set but it is a weakness that undoubtedly impacted the performance of all M