Exploring the feasibility of using ultra- sonic sensors and cameras for human gesture recognition to activate trunk open- ing in vehicles Master’s thesis in Complex Adaptive Systems and Systems, Control and Mechatronics Tim Johansson and Krister Mattsson DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis 2024 Exploring the feasibility of using ultrasonic sensors and cameras for human gesture recognition to activate trunk opening in vehicles Tim Johansson, Krister Mattsson Department of Some Subject or Technology Division of Division name Name of research group (if applicable) Chalmers University of Technology Gothenburg, Sweden 2024 Exploring the feasibility of using ultrasonic sensors and cameras for human gesture recognition to activate trunk opening in vehicles Tim Johansson, Krister Mattsson © Tim Johansson, Krister Mattsson, 2024. Supervisor: Pratish Ray, Volvo Cars Exterior Vision & Ultrasonics Supervisor: Jonas Fredriksson, Department of Electrical Engineering Examiner: Jonas Fredriksson, Department of Electrical Engineering Master’s Thesis 2024 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Illustration of the general simplified logic where all networks are shown as boxes with an input and output signal. The yellow arrow illustrates the initiation of the time window used for the USS model. Both the vision model and the USS model classification outputs are weighed using a factor α to determine the total classification output. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Gothenburg, Sweden 2024 iv Exploring the feasibility of using ultrasonic sensors and cameras for human gesture recognition to activate trunk opening in vehicles Tim Johansson, Krister Mattsson Department of Electrical Engineering Chalmers University of Technology Abstract The integration of new advanced technologies plays a crucial role in the industrial market. The automotive industry is no different. With the introduction of ultra- sonic parking sensors and high-resolution cameras in new vehicles combined with the integration of high-performance computing power, it is possible to implement ma- chine learning and classical methods to process real-time sensor information. This thesis focuses on recognizing human gestures using the combined information from the ultrasonic sensors and visual camera data for functional actuation. In particular, the thesis serves as a feasibility study for using gesture recognition as an input for activating the automatic opening of the trunk. Several approaches to this problem have been investigated through literature studies, and the most suitable method has been determined to be a combination of machine learning neural networks and sensor fusion from classical methods. Two different machine learning methods are implemented and analyzed for the visual input. One model that classifies static images and one model that classifies a series of images to capture information from dynamic movement. Another model is built for the parking sensory input, which, similarly to the previous model, utilized a series of measurements in time for the classification. Together, these models form a logical pipeline that utilizes classical ultrasonic sensory input as an indicator for activating the models. These models are evaluated for both binary outputs, meaning classify- ing gesture or no gesture, and multi-class gestures, meaning several different gesture classifications. Separately, the vision models achieved close to perfect test accuracy for both the bi- nary and the multi-class implementations, while the model for the ultrasonic sensors achieved a test accuracy of around 70 %. Using sensor fusion, the combined model achieved perfect test accuracy for both the static implementation and the dynamic, proving the proposed solution’s feasibility. However, one should note that the re- sults are all based on a small data pool collected during the thesis. Furthermore, the data lacks diversity. Implementing the solution on a greater scale would likely yield some changes in the results. In conclusion, it is possible to reliably use human gesture recognition for functional actuation from ultrasonic and visual data. Keywords: Human gesture recognition, machine learning, neural networks and sen- sor fusion. v Acknowledgements This thesis was conducted in collaboration with Volvo Cars in Torslanda, Göteborg, within the department of Safe Vehicle Automation. We would like to extend our deepest gratitude to the team members of USS Enterprise. A special thank you goes to Pratish Ray, our supervisor at Volvo, for his unwavering support throughout the project. We are also grateful to Venu Gopal Puripanda and Simon Rudh for their technical support related to test vehicles. We would like to express our appreciation to Khadija Dallah and Srinath Shanmugam for their guid- ance in decoding recorded data. Finally, we thank Jonas Fredriksson for taking on the roles of supervisor and examiner at Chalmers University of Technology. Your collective expertise, guidance, and support have been invaluable to the success of this project. Tim Johansson, Krister Mattsson, Gothenburg, June 2024 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: HMR Human Motion Recognition PoC Proof of Concept USS Ultra Sonic Sensors ANN Artificial Neural Network NN Neural Network CNN Convolutional Neural Network ML Machine Learning SGCM Static Gesture Classification Model TP True Positives TN True Negatives FP False Positives FN False Negatives ix Nomenclature Below is the nomenclature of indices, sets, parameters, and variables that have been used throughout this thesis. Indices i,j,k Indices in tensors/matrices t Index for time step Parameters η Learning rate λ Scaling parameter nh Number of hidden layers ns Number of samples in training set ni Number of input neurons no Number of output neurons Variables Oi Outputs g(x) General activation function wij Weights xj Nodes θi Biases Q(w) Loss function ∆t Time step (time interval) xi xii Contents List of Acronyms ix Nomenclature xi List of Figures xvii List of Tables xix 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Ethical and Sustainability aspects . . . . . . . . . . . . . . . . . . . . 2 1.5 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 5 2.1 Human motion recognition . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . 7 2.2.3 CNN-architecture, Residual Network . . . . . . . . . . . . . . 7 2.2.4 Spatial-temporal data and deep learning models . . . . . . . . 8 2.2.5 Balance of data . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.6 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.7 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.8 Cross entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.9 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . 11 2.3 Containing information in descaled images . . . . . . . . . . . . . . . 11 2.4 Evaluating network models . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Network certainty . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Ultrasonic sensor systems . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Method 15 3.1 Gestures representation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Approach and general idea . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 xiii Contents 3.4.1 Collected USS data . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Preprocessing: Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 USS model network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.1 Preprocessing USS classification data . . . . . . . . . . . . . . 22 3.6.2 Build USS classification model . . . . . . . . . . . . . . . . . . 23 3.6.3 Training and validation . . . . . . . . . . . . . . . . . . . . . . 23 3.7 Static vision model networks . . . . . . . . . . . . . . . . . . . . . . . 23 3.7.1 Preprocessing static vision classification data . . . . . . . . . . 24 3.7.2 Static Vision classification model . . . . . . . . . . . . . . . . 25 3.7.3 Training and validation . . . . . . . . . . . . . . . . . . . . . . 25 3.8 Dynamic vision model network . . . . . . . . . . . . . . . . . . . . . 26 3.8.1 Preprocessing dynamic vision classification data . . . . . . . . 27 3.8.2 Dynamic Vision classification model . . . . . . . . . . . . . . . 29 3.8.3 Training and validation . . . . . . . . . . . . . . . . . . . . . . 31 4 Results 33 4.1 USS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.1 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Static vision models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Binary gesture classification . . . . . . . . . . . . . . . . . . . 35 4.2.2 Multiclass gesture classification . . . . . . . . . . . . . . . . . 35 4.3 Dynamic vision models . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Binary gesture classification . . . . . . . . . . . . . . . . . . . 37 4.3.1.1 Collected and preprocessed data . . . . . . . . . . . 37 4.3.1.2 Model evaluation . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Multi-class gesture classification . . . . . . . . . . . . . . . . . 39 4.3.2.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Extended multi-class gesture classification . . . . . . . . . . . 42 4.3.3.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 44 4.4 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Combined model using dynamic vision model . . . . . . . . . 46 5 Discussion 47 5.1 Non neural network based approach . . . . . . . . . . . . . . . . . . . 47 5.2 USS model and data . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 Static vision model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Dynamic vision model . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.5 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.6 Compared to current solution . . . . . . . . . . . . . . . . . . . . . . 52 5.7 Data distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6 Conclusion 53 7 Future work 55 7.1 Dataset expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.2 USS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 xiv Contents 7.3 Static vision models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.4 Improved performance of ResNet . . . . . . . . . . . . . . . . . . . . 56 7.5 Improved approach for videos with arbitrary size and length . . . . . 57 7.6 Technological Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.7 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Bibliography 59 A Appendix 1 I A.1 Preprocessing: Decoding . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.1 Decoding recorded files . . . . . . . . . . . . . . . . . . . . . . II A.2 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.2.1 USS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.2.2 Static vision model . . . . . . . . . . . . . . . . . . . . . . . . XI A.3 Preprocessing dynamic vision model . . . . . . . . . . . . . . . . . . . XIX A.4 Dynamic Vision Model . . . . . . . . . . . . . . . . . . . . . . . . . . XXVI A.5 R(2+1)D Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . XXXVIII xv Contents xvi List of Figures 2.1 Illustration of a deep neural network consisting of an input layer, two hidden layers and a singular output. . . . . . . . . . . . . . . . . . . . 6 2.2 This figure illustrates the Residual block. . . . . . . . . . . . . . . . 8 2.3 Illustration of the firing sequence and how neighboring sensors listen to their own and each other’s echos. The yellow circles indicate the positions of the ultrasonic sensors, and the blue indicates where the camera is located. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Illustration of the leg ’kick’ gesture. Note that the distance between the starting position and the vehicle was roughly one meter. . . . . . 15 3.2 Illustration of the ’hand’ swipe gesture. . . . . . . . . . . . . . . . . . 16 3.3 Illustration of the general simplified logic, where all networks are shown as boxes with an input and output signal. The yellow arrow illustrates the initiation of the time window used for the USS model. The vision and USS model classification outputs are weighed using a factor α to determine the total classification output. . . . . . . . . . . 17 3.4 A snippet of the measurement data logbook. This file connects the data files to the measurements and was used to label the datasets. . . 19 3.5 Illustration of a frame sequence from a single recording, depicting an individual standing in an open area without performing any gestures. 19 3.6 Illustration of the general distance measured over a time span around 5 seconds. Note how the detected distance is closer around time step 150. This is the indication of the kick gesture. . . . . . . . . . . . . 20 3.7 In the figure to the left one can see an example of a measurement series where all points of interest were lost by noise such that no kick gesture could be distinguished. The right figure shows an example of a clear gesture profile. The green points are the preprocessed and merged points, as explained in the methods chapter. The red and blue points are from echo1 in RIL and RIR respectively. . . . . . . . . 20 3.8 Illustration of several (30) measurement sequences put together. . . . 21 3.9 Illustration of the USS network architecture. . . . . . . . . . . . . . . 24 3.10 Raw frame extracted from one of the mp4-files to the left and the down-scaled version of the same frame to the right using the scale factor γ = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.11 Illustration of the static vision network architecture. . . . . . . . . . . 26 xvii List of Figures 3.12 In figure 3.12a one can see the original resolution of a mp4-file, and in figure 3.12b one can see the down-scaled version of the same mp4-file, the scale factor is approximately γ = 0.1. . . . . . . . . . . . . . . . . 29 3.13 Illustration of the dynamic vision network architecture based on [24]. The final fully connected layer is adjusted for binary classification. . . 30 3.14 Overview of the entire model during the backward pass [24]. . . . . . 30 4.1 Illustration of the validation accuracy over each epoch using a batch size of 12 (blue) together with the validation loss scaled up by a factor of three (orange). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Illustration of the amount of TP:s and TN:s together with the FP:s and FN:s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Illustration of the validation accuracy (blue line) over epochs and the validation loss scaled by a factor of three (orange). . . . . . . . . . . . 36 4.4 Illustration of the confusion matrix for the binary static vision model is presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Illustration of the total validation accuracy over all gestures together with the separate validation accuracies for each gesture and the vali- dation loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Illustration of the video lengths, in number of frames, per class in the case of a binary classification task. . . . . . . . . . . . . . . . . . . . 38 4.7 Illustration of the training and validation loss together with validation accuracy and F1 score over 20 epochs . . . . . . . . . . . . . . . . . . 38 4.8 Illustration of the confusion matrix from the test evaluation . . . . . 39 4.9 Illustration of the video lengths, in number of frames, per class in the case of the multi-class classification task. . . . . . . . . . . . . . . . . 40 4.10 Illustration of the training and validation loss together with validation accuracy and F1 score over 20 epochs . . . . . . . . . . . . . . . . . . 41 4.11 Illustration of the confusion matrix from the test evaluation . . . . . 42 4.12 Illustration of the video lengths, in number of frames, per class in the case of the extended multi-class classification task. . . . . . . . . . . . 43 4.13 Illustration of the training and validation loss together with validation accuracy and F1 score over epochs . . . . . . . . . . . . . . . . . . . 44 4.14 Illustration of the confusion matrix from the test evaluation . . . . . 45 7.1 Illustration of the leg swipe motion. . . . . . . . . . . . . . . . . . . . 55 A.1 Overview of the system environment configuration for preprocessing of USS and vision data . . . . . . . . . . . . . . . . . . . . . . . . . . II xviii List of Tables 3.1 Overview of labeling functions and their corresponding labels . . . . . 27 3.2 Configuration parameters and preprocessing operations for the pre- trained R(2+1)D model [24]. . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 The number of true positives, true negatives, false positives, and false negatives and their relative mean certainty are shown in this table for the USS model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 This table shows the number of true positives, true negatives, false positives, and false negatives and their relative mean certainty. . . . . 35 4.3 Number of videos per ’kick’ - ’no kick’ gesture from the acquired dateset. 38 4.4 Number of videos for the ’hand’ - ’no hand’ gesture from the acquired dateset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 The mean certainty and count for true positives, true negatives, false positives, and false negatives for the ’kick’ - ’no kick’ gesture. . . . . 39 4.6 The mean certainty and count for true positives, true negatives, false positives, and false negatives for the ’hand’ - ’no hand’ gesture. . . . 39 4.7 Test metrics of dynamic vision model for binary classification of ’kick’ - ’no kick’ gesture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Test metrics of dynamic vision model for binary classification of ’hand’ - ’no hand’ gesture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.9 This table shows the number of videos per class in each of the datasets 40 4.10 The mean certainty and count for true positives, true negatives, false positives, and false negatives. . . . . . . . . . . . . . . . . . . . . . . 41 4.11 Test metrics of dynamic vision model for multi-classification. . . . . 41 4.12 This table shows the number of videos per class in each of the datasets for extended multi-class classification. . . . . . . . . . . . . . . . . . . 43 4.13 The mean certainty and count for true positives, true negatives, false positives, and false negatives. . . . . . . . . . . . . . . . . . . . . . . 44 4.14 Test metrics of dynamic vision model for extended multi-classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.15 This table shows measures of the combined network model evaluation. Note that the network certainty is defined in a different way for the combined model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.16 Test metrics for dynamic vision model for multi-classification. . . . . 46 4.17 The mean certainty and count for true positives, true negatives, false positives, and false negatives. . . . . . . . . . . . . . . . . . . . . . . 46 xix List of Tables xx 1 Introduction This chapter presents the project and its background, delimitations, and outline. 1.1 Background In the automotive industry, the integration of advanced technologies plays a pivotal role in enhancing safety and user experience. With the introduction of autonomous drive and driving aid features, the industry has significantly augmented the de- ployment of sensors in their vehicles, allowing for an increased perception of their surroundings. Furthermore, advancements in machine learning have paved the way for novel solutions that not only enhance the efficiency of features but also reduce costs for manufacturers as they allow for the possibility of removing previously re- quired sensors. For instance, Tesla replaced their front-facing radar with vision [1]. A feature that has gained recognition across the industry, adding convenience and innovation to the overall user experience, is the radar-based contactless control of the trunk. The radar system is centered underneath the rear bumper and requires a person to approach the center of the rear bumper and perform a ’kick’ gesture to activate and open the trunk. The detection range of the radar is limited by its placement, which consequently requires the person to stand close to the rear to ac- tivate the trunk, and depending on the model of the vehicle, it can be necessary for the person to inconveniently take a step back to not be in the way of the trunks path. In addition to the user experience challenges, the current implementation of the radar-based system incurs significant costs. The cost associated with the current solution for contactless trunk control through a single-purpose radar system is ex- tensive, considering the intricate integration projects and expenses associated with suppliers-, production-, logistics- and service contracts. According to the function owner, this model has an accuracy of around 96 %. The automotive industry continuously strives for cost-efficient and innovative so- lutions. This project focuses on leveraging the existing ultrasonic sensors (USS) and a rear-view fish-eye camera to replace the radar-based system. Such a solution would eliminate the need for these radar sensors, saving all costs associated with material and logistics, which would, in return, reduce the environmental impact. 1 1. Introduction 1.2 Objectives The main objectives of the project are as follows: • Determine a suitable approach to detect and classify patterns of human ges- tures in real-time from echo and vision data. • Translate meaningful human gestures into inputs for functional actuation. • Compare model accuracy between the proposed system from the captured data set recorded in this thesis project and the radar-based system. 1.3 Delimitations This section outlines the boundaries and limitations of the project, ensuring clarity and managing expectations regarding the outcomes. • In this project, only the available ultrasonic sensors and the camera positioned at the rear of the car are utilized without exploring additional sensor options. • In terms of the number of gestures considered for the project, the project focuses on a specific set of gestures rather than a comprehensive range to ensure a large dataset due to the limited availability of test cars. Consequently, there is a limited distribution of the performed gestures concerning the number of people performing the gesture and environmental factors like weather. • The recording sessions are conducted on the company’s premises leading to further constraints. • The project does not address constraints related to car system integration, such as computational load, storage capacity, or system architecture. • The thesis project exclusively considers the user intention of opening the trunk by one person, without exploring other potential user intentions. • All development is performed on local workstations to retain confidential and sensitive information and proprietary knowledge. This approach was essential to comply with company policies and ensure data security. 1.4 Ethical and Sustainability aspects This project utilizes sensors already implemented on the car and will, therefore, have a minimal impact on sustainability and not increase the risk of privacy intrusion any further. The thesis work is purely software-oriented, and the technology is aimed at being used for comfort, accessibility, and simplicity, aimed at functional actuation, such as opening the trunk. No personal information such as name, age, or gender is recorded, ensuring the privacy of the persons participating in the recording session. 1.5 Outline of thesis Advancements in machine learning have paved the way for novel solutions that en- hance feature efficiency and reduce manufacturer costs by potentially eliminating previously required sensors. 2 1. Introduction In this thesis report, the underlying theory used in this project is presented in the theory chapter, and the methods for classifying gestures are presented in the method chapter. After this, the results obtained using the presented methods are stated and illustrated. Furthermore, the next chapter presents the discussion, re- sults, and potential error sources. Here, the created models for the project are also compared to the current radar-based system. After this, the conclusion is presented, followed by some ideas for future work. 3 1. Introduction 4 2 Theory This section introduces the underlying theory used in the project to motivate the method and analyze the results. 2.1 Human motion recognition Human motion recognition (HMR) involves the processes of identification, classifi- cation, and characterization of human movements [2]. In the context of computer vision, HMR is a multidisciplinary field composed of biomechanics, machine vision, image processing, data analytics, nonlinear modeling, and pattern recognition [3]. The development of an efficient HMR system requires it to handle a vast diversity of human features like body size, postures, and appearances, as well as environmen- tal factors like illumination, viewing angles, and disturbances. The complexity of human motion and the variability of recording conditions make HMR challenging, but extensive research has gone into HMR due to its wide range of applications [2]. Each of the applications faces similar primary challenges: interpreting ambiguous poses and actions; varying interpretations of classification; potential partial occlu- sion of bodies or objects; poor video quality, including blurring and noisy data from low-quality sensors; significant time differences between actions; inadequate or ex- cessive lighting; and difficulty in acquiring large-scale datasets [3]. These challenges necessitate advanced methods to accurately capture and analyze human motion. HMR can be broadly divided into two categories: vision-based and sensor-based recognition [4]. The vision-based method relies on one or more cameras; the method of approach for reaching motion predictions, therefore, varies significantly depending on the techniques employed and continues to be a field of interest for studies within the topic of HMR. On the other hand, the sensor-based method is a more standard approach and an extensively researched area given the feasibility of attaching sen- sors or using mobile devices [4]. Recent studies, such as those reviewed in [5], have explored a variety of HMR meth- ods, covering traditional approaches to manually designed motion features extracted from RGB and depth data, as well as modern deep learning-based approaches for motion feature representation, techniques for recognizing human-object interactions, and methods for action detection. Unlike image classification, which primarily fo- cuses on spatial information, vision-based HMR requires the integration of temporal information to accurately capture and analyze motion sequences. The review in [5] 5 2. Theory concludes that deep learning-based methods exhibit superior performance in motion feature learning problems as they leverage advanced neural network architectures to learn complex patterns and relationships within the data. In addition, the nature of deep learning-based methods is that they are much more resource-efficient compared to traditional computer vision approaches, [4]. 2.2 Deep learning Deep learning is a subset of machine learning in Artificial Neural Networks (ANN) where hidden layers are introduced to capture complex and intricate patterns in data. As a problem aimed to be solved using artificial neural networks cannot be solved by linear separation only, deep learning models or deep neural networks can approximate more complex patterns of information and have the ability to classify non-linear problems [6]. As mentioned earlier, a deep neural network consists of one or more hidden layers in addition to the input and output layers, see figure 2.1. The hidden layers are pivotal in an ANN’s capture of complex classification patterns. For nonlinear classification or complex data patterns, the ability to handle these types of intricate data patterns by separating information becomes necessary. Each hidden layer in the network will contribute to and make a more complex classification possible, but it will also add more parameters to tune. The extra size and parameters also mean that a deep learning model often requires large datasets for all weights and biases to be tuned in a desirable way [7]. A common approximation measure Figure 2.1: Illustration of a deep neural network consisting of an input layer, two hidden layers and a singular output. for determining a reasonable amount of hidden layers according to [8] is: nh = ns λ(ni + no) , (2.1) where nh is the number of hidden layers, ns the number of samples in the training set, ni number of input neurons, no number of output neurons, and λ is a constant which is usually in the range of 2-10. 6 2. Theory 2.2.1 Activation functions An activation function within the field of ANNs is a mathematical function that converts the output of each network layer to some binary value type, ranging from positive and negative numbers to specific integers, depending on the network spec- ifications. Activation functions can be of different forms. Two commonly used functions are tanh(b) and sgn(b), where b is the neuron states, weights and biases of the current layer. There is one distinct difference: tanh(b) is continuous while sgn(b) is not. This detail becomes important when the networks are trained, as it is relatively common to use training algorithms, such as backpropagation, which utilizes the activation function’s derivative. It is also important to note that when the activation function is continuous, the states of the neurons also become con- tinuous. Another activation function that is commonly used in image classification networks and CNNs is the Rectified Linear Unit (ReLU) function. ReLU is a linear non-negative activation function. One of the key advantages of ReLU is its non- saturating property, which further mitigates the phenomenon of vanishing gradient [9]. 2.2.2 Convolutional neural networks A convolutional neural network (CNN) is a deep learning model designed and mainly used for processing visual data. More specifically, CNNs are well suited for tasks such as image classification and object detection within images or videos. CNNs include convolutional layers, where each layer applies filters or kernels to the input data. The kernels or filters are used to detect features and patterns within the visual data. Pooling layers are a common way to downsize the spatial dimensions after a convolution layer. This reduces the computational resources necessary for training and using the network. At the end of the network, after the convolutions and pool- ing layers, CNNs typically have one or more fully connected layers connecting the last layer with the output. This part performs a high-level feature extraction from the last convolution and connects it to the output. CNNs use supervised learning, or in other words, they need labeled datasets for training, which furthers the need for good-quality datasets. Backpropagation is commonly used for training. The network weights and thresholds are adjusted to minimize the difference between the labeled targets and the network output [10]. 2.2.3 CNN-architecture, Residual Network Over the past decade, extensive research of CNN architectures has taken place, lead- ing to the successive development of AlexNet, GoogleNet, ResNet, and DensNet, to name a few. Each of these architectures has significant contributions to the devel- opment and performance of deep learning models, particularly in the field of image recognition [11], with unique approaches to address some of the common issues in deep learning like for instance vanishing gradients, etc. To address the commonly encountered vanishing gradient problem, Residual Net- 7 2. Theory works (ResNets) are purposely designed architectures to counter the issues with the use of so-called skip connections [9]. As neural networks become deeper, the gra- dients used in backpropagation can become very small as a consequence of both the chain rule and the selection of activation functions of saturating nature, such as tanh(b), leading to slower and even stalled learning during the training process. The key element in ResNets is the Residual block, shown in figure 2.2. Figure 2.2: This figure illustrates the Residual block. By introducing skip connections, where the input to a layer is added directly to the output of a subsequent layer, the gradients are less likely to diminish to in- significantly small values as they pass through each layer of the network [9]. If the desired underlying mapping is denoted as H(x), ResNets reformulate the learning task to instead model the residual function F (x) = H(x) − x and subsequently the original function becomes H(x) = F (x) + x. Residual blocks will commonly include two or more convolution layers, batch normalization, and ReLU activation functions [9]. ResNets have been shown to achieve remarkable performance and significantly outperform traditional CNN architectures in terms of both accuracy and depth on various image recognition tasks [11]. Using residual blocks effectively allows the network to preserve the essential features learned in earlier layers. 2.2.4 Spatial-temporal data and deep learning models The temporal dimension is crucial in capturing the dynamics of motion over time, adding complexity to the task of HMR and making it more informative. In HMR, the focus on deep learning techniques and the processing of RGB video data has greatly increased since 2015 [3]. Various methods, including deep learning architec- tures based on CNN, Recurrent Neural Networks (RNN), and hybrid approaches, have undergone comprehensive analysis of their advantages and limitations [5, 12, 3, 4, 13]. Different architectures for handling spatial-temporal data in HMR have been ex- plored. One approach is the 3D CNNs, where the third dimension can be viewed as the time axis. These networks build upon the architecture of 2D CNNs by adding an extra dimension to the input, allowing for the processing of temporal information 8 2. Theory for several frames in a video sequence. Another approach is the hybrid method, which combines different types of neural networks to handle both spatial and tem- poral features. For example, CNN-RNN architectures utilize ResNet to extract spatial features and RNNs to extract temporal features. While 2D CNN-based ar- chitectures excel in spatial data handling, they cannot capture temporal features effectively. The limitation can be addressed by including algorithms such as optical flow, Long Short-Term Memory (LSTM), which handle sequential data and capture temporal dependencies effectively [4], and temporal grouping [3]. An alternative strategy is that of stream networks, meaning that types of inputs are handled on different networks. For instance, processing RGB frames in the first stream and optical flow in the second stream [3]. This approach allows for the capture of both spatial and temporal information. Interestingly, despite the disadvantage of ordinary 2D CNNs being applied to in- dividual frames and therefore cannot model temporal information, they perform remarkably well in some instances, such as the Sport-1M benchmark [13]. Never- theless, 3D CNNs are still vastly outperforming 2D CNNs on large datasets [13]. A more specific example, [3] evaluates a 3D ResNet of depth 50 and a 2D vision transformer (ViT) with a long short-term memory network (LSTM) on the human motion database (HMDB51). It was shown that the 3D ResNet outperformed the ViT with LSTM, reaching accuracy scores in the train and test phases of 96.7 ± 0.35% and 41.0 ± 0.27%, respectively. 3D CNNs continue to be an explored topic within HMR [13]. An attractive fea- ture of 3D CNNs, compared to the two-stream method, is that the architecture creates the hierarchy and relationship between spatial and temporal features with- out the need for other information like optical flows [13]. Furthermore, 3D CNNs are known as end-to-end networks as the input processing and generation of output do not require any additional step sequences. However, a significant disadvantage of 3D CNNs compared to 2D CNNs is their high parameter count, which is an order of magnitude greater, leading to a higher risk of overfitting, thereby requiring a large volume of data like Kinetics [3]. In [13], several spatial-temporal architectural models based on 3D CNNs, two-stream networks, and ResNets are studied with regard to their performance on HMR. In particular, architectures such as 2D convolutions over frames, 2D convolutions over video clips, alternating 3D-2D convolutions, and factorization of 3D convolution into a 2D spatial convolution followed by 1D temporal convolution have been in- vestigated. The residual 2D plus 1D CNN architecture R(2+1)D, stems from the factorization of the Ni 3D spatiotemporal convolution of size Ni−1 × t×d× t into Mi 2D spatial convolution filters of size Mi−1 ×1×d×t and Ni 1D temporal convolution filters of size Mi × t × 1 × 1. The hyperparameter Mi defines the number of dimen- sions in the intermediate spaces where the signal is mapped during the transition between spatial and temporal convolutions [13]. To effectively maintains a similar number of parameters as a 3D convolution block [13], Mi is chosen according to: 9 2. Theory Mi = td2Ni − 1Ni d2Ni − 1 + tNi (2.2) The study in [13] concluded that the R(2+1)D, which is closely related to Pseudo- 3D, outperforms the other models and even achieves comparable or superior results of the benchmarking models like Iterative Dichotomiser 3 on datasets of Sports- 1M, Kinteics, UCF101, and HMDB51 [13]. The performance gain of the R(2+1)D model can be attributed to the factorization of each spatiotemporal block, leading to consecutive spatial and temporal convolutions across the network with the fol- lowing positive effects. Firstly, an additional nonlinear rectification is incorporated between the two operations, which effectively doubles the number of nonlinearities with the same number of parameters as in 3D convolutions. Secondly, yielding a lower training and testing loss at the factorization facilitates optimization [13]. 2.2.5 Balance of data In a classification approximation problem, as well as other problems of similar char- acteristics, when implementing a neural networks approach, it is relevant to look into possible local optima while training. If the majority of the training data for the model is of one type or class, one such local optima can be for the model to classify only one class. The loss will seem rather low, but in reality, the model has just approximated the problem to a constant output from only the data types. To combat this problem, one can balance the dataset so that there are roughly the same amount of data samples for each class or data type. In this way, the network is forced to fit another pattern within the data. Regardless of how often the data types or classes normally occur outside the test environment, the network still needs to be trained on a balanced dataset to avoid an unwanted bias [10]. It is common to split the data into a training set, a validation set, and a test set to avoid training biases in evaluation processes. By using different parts of the dataset, the evaluation will simulate the network in use since it has to handle data that is completely new to it. 2.2.6 Overfitting When training a neural network such as a CNN, over-training or, in other words, training too much may result in unwanted pattern findings in the training dataset. This also depends on the number of hidden layers within the network, which allows for more complex information classification. The network will adapt to the specific training set trends and patterns, which may be unique for this set. If this happens, the accuracy against the validation set is decreased. Since the network has not been trained on the data from the validation set, its unique unwanted features will not be integrated. Therefore, the overall accuracy will decrease against the validation set. However, a network can reach a local peak in accuracy. It is not certain that 10 2. Theory the network is overfitting if the validation accuracy is lowered temporarily, see e.g., [10] and [14]. 2.2.7 Transfer learning In neural networks and machine learning, it can sometimes be useful to use infor- mation from pre-trained weights and biases in a smaller scope than the original model. By using a pre-trained model with several classification outputs, one can use these outputs as inputs to a new layer or model where the problem dimensionality is significantly reduced. Essentially, one transfers one network model’s knowledge of the domain or area it is trained on to another targeted domain, [15]. This domain could, for instance, be a subset of the original one with a more specific classification. This also means that less data is necessary for training the specific model since the complexity of the problem is already decreased by the pre-trained model [15]. 2.2.8 Cross entropy loss Cross entropy loss is a metric for measuring the performance of classification model networks. Cross-entropy loss quantifies how well the predicted probabilities match the actual class labels. For networks with multiple output classes, the cross entropy loss CEL can be calculated as CEL = − 1 N N∑ i=1 C∑ j=1 yij ln(pij), (2.3) where N is the number of data points, C is the number of classes, pij is the predicted probability of data point i belongs to class j and yij is a boolean value (either 0 or 1) that indicates if j is the correct class for data point i. yij is 1 if this is true and 0 if not [10]. 2.2.9 Stochastic gradient descent Stochastic gradient descent (SGD) is a mathematical method for optimizing param- eters. The goal is to minimize a loss function. The network model’s parameters are updated for each training iteration following an update rule dictated by SGD. First, the dataset is shuffled randomly, then the data is passed through the network. The gradient of the loss function is calculated, and the parameters are updated using: wk = wk−1 − η∇Q(wk−1), (2.4) where w are the weights, η is the learning rate (a constant that scales the change), and Q(w) is the loss function. The updated form for the thresholds is updated in the same way. This is repeated through the dataset and for every parameter [6]. 2.3 Containing information in descaled images A standard format image, such as .jpg and .png, uses pixels to store information, where the resolution describes the number of pixels used. Descaling such an image, 11 2. Theory therefore, means approximating the same image using fewer pixels. It is apparent that an image of high resolution contains a lot of information. The greater the resolution of the image, the more details can be shown and the clearer the image be- comes. However, due to hardware limitations and/or runtime optimization, keeping a low image resolution is often preferable. Sometimes, information from the most important details can remain even if the image is scaled down, [16]. How much an image resolution can be descaled to contain relevant information still depends on the content of the image and the purpose of the downsizing. In the case of hardware limitations and computing speed for neural networks, it depends on the network size and the used GPU [17]. A common way to determine this is through iterative testing with different image resolutions. 2.4 Evaluating network models There are many ways to analyze and evaluate neural network models, and what results are relevant depends on the problem the model is aimed to solve or ap- proximate. In machine learning classification models, measures such as accuracy, precision, and recall are commonly used to help evaluate the quality of the classifi- cations [18]. Accuracy is a measure of how often a model can predict the correct class or outcome. It is calculated using the following equation A = pc p , (2.5) where A is the accuracy, pc the number of correct predictions and p the total number of predictions. In a classification problem that is binary or has only two classes, one can split the prediction outcomes into four possible categories. If we imagine one class being positive and the other negative then, • True Positive (TP), the model correctly classified positive. • True Negative (TN), the model correctly classified negative. • False Positive (FP), the model incorrectly classified positive. • False Negative (FN), the model incorrectly classified negative. Using this terminology, accuracy can be written as A = TP + TN TP + TN + FP + FN , (2.6) i.e., TP + TN = pc and TP + TN + FP + FN = p. Precision measures how reliably the model classifies true positives. Or, in other words, how often the positive classifications are correct. This measure is calculated using the following equation P = TP TP + FP , (2.7) 12 2. Theory where P is the precision, TP is the number of true positive predictions and FP is the number of false predictions. Recall measures how well the model can classify one class correctly. In other words, recall will measure if the model finds all instances of this class in a given data set. Recall R is calculated as [18]: R = TP TP + FN , (2.8) To manage a trade-off between P and R, the F1 score is used as a harmonic mean of these two metrics, giving a single measure of accuracy. Balancing the two measure- ments is crucial as the FP and FN should be minimized. The F1 score is calculated as F1 = 2 · P · R P + R , (2.9) which ensures that the score is high only if both P and R are high, making it a robust metric for evaluating the effectiveness of our classification models. 2.4.1 Network certainty Network certainty measures how decisive the network model acts on each classi- fication. If an ANN classification model has m output nodes, where each node corresponds to a class, and the node with the highest value indicates the predicted class, the absolute difference between the node values can be used to estimate a model certainty. In this thesis, the network certainty, Γ, is defined as Γ = knmax − ∑ k ||nk|| k , (2.10) where nk are the node values and k ∈ Z+, k ∈ [1, m]. For a combined model that uses several network models, the model certainty is defined as the sum of the model’s certainties. 2.5 Ultrasonic sensor systems Many new cars use ultrasonic sensors (USS) to detect objects in proximity. Ultra- sonic sensors can measure distances with low power consumption. The sensors send a sound wave with a frequency outside the human hearing spectrum, making them appear quiet. If the sound wave hits an object, it will be reflected, and the sensor will then listen for the echo to measure the distance to the object. The reflection amplitude and general direction depend on the object’s material and shape, but since the sound wave has a spherical propagation, it is very likely for some sound to reflect back regardless of the shape or material. As sound travels at vs ≈ 343 m/s in ground level air [19], the distance to the object can be calculated using d = vs∆t 2 , (2.11) 13 2. Theory where ∆t is the difference in time between the emission and detection of the sound wave. These sensors are commonly used for parking and object detection in both the front and rear of the car. New car models have several USSs in the rear and the front, which can all triangulate and listen to each other’s echos and their own. Therefore, a specific firing sequence is used to map and measure the objects. Figure 2.3: Illustration of the firing sequence and how neighboring sensors listen to their own and each other’s echos. The yellow circles indicate the positions of the ultrasonic sensors, and the blue indicates where the camera is located. The sensors in the rear of the vehicles have the following notation: • ROR - Rear Outer Right • RIR - Rear Inner Right • RIL - Rear Inner Left • ROL - Rear Outer Left In figure 2.3, RIL fires a signal and listens to its own echo, and the neighboring sensors, RIR and ROL, listen to the same echo. There are two sequences of sensor firing where a sensor either only listens to neighboring sensors or emits a signal and listens to itself. These modes define the firing sequences and are swapped for each sequence. The sensors that listen to other sensors can distinguish which sensor sig- nal it receives by utilizing small sound signal frequency differences that make each signal unique. Data could be obtained from the following signal ways: • Direct Signal way - when the receiving sensor detects its transmitted burst (RIL-RIL & RIR-RIR). • Indirect Signal way - when the receiving sensor detects a burst from its neigh- bor sensor (RIL-ROL). 14 3 Method In this chapter, the method is presented together with the investigated gestures. 3.1 Gestures representation Two distinct gestures were chosen for this project, a ’kick’ gesture and a ’hand’ swipe, to function for trunk actuation activation. The gestures were selected based on their distinctiveness and ease of detection for both ultrasonic and visual sensor perspectives. All gestures were, therefore, conducted around a one-meter distance away from the trunk. The ’kick’ gesture is a well-established gesture that is sometimes used in combi- nation with a radar sensor. The ’kick’ was specifically chosen as users already know it for trunk actuation activation, as illustrated in figure 3.1. The other gesture in- vestigated is the ’hand’ swipe gesture due to its simplicity and natural association with symbolizing opening, see figure 3.2. Figure 3.1: Illustration of the leg ’kick’ gesture. Note that the distance between the starting position and the vehicle was roughly one meter. 15 3. Method Figure 3.2: Illustration of the ’hand’ swipe gesture. 3.2 Approach and general idea It is relatively easy to realize that there is not only one solution to the formulated problem in the project. The created method was influenced by other gesture recogni- tion projects. The main idea from the method is to use all the available information, meaning both the data from the USS and the visual input from the rearview cam- era on the vehicles, and use the combined information from these sensors to create a robust model for classifying the information. The classification problem becomes more complex as false positives, meaning the model classifies a non-intended gesture as a gesture, which is considered a non-intended gesture that should not activate the actuation. The model needs to know that the gesture was intended, and, at the same time, it should be able to distinguish the same gestures for all people. Considering the idea of a combined model, combining the information from the USS and vision data, external information, and logic from the vehicle can be uti- lized. For instance, the model should first determine if the key to the car is near the vehicle. If it is, the model should check for nearby objects and the change of object distances using the USS. The camera system is activated if an object moves close and this logic is satisfied. Now, the vision model is initiated and uses visual information of the nearby object to classify whether the given object is human or not. The next step can be initiated if the object is classified as a human. After this, the vision model will classify whether a gesture is made. The first time the vision model classifies a gesture, the USS model classifier is initiated to verify the classification. Since this model requires temporal input, a time window is created where the most recent USS measurement replaces the earliest. The outputs of each classifier, vision and USS, should now be combined. This is done using a weight function that utilizes the network certainty of each model/classifier together with a parameter α that scales each signal, creating an adjustable bias towards one of the 16 3. Method models. This parameter is tuned by iterative testing. The vision data is used in combination with the USS data to acquire as much relevant information as possible such that the model where this information acts as a fail-safe for an incorrect clas- sification. In this way, the power consumption is reduced by using the passive USS before activating the vision system. The problem was split into smaller parts, and information was handled separately to achieve this logical structure. Three networks were created: a USS-based model, a static vision human detection model, and a dynamic vision model. These models should then work together, following the logic presented in figure 3.3. Figure 3.3: Illustration of the general simplified logic, where all networks are shown as boxes with an input and output signal. The yellow arrow illustrates the initiation of the time window used for the USS model. The vision and USS model classification outputs are weighed using a factor α to determine the total classification output. 3.3 Combined model The USS and vision models can be used in combination with each other. Using some external logic to fuse the output classifications, a combined model was created using both the USS and visual inputs. This logic can be tuned to potentially achieve an increased results performance compared to the USS and vision models separately. The overall logic used in this combined model is illustrated in figure 3.3. In the figure, there are some external functions and information, such as key detection and human detection; these functions are already implemented and are, therefore, assumed to work flawlessly for this project. As an object is classified as a human, the static vision network will be triggered to classify for any gestures. At the same time, the USS model will collect data points until the length of the time window is satisfied and then classify the mea- sured distance pattern over time. This is triggered by the static vision model when it first classifies a gesture. The collected measurements for the USS before this in- stance are used to fill the time window in the USS model. All new measure data is then inserted, and the oldest data point is removed so that the time window is moved. The outputs of both models are weighted by a weight function that takes 17 3. Method the network certainty into consideration and a tuning parameter that the user can adjust. In this way, the classifications are fused and can easily be tuned to compen- sate and to rule out false positives, etc. This logic and utilization of the USS and vision models is defined as the combined model. To evaluate the model, recorded data was fed to a Python script which simulated the combined model. Randomly selected USS data and vision data that belong to the same classification were fed to the model. External factors such as key proximity and human detection are assumed to always be triggered for these cases. 3.4 Data acquisition For the network models to be able to identify certain gestures, data has to be col- lected for all classification tasks involved in the project. In this project, it was necessary to generate new data for the specific gestures and the sensor setup of the provided test vehicles. The test vehicles have systems created for data collection in all instruments, which were saved on a portable solid-state hard drive. The data the USS and the rear-view fish-eye camera generated were synchronized in time. Each measurement for both types of sensors was also initiated simultaneously. Each mea- surement could be extracted and saved into a folder containing the separate data for each type of sensor in the predetermined mf4 format. The measurements were conducted as follows in the following order: 1. Discuss and determine what gesture and motion should be recorded and in what position. 2. Find an appropriate area to record the measurements, free from obstructions, to ensure unimpeded movement and accurate gesture capture. The chosen environment replicates typical parking scenarios encountered in urban settings. 3. Set up logging equipment and designate one team member to operate the recording equipment from within the vehicle, starting and stopping each ses- sion and monitoring real-time data stream to the logger, to its hard drive, and the capture through the rear camera system. 4. One person makes the agreed upon gesture communicating with the person starting/stopping the recordings when to initiate each measurement. In total, 231 recordings were acquired. A snippet of the measurement data logbook is illustrated in figure 3.4. The recordings included three different people making gestures in different situations, with various backgrounds and weather conditions. Figure 3.5 illustrates a few frames from one snippet. 18 3. Method Figure 3.4: A snippet of the measurement data logbook. This file connects the data files to the measurements and was used to label the datasets. (a) (b) (c) (d) (e) (f) Figure 3.5: Illustration of a frame sequence from a single recording, depicting an individual standing in an open area without performing any gestures. 3.4.1 Collected USS data The data from the USS contains measures, such as distance and signal amplitude, for each sensor’s own and neighboring echoes. The sampling frequency of the USS is 50 Hz. Each USS recording is 262 samples long, roughly corresponding to a data recording of 5.2 seconds. The measurements are low-pass filtered to remove extreme points and noise. From the kick motion gesture, a typical measurement would look like what is illustrated in figure 3.6. For this gesture, a human would stand roughly one meter away from the car trunk and make the gesture. In the figure, one can distinguish seven data points where the measured distance is significantly closer, which indicates the kick. Some sampled data points were more similar to figure 3.7, where extreme value measurements are illustrated. Several points of valuable information were lost due to noise, in some instances, all of the distance readings during the gesture were lost to noise, which resulted in readings that only indicated the presence of an object roughly one meter away. At other times, all the important data points are captured, and a clear motion signature can be detected, which is crucial for the model in clas- 19 3. Method sifying this information. The data illustrated by the figure 3.6 and 3.7 is filtered such that the points that are considered noise are removed. 0 50 100 150 200 250 200 400 600 800 1000 1200 1400 Figure 3.6: Illustration of the general distance measured over a time span around 5 seconds. Note how the detected distance is closer around time step 150. This is the indication of the kick gesture. 0 50 100 150 200 250 200 400 600 800 1000 1200 1400 0 50 100 150 200 250 200 400 600 800 1000 1200 1400 Figure 3.7: In the figure to the left one can see an example of a measurement series where all points of interest were lost by noise such that no kick gesture could be distinguished. The right figure shows an example of a clear gesture profile. The green points are the preprocessed and merged points, as explained in the methods chapter. The red and blue points are from echo1 in RIL and RIR respectively. 20 3. Method By combining 30 measurements of data points, a refined set of data points can be obtained, see figure 3.8. Figure 3.8: Illustration of several (30) measurement sequences put together. 3.5 Preprocessing: Decoding The initial preprocessing stage consisted of decoding the acquired recordings. The process of decoding involved setting up required environments and employing de- coding utilities according to the following algorithm 1, described in further detail in appendix A.1. Algorithm 1 Decoding of acquired USS and Vision recordings. 1: Ensure execution environment: 2: - Linux environment, using WSL2 with Ubuntu for this project. 3: - Deploy CUDA extension and set up Singularity container. 4: Extract decoding utilities to local workstation. 5: Run script decode_logg.py (see appendix A.1) to streamline decoding: 6: for all USS recordings in the input directory do 7: - Recreate output directory structure based on input directory structure. 8: - Construct Singularity, employing decoding utilities, for conversion. 9: - Execute the conversion command. 10: end for 11: for all Vision recordings in the input directory do 12: - Recreate output directory structure based on input directory structure. 13: - Construct Python command, employing decoding utilities, for conversion. 14: - Execute the conversion command. 15: end for 21 3. Method 3.6 USS model network The ultrasonic sensor-based model network was built upon the acquired data and the merged pre-processing of USS inputs. Since the sensors used in the project only measure the distance away from itself in one dimension, the idea is to capture a pattern of distance changes over time. The model essentially uses the information from the distance changes over a given time. Several approaches were considered, and in the end, a time window approach was selected, covering the 262 measurements corresponding to the time window size. This was done based on the collected data as one gesture took roughly this time to complete. The vehicle that was used to collect and record the USS data had four rear sensors, where each sensor listened to its own echo and its neighboring echo. The car’s mainframe manages the triangulation of these echoes to calculate distances, which restricts direct access to processed data. This limitation necessitates the model network to use direct sensory data and excludes the possibility of data points in more dimensions. Therefore, time windows containing distance measurements were used to train the network for these kinds of data structures. The time window for sampling is triggered by an external signal linked to the car’s security and proximity alert system, allowing for precise data capture when a gesture is likely to occur. This method enhances model accuracy by focusing on relevant data periods when a gesture is possible, see figure 3.3. 3.6.1 Preprocessing USS classification data After the raw data extraction and conversion to the format .hd5f, it was possible to extract the direct USS distance data using a Python script, see Appendix A.2. The rear sensor echo distances are extracted from the format, noise is filtered out, and time windows are created where the data from each recorded gesture is merged together separately for each time window. Since the recordings vary in length, measurements over 5.2 seconds are cut and measurements under are padded. A comma-separated value file, csv-file, was created with all the measurements of the same class or gesture. The data was plotted over the given time steps to visualize the time window. Since the data is saved as separate files if it succeeds at a given storage size, the Python script combines all similar files in each directory, where each directory contains one measurement sample. The noise filtering as discussed earlier, works by removing the data points that are outside the range of 60000 mm and 2 mm since these are the limits of the hardware and all points above or below are considered to be noise. Since not all sensors have guaranteed disturbances or noise at the same time stamps, the script also checks each step to see if the neighboring sensor picked up a non-noise mea- surement and adds that value to a new vector containing the merged values from the sensors. In that way, more information of interest can be saved in a singular measurement vector, which can later be fed to the neural network. These vectors were later combined with the script and saved as a csv-file. 22 3. Method 3.6.2 Build USS classification model Based on the data pattern complexity of the input data classification problem, the network was initiated with three hidden layers and a substantial number of nodes as stated in the theory section. By taking a given set of measurement frames or measurements over given time steps, an input window for a short time sequence can act as the input to the network. The model would use the differences in detected distance within this time window over the time steps to detect patterns from the performed gesture. Only the’ kick’ gesture was classified to narrow down the com- plexity of the initial data. The network had two output nodes. The idea of having two output nodes was also to be able to integrate a network certainty as discussed in the theory section 2.4.1. These two nodes corresponded to either whether a ’kick’ was detected or no ’kick’ was detected. Furthermore, two outputs is useful for both the evaluation of the network and the combined model, which will be discussed later. The node with the maximum output is used as the chosen output. 3.6.3 Training and validation The USS classification model is a linear dense NN using three hidden layers utiliz- ing the tanh activation function. The network was built for two network classes, ’kick’ or ’no kick’, with 262 input nodes, one for each time step in the given time window. See figure 3.9 or A.3 for a detailed description of the network architecture. The time window corresponds to roughly 5.2 seconds in recorded time, which was deemed sufficient to capture all the kick data in the collected samples. A dataloader class was defined where the labeled data is loaded from the csv files and then easily extracted by a function. The dataset was then split into a training set, a validation set, and a test set with the respective separation, 80 %, 10%, and 10%. A training loop was defined where data from the training set was loaded together with their respective labels. The Adam optimizer was used with a learning rate of 0.001, and a mean square error function was used as a loss function. Each epoch was monitored while running the loop to roughly evaluate the network model. By observing the trend of the validation accuracy and loss, one can monitor overfitting and evaluate the model as stated in the theory section. The validation accuracy was saved for each epoch, and after training, the model was saved. 3.7 Static vision model networks Several projects in gesture recognition and static image analysis use a Convolutional Neural Networks (CNN:s) approach, which was also chosen for this project. After training, it is quite compact and requires little computational resources compared to other alternative models, such as the vision transformer (ViT) network. For these networks, greyscale imagery was used in this project. 23 3. Method Figure 3.9: Illustration of the USS network architecture. 3.7.1 Preprocessing static vision classification data From the data collection, the recorded visual data was saved in mf4-format, similar to the raw data from the USS. However, a different decoding method was used due to the difference in size and encoding. A script, see Appendix A.5, was created for saving each frame as a jpg file and downsizing the image resolution by a factor γ. The lesser resolution will result in a smaller network as the input dimensions directly correspond to the number of pixels in the images. If the colors of the pixels are included, each pixel contributes with three color channel inputs. The scale factor, γ, was tested iteratively as a parameter using the same network model with a scaled-down input size and comparing the results with the model using the maximal resolution until a satisfactory scale factor could be determined. See 2.3 in the theory section. (a) (b) Figure 3.10: Raw frame extracted from one of the mp4-files to the left and the down-scaled version of the same frame to the right using the scale factor γ = 0.1. All images were manually labeled as either the given gesture or no gesture for each 24 3. Method measurement and saved in separate folders for each gesture, counting no gesture as a class. The dataset was split as equally as possible to balance the data. 3.7.2 Static Vision classification model Using (2.1), a starting point for the initial linear neural network part was made. Several different network structures were tested and implemented. As a base point, three convolutions were used as this is rather common in facial expression recogni- tion networks of similar characteristics [10]. The network size, pooling layers, and linear deep neural network are all parameters that were shifted and implemented in several different ways, yielding different results. Since gestures can contain much information, the network needs to be able to capture a vast amount of feature in- formation. Therefore, the overall size and channels of the network were set up to be able to capture this, and several transformations to the images can be applied, [20] and [21]. A base channel size of 64 was used and varied slightly between the layers. The static vision classification network was based on the CNN structure with three convolutional layers and three linear layers to allow for the possibility of complex data classification which is expressed in the given dataset problem while still being rather compact. For the linear network part, the layers were also varied in terms of size and amount. As a starting point, one hidden layer was implemented, which connected to the seven outputs, one for each emotion classification. See figure 3.11 for the full architecture and A.6. The CNN-model was built using pytorch. Before the data is fed to the network, each image is resized to 100x80 pixel and converted to greyscale such that the input dimension is reduced for the network. The images are also normalized using the standard normalization for greyscale imagery. After this, the data set is split into a training set, a validation set, and a test set. This is then fed to the training and vali- dation functions with a batch that is selected randomly from each set. For this work a batch size of 32 was used. A cross-entropy loss function was used together with optims SGD optimizer. The learning rate was set to 0.005. Using the preprocessed data, a training and validation function could be defined, see 3.7.3. 3.7.3 Training and validation A validation function was defined, where the network model is fed data from the validation set, and its output is compared to the labeled targets. In this function, each correct classification is counted and then divided by the total number of valida- tions. This is done using the dataloader for the validation set. Also, for each image or item in the dataloader, the validation loss is calculated using the mean square distance between the network output and the validation targets. This is done in the same way as for the training set. To combat overfitting, each epoch is monitored while training. When the valida- 25 3. Method Figure 3.11: Illustration of the static vision network architecture. tion accuracy has reached a maximum peak, and then the accuracy is decreased for several iterations, the network is assumed to be overfitting. Also, the training loss was calculated using cross-entropy which was monitored in the same way. As the training loss continues to decrease while the validation accuracy is not increasing, it can also be a sign of overfitting. The validation accuracy was saved for each epoch above a set limit of validation accuracy, and if this accuracy was better than the previous one, the current network was saved. In this way, further training would not negatively impact the saved network. 3.8 Dynamic vision model network The dynamic vision model network was developed to model spatial-temporal fea- tures. Compared to the static vision model network, which only models the spatial features. The literature study in 2.2.4 suggests that the ResNet 3D CNN, and es- pecially the R(2+1)D classification model, were a suitable choice as a basis for the dynamic vision model. Due to time limitations, it was not feasible to develop the R(2+1)D model from scratch, and it was therefore retrieved from the PyTorch library [22]. The model is based on the work described in [13]. To meet the project’s classification require- ments, adjustments were made to the model’s architecture as stated in section 3.8.2. Training the model from scratch on a small dataset, which consisted of 231 videos of multiple classes, posed a significant risk of overfitting due to the high number 26 3. Method of trainable parameters, approximately 31.5 million. A large volume of data was required to minimize the risk of overfitting [3]. Fortunately, PyTorch provided pre- trained models, trained on the large benchmark dataset Kinetics-400, widely used for human action recognition [22]. Transfer learning was, therefore, possible, [23], where the acquired dataset was used for fine-tuning. As the pre-trained model does not accept videos with arbitrary size and length, preprocessing was required. 3.8.1 Preprocessing dynamic vision classification data As stated in section 3.5, the directory structure consists of folders for each sepa- rate measurement, containing one or several mp4-files, depending on the length of the recording. This required the creation of the script, merge_videos.py presented in appendix A.9, which finds and merges mp4-files by concatenation. The fish-eye camera captures clips of 30 frames per second (fps), and through iterative testing, it became clear that 2 fps was sufficient for the dynamic model to achieve high accuracy. Subsequent steps addressed the labeling process. The script video_Labeling.py, presented in appendix A.10, labels merged video files according to logbook entries shown in the figure 3.4. It executes three types of labeling functions: binary class labeling, multiclass labeling, and extended multi-class labeling. The binary labeling function assigns labels ’kick’ - ’no_kick’ or ’hand’ - ’no_hand’. The multiclass la- beling function assigns three labels ’kick’, ’hand’ and ’no_gesture’ and the extended multi-class labeling function uses combined gestures and attributes reaching a total of nine labels, seen in table 3.1. All three functions generate a csv-file with video paths and labels, and a csv-file with label mapping, creating a dataframe. Table 3.1: Overview of labeling functions and their corresponding labels Binary class labeling Multi-class labeling Extended multi-class labeling ’kick’ ’hand’ ’kick’ ’kick right leg’ ’no_kick’ ’no_hand’ ’hand’ ’kick left leg’ ’no_gesture’ ’kick right leg and 2 bags’ ’walk pass perpendicular forward and back’ ’walk approach and depart gull wing’ ’walk approach and depart straight’ ’walk approach and depart straight 2 bags’ ’hand motion right hand’ ’hand motion left hand’ The remaining preprocessing was deployed within the principal script named dynamicVisionNetwork.py presented in appendix A.11. The script was initiated by reading video paths, labels, and mapping from previously created data. The video data and labels are split into training, validation, and test sets using the train_test_split function from the scikitlearn library in two steps. The initial split separates the data into training and test sets with an 80/20 ratio. A subsequent split further divides the training data into training and validation sets, also with an 80/20 ratio. Stratified sampling was employed for all the datasets to address the 27 3. Method imbalance in the distribution of classes. Followed by data loading, transformation parameters are defined using PyTorch’s transforms.Compose to ensure that the in- put data was consistent with size and normalized according to the requirements. To effectively leverage the learned features of the pre-trained model, the configuration parameters and preprocessing operations, such as the frame size, are preferred to be consistent with that of the pre-trained model. These are presented below in table 3.2 below [24]. Table 3.2: Configuration parameters and preprocessing operations for the pre- trained R(2+1)D model [24]. Parameter Configuration Frame rate 15 Clips per video 5 Clip length 16 Resize size [128, 171] Crop size [112, 112] RGB x̄ [0.43216, 0.394666, 0.37645] RGB σ [0.22803, 0.22145, 0.216989] Video datasets are created using a custom VideoDataset class, presented in ap- pendix A.11, which handles the loading and processing of video data according to the defined transformations. In the VideoDataset class, the video frames are read using read_video method from PyTorch, which returns the frames in a tensor for- mat (C, T, H, W) where T stands for temporal dimension, C stands for channels, which is three in the case of RGB-video, H stands for the height of each frame in pixels and W stands for the width of each frame in pixels. Due to time limitations, the configuration parameters in table 3.2, with regard to length, were not consid- ered. Instead, the change of fps from 30 to 2 fps gives the longest video 70 frames. To ensure that the length of the videos in the datasets are concise, the elementary method of padding the last frame to 70 frames. Transformations to each frame are provided using the apply_transform method. This method converts each frame from a tensor to Python Imaging Library image format, normalizes it to the range [0, 1], and applies the predefined transformations. These transformations ensure that the input data is consistent with size and nor- malized according to the specified mean and standard deviation values. As the R(2+1)D model accepts video frames in batched format (B, C, T, H W) where B stands for the number of video samples in a batch, the apply_transform method finally converts the transformed frames back into a tensor and reorders the dimen- sions to match the expected format (C, T, H, W) [24]. In figure 3.12, it is shown how a video frame of original resolution 1282 × 722 = 925, 604 pixels was resized to 128 × 171 = 21, 888 pixels, cropped to 112 × 112 = 12, 544 pixels. The final cropped image has about 1.36 % of the original number of pixels. 28 3. Method (a) 1282 × 722 (b) 112 × 112 Figure 3.12: In figure 3.12a one can see the original resolution of a mp4-file, and in figure 3.12b one can see the down-scaled version of the same mp4-file, the scale factor is approximately γ = 0.1. 3.8.2 Dynamic Vision classification model As mentioned earlier, the dynamic vision classification model is based on the R(2+1)D model of 18 layers from PyTorch [24], which is in turn based on [13]. PyTorch pro- vides R(2+1)D-18 model which has been pretrained on Kinetics-400 dataset, the model has 40.52 GFLOPS and a file size of 120.3 MB. In dynamicVisionNetwork.py, presented in appendix A.11, the class DynamicVisionNN defines a custom neural network model. It was initiated by loading the pre-trained weights’ configuration R2Plus1D_18_Weights.DEFAULT into the model r2plus1d_18. Furthermore, the fi- nal fully connected layer of the pre-trained model was replaced with a new linear fully connected layer modified for the specified number of output classes required by the dynamic vision classification model. To prevent the weights of the parameters of the pre-trained model from being updated during training, they were all frozen except for the newly added fully connected layer, ensuring that only this layer’s weights would be updated during training. In figure 3.13, a visual representation is seen of the modules, module hierarchy, ten- sor operations, shapes, and tensors involved during the forward pass of the model. Similarly, figure 3.14 highlights how gradients are computed and how they pass through the model during backpropagation. The nodes are color-coded and rep- resent different types of tensors and functions: gray indicates backward functions, blue indicates reachable tensors requiring gradients, and green indicates the output tensor. A detailed description of the R(2+1) architecture is presented in A.12. 29 3. Method Figure 3.13: Illustration of the dynamic vision network architecture based on [24]. The final fully connected layer is adjusted for binary classification. Figure 3.14: Overview of the entire model during the backward pass [24]. 30 3. Method 3.8.3 Training and validation Building, training, and validating the dynamic vision model was done by a produced Python script seen in appendix A.11. The necessary components for training the model include data loaders, the model itself, an optimizer, and a loss function. A training function was defined, containing a loop that runs for a specified number of epochs. During each epoch, the training data is processed in batches, and for each batch, the model performs a forward pass to compute the output. The loss was calcu- lated using the cross-entropy loss function defined in section 2.2.8. Backpropagation was then performed, and the Adam optimizer updated the model’s parameters. A batch size of 4 and the Adam optimizer are chosen based on recommendations from the paper [13] and memory capacity. The learning rate was initiated as η = 0.001. After the training loop for an epoch, the model’s performance was validated. A validation function was defined, and provided a validation dataset. The model re- turned the validation loss, F1 score, and accuracy. To further detect or monitor overfitting, the model’s performance was validated and printed at the end of each epoch. When the validation F1 score reached a peak and then decreases for sev- eral iterations, it was seen as an indication of overfitting. Similarly, if the training loss continued to decrease while the validation accuracy did not improve, this was also seen as a sign of overfitting. To retain the model with the best generalized performance, the model with the best validation F1 score was saved. 31 3. Method 32 4 Results In this section, the results for the different models and the combined model is pre- sented. 4.1 USS model The USS-based NN model architecture consists of 262 input nodes which are then connected to a linear hidden layer of 100 nodes. These are then connected to a second hidden linear layer with 64 nodes. After this, a third hidden layer consisting of 32 nodes and biases is connected. These are then connected to two output nodes which are used to indicate the classification. All layers have biases and use tanh as an activation function. 4.1.1 Model evaluation The USS model achieved a validation accuracy of 75 percent after 26 epochs of training using a batch size of 10 samples while training on a refined dataset con- taining only kicks and no kicks. The model’s performance varies rather distinctively between randomized training loops, but the loss becomes rather stable using the stated parameters, see figure 4.1. 33 4. Results 0 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Validation accuracy Validation loss Figure 4.1: Illustration of the validation accuracy over each epoch using a batch size of 12 (blue) together with the validation loss scaled up by a factor of three (orange). While evaluating the model using all data points in the set, a classification accuracy of 70.59 percent was achieved. The model precision was calculated as 72.50 percent and the recall as 76.32 percent. See table 4.1. Table 4.1: The number of true positives, true negatives, false positives, and false negatives and their relative mean certainty are shown in this table for the USS model. Percentage of data Quantity Mean certainty TP 42.65 29 0.06476634 TN 27.94 19 0.04153539 FP 16.18 11 0.051356044 FN 13.26 9 0.06391618 4.2 Static vision models The static vision model NN architecture consists of three convolutional layers, three max-pooling layers, and three linear layers. The structure can be seen in appendix A.6. The rear-end fish-eye camera captured clips of 30 fps with a resolution of 1282x722 pixels. This resolution was scaled down to 128x72 pixels. 34 4. Results Figure 4.2: Illustration of the amount of TP:s and TN:s together with the FP:s and FN:s. 4.2.1 Binary gesture classification The binary version of the static gesture model network, classifying the binary action of the kick motion as stated in section 3.1 and no kick, yielded a validation accuracy of 100 percent over a dataset containing over a thousand data points. The network was iterated for 10 epochs with a batch size of 15. Evaluating the network model over the test dataset, an accuracy of 99.906 percent was obtained with a precision score of 99.825 percent and a recall of 100 percent. This gives an F1 score of 99.912 percent, see table 4.2. The accuracy and loss trend is illustrated in fig 4.3 and a confusion matrix for the model is shown in fig 4.4. Table 4.2: This table shows the number of true positives, true negatives, false positives, and false negatives and their relative mean certainty. Percentage of data Quantity Mean certainty TP 53.61 572 9.971248 TN 46.29 494 6.7518287 FP 0.094 1 1.3280579 FN 0.000 0 - 4.2.2 Multiclass gesture classification Training the static vision network on several gestures or classifications, ’kick’ and ’hand’ gestures, as explained previously, the model achieved a total validation accu- racy of 100 percent using a batch size of 30 while training for 20 epochs. As seen in figure 4.5, the validation accuracy for each gesture is shown together with the total 35 4. Results 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Validation accuracy Validation loss Figure 4.3: Illustration of the validation accuracy (blue line) over epochs and the validation loss scaled by a factor of three (orange). Figure 4.4: Illustration of the confusion matrix for the binary static vision model is presented. validation accuracy and validation loss. The validation loss is scaled up by a factor of 30 for visibility. 36 4. Results 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80 90 100 Total validation accuracy Validation loss No gesture Kick Hand gesture Figure 4.5: Illustration of the total validation accuracy over all gestures together with the separate validation accuracies for each gesture and the validation loss. 4.3 Dynamic vision models The dynamic vision models underwent training, validation and testing on three distinct classification tasks. The first model was designed for binary classification. The second one handled multi-class classification with three classes. The third model extended the multi-classification model to a nine-class multi-classification task. The networks were iterated for 20 epochs with a batch size of four. 4.3.1 Binary gesture classification The first iteration of the dynamic vision model network, classifying only binary action of the ’kick’ - ’no kick’ gesture, yielded a validation accuracy of 100 percent over a dataset containing 37 videos. The second iteration of the dynamic vision model network, classifying only binary action of the ’hand’ - ’no hand’ gesture, yielded a validation accuracy of 100 percent over a dataset containing 47 videos. 4.3.1.1 Collected and preprocessed data In the tables 4.3 and 4.4, the distribution of the binary classes over each of the datasets is presented. As mentioned in section 3.8.1, the distribution was achieved by using train_test_split and stratified sampling. In figure 4.6, it can be observed that a great number of videos at 2 fps are in the range of 10 to 15 frames long. The resulting maximum length of the videos in the dataset was 70 frames. 37 4. Results Table 4.3: Number of videos per ’kick’ - ’no kick’ gesture from the ac- quired dateset. Dataset Class ’kick’ ’no Kick’ Overall set 131 100 Training set 84 63 Validation set 22 15 Test set 25 22 Table 4.4: Number of videos for the ’hand’ - ’no hand’ gesture from the acquired dateset. Dataset Class ’hand’ ’no Hand’ Overall set 59 172 Training set 38 109 Validation set 9 28 Test set 12 35 (a) ’kick’ - ’ No kick’ gesture (b) ’hand’ - ’no hand’ gesture Figure 4.6: Illustration of the video lengths, in number of frames, per class in the case of a binary classification task. 4.3.1.2 Model evaluation For the binary classification task, the dynamic vision model recorded validation accuracies for both the ’kick’ - ’no kick’ and ’hand’ - ’no hand’ classification reached 100 %. The model achieved a perfect score for the test dataset, as seen in tables 4.7 and 4.8. This can be further visualized in the confusion matrices, figure 4.8a and 4.8b for this model, as the diagonal is 100 %. 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Validation accuracy Validation loss Training loss F1 score (a) ’kick’ - ’ No kick’ gesture metrics 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Validation accuracy Validation loss Training loss F1 score (b) ’hand’ - ’no hand’ gesture Figure 4.7: Illustration of the training and validation loss together with validation accuracy and F1 score over 20 epochs 38 4. Results Table 4.5: The mean certainty and count for true positives, true negatives, false positives, and false negatives for the ’kick’ - ’no kick’ gesture. Percentage of data Quantity Mean certainty TP 57.45 27 6.260 TN 42.55 20 3.776 FP 0.0 0 - FN 0.0 0 - Table 4.6: The mean certainty and count for true positives, true negatives, false positives, and false negatives for the ’hand’ - ’no hand’ gesture. Percentage of data Quantity Mean certainty TP 25.53 12 5.808 TN 74.46 35 5.340 FP 0.0 0 - FN 0.0 0 - Table 4.7: Test metrics of dynamic vision model for binary classification of ’kick’ - ’no kick’ gesture. Loss Accuracy F1 Test set 0.025 1 1 Table 4.8: Test metrics of dynamic vision model for binary classification of ’hand’ - ’no hand’ gesture. Loss Accuracy F1 Test set 0.034 1 1 (a) ’kick’ - ’ No kick’ gesture matrix (b) ’hand’ - ’no hand’ gesture matrix Figure 4.8: Illustration of the confusion matrix from the test evaluation 4.3.2 Multi-class gesture classification The multi-class gesture classification setup of the dynamic vision model classifies, as previously mentioned three distinct classes, ’kick’, ’hand’ and ’no gesture’. The 39 4. Results model yielded again a validation accuracy of 100 percent over a dataset containing 30 videos. In table 4.9, the distribution of the three classes over each of the datasets is presented. Table 4.9: This table shows the number of videos per class in each of the datasets Dataset Class ’kick’ ’hand’ ’No gesture’ Overall set 131 59 41 Training set 83 38 26 Validation set 21 9 7 Test set 27 12 8 Figure 4.9: Illustration of the video lengths, in number of frames, per class in the case of the multi-class classification task. 4.3.2.1 Model evaluation Recorded validation accuracy for both the ’kick’, ’hand’ and ’no gesture’ classifica- tions reached 100 %. The model achieved a perfect score for the test dataset, as seen in table 4.11. This can be further visualized in the confusion matrix 4.11 for this model, as the diagonal is 100 %. 40 4. Results 0 2 4 6 8 10 12 14 16 18 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Validation accuracy Validation loss Training loss F1 score Figure 4.10: Illustration of the training and validation loss together with validation accuracy and F1 score over 20 epochs Table 4.10: The mean certainty and count for true positives, true negatives, false positives, and false negatives. Percentage of data Quantity Mean certainty TP 77.14 27 6.060 TN 22.86 8 6.328 FP 0.0 0 - FN 0.0 0 - Table 4.11: Test metrics of dynamic vision model for multi-classification. Loss Accuracy F1 Test set 0.019 1 1 41 4. Results Figure 4.11: Illustration of the confusion matrix from the test evaluation 4.3.3 Extended multi-class gesture classification The extended multi-class version includes several gestures as presented in the table 4.12. The distribution of the nine classes over each of the datasets is presented. As mentioned in section 3.8.1, the distribution was achieved by using train_test_split and stratified sampling. In figure 4.12, it can be observed that ’kick’ and ’hand’ classes have a higher concentration of videos with a shorter number of frames in the range of 10 to 20 frames long, while the ’walk’ classes had a wider range of video lengths. The maximum length of the videos of 2 fps in the datasets is 70 frames. 42 4. Results Table 4.12: This table shows the number of videos per class in each of the datasets for extended multi-class classification. Class Dataset Overall set Training set Validation set Test set ’kick right leg’ 62 35 15 12 ’kick left leg’ 50 34 7 10 ’kick right leg and 2 bags’ 38 15 1 3 ’walk pass perpendicular forward and back’ 13 8 3 2 ’walk approach and depart gull wing’ 12 8 1 3 ’walk approach and depart straight’ 10 6 2 2 ’walk approach and depart straight 2 bags’ 4 2 1 1 ’hand motion right hand’ 38 25 7 6 ’hand motion left hand’ 23 13 3 7 Figure 4.12: Illustration of the video lengths, in number of frames, per class in the case of the extended multi-class classification task. 43 4. Results 4.3.3.1 Model evaluation The dynamic model with extended multi-class classification achieved an accuracy of 85 %, as shown in table 4.14. This can be further visualized in the confusion matrix 4.14, as the diagonal is approximately 100 %. 0 2 4 6 8 10 12 14 16 18 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Validation accuracy Validation loss Training loss F1 score Figure 4.13: Illustration of the training and validation loss together with validation accuracy and F1 score over epochs Table 4.13: The mean certainty and count for true positives, true negatives, false positives, and false negatives. Percentage of data Quantity Mean certainty TP 33.33 7 2.886 TN 47.62 10 3.590 FP 4.76 1 1.013 FN 14.28 3 1.953 Table 4.14: Test metrics of dynamic vision model for extended multi-classification. Loss Accuracy F1 Test set 0.511 0.851 0.896 44 4. Results Figure 4.14: Illustration of the confusion matrix from the test evaluation 4.4 Combined model The data fed into the combined model consisted of all gesture data of the kick motion and a similar amount of gesture data containing no kick for both models, i.e., image data and USS data. 4.4.1 Model evaluation Combining the network models for static vision and USS and feeding randomly selected samples of gesture data and data with no gesture, the results in table 4.15 were obtained after 2000 iterations (to likely capture most of the unique data in the respective data sets). 45 4. Results Table 4.15: This table shows measures of the combined network model evaluation. Note that the network certainty is defined in a different way for the combined model. Percentage of data Quantity Mean certainty TP 50 1000 11.279448 TN 50 1000 11.277735 FP 0.000 0 - FN 0.000 0 - The combined model’s accuracy was 1.0. This accuracy can be compared to the models presented for the static vision network model, 1.0, and the USS network, 0.773. 4.4.2 Combined model using dynamic vision model In tables 4.16 and 4.16, the results of processing the full dataset with the binary dynamic vision model for classifying the ’kick’ - ’no kick’ gesture are presented. Table 4.16: Test metrics for dynamic vision model for multi-classification. Loss Accuracy F1 Test set 0.03134 0.995 0.995 Table 4.17: The mean certainty and count for true positives, true negatives, false positives, and false negatives. Percentage of data Quantity Mean certainty TP 56.71 131 5.971 TN 42.86 99 3.938 FP 0.43 1 2.059 FN 0 0 - 46 5 Discussion This chapter discusses the results and potential error sources and compares the tested models and the current radar-based systems. 5.1 Non neural network based approach It is possible to use classical measures, such as the radar-based system, to acti- vate the requested actuation using information from other sources. Still, there are some drawbacks as well. Without using a neural network approach to classify the movement signatures of intended gestures, it could be hard to determine whether a human made a gesture intending to open the trunk or not. More accurate real-time analysis would require significant computational resources compared to the ANN solution. Perhaps someone is walking by, or an object, for instance, an animal or plastic bag, moves close. It would not be optimal if such scenarios triggered the actuation. Therefore, one would need more information to avoid activating false positives, which would be quite an unpleasant or dangerous situation for customers. To combat this issue, one could use the key position as an indicator of where the driver is. If the person holding the key is standing behind the car and the classical method determines that the trunk should open, this information could potentially be matched and used to cause an intended actuation more reliably. However, that also means that only the person wearing the key can use this feature. This removes the possibility of, for instance, a family member using the feature without the key. Perhaps it would be enough to use only the ultrasonic distance measured, but there could still be false positives, such as if the driver walks by the car too close to the vehicle with the key. Then, this method would classify this as a gesture even though it is not. Of course, there are workarounds for this as well, for instance, if the driver has to stand a given time span at the right position for the trunk to open. However, this might be a bit unpractical. 5.2 USS model and data The obtained data from the ultrasonic sensors only captured distance in one di- mension, making the human gesture pattern hard to distinguish from a potential object over time. As seen in figures 3.7 and 3.6, each gesture has no clear, unique pattern. Due to measurement noise and errors, not all sampled measurement points in the data collection gave reasonable results, and the data density in time after the noise filtration was ins