Exploring the feasibility of using ultra-
sonic sensors and cameras for human
gesture recognition to activate trunk open-
ing in vehicles
Master’s thesis in Complex Adaptive Systems and Systems, Control and Mechatronics

Tim Johansson and Krister Mattsson

DEPARTMENT OF ELECTRICAL ENGINEERING

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2024
www.chalmers.se

www.chalmers.se


Master’s thesis 2024

Exploring the feasibility of using ultrasonic
sensors and cameras for human gesture

recognition to activate trunk opening in vehicles

Tim Johansson, Krister Mattsson

Department of Some Subject or Technology
Division of Division name

Name of research group (if applicable)
Chalmers University of Technology

Gothenburg, Sweden 2024


Exploring the feasibility of using ultrasonic sensors and cameras for human gesture
recognition to activate trunk opening in vehicles
Tim Johansson, Krister Mattsson

© Tim Johansson, Krister Mattsson, 2024.

Supervisor: Pratish Ray, Volvo Cars Exterior Vision & Ultrasonics
Supervisor: Jonas Fredriksson, Department of Electrical Engineering
Examiner: Jonas Fredriksson, Department of Electrical Engineering

Master’s Thesis 2024
Department of Electrical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Illustration of the general simplified logic where all networks are shown as
boxes with an input and output signal. The yellow arrow illustrates the initiation
of the time window used for the USS model. Both the vision model and the USS
model classification outputs are weighed using a factor α to determine the total
classification output.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2024

iv


Exploring the feasibility of using ultrasonic sensors and cameras for human gesture
recognition to activate trunk opening in vehicles
Tim Johansson, Krister Mattsson
Department of Electrical Engineering
Chalmers University of Technology

Abstract
The integration of new advanced technologies plays a crucial role in the industrial
market. The automotive industry is no different. With the introduction of ultra-
sonic parking sensors and high-resolution cameras in new vehicles combined with the
integration of high-performance computing power, it is possible to implement ma-
chine learning and classical methods to process real-time sensor information. This
thesis focuses on recognizing human gestures using the combined information from
the ultrasonic sensors and visual camera data for functional actuation. In particular,
the thesis serves as a feasibility study for using gesture recognition as an input for
activating the automatic opening of the trunk.

Several approaches to this problem have been investigated through literature studies,
and the most suitable method has been determined to be a combination of machine
learning neural networks and sensor fusion from classical methods. Two different
machine learning methods are implemented and analyzed for the visual input. One
model that classifies static images and one model that classifies a series of images
to capture information from dynamic movement. Another model is built for the
parking sensory input, which, similarly to the previous model, utilized a series of
measurements in time for the classification. Together, these models form a logical
pipeline that utilizes classical ultrasonic sensory input as an indicator for activating
the models. These models are evaluated for both binary outputs, meaning classify-
ing gesture or no gesture, and multi-class gestures, meaning several different gesture
classifications.

Separately, the vision models achieved close to perfect test accuracy for both the bi-
nary and the multi-class implementations, while the model for the ultrasonic sensors
achieved a test accuracy of around 70 %. Using sensor fusion, the combined model
achieved perfect test accuracy for both the static implementation and the dynamic,
proving the proposed solution’s feasibility. However, one should note that the re-
sults are all based on a small data pool collected during the thesis. Furthermore,
the data lacks diversity. Implementing the solution on a greater scale would likely
yield some changes in the results. In conclusion, it is possible to reliably use human
gesture recognition for functional actuation from ultrasonic and visual data.

Keywords: Human gesture recognition, machine learning, neural networks and sen-
sor fusion.

v


Acknowledgements
This thesis was conducted in collaboration with Volvo Cars in Torslanda, Göteborg,
within the department of Safe Vehicle Automation. We would like to extend our
deepest gratitude to the team members of USS Enterprise.

A special thank you goes to Pratish Ray, our supervisor at Volvo, for his unwavering
support throughout the project. We are also grateful to Venu Gopal Puripanda and
Simon Rudh for their technical support related to test vehicles. We would like to
express our appreciation to Khadija Dallah and Srinath Shanmugam for their guid-
ance in decoding recorded data. Finally, we thank Jonas Fredriksson for taking on
the roles of supervisor and examiner at Chalmers University of Technology. Your
collective expertise, guidance, and support have been invaluable to the success of
this project.

Tim Johansson, Krister Mattsson, Gothenburg, June 2024

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

HMR Human Motion Recognition
PoC Proof of Concept
USS Ultra Sonic Sensors
ANN Artificial Neural Network
NN Neural Network
CNN Convolutional Neural Network
ML Machine Learning
SGCM Static Gesture Classification Model
TP True Positives
TN True Negatives
FP False Positives
FN False Negatives

ix


Nomenclature

Below is the nomenclature of indices, sets, parameters, and variables that have been
used throughout this thesis.

Indices

i,j,k Indices in tensors/matrices
t Index for time step

Parameters

η Learning rate
λ Scaling parameter
nh Number of hidden layers
ns Number of samples in training set
ni Number of input neurons
no Number of output neurons

Variables

Oi Outputs
g(x) General activation function
wij Weights
xj Nodes
θi Biases
Q(w) Loss function
∆t Time step (time interval)

xi


xii


Contents

List of Acronyms ix

Nomenclature xi

List of Figures xvii

List of Tables xix

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Ethical and Sustainability aspects . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 5
2.1 Human motion recognition . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . 7
2.2.3 CNN-architecture, Residual Network . . . . . . . . . . . . . . 7
2.2.4 Spatial-temporal data and deep learning models . . . . . . . . 8
2.2.5 Balance of data . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.7 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.8 Cross entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.9 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . 11

2.3 Containing information in descaled images . . . . . . . . . . . . . . . 11
2.4 Evaluating network models . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Network certainty . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Ultrasonic sensor systems . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Method 15
3.1 Gestures representation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Approach and general idea . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

xiii


Contents

3.4.1 Collected USS data . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Preprocessing: Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 USS model network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6.1 Preprocessing USS classification data . . . . . . . . . . . . . . 22
3.6.2 Build USS classification model . . . . . . . . . . . . . . . . . . 23
3.6.3 Training and validation . . . . . . . . . . . . . . . . . . . . . . 23

3.7 Static vision model networks . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.1 Preprocessing static vision classification data . . . . . . . . . . 24
3.7.2 Static Vision classification model . . . . . . . . . . . . . . . . 25
3.7.3 Training and validation . . . . . . . . . . . . . . . . . . . . . . 25

3.8 Dynamic vision model network . . . . . . . . . . . . . . . . . . . . . 26
3.8.1 Preprocessing dynamic vision classification data . . . . . . . . 27
3.8.2 Dynamic Vision classification model . . . . . . . . . . . . . . . 29
3.8.3 Training and validation . . . . . . . . . . . . . . . . . . . . . . 31

4 Results 33
4.1 USS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Static vision models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Binary gesture classification . . . . . . . . . . . . . . . . . . . 35
4.2.2 Multiclass gesture classification . . . . . . . . . . . . . . . . . 35

4.3 Dynamic vision models . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Binary gesture classification . . . . . . . . . . . . . . . . . . . 37

4.3.1.1 Collected and preprocessed data . . . . . . . . . . . 37
4.3.1.2 Model evaluation . . . . . . . . . . . . . . . . . . . . 38

4.3.2 Multi-class gesture classification . . . . . . . . . . . . . . . . . 39
4.3.2.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 40

4.3.3 Extended multi-class gesture classification . . . . . . . . . . . 42
4.3.3.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 44

4.4 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.2 Combined model using dynamic vision model . . . . . . . . . 46

5 Discussion 47
5.1 Non neural network based approach . . . . . . . . . . . . . . . . . . . 47
5.2 USS model and data . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Static vision model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Dynamic vision model . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Compared to current solution . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Data distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusion 53

7 Future work 55
7.1 Dataset expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 USS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xiv


Contents

7.3 Static vision models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.4 Improved performance of ResNet . . . . . . . . . . . . . . . . . . . . 56
7.5 Improved approach for videos with arbitrary size and length . . . . . 57
7.6 Technological Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.7 Combined model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 59

A Appendix 1 I
A.1 Preprocessing: Decoding . . . . . . . . . . . . . . . . . . . . . . . . . I

A.1.1 Decoding recorded files . . . . . . . . . . . . . . . . . . . . . . II
A.2 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV

A.2.1 USS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.2.2 Static vision model . . . . . . . . . . . . . . . . . . . . . . . . XI

A.3 Preprocessing dynamic vision model . . . . . . . . . . . . . . . . . . . XIX
A.4 Dynamic Vision Model . . . . . . . . . . . . . . . . . . . . . . . . . . XXVI
A.5 R(2+1)D Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . XXXVIII

xv


Contents

xvi


List of Figures

2.1 Illustration of a deep neural network consisting of an input layer, two
hidden layers and a singular output. . . . . . . . . . . . . . . . . . . . 6

2.2 This figure illustrates the Residual block. . . . . . . . . . . . . . . . 8
2.3 Illustration of the firing sequence and how neighboring sensors listen

to their own and each other’s echos. The yellow circles indicate the
positions of the ultrasonic sensors, and the blue indicates where the
camera is located. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Illustration of the leg ’kick’ gesture. Note that the distance between
the starting position and the vehicle was roughly one meter. . . . . . 15

3.2 Illustration of the ’hand’ swipe gesture. . . . . . . . . . . . . . . . . . 16
3.3 Illustration of the general simplified logic, where all networks are

shown as boxes with an input and output signal. The yellow arrow
illustrates the initiation of the time window used for the USS model.
The vision and USS model classification outputs are weighed using a
factor α to determine the total classification output. . . . . . . . . . . 17

3.4 A snippet of the measurement data logbook. This file connects the
data files to the measurements and was used to label the datasets. . . 19

3.5 Illustration of a frame sequence from a single recording, depicting an
individual standing in an open area without performing any gestures. 19

3.6 Illustration of the general distance measured over a time span around
5 seconds. Note how the detected distance is closer around time step
150. This is the indication of the kick gesture. . . . . . . . . . . . . 20

3.7 In the figure to the left one can see an example of a measurement
series where all points of interest were lost by noise such that no kick
gesture could be distinguished. The right figure shows an example
of a clear gesture profile. The green points are the preprocessed and
merged points, as explained in the methods chapter. The red and
blue points are from echo1 in RIL and RIR respectively. . . . . . . . . 20

3.8 Illustration of several (30) measurement sequences put together. . . . 21
3.9 Illustration of the USS network architecture. . . . . . . . . . . . . . . 24
3.10 Raw frame extracted from one of the mp4-files to the left and the

down-scaled version of the same frame to the right using the scale
factor γ = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.11 Illustration of the static vision network architecture. . . . . . . . . . . 26

xvii


List of Figures

3.12 In figure 3.12a one can see the original resolution of a mp4-file, and in
figure 3.12b one can see the down-scaled version of the same mp4-file,
the scale factor is approximately γ = 0.1. . . . . . . . . . . . . . . . . 29

3.13 Illustration of the dynamic vision network architecture based on [24].
The final fully connected layer is adjusted for binary classification. . . 30

3.14 Overview of the entire model during the backward pass [24]. . . . . . 30

4.1 Illustration of the validation accuracy over each epoch using a batch
size of 12 (blue) together with the validation loss scaled up by a factor
of three (orange). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Illustration of the amount of TP:s and TN:s together with the FP:s
and FN:s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Illustration of the validation accuracy (blue line) over epochs and the
validation loss scaled by a factor of three (orange). . . . . . . . . . . . 36

4.4 Illustration of the confusion matrix for the binary static vision model
is presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 Illustration of the total validation accuracy over all gestures together
with the separate validation accuracies for each gesture and the vali-
dation loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.6 Illustration of the video lengths, in number of frames, per class in the
case of a binary classification task. . . . . . . . . . . . . . . . . . . . 38

4.7 Illustration of the training and validation loss together with validation
accuracy and F1 score over 20 epochs . . . . . . . . . . . . . . . . . . 38

4.8 Illustration of the confusion matrix from the test evaluation . . . . . 39
4.9 Illustration of the video lengths, in number of frames, per class in the

case of the multi-class classification task. . . . . . . . . . . . . . . . . 40
4.10 Illustration of the training and validation loss together with validation

accuracy and F1 score over 20 epochs . . . . . . . . . . . . . . . . . . 41
4.11 Illustration of the confusion matrix from the test evaluation . . . . . 42
4.12 Illustration of the video lengths, in number of frames, per class in the

case of the extended multi-class classification task. . . . . . . . . . . . 43
4.13 Illustration of the training and validation loss together with validation

accuracy and F1 score over epochs . . . . . . . . . . . . . . . . . . . 44
4.14 Illustration of the confusion matrix from the test evaluation . . . . . 45

7.1 Illustration of the leg swipe motion. . . . . . . . . . . . . . . . . . . . 55

A.1 Overview of the system environment configuration for preprocessing
of USS and vision data . . . . . . . . . . . . . . . . . . . . . . . . . . II

xviii


List of Tables

3.1 Overview of labeling functions and their corresponding labels . . . . . 27
3.2 Configuration parameters and preprocessing operations for the pre-

trained R(2+1)D model [24]. . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 The number of true positives, true negatives, false positives, and false
negatives and their relative mean certainty are shown in this table for
the USS model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 This table shows the number of true positives, true negatives, false
positives, and false negatives and their relative mean certainty. . . . . 35

4.3 Number of videos per ’kick’ - ’no kick’ gesture from the acquired dateset. 38
4.4 Number of videos for the ’hand’ - ’no hand’ gesture from the acquired

dateset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 The mean certainty and count for true positives, true negatives, false

positives, and false negatives for the ’kick’ - ’no kick’ gesture. . . . . 39
4.6 The mean certainty and count for true positives, true negatives, false

positives, and false negatives for the ’hand’ - ’no hand’ gesture. . . . 39
4.7 Test metrics of dynamic vision model for binary classification of ’kick’

- ’no kick’ gesture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Test metrics of dynamic vision model for binary classification of ’hand’

- ’no hand’ gesture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.9 This table shows the number of videos per class in each of the datasets 40
4.10 The mean certainty and count for true positives, true negatives, false

positives, and false negatives. . . . . . . . . . . . . . . . . . . . . . . 41
4.11 Test metrics of dynamic vision model for multi-classification. . . . . 41
4.12 This table shows the number of videos per class in each of the datasets

for extended multi-class classification. . . . . . . . . . . . . . . . . . . 43
4.13 The mean certainty and count for true positives, true negatives, false

positives, and false negatives. . . . . . . . . . . . . . . . . . . . . . . 44
4.14 Test metrics of dynamic vision model for extended multi-classification.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 This table shows measures of the combined network model evaluation.

Note that the network certainty is defined in a different way for the
combined model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.16 Test metrics for dynamic vision model for multi-classification. . . . . 46
4.17 The mean certainty and count for true positives, true negatives, false

positives, and false negatives. . . . . . . . . . . . . . . . . . . . . . . 46

xix


List of Tables

xx


1
Introduction

This chapter presents the project and its background, delimitations, and outline.

1.1 Background

In the automotive industry, the integration of advanced technologies plays a pivotal
role in enhancing safety and user experience. With the introduction of autonomous
drive and driving aid features, the industry has significantly augmented the de-
ployment of sensors in their vehicles, allowing for an increased perception of their
surroundings. Furthermore, advancements in machine learning have paved the way
for novel solutions that not only enhance the efficiency of features but also reduce
costs for manufacturers as they allow for the possibility of removing previously re-
quired sensors. For instance, Tesla replaced their front-facing radar with vision [1].

A feature that has gained recognition across the industry, adding convenience and
innovation to the overall user experience, is the radar-based contactless control of
the trunk. The radar system is centered underneath the rear bumper and requires
a person to approach the center of the rear bumper and perform a ’kick’ gesture
to activate and open the trunk. The detection range of the radar is limited by its
placement, which consequently requires the person to stand close to the rear to ac-
tivate the trunk, and depending on the model of the vehicle, it can be necessary for
the person to inconveniently take a step back to not be in the way of the trunks
path. In addition to the user experience challenges, the current implementation of
the radar-based system incurs significant costs. The cost associated with the current
solution for contactless trunk control through a single-purpose radar system is ex-
tensive, considering the intricate integration projects and expenses associated with
suppliers-, production-, logistics- and service contracts. According to the function
owner, this model has an accuracy of around 96 %.

The automotive industry continuously strives for cost-efficient and innovative so-
lutions. This project focuses on leveraging the existing ultrasonic sensors (USS)
and a rear-view fish-eye camera to replace the radar-based system. Such a solution
would eliminate the need for these radar sensors, saving all costs associated with
material and logistics, which would, in return, reduce the environmental impact.

1


1. Introduction

1.2 Objectives
The main objectives of the project are as follows:

• Determine a suitable approach to detect and classify patterns of human ges-
tures in real-time from echo and vision data.

• Translate meaningful human gestures into inputs for functional actuation.
• Compare model accuracy between the proposed system from the captured data

set recorded in this thesis project and the radar-based system.

1.3 Delimitations
This section outlines the boundaries and limitations of the project, ensuring clarity
and managing expectations regarding the outcomes.

• In this project, only the available ultrasonic sensors and the camera positioned
at the rear of the car are utilized without exploring additional sensor options.

• In terms of the number of gestures considered for the project, the project
focuses on a specific set of gestures rather than a comprehensive range to ensure
a large dataset due to the limited availability of test cars. Consequently, there
is a limited distribution of the performed gestures concerning the number of
people performing the gesture and environmental factors like weather.

• The recording sessions are conducted on the company’s premises leading to
further constraints.

• The project does not address constraints related to car system integration,
such as computational load, storage capacity, or system architecture.

• The thesis project exclusively considers the user intention of opening the trunk
by one person, without exploring other potential user intentions.

• All development is performed on local workstations to retain confidential and
sensitive information and proprietary knowledge. This approach was essential
to comply with company policies and ensure data security.

1.4 Ethical and Sustainability aspects
This project utilizes sensors already implemented on the car and will, therefore, have
a minimal impact on sustainability and not increase the risk of privacy intrusion any
further. The thesis work is purely software-oriented, and the technology is aimed at
being used for comfort, accessibility, and simplicity, aimed at functional actuation,
such as opening the trunk. No personal information such as name, age, or gender is
recorded, ensuring the privacy of the persons participating in the recording session.

1.5 Outline of thesis
Advancements in machine learning have paved the way for novel solutions that en-
hance feature efficiency and reduce manufacturer costs by potentially eliminating
previously required sensors.

2


1. Introduction

In this thesis report, the underlying theory used in this project is presented in
the theory chapter, and the methods for classifying gestures are presented in the
method chapter. After this, the results obtained using the presented methods are
stated and illustrated. Furthermore, the next chapter presents the discussion, re-
sults, and potential error sources. Here, the created models for the project are also
compared to the current radar-based system. After this, the conclusion is presented,
followed by some ideas for future work.

3


1. Introduction

4


2
Theory

This section introduces the underlying theory used in the project to motivate the
method and analyze the results.

2.1 Human motion recognition
Human motion recognition (HMR) involves the processes of identification, classifi-
cation, and characterization of human movements [2]. In the context of computer
vision, HMR is a multidisciplinary field composed of biomechanics, machine vision,
image processing, data analytics, nonlinear modeling, and pattern recognition [3].
The development of an efficient HMR system requires it to handle a vast diversity
of human features like body size, postures, and appearances, as well as environmen-
tal factors like illumination, viewing angles, and disturbances. The complexity of
human motion and the variability of recording conditions make HMR challenging,
but extensive research has gone into HMR due to its wide range of applications [2].
Each of the applications faces similar primary challenges: interpreting ambiguous
poses and actions; varying interpretations of classification; potential partial occlu-
sion of bodies or objects; poor video quality, including blurring and noisy data from
low-quality sensors; significant time differences between actions; inadequate or ex-
cessive lighting; and difficulty in acquiring large-scale datasets [3]. These challenges
necessitate advanced methods to accurately capture and analyze human motion.

HMR can be broadly divided into two categories: vision-based and sensor-based
recognition [4]. The vision-based method relies on one or more cameras; the method
of approach for reaching motion predictions, therefore, varies significantly depending
on the techniques employed and continues to be a field of interest for studies within
the topic of HMR. On the other hand, the sensor-based method is a more standard
approach and an extensively researched area given the feasibility of attaching sen-
sors or using mobile devices [4].

Recent studies, such as those reviewed in [5], have explored a variety of HMR meth-
ods, covering traditional approaches to manually designed motion features extracted
from RGB and depth data, as well as modern deep learning-based approaches for
motion feature representation, techniques for recognizing human-object interactions,
and methods for action detection. Unlike image classification, which primarily fo-
cuses on spatial information, vision-based HMR requires the integration of temporal
information to accurately capture and analyze motion sequences. The review in [5]

5


2. Theory

concludes that deep learning-based methods exhibit superior performance in motion
feature learning problems as they leverage advanced neural network architectures to
learn complex patterns and relationships within the data. In addition, the nature of
deep learning-based methods is that they are much more resource-efficient compared
to traditional computer vision approaches, [4].

2.2 Deep learning
Deep learning is a subset of machine learning in Artificial Neural Networks (ANN)
where hidden layers are introduced to capture complex and intricate patterns in
data. As a problem aimed to be solved using artificial neural networks cannot be
solved by linear separation only, deep learning models or deep neural networks can
approximate more complex patterns of information and have the ability to classify
non-linear problems [6]. As mentioned earlier, a deep neural network consists of one
or more hidden layers in addition to the input and output layers, see figure 2.1. The
hidden layers are pivotal in an ANN’s capture of complex classification patterns.
For nonlinear classification or complex data patterns, the ability to handle these
types of intricate data patterns by separating information becomes necessary. Each
hidden layer in the network will contribute to and make a more complex classification
possible, but it will also add more parameters to tune. The extra size and parameters
also mean that a deep learning model often requires large datasets for all weights
and biases to be tuned in a desirable way [7]. A common approximation measure

Figure 2.1: Illustration of a deep neural network consisting of an input layer, two
hidden layers and a singular output.

for determining a reasonable amount of hidden layers according to [8] is:

nh = ns

λ(ni + no)
, (2.1)

where nh is the number of hidden layers, ns the number of samples in the training
set, ni number of input neurons, no number of output neurons, and λ is a constant
which is usually in the range of 2-10.

6


2. Theory

2.2.1 Activation functions
An activation function within the field of ANNs is a mathematical function that
converts the output of each network layer to some binary value type, ranging from
positive and negative numbers to specific integers, depending on the network spec-
ifications. Activation functions can be of different forms. Two commonly used
functions are tanh(b) and sgn(b), where b is the neuron states, weights and biases
of the current layer. There is one distinct difference: tanh(b) is continuous while
sgn(b) is not. This detail becomes important when the networks are trained, as it
is relatively common to use training algorithms, such as backpropagation, which
utilizes the activation function’s derivative. It is also important to note that when
the activation function is continuous, the states of the neurons also become con-
tinuous. Another activation function that is commonly used in image classification
networks and CNNs is the Rectified Linear Unit (ReLU) function. ReLU is a linear
non-negative activation function. One of the key advantages of ReLU is its non-
saturating property, which further mitigates the phenomenon of vanishing gradient
[9].

2.2.2 Convolutional neural networks
A convolutional neural network (CNN) is a deep learning model designed and mainly
used for processing visual data. More specifically, CNNs are well suited for tasks
such as image classification and object detection within images or videos. CNNs
include convolutional layers, where each layer applies filters or kernels to the input
data. The kernels or filters are used to detect features and patterns within the visual
data. Pooling layers are a common way to downsize the spatial dimensions after a
convolution layer. This reduces the computational resources necessary for training
and using the network. At the end of the network, after the convolutions and pool-
ing layers, CNNs typically have one or more fully connected layers connecting the
last layer with the output. This part performs a high-level feature extraction from
the last convolution and connects it to the output.

CNNs use supervised learning, or in other words, they need labeled datasets for
training, which furthers the need for good-quality datasets. Backpropagation is
commonly used for training. The network weights and thresholds are adjusted to
minimize the difference between the labeled targets and the network output [10].

2.2.3 CNN-architecture, Residual Network
Over the past decade, extensive research of CNN architectures has taken place, lead-
ing to the successive development of AlexNet, GoogleNet, ResNet, and DensNet, to
name a few. Each of these architectures has significant contributions to the devel-
opment and performance of deep learning models, particularly in the field of image
recognition [11], with unique approaches to address some of the common issues in
deep learning like for instance vanishing gradients, etc.

To address the commonly encountered vanishing gradient problem, Residual Net-

7


2. Theory

works (ResNets) are purposely designed architectures to counter the issues with the
use of so-called skip connections [9]. As neural networks become deeper, the gra-
dients used in backpropagation can become very small as a consequence of both
the chain rule and the selection of activation functions of saturating nature, such
as tanh(b), leading to slower and even stalled learning during the training process.
The key element in ResNets is the Residual block, shown in figure 2.2.

Figure 2.2: This figure illustrates the Residual block.

By introducing skip connections, where the input to a layer is added directly to
the output of a subsequent layer, the gradients are less likely to diminish to in-
significantly small values as they pass through each layer of the network [9]. If the
desired underlying mapping is denoted as H(x), ResNets reformulate the learning
task to instead model the residual function F (x) = H(x) − x and subsequently the
original function becomes H(x) = F (x) + x. Residual blocks will commonly include
two or more convolution layers, batch normalization, and ReLU activation functions
[9]. ResNets have been shown to achieve remarkable performance and significantly
outperform traditional CNN architectures in terms of both accuracy and depth on
various image recognition tasks [11]. Using residual blocks effectively allows the
network to preserve the essential features learned in earlier layers.

2.2.4 Spatial-temporal data and deep learning models
The temporal dimension is crucial in capturing the dynamics of motion over time,
adding complexity to the task of HMR and making it more informative. In HMR,
the focus on deep learning techniques and the processing of RGB video data has
greatly increased since 2015 [3]. Various methods, including deep learning architec-
tures based on CNN, Recurrent Neural Networks (RNN), and hybrid approaches,
have undergone comprehensive analysis of their advantages and limitations [5, 12,
3, 4, 13].

Different architectures for handling spatial-temporal data in HMR have been ex-
plored. One approach is the 3D CNNs, where the third dimension can be viewed as
the time axis. These networks build upon the architecture of 2D CNNs by adding an
extra dimension to the input, allowing for the processing of temporal information

8


2. Theory

for several frames in a video sequence. Another approach is the hybrid method,
which combines different types of neural networks to handle both spatial and tem-
poral features. For example, CNN-RNN architectures utilize ResNet to extract
spatial features and RNNs to extract temporal features. While 2D CNN-based ar-
chitectures excel in spatial data handling, they cannot capture temporal features
effectively. The limitation can be addressed by including algorithms such as optical
flow, Long Short-Term Memory (LSTM), which handle sequential data and capture
temporal dependencies effectively [4], and temporal grouping [3]. An alternative
strategy is that of stream networks, meaning that types of inputs are handled on
different networks. For instance, processing RGB frames in the first stream and
optical flow in the second stream [3]. This approach allows for the capture of both
spatial and temporal information.

Interestingly, despite the disadvantage of ordinary 2D CNNs being applied to in-
dividual frames and therefore cannot model temporal information, they perform
remarkably well in some instances, such as the Sport-1M benchmark [13]. Never-
theless, 3D CNNs are still vastly outperforming 2D CNNs on large datasets [13].
A more specific example, [3] evaluates a 3D ResNet of depth 50 and a 2D vision
transformer (ViT) with a long short-term memory network (LSTM) on the human
motion database (HMDB51). It was shown that the 3D ResNet outperformed the
ViT with LSTM, reaching accuracy scores in the train and test phases of 96.7 ±
0.35% and 41.0 ± 0.27%, respectively.

3D CNNs continue to be an explored topic within HMR [13]. An attractive fea-
ture of 3D CNNs, compared to the two-stream method, is that the architecture
creates the hierarchy and relationship between spatial and temporal features with-
out the need for other information like optical flows [13]. Furthermore, 3D CNNs
are known as end-to-end networks as the input processing and generation of output
do not require any additional step sequences. However, a significant disadvantage of
3D CNNs compared to 2D CNNs is their high parameter count, which is an order of
magnitude greater, leading to a higher risk of overfitting, thereby requiring a large
volume of data like Kinetics [3].

In [13], several spatial-temporal architectural models based on 3D CNNs, two-stream
networks, and ResNets are studied with regard to their performance on HMR. In
particular, architectures such as 2D convolutions over frames, 2D convolutions over
video clips, alternating 3D-2D convolutions, and factorization of 3D convolution
into a 2D spatial convolution followed by 1D temporal convolution have been in-
vestigated. The residual 2D plus 1D CNN architecture R(2+1)D, stems from the
factorization of the Ni 3D spatiotemporal convolution of size Ni−1 × t×d× t into Mi

2D spatial convolution filters of size Mi−1 ×1×d×t and Ni 1D temporal convolution
filters of size Mi × t × 1 × 1. The hyperparameter Mi defines the number of dimen-
sions in the intermediate spaces where the signal is mapped during the transition
between spatial and temporal convolutions [13]. To effectively maintains a similar
number of parameters as a 3D convolution block [13], Mi is chosen according to:

9


2. Theory

Mi = td2Ni − 1Ni

d2Ni − 1 + tNi

(2.2)

The study in [13] concluded that the R(2+1)D, which is closely related to Pseudo-
3D, outperforms the other models and even achieves comparable or superior results
of the benchmarking models like Iterative Dichotomiser 3 on datasets of Sports-
1M, Kinteics, UCF101, and HMDB51 [13]. The performance gain of the R(2+1)D
model can be attributed to the factorization of each spatiotemporal block, leading
to consecutive spatial and temporal convolutions across the network with the fol-
lowing positive effects. Firstly, an additional nonlinear rectification is incorporated
between the two operations, which effectively doubles the number of nonlinearities
with the same number of parameters as in 3D convolutions. Secondly, yielding a
lower training and testing loss at the factorization facilitates optimization [13].

2.2.5 Balance of data
In a classification approximation problem, as well as other problems of similar char-
acteristics, when implementing a neural networks approach, it is relevant to look
into possible local optima while training. If the majority of the training data for
the model is of one type or class, one such local optima can be for the model to
classify only one class. The loss will seem rather low, but in reality, the model has
just approximated the problem to a constant output from only the data types. To
combat this problem, one can balance the dataset so that there are roughly the
same amount of data samples for each class or data type. In this way, the network
is forced to fit another pattern within the data. Regardless of how often the data
types or classes normally occur outside the test environment, the network still needs
to be trained on a balanced dataset to avoid an unwanted bias [10].

It is common to split the data into a training set, a validation set, and a test
set to avoid training biases in evaluation processes. By using different parts of the
dataset, the evaluation will simulate the network in use since it has to handle data
that is completely new to it.

2.2.6 Overfitting
When training a neural network such as a CNN, over-training or, in other words,
training too much may result in unwanted pattern findings in the training dataset.
This also depends on the number of hidden layers within the network, which allows
for more complex information classification. The network will adapt to the specific
training set trends and patterns, which may be unique for this set. If this happens,
the accuracy against the validation set is decreased. Since the network has not been
trained on the data from the validation set, its unique unwanted features will not
be integrated. Therefore, the overall accuracy will decrease against the validation
set. However, a network can reach a local peak in accuracy. It is not certain that

10


2. Theory

the network is overfitting if the validation accuracy is lowered temporarily, see e.g.,
[10] and [14].

2.2.7 Transfer learning
In neural networks and machine learning, it can sometimes be useful to use infor-
mation from pre-trained weights and biases in a smaller scope than the original
model. By using a pre-trained model with several classification outputs, one can use
these outputs as inputs to a new layer or model where the problem dimensionality
is significantly reduced. Essentially, one transfers one network model’s knowledge of
the domain or area it is trained on to another targeted domain, [15]. This domain
could, for instance, be a subset of the original one with a more specific classification.
This also means that less data is necessary for training the specific model since the
complexity of the problem is already decreased by the pre-trained model [15].

2.2.8 Cross entropy loss
Cross entropy loss is a metric for measuring the performance of classification model
networks. Cross-entropy loss quantifies how well the predicted probabilities match
the actual class labels. For networks with multiple output classes, the cross entropy
loss CEL can be calculated as

CEL = − 1
N

N∑
i=1

C∑
j=1

yij ln(pij), (2.3)

where N is the number of data points, C is the number of classes, pij is the predicted
probability of data point i belongs to class j and yij is a boolean value (either 0 or
1) that indicates if j is the correct class for data point i. yij is 1 if this is true and
0 if not [10].

2.2.9 Stochastic gradient descent
Stochastic gradient descent (SGD) is a mathematical method for optimizing param-
eters. The goal is to minimize a loss function. The network model’s parameters are
updated for each training iteration following an update rule dictated by SGD. First,
the dataset is shuffled randomly, then the data is passed through the network. The
gradient of the loss function is calculated, and the parameters are updated using:

wk = wk−1 − η∇Q(wk−1), (2.4)

where w are the weights, η is the learning rate (a constant that scales the change),
and Q(w) is the loss function. The updated form for the thresholds is updated in
the same way. This is repeated through the dataset and for every parameter [6].

2.3 Containing information in descaled images
A standard format image, such as .jpg and .png, uses pixels to store information,
where the resolution describes the number of pixels used. Descaling such an image,

11


2. Theory

therefore, means approximating the same image using fewer pixels. It is apparent
that an image of high resolution contains a lot of information. The greater the
resolution of the image, the more details can be shown and the clearer the image be-
comes. However, due to hardware limitations and/or runtime optimization, keeping
a low image resolution is often preferable. Sometimes, information from the most
important details can remain even if the image is scaled down, [16]. How much an
image resolution can be descaled to contain relevant information still depends on
the content of the image and the purpose of the downsizing. In the case of hardware
limitations and computing speed for neural networks, it depends on the network
size and the used GPU [17]. A common way to determine this is through iterative
testing with different image resolutions.

2.4 Evaluating network models
There are many ways to analyze and evaluate neural network models, and what
results are relevant depends on the problem the model is aimed to solve or ap-
proximate. In machine learning classification models, measures such as accuracy,
precision, and recall are commonly used to help evaluate the quality of the classifi-
cations [18].

Accuracy is a measure of how often a model can predict the correct class or outcome.
It is calculated using the following equation

A = pc

p
, (2.5)

where A is the accuracy, pc the number of correct predictions and p the total number
of predictions. In a classification problem that is binary or has only two classes, one
can split the prediction outcomes into four possible categories. If we imagine one
class being positive and the other negative then,

• True Positive (TP), the model correctly classified positive.
• True Negative (TN), the model correctly classified negative.
• False Positive (FP), the model incorrectly classified positive.
• False Negative (FN), the model incorrectly classified negative.

Using this terminology, accuracy can be written as

A = TP + TN

TP + TN + FP + FN
, (2.6)

i.e., TP + TN = pc and TP + TN + FP + FN = p.

Precision measures how reliably the model classifies true positives. Or, in other
words, how often the positive classifications are correct. This measure is calculated
using the following equation

P = TP

TP + FP
, (2.7)

12


2. Theory

where P is the precision, TP is the number of true positive predictions and FP is
the number of false predictions.

Recall measures how well the model can classify one class correctly. In other words,
recall will measure if the model finds all instances of this class in a given data set.
Recall R is calculated as [18]:

R = TP

TP + FN
, (2.8)

To manage a trade-off between P and R, the F1 score is used as a harmonic mean of
these two metrics, giving a single measure of accuracy. Balancing the two measure-
ments is crucial as the FP and FN should be minimized. The F1 score is calculated
as

F1 = 2 · P · R

P + R
, (2.9)

which ensures that the score is high only if both P and R are high, making it a
robust metric for evaluating the effectiveness of our classification models.

2.4.1 Network certainty
Network certainty measures how decisive the network model acts on each classi-
fication. If an ANN classification model has m output nodes, where each node
corresponds to a class, and the node with the highest value indicates the predicted
class, the absolute difference between the node values can be used to estimate a
model certainty. In this thesis, the network certainty, Γ, is defined as

Γ = knmax − ∑
k ||nk||

k
, (2.10)

where nk are the node values and k ∈ Z+, k ∈ [1, m]. For a combined model that
uses several network models, the model certainty is defined as the sum of the model’s
certainties.

2.5 Ultrasonic sensor systems
Many new cars use ultrasonic sensors (USS) to detect objects in proximity. Ultra-
sonic sensors can measure distances with low power consumption. The sensors send
a sound wave with a frequency outside the human hearing spectrum, making them
appear quiet. If the sound wave hits an object, it will be reflected, and the sensor
will then listen for the echo to measure the distance to the object. The reflection
amplitude and general direction depend on the object’s material and shape, but
since the sound wave has a spherical propagation, it is very likely for some sound to
reflect back regardless of the shape or material. As sound travels at vs ≈ 343 m/s
in ground level air [19], the distance to the object can be calculated using

d = vs∆t

2 , (2.11)

13


2. Theory

where ∆t is the difference in time between the emission and detection of the sound
wave. These sensors are commonly used for parking and object detection in both
the front and rear of the car. New car models have several USSs in the rear and
the front, which can all triangulate and listen to each other’s echos and their own.
Therefore, a specific firing sequence is used to map and measure the objects.

Figure 2.3: Illustration of the firing sequence and how neighboring sensors listen
to their own and each other’s echos. The yellow circles indicate the positions of the
ultrasonic sensors, and the blue indicates where the camera is located.

The sensors in the rear of the vehicles have the following notation:
• ROR - Rear Outer Right
• RIR - Rear Inner Right
• RIL - Rear Inner Left
• ROL - Rear Outer Left

In figure 2.3, RIL fires a signal and listens to its own echo, and the neighboring
sensors, RIR and ROL, listen to the same echo. There are two sequences of sensor
firing where a sensor either only listens to neighboring sensors or emits a signal and
listens to itself. These modes define the firing sequences and are swapped for each
sequence. The sensors that listen to other sensors can distinguish which sensor sig-
nal it receives by utilizing small sound signal frequency differences that make each
signal unique.

Data could be obtained from the following signal ways:
• Direct Signal way - when the receiving sensor detects its transmitted burst

(RIL-RIL & RIR-RIR).
• Indirect Signal way - when the receiving sensor detects a burst from its neigh-

bor sensor (RIL-ROL).

14


3
Method

In this chapter, the method is presented together with the investigated gestures.

3.1 Gestures representation

Two distinct gestures were chosen for this project, a ’kick’ gesture and a ’hand’
swipe, to function for trunk actuation activation. The gestures were selected based
on their distinctiveness and ease of detection for both ultrasonic and visual sensor
perspectives. All gestures were, therefore, conducted around a one-meter distance
away from the trunk.

The ’kick’ gesture is a well-established gesture that is sometimes used in combi-
nation with a radar sensor. The ’kick’ was specifically chosen as users already know
it for trunk actuation activation, as illustrated in figure 3.1. The other gesture in-
vestigated is the ’hand’ swipe gesture due to its simplicity and natural association
with symbolizing opening, see figure 3.2.

Figure 3.1: Illustration of the leg ’kick’ gesture. Note that the distance between
the starting position and the vehicle was roughly one meter.

15


3. Method

Figure 3.2: Illustration of the ’hand’ swipe gesture.

3.2 Approach and general idea
It is relatively easy to realize that there is not only one solution to the formulated
problem in the project. The created method was influenced by other gesture recogni-
tion projects. The main idea from the method is to use all the available information,
meaning both the data from the USS and the visual input from the rearview cam-
era on the vehicles, and use the combined information from these sensors to create
a robust model for classifying the information. The classification problem becomes
more complex as false positives, meaning the model classifies a non-intended gesture
as a gesture, which is considered a non-intended gesture that should not activate
the actuation. The model needs to know that the gesture was intended, and, at the
same time, it should be able to distinguish the same gestures for all people.

Considering the idea of a combined model, combining the information from the
USS and vision data, external information, and logic from the vehicle can be uti-
lized. For instance, the model should first determine if the key to the car is near
the vehicle. If it is, the model should check for nearby objects and the change of
object distances using the USS. The camera system is activated if an object moves
close and this logic is satisfied. Now, the vision model is initiated and uses visual
information of the nearby object to classify whether the given object is human or
not. The next step can be initiated if the object is classified as a human. After
this, the vision model will classify whether a gesture is made. The first time the
vision model classifies a gesture, the USS model classifier is initiated to verify the
classification. Since this model requires temporal input, a time window is created
where the most recent USS measurement replaces the earliest. The outputs of each
classifier, vision and USS, should now be combined. This is done using a weight
function that utilizes the network certainty of each model/classifier together with a
parameter α that scales each signal, creating an adjustable bias towards one of the

16


3. Method

models. This parameter is tuned by iterative testing. The vision data is used in
combination with the USS data to acquire as much relevant information as possible
such that the model where this information acts as a fail-safe for an incorrect clas-
sification. In this way, the power consumption is reduced by using the passive USS
before activating the vision system.

The problem was split into smaller parts, and information was handled separately
to achieve this logical structure. Three networks were created: a USS-based model,
a static vision human detection model, and a dynamic vision model. These models
should then work together, following the logic presented in figure 3.3.

Figure 3.3: Illustration of the general simplified logic, where all networks are shown
as boxes with an input and output signal. The yellow arrow illustrates the initiation
of the time window used for the USS model. The vision and USS model classification
outputs are weighed using a factor α to determine the total classification output.

3.3 Combined model
The USS and vision models can be used in combination with each other. Using some
external logic to fuse the output classifications, a combined model was created using
both the USS and visual inputs. This logic can be tuned to potentially achieve an
increased results performance compared to the USS and vision models separately.
The overall logic used in this combined model is illustrated in figure 3.3. In the
figure, there are some external functions and information, such as key detection
and human detection; these functions are already implemented and are, therefore,
assumed to work flawlessly for this project.

As an object is classified as a human, the static vision network will be triggered
to classify for any gestures. At the same time, the USS model will collect data
points until the length of the time window is satisfied and then classify the mea-
sured distance pattern over time. This is triggered by the static vision model when
it first classifies a gesture. The collected measurements for the USS before this in-
stance are used to fill the time window in the USS model. All new measure data
is then inserted, and the oldest data point is removed so that the time window is
moved. The outputs of both models are weighted by a weight function that takes

17


3. Method

the network certainty into consideration and a tuning parameter that the user can
adjust. In this way, the classifications are fused and can easily be tuned to compen-
sate and to rule out false positives, etc. This logic and utilization of the USS and
vision models is defined as the combined model.

To evaluate the model, recorded data was fed to a Python script which simulated
the combined model. Randomly selected USS data and vision data that belong to
the same classification were fed to the model. External factors such as key proximity
and human detection are assumed to always be triggered for these cases.

3.4 Data acquisition

For the network models to be able to identify certain gestures, data has to be col-
lected for all classification tasks involved in the project. In this project, it was
necessary to generate new data for the specific gestures and the sensor setup of the
provided test vehicles. The test vehicles have systems created for data collection in
all instruments, which were saved on a portable solid-state hard drive. The data the
USS and the rear-view fish-eye camera generated were synchronized in time. Each
measurement for both types of sensors was also initiated simultaneously. Each mea-
surement could be extracted and saved into a folder containing the separate data
for each type of sensor in the predetermined mf4 format.

The measurements were conducted as follows in the following order:

1. Discuss and determine what gesture and motion should be recorded and in
what position.

2. Find an appropriate area to record the measurements, free from obstructions,
to ensure unimpeded movement and accurate gesture capture. The chosen
environment replicates typical parking scenarios encountered in urban settings.

3. Set up logging equipment and designate one team member to operate the
recording equipment from within the vehicle, starting and stopping each ses-
sion and monitoring real-time data stream to the logger, to its hard drive, and
the capture through the rear camera system.

4. One person makes the agreed upon gesture communicating with the person
starting/stopping the recordings when to initiate each measurement.

In total, 231 recordings were acquired. A snippet of the measurement data logbook
is illustrated in figure 3.4. The recordings included three different people making
gestures in different situations, with various backgrounds and weather conditions.
Figure 3.5 illustrates a few frames from one snippet.

18


3. Method

Figure 3.4: A snippet of the measurement data logbook. This file connects the
data files to the measurements and was used to label the datasets.

(a) (b) (c)

(d) (e) (f)

Figure 3.5: Illustration of a frame sequence from a single recording, depicting an
individual standing in an open area without performing any gestures.

3.4.1 Collected USS data
The data from the USS contains measures, such as distance and signal amplitude,
for each sensor’s own and neighboring echoes. The sampling frequency of the USS
is 50 Hz. Each USS recording is 262 samples long, roughly corresponding to a data
recording of 5.2 seconds. The measurements are low-pass filtered to remove extreme
points and noise. From the kick motion gesture, a typical measurement would look
like what is illustrated in figure 3.6. For this gesture, a human would stand roughly
one meter away from the car trunk and make the gesture. In the figure, one can
distinguish seven data points where the measured distance is significantly closer,
which indicates the kick.

Some sampled data points were more similar to figure 3.7, where extreme value
measurements are illustrated. Several points of valuable information were lost due
to noise, in some instances, all of the distance readings during the gesture were lost
to noise, which resulted in readings that only indicated the presence of an object
roughly one meter away. At other times, all the important data points are captured,
and a clear motion signature can be detected, which is crucial for the model in clas-

19


3. Method

sifying this information. The data illustrated by the figure 3.6 and 3.7 is filtered
such that the points that are considered noise are removed.

0 50 100 150 200 250

200

400

600

800

1000

1200

1400

Figure 3.6: Illustration of the general distance measured over a time span around
5 seconds. Note how the detected distance is closer around time step 150. This is
the indication of the kick gesture.

0 50 100 150 200 250

200

400

600

800

1000

1200

1400

0 50 100 150 200 250

200

400

600

800

1000

1200

1400

Figure 3.7: In the figure to the left one can see an example of a measurement
series where all points of interest were lost by noise such that no kick gesture could
be distinguished. The right figure shows an example of a clear gesture profile. The
green points are the preprocessed and merged points, as explained in the methods
chapter. The red and blue points are from echo1 in RIL and RIR respectively.

20


3. Method

By combining 30 measurements of data points, a refined set of data points can be
obtained, see figure 3.8.

Figure 3.8: Illustration of several (30) measurement sequences put together.

3.5 Preprocessing: Decoding
The initial preprocessing stage consisted of decoding the acquired recordings. The
process of decoding involved setting up required environments and employing de-
coding utilities according to the following algorithm 1, described in further detail in
appendix A.1.

Algorithm 1 Decoding of acquired USS and Vision recordings.
1: Ensure execution environment:
2: - Linux environment, using WSL2 with Ubuntu for this project.
3: - Deploy CUDA extension and set up Singularity container.
4: Extract decoding utilities to local workstation.
5: Run script decode_logg.py (see appendix A.1) to streamline decoding:
6: for all USS recordings in the input directory do
7: - Recreate output directory structure based on input directory structure.
8: - Construct Singularity, employing decoding utilities, for conversion.
9: - Execute the conversion command.

10: end for
11: for all Vision recordings in the input directory do
12: - Recreate output directory structure based on input directory structure.
13: - Construct Python command, employing decoding utilities, for conversion.
14: - Execute the conversion command.
15: end for

21


3. Method

3.6 USS model network

The ultrasonic sensor-based model network was built upon the acquired data and
the merged pre-processing of USS inputs. Since the sensors used in the project only
measure the distance away from itself in one dimension, the idea is to capture a
pattern of distance changes over time. The model essentially uses the information
from the distance changes over a given time. Several approaches were considered,
and in the end, a time window approach was selected, covering the 262 measurements
corresponding to the time window size. This was done based on the collected data as
one gesture took roughly this time to complete. The vehicle that was used to collect
and record the USS data had four rear sensors, where each sensor listened to its
own echo and its neighboring echo. The car’s mainframe manages the triangulation
of these echoes to calculate distances, which restricts direct access to processed
data. This limitation necessitates the model network to use direct sensory data and
excludes the possibility of data points in more dimensions. Therefore, time windows
containing distance measurements were used to train the network for these kinds of
data structures. The time window for sampling is triggered by an external signal
linked to the car’s security and proximity alert system, allowing for precise data
capture when a gesture is likely to occur. This method enhances model accuracy by
focusing on relevant data periods when a gesture is possible, see figure 3.3.

3.6.1 Preprocessing USS classification data

After the raw data extraction and conversion to the format .hd5f, it was possible
to extract the direct USS distance data using a Python script, see Appendix A.2.
The rear sensor echo distances are extracted from the format, noise is filtered out,
and time windows are created where the data from each recorded gesture is merged
together separately for each time window. Since the recordings vary in length,
measurements over 5.2 seconds are cut and measurements under are padded. A
comma-separated value file, csv-file, was created with all the measurements of the
same class or gesture. The data was plotted over the given time steps to visualize
the time window. Since the data is saved as separate files if it succeeds at a given
storage size, the Python script combines all similar files in each directory, where
each directory contains one measurement sample.

The noise filtering as discussed earlier, works by removing the data points that
are outside the range of 60000 mm and 2 mm since these are the limits of the
hardware and all points above or below are considered to be noise. Since not all
sensors have guaranteed disturbances or noise at the same time stamps, the script
also checks each step to see if the neighboring sensor picked up a non-noise mea-
surement and adds that value to a new vector containing the merged values from
the sensors. In that way, more information of interest can be saved in a singular
measurement vector, which can later be fed to the neural network. These vectors
were later combined with the script and saved as a csv-file.

22


3. Method

3.6.2 Build USS classification model

Based on the data pattern complexity of the input data classification problem, the
network was initiated with three hidden layers and a substantial number of nodes
as stated in the theory section. By taking a given set of measurement frames or
measurements over given time steps, an input window for a short time sequence can
act as the input to the network. The model would use the differences in detected
distance within this time window over the time steps to detect patterns from the
performed gesture. Only the’ kick’ gesture was classified to narrow down the com-
plexity of the initial data. The network had two output nodes. The idea of having
two output nodes was also to be able to integrate a network certainty as discussed
in the theory section 2.4.1. These two nodes corresponded to either whether a ’kick’
was detected or no ’kick’ was detected. Furthermore, two outputs is useful for both
the evaluation of the network and the combined model, which will be discussed later.
The node with the maximum output is used as the chosen output.

3.6.3 Training and validation

The USS classification model is a linear dense NN using three hidden layers utiliz-
ing the tanh activation function. The network was built for two network classes,
’kick’ or ’no kick’, with 262 input nodes, one for each time step in the given time
window. See figure 3.9 or A.3 for a detailed description of the network architecture.
The time window corresponds to roughly 5.2 seconds in recorded time, which was
deemed sufficient to capture all the kick data in the collected samples. A dataloader
class was defined where the labeled data is loaded from the csv files and then easily
extracted by a function. The dataset was then split into a training set, a validation
set, and a test set with the respective separation, 80 %, 10%, and 10%. A training
loop was defined where data from the training set was loaded together with their
respective labels. The Adam optimizer was used with a learning rate of 0.001, and a
mean square error function was used as a loss function.

Each epoch was monitored while running the loop to roughly evaluate the network
model. By observing the trend of the validation accuracy and loss, one can monitor
overfitting and evaluate the model as stated in the theory section. The validation
accuracy was saved for each epoch, and after training, the model was saved.

3.7 Static vision model networks

Several projects in gesture recognition and static image analysis use a Convolutional
Neural Networks (CNN:s) approach, which was also chosen for this project. After
training, it is quite compact and requires little computational resources compared to
other alternative models, such as the vision transformer (ViT) network. For these
networks, greyscale imagery was used in this project.

23


3. Method

Figure 3.9: Illustration of the USS network architecture.

3.7.1 Preprocessing static vision classification data
From the data collection, the recorded visual data was saved in mf4-format, similar
to the raw data from the USS. However, a different decoding method was used due
to the difference in size and encoding. A script, see Appendix A.5, was created
for saving each frame as a jpg file and downsizing the image resolution by a factor
γ. The lesser resolution will result in a smaller network as the input dimensions
directly correspond to the number of pixels in the images. If the colors of the
pixels are included, each pixel contributes with three color channel inputs. The
scale factor, γ, was tested iteratively as a parameter using the same network model
with a scaled-down input size and comparing the results with the model using the
maximal resolution until a satisfactory scale factor could be determined. See 2.3 in
the theory section.

(a) (b)

Figure 3.10: Raw frame extracted from one of the mp4-files to the left and the
down-scaled version of the same frame to the right using the scale factor γ = 0.1.

All images were manually labeled as either the given gesture or no gesture for each

24


3. Method

measurement and saved in separate folders for each gesture, counting no gesture as
a class. The dataset was split as equally as possible to balance the data.

3.7.2 Static Vision classification model
Using (2.1), a starting point for the initial linear neural network part was made.
Several different network structures were tested and implemented. As a base point,
three convolutions were used as this is rather common in facial expression recogni-
tion networks of similar characteristics [10]. The network size, pooling layers, and
linear deep neural network are all parameters that were shifted and implemented
in several different ways, yielding different results. Since gestures can contain much
information, the network needs to be able to capture a vast amount of feature in-
formation. Therefore, the overall size and channels of the network were set up to be
able to capture this, and several transformations to the images can be applied, [20]
and [21]. A base channel size of 64 was used and varied slightly between the layers.
The static vision classification network was based on the CNN structure with three
convolutional layers and three linear layers to allow for the possibility of complex
data classification which is expressed in the given dataset problem while still being
rather compact.

For the linear network part, the layers were also varied in terms of size and amount.
As a starting point, one hidden layer was implemented, which connected to the seven
outputs, one for each emotion classification. See figure 3.11 for the full architecture
and A.6.

The CNN-model was built using pytorch. Before the data is fed to the network,
each image is resized to 100x80 pixel and converted to greyscale such that the input
dimension is reduced for the network. The images are also normalized using the
standard normalization for greyscale imagery. After this, the data set is split into a
training set, a validation set, and a test set. This is then fed to the training and vali-
dation functions with a batch that is selected randomly from each set. For this work
a batch size of 32 was used. A cross-entropy loss function was used together with
optims SGD optimizer. The learning rate was set to 0.005. Using the preprocessed
data, a training and validation function could be defined, see 3.7.3.

3.7.3 Training and validation
A validation function was defined, where the network model is fed data from the
validation set, and its output is compared to the labeled targets. In this function,
each correct classification is counted and then divided by the total number of valida-
tions. This is done using the dataloader for the validation set. Also, for each image
or item in the dataloader, the validation loss is calculated using the mean square
distance between the network output and the validation targets. This is done in the
same way as for the training set.

To combat overfitting, each epoch is monitored while training. When the valida-

25


3. Method

Figure 3.11: Illustration of the static vision network architecture.

tion accuracy has reached a maximum peak, and then the accuracy is decreased for
several iterations, the network is assumed to be overfitting. Also, the training loss
was calculated using cross-entropy which was monitored in the same way. As the
training loss continues to decrease while the validation accuracy is not increasing, it
can also be a sign of overfitting. The validation accuracy was saved for each epoch
above a set limit of validation accuracy, and if this accuracy was better than the
previous one, the current network was saved. In this way, further training would
not negatively impact the saved network.

3.8 Dynamic vision model network
The dynamic vision model network was developed to model spatial-temporal fea-
tures. Compared to the static vision model network, which only models the spatial
features. The literature study in 2.2.4 suggests that the ResNet 3D CNN, and es-
pecially the R(2+1)D classification model, were a suitable choice as a basis for the
dynamic vision model.

Due to time limitations, it was not feasible to develop the R(2+1)D model from
scratch, and it was therefore retrieved from the PyTorch library [22]. The model
is based on the work described in [13]. To meet the project’s classification require-
ments, adjustments were made to the model’s architecture as stated in section 3.8.2.
Training the model from scratch on a small dataset, which consisted of 231 videos
of multiple classes, posed a significant risk of overfitting due to the high number

26


3. Method

of trainable parameters, approximately 31.5 million. A large volume of data was
required to minimize the risk of overfitting [3]. Fortunately, PyTorch provided pre-
trained models, trained on the large benchmark dataset Kinetics-400, widely used
for human action recognition [22]. Transfer learning was, therefore, possible, [23],
where the acquired dataset was used for fine-tuning. As the pre-trained model does
not accept videos with arbitrary size and length, preprocessing was required.

3.8.1 Preprocessing dynamic vision classification data
As stated in section 3.5, the directory structure consists of folders for each sepa-
rate measurement, containing one or several mp4-files, depending on the length of
the recording. This required the creation of the script, merge_videos.py presented
in appendix A.9, which finds and merges mp4-files by concatenation. The fish-eye
camera captures clips of 30 frames per second (fps), and through iterative testing, it
became clear that 2 fps was sufficient for the dynamic model to achieve high accuracy.

Subsequent steps addressed the labeling process. The script video_Labeling.py,
presented in appendix A.10, labels merged video files according to logbook entries
shown in the figure 3.4. It executes three types of labeling functions: binary class
labeling, multiclass labeling, and extended multi-class labeling. The binary labeling
function assigns labels ’kick’ - ’no_kick’ or ’hand’ - ’no_hand’. The multiclass la-
beling function assigns three labels ’kick’, ’hand’ and ’no_gesture’ and the extended
multi-class labeling function uses combined gestures and attributes reaching a total
of nine labels, seen in table 3.1. All three functions generate a csv-file with video
paths and labels, and a csv-file with label mapping, creating a dataframe.

Table 3.1: Overview of labeling functions and their corresponding labels

Binary class labeling Multi-class labeling Extended multi-class labeling
’kick’ ’hand’ ’kick’ ’kick right leg’

’no_kick’ ’no_hand’ ’hand’ ’kick left leg’
’no_gesture’ ’kick right leg and 2 bags’

’walk pass perpendicular forward and back’
’walk approach and depart gull wing’
’walk approach and depart straight’

’walk approach and depart straight 2 bags’
’hand motion right hand’
’hand motion left hand’

The remaining preprocessing was deployed within the principal script named
dynamicVisionNetwork.py presented in appendix A.11. The script was initiated
by reading video paths, labels, and mapping from previously created data. The
video data and labels are split into training, validation, and test sets using the
train_test_split function from the scikitlearn library in two steps. The initial
split separates the data into training and test sets with an 80/20 ratio. A subsequent
split further divides the training data into training and validation sets, also with an
80/20 ratio. Stratified sampling was employed for all the datasets to address the

27


3. Method

imbalance in the distribution of classes. Followed by data loading, transformation
parameters are defined using PyTorch’s transforms.Compose to ensure that the in-
put data was consistent with size and normalized according to the requirements. To
effectively leverage the learned features of the pre-trained model, the configuration
parameters and preprocessing operations, such as the frame size, are preferred to be
consistent with that of the pre-trained model. These are presented below in table
3.2 below [24].

Table 3.2: Configuration parameters and preprocessing operations for the pre-
trained R(2+1)D model [24].

Parameter Configuration
Frame rate 15

Clips per video 5
Clip length 16
Resize size [128, 171]
Crop size [112, 112]
RGB x̄ [0.43216, 0.394666, 0.37645]
RGB σ [0.22803, 0.22145, 0.216989]

Video datasets are created using a custom VideoDataset class, presented in ap-
pendix A.11, which handles the loading and processing of video data according to
the defined transformations. In the VideoDataset class, the video frames are read
using read_video method from PyTorch, which returns the frames in a tensor for-
mat (C, T, H, W) where T stands for temporal dimension, C stands for channels,
which is three in the case of RGB-video, H stands for the height of each frame in
pixels and W stands for the width of each frame in pixels. Due to time limitations,
the configuration parameters in table 3.2, with regard to length, were not consid-
ered. Instead, the change of fps from 30 to 2 fps gives the longest video 70 frames.
To ensure that the length of the videos in the datasets are concise, the elementary
method of padding the last frame to 70 frames. Transformations to each frame are
provided using the apply_transform method. This method converts each frame
from a tensor to Python Imaging Library image format, normalizes it to the range
[0, 1], and applies the predefined transformations.

These transformations ensure that the input data is consistent with size and nor-
malized according to the specified mean and standard deviation values. As the
R(2+1)D model accepts video frames in batched format (B, C, T, H W) where B
stands for the number of video samples in a batch, the apply_transform method
finally converts the transformed frames back into a tensor and reorders the dimen-
sions to match the expected format (C, T, H, W) [24]. In figure 3.12, it is shown
how a video frame of original resolution 1282 × 722 = 925, 604 pixels was resized to
128 × 171 = 21, 888 pixels, cropped to 112 × 112 = 12, 544 pixels. The final cropped
image has about 1.36 % of the original number of pixels.

28


3. Method

(a) 1282 × 722 (b) 112 × 112

Figure 3.12: In figure 3.12a one can see the original resolution of a mp4-file, and
in figure 3.12b one can see the down-scaled version of the same mp4-file, the scale
factor is approximately γ = 0.1.

3.8.2 Dynamic Vision classification model

As mentioned earlier, the dynamic vision classification model is based on the R(2+1)D
model of 18 layers from PyTorch [24], which is in turn based on [13]. PyTorch pro-
vides R(2+1)D-18 model which has been pretrained on Kinetics-400 dataset, the
model has 40.52 GFLOPS and a file size of 120.3 MB. In dynamicVisionNetwork.py,
presented in appendix A.11, the class DynamicVisionNN defines a custom neural
network model. It was initiated by loading the pre-trained weights’ configuration
R2Plus1D_18_Weights.DEFAULT into the model r2plus1d_18. Furthermore, the fi-
nal fully connected layer of the pre-trained model was replaced with a new linear
fully connected layer modified for the specified number of output classes required by
the dynamic vision classification model. To prevent the weights of the parameters
of the pre-trained model from being updated during training, they were all frozen
except for the newly added fully connected layer, ensuring that only this layer’s
weights would be updated during training.

In figure 3.13, a visual representation is seen of the modules, module hierarchy, ten-
sor operations, shapes, and tensors involved during the forward pass of the model.
Similarly, figure 3.14 highlights how gradients are computed and how they pass
through the model during backpropagation. The nodes are color-coded and rep-
resent different types of tensors and functions: gray indicates backward functions,
blue indicates reachable tensors requiring gradients, and green indicates the output
tensor. A detailed description of the R(2+1) architecture is presented in A.12.

29


3. Method

Figure 3.13: Illustration of the dynamic vision network architecture based on [24].
The final fully connected layer is adjusted for binary classification.

Figure 3.14: Overview of the entire model during the backward pass [24].

30


3. Method

3.8.3 Training and validation
Building, training, and validating the dynamic vision model was done by a produced
Python script seen in appendix A.11. The necessary components for training the
model include data loaders, the model itself, an optimizer, and a loss function. A
training function was defined, containing a loop that runs for a specified number of
epochs. During each epoch, the training data is processed in batches, and for each
batch, the model performs a forward pass to compute the output. The loss was calcu-
lated using the cross-entropy loss function defined in section 2.2.8. Backpropagation
was then performed, and the Adam optimizer updated the model’s parameters. A
batch size of 4 and the Adam optimizer are chosen based on recommendations from
the paper [13] and memory capacity. The learning rate was initiated as η = 0.001.

After the training loop for an epoch, the model’s performance was validated. A
validation function was defined, and provided a validation dataset. The model re-
turned the validation loss, F1 score, and accuracy. To further detect or monitor
overfitting, the model’s performance was validated and printed at the end of each
epoch. When the validation F1 score reached a peak and then decreases for sev-
eral iterations, it was seen as an indication of overfitting. Similarly, if the training
loss continued to decrease while the validation accuracy did not improve, this was
also seen as a sign of overfitting. To retain the model with the best generalized
performance, the model with the best validation F1 score was saved.

31


3. Method

32


4
Results

In this section, the results for the different models and the combined model is pre-
sented.

4.1 USS model

The USS-based NN model architecture consists of 262 input nodes which are then
connected to a linear hidden layer of 100 nodes. These are then connected to a
second hidden linear layer with 64 nodes. After this, a third hidden layer consisting
of 32 nodes and biases is connected. These are then connected to two output nodes
which are used to indicate the classification. All layers have biases and use tanh as
an activation function.

4.1.1 Model evaluation

The USS model achieved a validation accuracy of 75 percent after 26 epochs of
training using a batch size of 10 samples while training on a refined dataset con-
taining only kicks and no kicks. The model’s performance varies rather distinctively
between randomized training loops, but the loss becomes rather stable using the
stated parameters, see figure 4.1.

33


4. Results

0 5 10 15 20 25
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Validation accuracy

Validation loss

Figure 4.1: Illustration of the validation accuracy over each epoch using a batch
size of 12 (blue) together with the validation loss scaled up by a factor of three
(orange).

While evaluating the model using all data points in the set, a classification accuracy
of 70.59 percent was achieved. The model precision was calculated as 72.50 percent
and the recall as 76.32 percent. See table 4.1.

Table 4.1: The number of true positives, true negatives, false positives, and false
negatives and their relative mean certainty are shown in this table for the USS
model.

Percentage of data Quantity Mean certainty
TP 42.65 29 0.06476634
TN 27.94 19 0.04153539
FP 16.18 11 0.051356044
FN 13.26 9 0.06391618

4.2 Static vision models
The static vision model NN architecture consists of three convolutional layers, three
max-pooling layers, and three linear layers. The structure can be seen in appendix
A.6. The rear-end fish-eye camera captured clips of 30 fps with a resolution of
1282x722 pixels. This resolution was scaled down to 128x72 pixels.

34


4. Results

Figure 4.2: Illustration of the amount of TP:s and TN:s together with the FP:s
and FN:s.

4.2.1 Binary gesture classification
The binary version of the static gesture model network, classifying the binary action
of the kick motion as stated in section 3.1 and no kick, yielded a validation accuracy
of 100 percent over a dataset containing over a thousand data points. The network
was iterated for 10 epochs with a batch size of 15. Evaluating the network model
over the test dataset, an accuracy of 99.906 percent was obtained with a precision
score of 99.825 percent and a recall of 100 percent. This gives an F1 score of 99.912
percent, see table 4.2. The accuracy and loss trend is illustrated in fig 4.3 and a
confusion matrix for the model is shown in fig 4.4.

Table 4.2: This table shows the number of true positives, true negatives, false
positives, and false negatives and their relative mean certainty.

Percentage of data Quantity Mean certainty
TP 53.61 572 9.971248
TN 46.29 494 6.7518287
FP 0.094 1 1.3280579
FN 0.000 0 -

4.2.2 Multiclass gesture classification
Training the static vision network on several gestures or classifications, ’kick’ and
’hand’ gestures, as explained previously, the model achieved a total validation accu-
racy of 100 percent using a batch size of 30 while training for 20 epochs. As seen in
figure 4.5, the validation accuracy for each gesture is shown together with the total

35


4. Results

1 2 3 4 5 6 7 8 9 10
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Validation accuracy

Validation loss

Figure 4.3: Illustration of the validation accuracy (blue line) over epochs and the
validation loss scaled by a factor of three (orange).

Figure 4.4: Illustration of the confusion matrix for the binary static vision model
is presented.

validation accuracy and validation loss. The validation loss is scaled up by a factor
of 30 for visibility.

36


4. Results

0 2 4 6 8 10 12 14 16 18 20
0

10

20

30

40

50

60

70

80

90

100

Total validation accuracy

Validation loss

No gesture

Kick

Hand gesture

Figure 4.5: Illustration of the total validation accuracy over all gestures together
with the separate validation accuracies for each gesture and the validation loss.

4.3 Dynamic vision models
The dynamic vision models underwent training, validation and testing on three
distinct classification tasks. The first model was designed for binary classification.
The second one handled multi-class classification with three classes. The third model
extended the multi-classification model to a nine-class multi-classification task. The
networks were iterated for 20 epochs with a batch size of four.

4.3.1 Binary gesture classification
The first iteration of the dynamic vision model network, classifying only binary
action of the ’kick’ - ’no kick’ gesture, yielded a validation accuracy of 100 percent
over a dataset containing 37 videos. The second iteration of the dynamic vision
model network, classifying only binary action of the ’hand’ - ’no hand’ gesture,
yielded a validation accuracy of 100 percent over a dataset containing 47 videos.

4.3.1.1 Collected and preprocessed data

In the tables 4.3 and 4.4, the distribution of the binary classes over each of the
datasets is presented. As mentioned in section 3.8.1, the distribution was achieved
by using train_test_split and stratified sampling. In figure 4.6, it can be observed
that a great number of videos at 2 fps are in the range of 10 to 15 frames long. The
resulting maximum length of the videos in the dataset was 70 frames.

37


4. Results

Table 4.3: Number of videos per
’kick’ - ’no kick’ gesture from the ac-
quired dateset.

Dataset Class
’kick’ ’no Kick’

Overall set 131 100
Training set 84 63

Validation set 22 15
Test set 25 22

Table 4.4: Number of videos for the
’hand’ - ’no hand’ gesture from the
acquired dateset.

Dataset Class
’hand’ ’no Hand’

Overall set 59 172
Training set 38 109

Validation set 9 28
Test set 12 35

(a) ’kick’ - ’ No kick’ gesture (b) ’hand’ - ’no hand’ gesture

Figure 4.6: Illustration of the video lengths, in number of frames, per class in the
case of a binary classification task.

4.3.1.2 Model evaluation

For the binary classification task, the dynamic vision model recorded validation
accuracies for both the ’kick’ - ’no kick’ and ’hand’ - ’no hand’ classification reached
100 %. The model achieved a perfect score for the test dataset, as seen in tables 4.7
and 4.8. This can be further visualized in the confusion matrices, figure 4.8a and
4.8b for this model, as the diagonal is 100 %.

0 2 4 6 8 10 12 14 16 18 20
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Validation accuracy

Validation loss

Training loss

F1 score

(a) ’kick’ - ’ No kick’ gesture metrics

0 2 4 6 8 10 12 14 16 18 20
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Validation accuracy

Validation loss

Training loss

F1 score

(b) ’hand’ - ’no hand’ gesture

Figure 4.7: Illustration of the training and validation loss together with validation
accuracy and F1 score over 20 epochs

38


4. Results

Table 4.5: The mean certainty and count for true positives, true negatives, false
positives, and false negatives for the ’kick’ - ’no kick’ gesture.

Percentage of data Quantity Mean certainty
TP 57.45 27 6.260
TN 42.55 20 3.776
FP 0.0 0 -
FN 0.0 0 -

Table 4.6: The mean certainty and count for true positives, true negatives, false
positives, and false negatives for the ’hand’ - ’no hand’ gesture.

Percentage of data Quantity Mean certainty
TP 25.53 12 5.808
TN 74.46 35 5.340
FP 0.0 0 -
FN 0.0 0 -

Table 4.7: Test metrics of dynamic
vision model for binary classification
of ’kick’ - ’no kick’ gesture.

Loss Accuracy F1
Test set 0.025 1 1

Table 4.8: Test metrics of dynamic
vision model for binary classification
of ’hand’ - ’no hand’ gesture.

Loss Accuracy F1
Test set 0.034 1 1

(a) ’kick’ - ’ No kick’ gesture matrix (b) ’hand’ - ’no hand’ gesture matrix

Figure 4.8: Illustration of the confusion matrix from the test evaluation

4.3.2 Multi-class gesture classification
The multi-class gesture classification setup of the dynamic vision model classifies,
as previously mentioned three distinct classes, ’kick’, ’hand’ and ’no gesture’. The

39


4. Results

model yielded again a validation accuracy of 100 percent over a dataset containing
30 videos. In table 4.9, the distribution of the three classes over each of the datasets
is presented.

Table 4.9: This table shows the number of videos per class in each of the datasets

Dataset Class
’kick’ ’hand’ ’No gesture’

Overall set 131 59 41
Training set 83 38 26

Validation set 21 9 7
Test set 27 12 8

Figure 4.9: Illustration of the video lengths, in number of frames, per class in the
case of the multi-class classification task.

4.3.2.1 Model evaluation

Recorded validation accuracy for both the ’kick’, ’hand’ and ’no gesture’ classifica-
tions reached 100 %. The model achieved a perfect score for the test dataset, as
seen in table 4.11. This can be further visualized in the confusion matrix 4.11 for
this model, as the diagonal is 100 %.

40


4. Results

0 2 4 6 8 10 12 14 16 18 20
0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Validation accuracy

Validation loss

Training loss

F1 score

Figure 4.10: Illustration of the training and validation loss together with validation
accuracy and F1 score over 20 epochs

Table 4.10: The mean certainty and count for true positives, true negatives, false
positives, and false negatives.

Percentage of data Quantity Mean certainty
TP 77.14 27 6.060
TN 22.86 8 6.328
FP 0.0 0 -
FN 0.0 0 -

Table 4.11: Test metrics of dynamic vision model for multi-classification.

Loss Accuracy F1
Test set 0.019 1 1

41


4. Results

Figure 4.11: Illustration of the confusion matrix from the test evaluation

4.3.3 Extended multi-class gesture classification

The extended multi-class version includes several gestures as presented in the table
4.12. The distribution of the nine classes over each of the datasets is presented. As
mentioned in section 3.8.1, the distribution was achieved by using train_test_split
and stratified sampling. In figure 4.12, it can be observed that ’kick’ and ’hand’
classes have a higher concentration of videos with a shorter number of frames in the
range of 10 to 20 frames long, while the ’walk’ classes had a wider range of video
lengths. The maximum length of the videos of 2 fps in the datasets is 70 frames.

42


4. Results

Table 4.12: This table shows the number of videos per class in each of the datasets
for extended multi-class classification.

Class Dataset
Overall set Training set Validation set Test set

’kick right leg’ 62 35 15 12
’kick left leg’ 50 34 7 10
’kick right leg

and 2 bags’ 38 15 1 3

’walk pass perpendicular
forward and back’ 13 8 3 2

’walk approach and
depart gull wing’ 12 8 1 3

’walk approach and
depart straight’ 10 6 2 2

’walk approach and
depart straight 2 bags’ 4 2 1 1

’hand motion
right hand’ 38 25 7 6

’hand motion
left hand’ 23 13 3 7

Figure 4.12: Illustration of the video lengths, in number of frames, per class in
the case of the extended multi-class classification task.

43


4. Results

4.3.3.1 Model evaluation

The dynamic model with extended multi-class classification achieved an accuracy of
85 %, as shown in table 4.14. This can be further visualized in the confusion matrix
4.14, as the diagonal is approximately 100 %.

0 2 4 6 8 10 12 14 16 18 20
0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Validation accuracy

Validation loss

Training loss

F1 score

Figure 4.13: Illustration of the training and validation loss together with validation
accuracy and F1 score over epochs

Table 4.13: The mean certainty and count for true positives, true negatives, false
positives, and false negatives.

Percentage of data Quantity Mean certainty
TP 33.33 7 2.886
TN 47.62 10 3.590
FP 4.76 1 1.013
FN 14.28 3 1.953

Table 4.14: Test metrics of dynamic vision model for extended multi-classification.

Loss Accuracy F1
Test set 0.511 0.851 0.896

44


4. Results

Figure 4.14: Illustration of the confusion matrix from the test evaluation

4.4 Combined model

The data fed into the combined model consisted of all gesture data of the kick motion
and a similar amount of gesture data containing no kick for both models, i.e., image
data and USS data.

4.4.1 Model evaluation

Combining the network models for static vision and USS and feeding randomly
selected samples of gesture data and data with no gesture, the results in table 4.15
were obtained after 2000 iterations (to likely capture most of the unique data in the
respective data sets).

45


4. Results

Table 4.15: This table shows measures of the combined network model evaluation.
Note that the network certainty is defined in a different way for the combined model.

Percentage of data Quantity Mean certainty
TP 50 1000 11.279448
TN 50 1000 11.277735
FP 0.000 0 -
FN 0.000 0 -

The combined model’s accuracy was 1.0. This accuracy can be compared to the
models presented for the static vision network model, 1.0, and the USS network,
0.773.

4.4.2 Combined model using dynamic vision model
In tables 4.16 and 4.16, the results of processing the full dataset with the binary
dynamic vision model for classifying the ’kick’ - ’no kick’ gesture are presented.

Table 4.16: Test metrics for dynamic vision model for multi-classification.

Loss Accuracy F1
Test set 0.03134 0.995 0.995

Table 4.17: The mean certainty and count for true positives, true negatives, false
positives, and false negatives.

Percentage of data Quantity Mean certainty
TP 56.71 131 5.971
TN 42.86 99 3.938
FP 0.43 1 2.059
FN 0 0 -

46


5
Discussion

This chapter discusses the results and potential error sources and compares the
tested models and the current radar-based systems.

5.1 Non neural network based approach
It is possible to use classical measures, such as the radar-based system, to acti-
vate the requested actuation using information from other sources. Still, there are
some drawbacks as well. Without using a neural network approach to classify the
movement signatures of intended gestures, it could be hard to determine whether a
human made a gesture intending to open the trunk or not. More accurate real-time
analysis would require significant computational resources compared to the ANN
solution. Perhaps someone is walking by, or an object, for instance, an animal or
plastic bag, moves close. It would not be optimal if such scenarios triggered the
actuation. Therefore, one would need more information to avoid activating false
positives, which would be quite an unpleasant or dangerous situation for customers.
To combat this issue, one could use the key position as an indicator of where the
driver is. If the person holding the key is standing behind the car and the classical
method determines that the trunk should open, this information could potentially
be matched and used to cause an intended actuation more reliably. However, that
also means that only the person wearing the key can use this feature. This removes
the possibility of, for instance, a family member using the feature without the key.
Perhaps it would be enough to use only the ultrasonic distance measured, but there
could still be false positives, such as if the driver walks by the car too close to the
vehicle with the key. Then, this method would classify this as a gesture even though
it is not. Of course, there are workarounds for this as well, for instance, if the driver
has to stand a given time span at the right position for the trunk to open. However,
this might be a bit unpractical.

5.2 USS model and data
The obtained data from the ultrasonic sensors only captured distance in one di-
mension, making the human gesture pattern hard to distinguish from a potential
object over time. As seen in figures 3.7 and 3.6, each gesture has no clear, unique
pattern. Due to measurement noise and errors, not all sampled measurement points
in the data collection gave reasonable results, and the data density in time after the
noise filtration was ins