Using Machine Learning to Estimate
Road-user Kinematics from Video Data

Luhan Fang, luhan@student.chalmers.se
Oskar Malm, omalm@student.chalmers.se
Yahui Wu, yahuiw@student.chalmers.se

Tianshuo Xiao, tianshuo@student.chalmers.se
Minxiang Zhao, minxiang@student.chalmers.se

December 2023

1


Abstract

Each year, there are over one million people who sustain fatal injuries in traffic-related
crashes, with vulnerable road users, often abbreviated as VRUs, being involved in more
than half of the crashes. In the context of road safety, VRUs are mainly pedestrians,
cyclists, motorcyclists, and e-scooterists. To mitigate the crashes, it is essential to under-
stand the causation mechanisms. Naturalistic data has been recognised as a good tool to
understand the trafficant’s behaviour and address safety concerns within the field of traffic
safety research. Traditionally, critical events are identified from naturalistic data using
the kinematic information from the sensors onboard the vehicles. Video footage for the
trip corresponding to the critical events is then used to validate and annotate the events.
While this is a reliable method when it comes to identification of crashes, near-crashes
may not display any anomalies in the sensor readings thereby going unidentified. These
near-crash events would be visible in the video footage, but the manual identification of
these by watching videos is not feasible because the amount of videos is too large for
the human eyes. Therefore, developing tools that can identify and estimate the position
of different road users using the video footage is essential and will enable automation of
process of identifying critical events. This report describes such models and also delves
into the application of machine learning to allow identify the severity of imminent critical
interactions among road users in the future.

This project investigates how models can be developed to estimate and predict the posi-
tion and kinematics of various road users from video data from a camera mounted on an
e-scooter. The initial generation of bounding boxes and categories for road users utilized
You Only Look Once (YOLOv7) algorithms. The detection for cyclists was achieved by a
simple rule-based model calculating the overlap area between the pedestrian and bicycle
detected by YOLOv7. The e-scooterists detection model was implemented by combining
YOLOv7 and MobileNetV2 models. Different machine learning models were trained to
estimate distance for the four different road users: pedestrians, cyclists, e-scooterists, and
cars separately using LiDAR and GPS data as the position ground truth. The input for
these models was derived from bounding box data extracted from videos. Furthermore,
a DBSCAN-based noise remover was used to remove the outlier point of the distance es-
timation model to filter out points with excessive errors. Finally, a Rauch-Tung-Striebel
smoother was applied to the output of the noise remover to improve the distance estima-
tion accuracy and generate both the relative position and velocity of the target road user.

It was concluded that the object detection model could achieve an accuracy over 90%, and
the distance estimation achieved the highest accuracy when using polar coordinates for all
road users, compared with using the cartesian system. The highest R2 score for distance
estimation was obtained with a k-nearest neighbors regression model (with n = 2) using
the pixel position in x and y direction of center point of the bottom of bounding box,
height, and width of the bounding box as input. As a consequence, the e-scooterist model
achieved a R2 score of 0.978, while the cyclist and car models attained commendable
scores of 0.92 and 0.96 respectively. This means that the distances predicted from the
model are highly accurate. These models can now be used to detect critical interactions
among road users using the naturalistic data collected using e-scooters.

2


Acknowledgements

Throughout the project, we have been fortunate to receive assistance from our supervisors,
Marco Dozza, Rahul Rajendra Pai, and Alexander Rasch, who have helped a lot in the
form of weekly meetings, fast responses over email, and guiding us in the right direction to
finish the project on time. We also want to express our gratitude to Karan Bharti based
on whose pedestrian position estimation model we could continue our work more easily.
Special thanks to Veoneer (now Magna) for letting us collect data on their test track with
their vehicle, and the project e-SAFER and MicroVision that financed the collection.
Finally, we extend our thanks to Kumar Apurv, Renran Tian, and Rini Sherony for their
previous studies on e-scooterists detection which gave us a dataset of e-scooterists and a
framework of the detection model.

3


CONTENTS CONTENTS

Contents

1 Introduction 6
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Theory 9
2.1 Coordinate systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Polar coordinate system . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Rotation of coordinate systems . . . . . . . . . . . . . . . . . . . . 9

2.2 YOLOv7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 MobileNetV2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Decision Tree Regressor . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 K-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.4 Random Forest Regressor . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.5 MLP Regressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Rauch–Tung–Striebel smoother . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Methodology 14
3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Object detection and classification . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 YOLO detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 E-scooter rider detection . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Cyclist detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Object position extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Position from GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Position from LiDAR . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Pre-processed ground truth data . . . . . . . . . . . . . . . . . . . . 26
3.4.2 Extracting bounding box information from camera frames . . . . . 26
3.4.3 Synchronizing LiDAR/GPS and camera data . . . . . . . . . . . . . 26

3.5 Data post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.1 Trajectory extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.2 Outlier Removing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.3 RTS smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 MAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.3 R-squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Results 31
4.1 Object detection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Object detection model for e-scooterist . . . . . . . . . . . . . . . . 31
4.1.2 Object detection model for cyclists . . . . . . . . . . . . . . . . . . 32

4


CONTENTS CONTENTS

4.2 Position estimation model . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Position estimation model for cyclists . . . . . . . . . . . . . . . . . 33
4.2.2 Position estimation model for e-scooter riders . . . . . . . . . . . . 39
4.2.3 Position estimation model for cars . . . . . . . . . . . . . . . . . . . 45

5 Discussion 53
5.1 Discussion on the overall result . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Training dataset for YOLOv7 . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Detection accuracy of PCM & CDM . . . . . . . . . . . . . . . . . . . . . 53
5.4 Limitations of the test data for distance estimation . . . . . . . . . . . . . 54
5.5 Calculation of the closest point of the car . . . . . . . . . . . . . . . . . . . 54
5.6 E-scooter orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.7 Object detection at extreme angles and distances . . . . . . . . . . . . . . 55
5.8 Critical event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusion 57

7 Future work 58

Reference 59

5


1 INTRODUCTION

1 Introduction

1.1 Background

Every year, around 1.3 million people are killed globally in traffic-related crashes and
between 20 to 50 million people sustain non-fatal injuries according to the World Health
Organization (WHO) [1]. In more than half of the annual fatal crashes, VRUs are in-
volved, which means that the risk of fatal crashes is particularly high for this group in
particular. A VRU is usually classified as someone who is not inside of a vehicle, in this
group there are, for example, cyclists, e-scooterists, and pedestrians [2]. In recent years
the number of e-scooters has increased as a form of sustainable transport in our cities.
This means that people who are using e-scooters as a form of transportation have a high
risk of injury and it is essential to understand the interactions between them and other
road users. Since this is a fairly new mode of transport, there isn’t a lot of research done
on how e-scooterists interact with different road users.
To reduce the involvement of e-scooterists and other road users in crashes, it is important
to learn the mechanism of how a crash could happen. Naturalistic data is collected from
real traffic which captures the genuine driver behavior and how the driver interacts with
other road users before a crash happens. Therefore, it is a good tool to understand the
crash mechanism and identify critical events. In the traditional active safety area, critical
events are often identified by analyzing the kinematic information based on signals from
sensors onboard vehicles. However, the sensor signals are insufficient for near-crash iden-
tification since signals would not show any anomalies. Video footage from the onboard
camera can easily capture near-crash events, but manually watching those videos to iden-
tify critical events can be time-consuming, and kinematic information cannot be derived
directly from the video footage. Some vision-based methods need to be developed to
estimate the position of different road users and enable automation of the process of iden-
tifying critical events. By using artificial intelligence and machine learning this process
would not only be automatic, saving time, but also objective, based on a predetermined
threshold for what is a critical situation.

1.2 Literature review

Surveying the relevant literature, we come across various methods for e-scooterist identi-
fication and distance estimation.
In the article by Apurv et al.[3][4], he introduced a vision-based system for detecting e-
scooter riders in natural scenes, employing YOLOv3 and MobileNetV2. While achieving a
commendable recall and accuracy, the methodology’s strengths lie in its efficient pipeline.
However, the absence of an existing dataset poses a limitation.
In the article by Zhu[5], a learning-based model for object-specific distance estimation is
proposed. In the article, the features in the image are first extracted by convolutional
neural networks, and then the distance is predicted along with the bounding box of the
identified object. The strengths include addressing challenges with traditional methods
and introducing extended datasets. However, the effectiveness of the model is contextu-
alized primarily for objects on curved roads.
Report by Bharti[6] delves into a machine learning approach for automating video data
reduction in analyzing road user interactions. Leveraging LiDAR and camera data, the

6


1.3 Aims and objectives 1 INTRODUCTION

study aims to identify the position of pedestrians with respect to the e-scooters to identify
critical events. The scope of the work here is limited to the identification of pedestrian
positions and excludes e-scooter riders and cyclists. The methodology’s robustness lies in
its multidimensional data utilization.
The article by Davydov[7] focuses on supervised object-specific distance estimation using
monocular images for autonomous driving. The proposed lightweight convolutional deep
learning model outperforms existing methods. Its strengths include accurate distance
estimation and relevance to advanced driver assistance systems, although challenges may
arise in various real-world scenarios.
The papers and reports above address a crucial aspect of road safety by detecting e-
scooterists and estimating object-specific distances. As crashes involving VRUs continue
to cause injuries, understanding the intricate interactions between emerging sustainable
transport modes and other road users becomes imperative. To understand why critical
events occur, it is crucial to identify these interactions. Traditional sensor readings may
capture crashes, but numerous near-crash scenarios, discernible only in videos, remain un-
explored due to resource-intensive manual analysis. Using AI and machine learning, these
papers propose algorithms that efficiently identify and assess interactions, pinpointing
potential threats and critical scenarios based on predefined thresholds.
However, the current method’s limitations lie in its exclusive reliance on computer vision
for VRUs identification, without integrating LiDAR and camera data for distance pre-
diction. While traditional camera calibration provides a robust theoretical foundation,
it involves manual processes and may face challenges in dynamic environments. In con-
trast, machine learning automates the process and adapts to diverse conditions. In this
project, we use LiDAR data which offers accurate ground truth for model training. This
approach proves superior to traditional camera calibration methods, enhancing accuracy
and adaptability.

1.3 Aims and objectives

The overall objective of this project is to automatically estimate kinematic information of
road users from video data collected from an e-scooter. By developing methods to analyze
video data collected in the real world by e-scooterists, we can objectively investigate rider
behavior and interactions. The data and knowledge could then be used to develop new
active safety systems or make riders aware of high-risk scenarios that can occur during
riding. To achieve this goal, a number of tasks were completed:

1. Object detection and classification from video data, this was done to ensure
that different road users are detected and classified correctly and reliably. But,
also be able to differentiate a pedestrian, cyclist, and an e-scooterist which in a
traditional computer vision algorithm would be identified as a pedestrian.

2. Processing LiDAR and GPS data to acquire accurate distances and angles
which were collected in a controlled environment. This data was then used to
establish a ground truth which was used to train and test the machine learning
algorithms to be able to determine distances from the video data. Finally, the data
from the different sensors was synchronized, due to the different activation times
and sample rates, to form the training data.

3. Train the machine learning models based on the object detection and classifi-
cation which generated bounding boxes for each road user cyclist, e-scooterist, and

7


1.3 Aims and objectives 1 INTRODUCTION

car. The training data for the model is the combination of the data from object
detection and the distance data from the LiDAR and GPS (ground truth).

4. Validation of the models was done by applying the models on the validation
data to compare predicted values to the ground truth.

5. Filter the predictions done by the models to ensure that the final predictions are
less affected by noise.

6. Test on a sample of the naturalistic data to determine the performance in real
world conditions.

8


2 THEORY

2 Theory

This chapter will present and explain the different theoretical aspects used during this
project.

2.1 Coordinate systems

This subsection will explain the coordinate systems and the transformations used during
the work.

2.1.1 Polar coordinate system

Coordinate systems are used to represent the location of a point or object in space, which
in this project was used to calculate the radial distances and angles to objects detected
by the camera, this information was then related to pixels in the recorded video data.

To represent a point in space with a radial distance (r) and an angle (θ), to the x-axis, the
polar coordinate system is used and to transform a point (x, y) in the Cartesian coordinate
system to a radius and angle, the following formulas are used:

r =
√

(x2 + y2) and θ = atan

(
y

x

)
(1)

To transform from polar coordinates to Cartesian coordinates the following formulas are
used:

x = r · cos(θ) and y = r · sin(θ) (2)

2.1.2 Rotation of coordinate systems

Since different objects are rotated in relation to each other it is useful to rotate one of
them so that the systems becomes aligned to more easily do calculations. This can be
done by applying the rotation matrix, which has the following form in two dimensions:[

cos(θ) −sin(θ)
sin(θ) cos(θ)

]
(3)

Which is used then to calculate the rotated coordinates (x′ and y′) in the following way:[
x′

y′

]
=

[
cos(θ) −sin(θ)
sin(θ) cos(θ)

] [
x
y

]
(4)

Where x′ and y′ are rotated by the angle θ. The direction of the rotation is defined by the
angle θ, where a positive angle is a counter-clockwise (positive mathematical direction)
rotation, and a negative angle rotates the system clockwise.

9


2.2 YOLOv7 2 THEORY

2.2 YOLOv7

You Only Look Once (YOLOv7) [8] is an object detection model, which is based on
the backbone-head pattern. The backbone part is composed of several convolutional
layers and max-pooling layers to extract the features at different levels from the input
image. There is also a net structure called ELAN to control the gradient to extract more
features and have better robustness. The head part takes responsibility for further feature
extraction, classification, and detection. It has represented convolutional layers (RepConv
in the configuration file) to do the detection part at the low, middle, and high levels to
cover the targets of interest from small objects to big objects and from simple detection
to complicated detection.
YOLOv7 can be trained on various data, and once the training is finished, a pre-trained
YOLOv7 model with its weight and configuration file will be generated. The weights and
configuration can be used for other users to implement the detection without training or
retraining the model with less time taken.

2.3 MobileNetV2

MobileNet [9] is a lightweight convolutional neural network (CNN), which features depth-
wise separable convolution. This feature drastically reduces computational costs while
minimizing loss in accuracy. Compared to the previous version, MobileNetV2 [10] in-
cludes an inverted residual and linear bottleneck. Linear bottleneck solves the problem
that the ReLu activation function causes too much information loss in low dimensions.
While the inverted residual structure can reduce the usage of memory. As it is shown in
Figure 1, the size of the input to the network is (160,160,3) scaled to [-1,1]. The output
is a (5,5,1280) tensor. After adding a dense layer, an output of a value between 0 and 1
can be obtained. 0 represents the prediction result as a pedestrian, while 1 indicates that
it is an e-scooter rider.

Figure 1: MobileNetV2 architecture for binary classification task [3].

10


2.4 Machine learning models 2 THEORY

2.4 Machine learning models

2.4.1 Linear Regression

Linear regression is a fundamental regression method that attempts to establish a linear
relationship, minimizing the prediction error between input features and output labels.
It fits a straight line by minimizing the squared differences between predicted and actual
values, using the method of least squares. Linear regression is suitable when there is a
linear relationship between features and the target.[11]

2.4.2 Decision Tree Regressor

Decision trees are tree-like structures where each node represents a test on a feature,
each branch represents the result of the test, and each leaf node stores a target value. It
recursively partitions the data into different regions, with the target value in each region
represented by the average of the samples in that region. Decision trees are suitable for
capturing non-linear relationships, but they can be prone to overfitting.[12]

2.4.3 K-nearest neighbors

The K-nearest neighbors algorithm predicts by finding the closest K neighbors to a new
sample and averaging or weighted averaging their target values. It determines the most
similar samples based on the distance in the feature space. The K-nearest neighbors
are suitable for data with local patterns and perform well when there is a clear local
structure.[13]

2.4.4 Random Forest Regressor

Random forests are an ensemble learning method composed of multiple decision trees. It
reduces overfitting by averaging the predictions of each tree. Random forests introduce
randomness during tree construction, such as feature subset sampling, to increase model
diversity. It performs well on large datasets with many features, effectively modeling
complex relationships.[14]

2.4.5 MLP Regressor

Multi-layer perceptron (MLP) is an artificial neural network with multiple layers (input,
hidden, output). In regression tasks, the output layer typically has a single node. It learns
nonlinear relationships in data through forward and backward propagation algorithms.
MLP is suitable for complex non-linear relationships, demonstrating strong expressive
power for large-scale datasets and highly nonlinear problems.[15]

2.5 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)[16] is a clustering
algorithm that groups data points based on their density in the feature space. It identifies
core points as those within a specified radius containing at least a minimum number of
neighboring points. Clusters are formed by connecting core points and their reachable
neighbors, while points outside such clusters are classified as noise. This method effectively

11


2.6 Rauch–Tung–Striebel smoother 2 THEORY

captures dense regions in the data, enabling the discovery of arbitrary-shaped clusters and
the filtration of noise points, making it robust for various spatial clustering applications.
From Figure 2, consider a randomly selected point, A, in the sample dataset. Begin by
defining a radius and a minimum number of sample points required within a circular region
centered on A. If the circle encompasses enough sample points, such as B, C, M, and N,
the center of the circle is shifted to these internal points. This process of drawing circles
is repeated until the number of enclosed sample points becomes less than the specified
minimum number of points (minPts).

Figure 2: Principle of DBSCAN algorithm[17].

DBSCAN groups data points into clusters, discerning regions of high density while clas-
sifying outliers as noise. In the context of filtering, this translates to the removal of data
points that deviate from expected patterns or exhibit anomalous behavior. By leveraging
DBSCAN’s ability to form clusters based on data density, the algorithm enhances data
quality by filtering out noise, ensuring that subsequent analyses and modeling efforts are
conducted on a more accurate and reliable dataset.

2.6 Rauch–Tung–Striebel smoother

The Rauch–Tung–Striebel (RTS) smoother [18] is a mathematical method used to improve
the accuracy of estimating the state variables of a system by incorporating both past and
future measurements. The RTS smoother could be represented as a state-space model,
which consists of a state transition equation (describing how the system evolves) and a
measurement equation (relating the system’s state to the observed measurements). The
system state is first estimated by a Kalman filter, which recursively processes incoming
measurements to estimate the current state of the system. Then, the RTS smoother per-
forms a backward pass through the state data. The backward pass revisits the estimates
produced by the Kalman filter and refines them by incorporating information from future

12


2.6 Rauch–Tung–Striebel smoother 2 THEORY

measurements.
Several parameters, including the transition matrix, the observation matrix, the transi-
tion covariance, the observation covariance, the initial state mean, and the initial state
covariance, need to be specified before implying the RTS smoother. The state is a vector
that describes the kinematics information. Both the transition matrix and the transition
covariance compose the motion model, and the observation matrix and the observation
covariance compose the measurement model.
The RTS smoother not only enhances the accuracy of state estimation by incorporating
both past and future measurements, resulting in improved estimates of a system’s true
state over time, but also, in this project, estimates the parameters (velocities) in the state
that not included in the measurements (only positions). The RTS smoother can not only
estimate the kinematic information of one object but also improve the data quality and
distance estimation accuracy.

13


3 METHODOLOGY

3 Methodology

In this chapter, the methods used will be presented and discussed.

3.1 Data collection

The ground truth data that is used to train the position estimation model is collected by
an e-scooter with a camera and LiDAR mounted on it as shown in Figure 3. On the car,
the GPS is mounted as shown in Figure 6 and is also logged during the data collection.
During the data collection, typical objects like pedestrians, cyclists, e-scooter riders and
cars will hover in front of the e-scooter so that different angles of different objects at
different distances can be captured.

Figure 3: E-scooter for data collection, equipped with a camera, LiDAR, and logging
equipment.

Figure 4: The e-scooter from another angle.

The e-scooter’s camera is a monocular fish-eye camera with a resolution of 720x532 pixels,
a field of view (FOV) of 220◦ and a frame rate of 30 frames per second (fps). While the

14


3.1 Data collection 3 METHODOLOGY

LiDAR is a VLP16 manufactured by Velodyne Lidar with a vertical and horizontal FOV
of 30 degrees and 360 degrees respectively distributed over 16 channels and a frame rate
of 10 fps.

The car in Figure 5 that was used during data collection was a Lincoln MKZ with the
following dimensions: a length of 4.930, a width of 1.864 m, and a height of 1.478 m.
From the car, GPS data was obtained and analyzed, this is described in more detail
in Chapter 3.3.1, from the GPS mounted in the geometrical center of the car, i.e., at
(length/2, width/2) which can be seen in Figure 6.

Figure 5: The car used during testing, equipped with GPS mounted in the car’s geomet-
rical center, i.e., at (length/2, width/2).

Figure 6: Schematic drawing of the car and the mounting position of the GPS at the car’s
geometrical center (length/2, width/2).

15


3.2 Object detection and classification 3 METHODOLOGY

3.2 Object detection and classification

In this part, the road users in the video were detected and classified. Firstly, the video
was inputted to the YOLOv7 model for preliminary detection. Then, the e-scooterists
and cyclists would be classified by the Person Classification Model (PCM) and Cyclist
Detection Model (CDM). Figure 7 shows the pipeline of object detection and classification.

Figure 7: Pipeline of the object detection and classification model.

3.2.1 YOLO detection

YOLOv7 could detect the objects of desired categories from a video. The JSON file,
whose structure is shown in Figure 8, and the annotated video, are both the outputs of
the YOLOv7 model. The weight file ’yolov7x.pt’, which is trained by the COCO dataset,
is used as the weight of the YOLOv7 of the object detection model. A class filter of
YOLOv7 could be set to make the YOLOv7 detect traffic objects only, which include
’person’, ’bicycle’, ’car’, ’motorcycle’, ’bus’, ’train’, and ’truck’. For each frame, the
object’s ID, the object’s class, and the object’s bounding box (BBox) could be detected
by the YOLOv7 model and then be written to the JSON file. The uptime, which is the
timestamp of the frame, could be detected by the Optical Character Recognition (OCR)
model which executes the task that recognizing the characters in an image. Then the
detected uptime will be written to the JSON file.

3.2.2 E-scooter rider detection

The e-scooter rider (e-scooterist) is identified by the PCM. Figure 9 shows the framework
of the PCM model. The PCM model takes images of one person’s ID as the input and
then returns the class of the person, which could be pedestrian or e-scooterist. In the
first step, all frames corresponding to the person’s ID are extracted. For each frame, the
BBox for the person is identified which is then extended using Equation 5 to obtain an
enlarged bounding box. x1, y1, x2, y2 represent the coordinates of the top left and button
right corners of the original BBox respectively. w, h are the width and the height of the
original BBox respectively. x′, y′ are the top left coordinate of the extended BBox, and
w′, h′ are the width and the height of the extended BBox respectively. This extended
bounding box forms the input to the Image Classification Model (ICM).{

(w, h) = (x2 − x1, y2 − y1)

(x′, y′, w′, h′) = (x1 − w, y1, 3w, h+ h/4)
(5)

16


3.2 Object detection and classification 3 METHODOLOGY

Figure 8: Example of the structure of a JSON file with BBox data for each category of
detected objects.

The ICM, which is based on MobileNetV2, takes the extended person image as input and
then returns a sigmoid score. The range of the sigmoid score is from 0 to 1, and the
closer the score is to 1, the more likely the image’s class is persons, and vice versa. The
ICM is trained by the dataset [19]. By passing all extended images of one person’s ID to
the ICM, the score in each frame of the person’s ID could be computed. According to
Equation 6, the weighted score Sweighted could be computed.

Sweighted =
∑F

f=1 WfSf∑F
f=1 Wf

Wf =

{
1

w′
fh

′
f

160×160
≥ 1

w′
fh

′
f

160×160

w′
fh

′
f

160×160
< 1

(6)

In Equation 6, F represents the number of frames that the ID exists, Sf represents the
score in each frame, Wf represents the weight of frame f , which is related to the rate
between the extended image’s size and the desired input size of the Image Classification
Model, w′

f and h′
f represent the width and height of the extended image of frame f

respectively. The reason for introducing the weight is to minimize the effect of low-quality

17


3.2 Object detection and classification 3 METHODOLOGY

Figure 9: The framework of the person classification model.

images which are more likely to be miss classified.

3.2.3 Cyclist detection

Figure 10: The framework of the cyclist detection model.

YOLOv7 can detect either persons or bicycles, but it cannot classify the person riding
the bicycle as a cyclist. Figure 12 shows the framework of the cyclist detection model.
The key to enable cyclist detection in the object detection model is the overlap area of
the person and the cyclist. The left part of the Figure 11 shows the overlap area. If the
overlap area between the person and the bicycle is larger than the threshold, the person

18


3.2 Object detection and classification 3 METHODOLOGY

could be considered as the overlap person of the bicycle in the current frame. The ’if
BBox Overlap Function’ (iBO) can check whether the person is the overlap person of the
bicycle by comparing the metric rateoverlap, which is the rate between the overlap area
and the whole area of the bicycle and person, with a threshold. Equation 7 shows how to
calculate the metric, where variables’ definitions are shown in Table 1, but both BBoxes
should pass to the Overlap Checking Function which is shown in Figure 12 to make sure
the overlap area is larger than 0.

Figure 11: Schematic of person bicycle overlapping and BBox merging.

Figure 12: A schematic drawing of the overlap checking function.


Ap = (xp2 − xp1)(yp2 − yp1)

Ab = (xb2 − xb1)(yb2 − yb1)

Aoverlap = (abs (min(xp2, xb2)− max(xp1, xb1))) (abs (min(yp2, yb2)− max(yp1, yb1)))
roverlap =

Aoverlap
Ap+Ab−Aoverlap

(7)
To convert the overlap person as the rider of the bicycle, the overlap person should either
overlap in most of the frames of the corresponding ID of the bicycle or have a long overlap
duration that is beyond a threshold time. Figure 13 shows how the rider detection function
works. The model has two branches according to the frame count of the bicycle’s ID. The
overlap time of the rider should be long to avoid noise. The reason that the lower branch
of Figure 13 is included in the framework is the overlap person may vary because of the
YOLOv7 detection limitation.
The Updating Function in Figure 12 can merge the BBox, update the bicycle label to the
cyclist, and delete the frame of the person from the list when the person is detected as a
rider. Finally, the updated JSON file could be created.

19


3.2 Object detection and classification 3 METHODOLOGY

Table 1: Definitions of Variables in Equation 7

Variables Definition
Ap BBox area of person
Ab BBox area of bicycle

Aoverlap BBox overlap area of both person and bicycle
xp1, yp1 Top left coordinate of BBox of person
xp2, yp2 Bottom right coordinate of BBox of person
xb1, yb1 Top left coordinate of BBox of bicycle
xb2, yb2 Bottom right coordinate of BBox of bicycle
roverlap Rate between overlap area and overall area of both person and bicycle

Figure 13: A schematic drawing of the framework of the rider detection function

20


3.3 Object position extraction 3 METHODOLOGY

3.3 Object position extraction

The ground truth for the position used as input to the machine learning algorithms was
obtained from a GPS that was mounted in the center of a car and a LiDAR mounted
on the e-scooter. The position data for the car was obtained from the car’s GPS, while
the data for the e-scooter and bike where obtained from the LiDAR mounted on the
stationary e-scooter.

3.3.1 Position from GPS

In the GPS data, the following information was obtained:

• UTC timestamp

• Latitude and longitude in degrees

• Heading angle in GPS reference system

• The 2-D velocity

To simplify the data analysis the latitude and longitude coordinates were converted to
the WGS84 reference system and then rotated by the geographical angle (40.333 degrees)
of the airfield. Since the coordinates of the e-scooter is known this point was set as origin
(0,0) and the coordinates from the car could now be expressed as the position in x and y
from this new origin. This means that the coordinates after these conversions represent
the distance to the center of the car from the e-scooter. An example of this is plotted
below:

Figure 14: GPS data which shows the distance between the e-scooter to the center of the
car during one test.

The e-scooter was oriented at different angles in relation to the runway, this is visualized
in the figure below:

21


3.3 Object position extraction 3 METHODOLOGY

(a) Illustration of 0° rotation of the e-scooter
relative to the runway.

(b) Illustration of -90° rotation of the e-scooter
relative to the runway.

Figure 15: Illustration of different rotations of the e-scooter relative to the runway.

There were also cases where the e-scooter was rotated -45° and 45° relative to the runway.
Since the e-scooter was oriented with different angles to the runway it has to be accounted
for to ensure that the coordinate system of the car would match the camera´s system.
This was done by applying a rotation matrix on the x and y values, this is illustrated in
the figures below:

(a) This example shows the GPS data before
rotated by the e-scooter’s angle to the runway.

(b) This example shows the GPS data after the
rotation matrix is applied.

Figure 16: The two plots show first the GPS data before taking the e-scooter’s angle
relative to the runway into account. After the rotation matrix was applied the second
plot was obtained which gives the correct orientation.

The information about the car’s position cannot be used directly since the model is ideally
based on the distance to the closest point of the object that’s being tracked rather than
the geometrical center of the car where the GPS was mounted. In other words, the car’s
position is actually the geometrical center’s position of the car. It is easy to imagine a
scenario where the geometrical center is still a distance to the e-scooter while some part
of the car has already touched the e-scooter due to the car’s dimensions. Therefore, it is
necessary to convert the position of the geometrical point to the position of the closest
point of the car. This problem was solved by first considering the dimensions of the car.
Because the car is not driving in a straight line the heading angle of the car will affect
which point of the car that will be the closest. This was solved by applying a rotation
matrix with the same angle as the heading angle of the car. Then, the closest point of
the car to the e-scooter was calculated by placing 100 points on the sides, front, and rear
of the car (400 total), which could then be used to calculate which point of the car was

22


3.3 Object position extraction 3 METHODOLOGY

closest. These points’ coordinates were then transformed to the polar system to find the
polar radius which was then used to find the closest point. This can be illustrated by the
following schematic:

Figure 17: Schematic over how the closest point of the car to the e-scooter was calculated.
The red dots represent the points placed along the front, side, and rear of the car. The
polar radius for each point was calculated and the smallest radius was identified and
saved. The green circle shows the point of the car which is closest to the e-scooter.

By doing this, the plot of the closest point of the car to the e-scooter was obtained as
shown in Figure 18:

Figure 18: Based on the same data as Figure 14 with the addition of the position of the
closest point of the car to the e-scooter.

23


3.3 Object position extraction 3 METHODOLOGY

The radius, angle, and timestamps were saved to a CSV file, which was used to be used to
generate training data for the position estimation model. During the transition of closest
point from one side of the car to the other a sudden fluctuation is observed causing the
jump in the green line as shown in Figure 18.

3.3.2 Position from LiDAR

The position identification from LiDAR is based on the work by Bharti [6]. This uses
DBSCAN to process the LiDAR point cloud map. In the point cloud map, the pedestrian,
the e-scooter rider, and the cyclist are tracked. The utilization of the Unix epoch time
from the first frame in VeloView serves as a standardized and synchronized time reference
for the script. This practice enhances consistency, ensures accurate alignment of data,
and facilitates the proper initialization of the processing environment. To achieve this,
the initial step involved using VeloView to extract the Unix epoch time from the first
frame, subsequently employing this time as the script’s starting point. After the initial
settings were completed, the object could be tracked, as shown in Figure 19a and Figure
19b below:

(a) Point cloud map before applying DBSCAN
to remove noise.

(b) Tracked object point cloud map after ap-
plying DBSCAN.

Figure 19: The two plots show how to track traffic participants.

For different tracking objects, different X-axis and Y-axis ranges need to be set to en-
sure that the object could be continuously tracked and more data related to the object
could be obtained. If the calculated distance between consecutive centroids exceeds this
threshold, it suggests that the tracked object has undergone rapid movement within the
specified time frame. It’s important to note that certain objects may exhibit faster mo-
tion, leading to the adjustment of the threshold used to determine segment loss. Finally,
by processing 22 LiDAR point cloud files, the X position, Y position, and angle of the
tracked objects were saved in different CSV files, which were used for future training and
trajectory prediction.
The polar coordinate system used to represent the position from the LiDAR is the fol-
lowing:

24


3.4 Training data 3 METHODOLOGY

Figure 20: The polar coordinate system used to represent the position data from the
LiDAR.

By using this coordinate system equation 2 has to be modified to:

x = −R · cos θ and y = −R · sin θ (8)

3.4 Training data

To form the training data for the position estimation, the position data from the LiDAR
or GPS was used in combination with the JSON file from the video data. The first step
was to synchronize the timestamps from the LiDAR and GPS to the timestamps of the
frames from the camera. This was done by interpolating the different time vectors to get
one that was related to the tracked IDs and the position to the object from the e-scooter.
Then, for each frame that contains the object of interest, the position, and information
about the BBox are saved to form the training data. Below is a table that explains the
variables saved for each frame that contains the tracked object.

Table 2: Variables saved from the tracked object for each camera frame that contains the
tract object. This data is saved and used as training data for the models.

X x-coordinate of the object
Y y-coordinate of the object
R Radial distance to the object
θ Angle to the object
x Lower mid x of BBox
y Lower y of BBox
h Height of BBox
w Width of BBox

y top y value of top of BBox

This data was then inputted to different machine learning algorithms, to avoid overfitting,
different combinations of inputs were tested to find the optimal set of inputs.

25


3.4 Training data 3 METHODOLOGY

3.4.1 Pre-processed ground truth data

For the LiDAR and GPS data processing, the data from all the traffic participants was
saved in different CSV files with the format of Table 3. The CSV files contain the UTC
time, position of the X-axis, position of the Y-axis, and angle at each frame of the tract
object, which was used as input for the model training.

Table 3: Example of the content of the LiDAR and GPS CSV files.

UTC X position Y position Distance Degree
2023-08-17 12:16:53.399 -17.0146 0.78729 17.0335 -2.649273
2023-08-17 12:16:53.510 -16.3035 0.84513 16.326 -2.967397
2023-08-17 12:16:53.620 -15.5918 0.8581 15.616 -3.150115
2023-08-17 12:16:53.731 -14.8761 0.86372 14.9017 -3.322912
2023-08-17 12:16:53.842 -14.1617 0.88015 14.1896 -3.556359
2023-08-17 12:16:53.953 -13.439 0.8971 13.4694 -3.819038
2023-08-17 12:16:54.064 -12.7228 1.0048 12.7629 -4.515852
2023-08-17 12:16:54.174 -12.0091 0.95308 12.048 -4.537653

... ... ... ... ...

From the test data set the following number of CSV files for the ground truth was obtained:

Table 4: Number of CSV files for the ground truth for the three different road users.

Road user Sensor Total number of data points
Cyclist LiDAR 1764

Car GPS 1387
E-scooterist LiDAR 4374

3.4.2 Extracting bounding box information from camera frames

The video data processed by the object detection model was saved in the JSON file which
contains every camera frame with the frame number, GPS time, and uptime of this frame,
as well as the BBox, object type, and ID of the target object appearing in this frame.
By indexing the ID of the target object, the corresponding frame number, GPS time,
uptime and information of the BBox could be extracted. In this step, the information of
the BBox includes x and y coordinates of the left top point of the BBox and its height
and width. The information of the BBox was used as the input features for the position
estimation model and GPS time which indicates at what time the target object appears
in the video was used to synchronize camera data and ground truth data.

3.4.3 Synchronizing LiDAR/GPS and camera data

The whole dataset for position estimation should be based on camera data which means
each camera frame containing the target object should have corresponding ground truth.
However, because the camera has a frame rate of 30 Hz while LiDAR and GPS have a
frame rate of 10 Hz and 50 Hz respectively, the camera data needs to be synchronized
with LiDAR/GPS. LiDAR and GPS are interpolated to match the timestamp of the
camera frames. In terms of LiDAR, it has a lower sampling frequency than that of

26


3.5 Data post-processing 3 METHODOLOGY

the camera, which means there are more camera frames between two continuous LiDAR
frames. The interpolation was done on the current camera frame according to the two
nearest LiDAR frames which were before and after the camera frame respectively. By
doing the interpolation, the camera frame can have the corresponding positions X, and
Y, as well as the distance and angle of the target object. The synchronization for GPS
was done based on the same process.

Figure 21: Interpolation to synchronize camera and LiDAR.

3.5 Data post-processing

To both improve the model’s estimation data quality and estimate the kinematic informa-
tion, an outlier removing function based on DBSCAN and a Rauch–Tung–Striebel (RTS)
smoother could be applied to the raw output of the position estimation algorithm.

3.5.1 Trajectory extraction

The input of the outlier removing function and the RTS smoother is the trajectory of the
object, which is time series data describing the object’s position in corresponding time.
Since the output JSON after the position estimation contains the objects’ positions in
each frame of the input video, rather than the trajectory of each object, a trajectory
extraction process should be executed. However, the extracted trajectory may not time
continue since the position information of one object is missing in some time (frame), but
the following function expects a continuous trajectory as the input. As a result, the missed
position information should be found and then complementary by linear interpolation to
get a continuous trajectory.

3.5.2 Outlier Removing

The output of the position estimation may include some outliers (eg. Figure 22) that are
far away from the fine trajectory points due to insufficient training data or some reasons.

27


3.5 Data post-processing 3 METHODOLOGY

Figure 22: two trajectories examples that include a few outliers

Table 5: Selection Rule of min-samples

n (points of trajectory) n < 20 20 < n < 100 n ≥ 100
min-samples 2 0.1n 10

To find the outlier, DBSCAN, a cluster that can group the linked points and detect the
outlier points, could be applied to the trajectory points of one object to detect the outlier
points.
To imply the outlier removing function, two hyper-parameters, eps (the maximum dis-

tance between two samples, whose one sample is a point belong one neighbor point) and
min-samples (minimum number of one neighbor), should be tuned. The eps value is set
as 2 according to the motion limitation. The value of min-samples could be set based on
the number of points of a trajectory (Table 5). To make this removing function figure out
the outliers, which are a few points far away from the trajectory, in other words, which
are a minority of all points, the value of min-samples needs to be set as a small value.
However, the outlier points may grouped (more than 2 points) if the n is large, but the
number of the grouped outliers will not be too many (less than 10 points). As a result,
the value of min-samples needs to be changed according to n.

3.5.3 RTS smoothing

The state vector of the e-scooter kinematics information includes the lateral position xpx
k+1,

the longitudinal position x
py
k+1, the lateral velocity xvx

k+1, and the longitudinal velocity x
vy
k+1.

The motion model is shown in equation (9), where the left 4× 4 matrix is the transition
matrix, and the covariance of the motion noise qk which follows a normal distribution
is the transition covariance. The parameter T is the time interval between two frames.
Since the frame-per-second of the video is 30, the time interval should be set as T = 1/30.
The motion model describes a constant velocity process, which means the position in the
next state is derived from the position and velocity of the previous state, and the velocity
in the next state is almost the same as the previous state, that is, the velocity in the
next state is equal to the velocity in the previous state add a zero-mean Gaussian noise
(derived from Equation 10).

xpx
k+1

x
py
k+1

xvx
k+1

x
vy
k+1

 =


1 0 T 0
0 1 0 T
0 0 1 0
0 0 0 1




xpx
k

x
py
k

xvx
k

x
vy
k

+ qk, qk ∼ N

0,


0 0 0 0
0 0 0 0
0 0 0.1 0
0 0 0 0.1


 (9)

28


3.6 Evaluation methods 3 METHODOLOGY


ẋpx

ẋpy

ẋvx

ẋvy

 =


0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0




xpx

xpy

xvx

xvy

+


0 0
0 0
1 0
0 1

[ σ2
vx

σ2
vy

]

⇒ qk = T


0 0
0 0
1 0
0 1

[ σ2
vx 0
0 σ2

vy

]
0 0
0 0
1 0
0 1


T

=


0 0 0 0
0 0 0 0
0 0 Tσ2

vx 0
0 0 0 Tσ2

vy


(10)

The measurement model is shown in equation (11), where the left 2 × 4 matrix is the
observation matrix, and the covariance of the measurement noise rk which follows a
normal distribution is the observation covariance. Both the transition covariance and the
observation covariance can be tuned to optimize the performance of RTS on the training
video. The variance in the measurement model is bigger than the motion model since the
measurement, which is the output of the position estimation model, has some noise that
makes the trajectory not smooth.

[
ypxk
y
py
k

]
=

[
1 0 0 0
0 1 0 0

]
xpx
k

x
py
k

xvx
k

x
vy
k

+ rk, rk ∼ N
(
0,

[
2 0
0 2

])
(11)

The position (both lateral and longitudinal) in the initial state is set as the same as the
measurement at the beginning, and the velocity in the initial state is derived from the
first and the second position measurements as a constant velocity process. The initial

state covariance could be set as


2 0 0 0
0 2 0 0
0 0 9 0
0 0 0 9

, whose variance of velocity is relatively

big because the derived velocity is not accuracy.

3.6 Evaluation methods

When evaluating the performance of a regression model, there are three common metrics
that are typically used: mean squared error (MSE, Mean Squared Error), mean absolute
error (MAE, Mean Absolute Error), and coefficient of determination (R², R-squared).

3.6.1 MSE

MSE is the average of the squared differences between predicted and actual values.

MSE =
1

n

n∑
i=1

(yi − ŷi)
2 (12)

where n is the number of samples, yi is the actual value, and ŷi is the corresponding
predicted value. The lower the MSE, the better the prediction performance of the model.

29


3.6 Evaluation methods 3 METHODOLOGY

3.6.2 MAE

MAE is the average of the absolute differences between predicted and actual values.

MAE =
1

n

n∑
i=1

|yi − ŷi| (13)

where n is the number of samples, yi is the actual value, and ŷi is the corresponding
predicted value. Also, the lower the MSE, the better the prediction performance of the
model.

3.6.3 R-squared

R-squared measures the proportion of the variance in the dependent variable that is
predictable from the independent variable(s). It ranges from 0 to 1, where a higher value
indicates a better explanatory power of the model.

R2 = 1−
∑n

i=1 (yi − ŷi)
2∑n

i=1 (yi − ȳ)2
(14)

where n is the number of samples, yi is the actual value, ŷi is the corresponding predicted
value, and ȳ is the mean of the actual values.

30


4 RESULTS

4 Results

In this chapter, the results are presented and visualized.

4.1 Object detection model

4.1.1 Object detection model for e-scooterist

Firstly, to test the performance of ICM, the test image set should be built. To build the
test image set, all images that are categorized as the person from the test videos need
to be extracted, and then need to be manually classified these images into pedestrian
and e-scooterist. The Receiver operating characteristic (ROC) curve (Figure 23) and the
proportion distribution curve (Figure 24) of the ICM could be then derived.

Figure 23: ROC curve of ICM

Figure 24 shows that ICM can distinguish the pedestrian well, but not good in distin-
guishing the e-scooterist. The threshold (sigmoid score to make classification) of the ICM
should be optimized to improve the performance of ICM. The threshold is set as 0.7 to
make the model have a high specificity to distinguish the pedestrian and make the sen-
sitivity not low in distinguishing the e-scooterist. The confusion matrix of the selected
threshold is shown in Table 6.

Table 6: Confusion matrix of e-scooterist detection (based on the image)

Actual \Predicted Positive Negative
E-scooterists 1413 1358
Pedestrian 232 2508

By applying this ICM, the pipeline of PCM can be built, and the performance of PCM
can be tested on the test videos. By manual observation, the confusion matrix of the
e-scooterist detection could be derived (Table 7). As we can see in Table 7, unlike Table
6, the performance of PCM is good both in distinguishing e-scooterist and pedestrian
which gives an accuracy of 90.48%.

31


4.2 Position estimation model 4 RESULTS

Figure 24: Proportion distribution curve of ICM (0: e-scooterist; 1: pedestrian)

Table 7: Confusion matrix of e-scooterist detection (based on the object)

Actual \Predicted Positive Negative
E-scooterists 23 1
Pedestrian 3 15

4.1.2 Object detection model for cyclists

The performance of CDM can be tested on the test videos. By manual observation, 26 IDs
of cyclists can be successfully detected, and 2 IDs of cyclists cannot (Table 8). The reason
that 2 cyclists cannot be detected is the riders (persons) haven’t been detected by the
YOLOv7 since they are far away from the ego then affects the CDM’s function because
the CDM updates the category based on rider detection. Overall, the CDM mostly can
detect the cyclists on the test videos with an accuracy of 92.86%.

Table 8: Detection accuracy of CDM

True False
Cyclists 26 2

4.2 Position estimation model

To be able to obtain position information from naturalistic data, position estimation
models are trained for each kind of road user using different machine learning algorithms
like Linear Regression, Decision Tree Regressor, K-Nearest Neighbors Regressor, Random
Forest Regressor, and Multi-layer Perceptron Regressor. The output of the model is the
distance and angle of the object.

32


4.2 Position estimation model 4 RESULTS

4.2.1 Position estimation model for cyclists

It can be concluded from Table 9 that KNeighborsRegressor has the highest R2 score and
meanwhile has the lowest error. Then, the RandomForestRegressor also has good perfor-
mance in general. Figure 26 shows the actual position and predicted position estimated
by the KNeighborsRegressor in Cartesian coordinate system, while Figure 27 shows it by
polar system.

Table 9: Comparison of the performance of different regression algorithms.

Model MSE MAE R2 Score
LinearRegression 5.692 1.923 0.477

DecisionTreeRegressor 1.842 0.576 0.835
KNeighborsRegressor 0.649 0.268 0.944

RandomForestRegressor 0.908 0.560 0.918
MLPRegressor 2.437 1.162 0.781

Figure 27 and Figure 26 show the reference data used to test and evaluate the polar
and cartesian models respectively. From the data, it is clear that the accuracy of cyclist
detection decreases as the angle and distance increase.

Table 10: Performance of KNN using different parameters.

Parameters MSE MAE R2 Score
Low mid x,Low mid y 2.064 0.984 0.819

Low mid x,Low mid y,height, width 1.063 0.606 0.905
Low mid x,Low mid y,height 1.522 0.787 0.861
Low mid x,Low mid y, width 1.242 0.667 0.889

Table 11: Performance of Random Forest using different parameters.

Parameters MSE MAE R2 Score
Low mid x,Low mid y 2.112 0.947 0.813

Low mid x,Low mid y,height, width 0.942 0.577 0.917
Low mid x,Low mid y,height 1.277 0.722 0.886
Low mid x,Low mid y, width 1.242 0.667 0.889

In Table 10 and 11, two models with high accuracy were chosen for comparison. Change
the features during training and combine them, and observe the R2 value. When the
features are selected as Low mid x,Low mid y,height, width, the model performs well.

Table 12: Performance of RF and KNN models for polar and Cartesian coordinates.

Algorithms MSE MAE R2 Score
RF Polar 0.979 0.639 0.892

RF Cartesian 1.049 0.687 0.854
KNN Polar 0.713 0.365 0.920

KNN Cartesian 0.731 0.402 0.913

33


4.2 Position estimation model 4 RESULTS

It’s clear from Table 12 that the R2 score is the highest when using the KNN model and
where the predicted values are the distance and angle in the polar coordinate system,
which gives the best model performance.

Figure 25: KNeighborsRegressor hyperparameter optimization. Where k = 2 gives the
highest R2 score.

By tuning the hyperparameters of the KNN model, as shown in Figure 25, the model
performs the best when n_estimators is 2. Therefore, in the best model, n_estimators =
2 and R2 score is 0.942.

34


4.2 Position estimation model 4 RESULTS

Figure 26: Cyclist test data and predicted data using Cartesian coordinates. Where the
blue points correspond to the test data and the orange points correspond to the predicted
points by the model.

Figure 27: Cyclist test data and predicted data using polar coordinates. This plot shows
the same data as in Figure 26 but represented in polar coordinates.

Plot the predicted data against the test data (shown in Figure 26 and Figure 27). In
general, the model predictions are very close to the actual values.

35


4.2 Position estimation model 4 RESULTS

Figure 28: Cyclist model error heatmap in Cartesian coordinates. Which shows small
errors, with a maximal error of 0.8 m.

While the cyclist model exhibits overall good performance (as shown in Figure 28, the
prediction error is within 0.4m for most areas), certain challenges arise in predictions
when the cyclist is positioned at a significant distance and angle (as shown in Figure 29
and Figure 32). Additionally, notable angular errors are observed in proximity to the
scooter, indicating specific areas where the model’s predictive accuracy may be improved

36


4.2 Position estimation model 4 RESULTS

Figure 29: Cyclist model distance error heatmap when using polar coordinates. Notable,
is the error when cyclist has a big angle to the e-scooter and is far away, this is where the
biggest distance error occur in the predictions.

Figure 30: Cyclist model distance error heatmap by polar coordinates after DBSCAN is
applied to the predictions done by the model.

37


4.2 Position estimation model 4 RESULTS

(a) KNN model distance prediction before filter. (b) KNN model distance prediction after filter.

Figure 31: Pre-filter and post-filter K Neighbors Regressor (n = 2) normalised polar model
distance prediction vs ground truths.

Figure 32: Cyclist angle error heatmap by polar coordinates

38


4.2 Position estimation model 4 RESULTS

Figure 33: Cyclist angle error heatmap by polar coordinates after filter

(a) KNN model angle prediction before filter-
ing.

(b) KNN model angle prediction after filtering
is done.

Figure 34: Pre-filter and post-filter K Neighbors Regressor (n = 2) normalised polar model
angle prediction vs ground truths.

After filtering and smoothing, the results are shown in Figure 30 and 33. Compared
to the error thermogram before filtering, the error is significantly reduced. By plotting
the regression curves (distance and angle) before and after filtering (Figure 31a and 31b,
Figure 34a and 34b), we can see that the filtered model predicts better.

4.2.2 Position estimation model for e-scooter riders

In this case, KNeighborsRegressor still has the best performance according to Table 13.
The R2 score reaches 0.9875, meanwhile, the mean absolute error is only 0.2371. The
RandomForestRegressor also has a similar performance.

39


4.2 Position estimation model 4 RESULTS

Table 13: Performance of different regression algorithms.

Model MSE MAE R2 Score
LinearRegression 7.391 1.999 0.682

DecisionTreeRegressor 0.527 0.288 0.976
KNeighborsRegressor 0.268 0.237 0.987

RandomForestRegressor 0.312 0.280 0.986
MLPRegressor 1.336 0.705 0.935

From Figure 36 and Figure 37, it can be seen that the consistency between predicted and
actual position is very high, which indicates that the model can make good predictions.

Table 14: Performance of KNN with different parameters.

Parameters MSE MAE R2 Score
Low mid x,Low mid y 0.611 0.463 0.975

Low mid x,Low mid y,height, width 0.342 0.297 0.984
Low mid x,Low mid y,height 0.432 0.367 0.982
Low mid x,Low mid y, width 0.413 0.345 0.982

Table 15: Performance of Random Forest with different parameters.

Parameters MSE MAE R2 Score
Low mid x,Low mid y 0.612 0.495 0.975

Low mid x,Low mid y,height, width 0.393 0.320 0.982
Low mid x,Low mid y,height 0.479 0.416 0.979
Low mid x,Low mid y, width 0.621 0.393 0.970

Table 16: Performance of RF and KNN models using polar and Cartesian coordinates.

Algorithms MSE MAE R2 Score
RF Polar 0.553 0.385 0.969

RF Cartesian 0.564 0.386 0.967
KNN Polar 0.386 0.294 0.978

KNN Cartesian 0.394 0.304 0.978

Similarly, it can be concluded from Table 16, 15 and 14 that the KNN model using the
polar coordinate as output still has the best performance.
Next, the escooter rider data is processed in the same way as the cyclist data, the inputs
are Low mid x,Low mid y,height, width, the KNN model is chosen for prediction, and the
hyperparameters are adjusted to get the final model. Where k = 2 gives the highest R2

score (best performance), as can be seen in figure 35. This gave a final R2 score of 0.978.

40


4.2 Position estimation model 4 RESULTS

Figure 35: KNeighborsRegressor hyperparameter optimization, where k = 2 gives the
best result.

Figure 36: E-scooter Test data and Predict data by Cartesian coordinates.

41


4.2 Position estimation model 4 RESULTS

Figure 37: E-scooter Test data and Predict data by polar coordinates.

Plot the predicted data against the test data for escooterists (in Figure 36 and 37).
Overall, the model predictions closely align with the actual values.

Figure 38: E-scooter error heatmap by Cartesian coordinates.

The analysis reveals negligible deviations in both distance(in Figure 39) and angular errors
(in Figure 42) across the majority of areas. At the same time, the prediction errors of

42


4.2 Position estimation model 4 RESULTS

the distances were very small, mostly within 0.4m (in Figure 38). Notably, only a small
fraction of regions exhibited angular errors surpassing 10 degrees.

Figure 39: E-scooter distance error heatmap by Polar Coordinates.

Figure 40: E-scooter distance error heatmap by polar coordinates after filter.

43


4.2 Position estimation model 4 RESULTS

(a) KNN model distance prediction before fil-
tering.

(b) KNN model distance prediction after filter-
ing is done.

Figure 41: Pre-filter and post-filter K Neighbors Regressor (n = 2) normalised polar model
distance prediction vs ground truths.

Figure 42: E-scooter angle error heatmap by polar coordinates.

44


4.2 Position estimation model 4 RESULTS

Figure 43: E-scooter angle error heatmap by polar coordinates after filter.

(a) KNN model angle prediction before filter-
ing.

(b) KNN model angle prediction after filtering
is done.

Figure 44: Pre-filter and post-filter K Neighbors Regressor (n = 2) normalised polar model
angle prediction vs ground truths.

After filtering and smoothing, the prediction error results are shown in Figure 40 and 43.
Compared to the error thermogram before filtering, the error is significantly reduced. By
plotting the regression curves (distance and angle) before and after filtering (Figure 41a
and 41b, Figure 44a and 44b), we can see that the filtered model predicts better.

4.2.3 Position estimation model for cars

It can be concluded from the previous steps that KNeighborsRegressor and Random-
ForestRegressor are optimal algorithms. So, the performance of these two algorithms is
compared in this section.

45


4.2 Position estimation model 4 RESULTS

Table 17: Performance of different regression algorithms.

Model MSE MAE R2 Score
KNeighborsRegressor 0.829 0.278 0.989

RandomForestRegressor 1.872 0.618 0.974

Table 18: Performance of KNN with different parameters.

Parameters MSE MAE R2 Score
Low mid x,Low mid y 4.866 0.770 0.933

Low mid x,Low mid y,height, width 1.566 0.326 0.979
Low mid x,Low mid y,height 2.339 0.470 0.968
Low mid x,Low mid y, width 1.504 0.341 0.980

Table 19: Performance of Random Forest with different parameters.

Parameters MSE MAE R2 Score
Low mid x,Low mid y 4.522 1.017 0.938

Low mid x,Low mid y,height, width 2.223 0.655 0.970
Low mid x,Low mid y,height 2.481 0.730 0.966
Low mid x,Low mid y, width 2.706 0.724 0.963

Table 20: Performance of RF and KNN models using polar and Cartesian coordinates.

Algorithms MSE MAE R2 Score
RF Polar 3.291 0.843 0.955

RF Cartesian 3.475 0.883 0.951
KNN Polar 2.804 0.558 0.963

KNN Cartesian 2.845 0.563 0.961

From Table 17, 18, 19 and 20, it can be seen that the KNN model using polar coordinate
as output has the same R2 score when it uses Cartesian coordinate. However, the MSE
and MAE values are slightly lower when polar coordinate is used. Therefore, KNN model
using polar coordinate as output has the best performance. Using 2 as the number of
neighbors (k=2) gives the highest result for the car’s models, see figure 45.

46


4.2 Position estimation model 4 RESULTS

Figure 45: KNeighborsRegressor hyperparameter optimization. From the figure it can be
concluded that using k = 2, number of neighbors, gives the highest R2 score.

Figure 46: Car test data and predicted data by Cartesian coordinates.

47


4.2 Position estimation model 4 RESULTS

Figure 47: Car test data and predicted data by polar coordinates.

By comparing the predicted values in Cartesian coordinate (Figure 46) and polar coor-
dinate (Figure 47), the predicted and true values are almost matched, which means that
the prediction performance of the model is good.

Figure 48: Car error heatmap by Cartesian coordinates.
48


4.2 Position estimation model 4 RESULTS

Utilizing our model predictions, we can effectively constrain the distance prediction error
of the vehicle to within 2 meters, from Figure 48. Certain regions may exhibit white spaces
(in Figure 48, 49 and 52), indicative of missing data, a consequence of the constraints
imposed by our limited test dataset.

Figure 49: Car distance error heatmap by polar coordinates.

Figure 50: Car distance error heatmap by polar coordinates after filter.

49


4.2 Position estimation model 4 RESULTS

(a) KNN model distance prediction before fil-
tering.

(b) KNN model distance prediction after filter-
ing is done.

Figure 51: Pre-filter and post-filter K Neighbors Regressor (n = 2) normalised polar model
distance prediction vs ground truths.

Figure 52: Car angle error by polar coordinates.

50


4.2 Position estimation model 4 RESULTS

Figure 53: Car angle error by polar coordinates after filter.

(a) KNN model distance prediction before fil-
tering.

(b) KNN model distance prediction after filter-
ing is done.

Figure 54: Pre-filter and post-filter K Neighbors Regressor (n = 2) normalised polar model
distance prediction vs ground truths.

After filtering and smoothing, the prediction error results are shown in Figure 50 and 53.
Compared to the error thermogram before filtering, the error is significantly reduced. By
plotting the regression curves (distance and angle) before and after filtering (Figure 51a
and 51b, Figure 54a and 54b), we can see that the filtered model predicts better.
In conclusion, the entire training process adheres to a systematic pipeline, encompassing
the selection of a machine learning model, definition of its input and output parameters,
subsequent hyperparameter tuning, and final evaluation of its performance. The final
results are shown in Table 21

51


4.2 Position estimation model 4 RESULTS

Table 21: Final model selection results

Target Model Selection Input Selection Output Selection Hyperparameter
Cyclist KNN Low mid x,Low mid y,height, width KNN Polar Coordinates n = 2

Escooterists KNN Low mid x,Low mid y,height, width KNN Polar Coordinates n = 2
Cars KNN Low mid x,Low mid y,height, width KNN Polar Coordinates n = 2

52


5 DISCUSSION

5 Discussion

In this chapter, various aspects of the work done will be discussed and potential limitations
presented.

5.1 Discussion on the overall result

In the results discussion section, we first delve into the performance of the model, particu-
larly in the detection of e-scooterist and distance estimation models. When validating the
model on the training set, we observed commendable performance. However, challenges
arise when validating natural data sets, particularly in the case of pedestrian prediction.
For instance, when visually estimated distances of 10 meters in a video are compared with
the model’s prediction of 3 meters, a significant discrepancy is evident, which indicates
notable errors. This highlights the need for a more in-depth investigation into the model’s
generalization capabilities in real-world scenarios.
A significant challenge stems from the absence of ground truth values in natural datasets,
hindering our ability to accurately assess the model’s performance in real-world environ-
ments. The limitations imposed by the lack of known ground truth values underscore
the importance of future work in collecting additional real-world data to enhance the
robustness of our evaluation framework.
Furthermore, in the context of distance prediction models, we observed minimal differ-
ences between the prediction results of the KNN model and the Random Forest Regression
models. This makes it challenging to determine which model performs better on Natural-
istic datasets definitively. This suggests a need for further optimization in model selection
and parameter tuning to enhance performance in real-world environments.

5.2 Training dataset for YOLOv7

The training dataset used for the weight "yolov7x.pt" of YOLOv7 was based on COCO
which is a more general training dataset which could introduce more errors for object
detection or classification compared if a more specific training dataset would have been
used. The benefit of using a more traffic-related dataset is unknown, due to time this
was not investigated, but it could improve the performance of the models. An example
of when this potentially caused a problem was an incident where the model classified a
road marking as a cyclist.

5.3 Detection accuracy of PCM & CDM

Figure 24 shows the proportion distribution curve of the ICM, which is the model that
classifies the single image into pedestrian or e-scooterist. Ideally, the e-scooterist should
has a high proportion when the sigmoid score is near 0 (in other words, the blue line
(e-scooterist) should have a high proportion value (>90%) when the sigmoid value is
smaller than 0.5), and vice versa pedestrian. However, only pedestrian detection has a
good performance, and the performance of e-scooterist detection is worse than pedestrian
detection. The most significant reason for this phenomenon is that the image dataset
that trains the ICM is different from the image extracted from the track test video. The
image dataset that trains the ICM is not captured by a fisheye camera, and the camera
mounting locations and image capture environments also vary. All these factors could

53


5.4 Limitations of the test data for distance estimation 5 DISCUSSION

affect the performance of ICM. Another possible reason is that the picture quality of
some of the person objects that are too far away is too poor (too small BBox), leading
to inaccurate predictions by the ICM. As a result, the weighted score algorithm (Figure
9) is introduced to mitigate the shortcomings of the ICM model. By checking the result
of PCM (Table 7), we found that the detection accuracy is acceptable and much better
than solely using the ICM model itself as the e-scooterist detection model, which means
the weighted score algorithm works and is necessary.
The CDM is a rule-based algorithm based on the overlap area. Although using the rate
between the overlap area and the overall area to do cyclist detection is not rigorous, the
detection performance on the test track video (Table 8) shows an acceptable accuracy.

5.4 Limitations of the test data for distance estimation

Since the test dataset only includes a few people acting as VRUs, the developed algorithms
may be biased towards people with similar height and size. In normal traffic situations,
there could be a number of different people on the road, which could introduce problems.
To combat this issue the algorithms should be ideally trained on people of all ages and
heights, to avoid bias. It would also be preferred to include strollers and wheelchairs.
The same could be said about the car used, since there are a lot of different vehicles on
the road and not just sedans. To properly train the algorithms, the test data should also
include (but not be limited to): trucks, buses, and trams. Finally, the impact of different
weather conditions would also be of great interest to investigate. But to include such a
vast number of road users, vehicles, and different weather conditions, the cost of the test
data would increase by a lot, making it difficult to finance.
From table 4, the number of tests for the e-scooter rider is half of the number for the
cyclist and car, which is not ideal. Ideally, there should be the same number of tests for
each road user.
As can be seen in Figure 48 to 50 and Figure 52 to 53, there are parts of the figure which
are white, which means that there are no data in these areas. This results in a model that
is not trained properly, since training data is not complete. This has a risk of lowering
the model’s performance which makes predictions in some parts of the video frames less
accurate. This is not the case for e-scooterists and cyclists since the training data does
not have the same problem.

5.5 Calculation of the closest point of the car

As can be seen in Figure 18, the plot for the closest point of the car to the e-scooter
deviates more than expected from the center of the car when the car is far away. This is
most likely due to rounding errors and the fact that a small error in the polar angle will
cause a greater error in the calculation of position due to the large radial distance. This
didn’t cause any significant problems since this error occurs a large distances where the
camera is unable to detect the car. Since the machine learning algorithms are based on
frames where the car is detected, this error is not included in the model.

5.6 E-scooter orientation

The transformation of the coordinate systems was based on the orientation of the e-scooter
in relation to the runway, there could be errors introduced in the calculations. The angles

54


5.7 Object detection at extreme angles and distances 5 DISCUSSION

were determined from the videos, which were: -90°, 90°, -45°, and 45°, but it is possible
that the actual angle could be slightly different. This is due to the fact that the angles
are based on visual estimation, which could be improved by using the containers, as can
be seen in Figure 55, as a reference by using LiDAR data. This is why the orientation
could introduce errors when using polar coordinates, especially when the distance is large.
Because the lateral and longitudinal distances are based on the distance and angle. Below
follows an example of -45° and 45°, which could deviate from the actual angle, and rotation
of the e-scooter relative to the runway:

Figure 55: 45° and -45° rotation of the e-scooter relative to the runway.

5.7 Object detection at extreme angles and distances

When watching the videos produced by YOLOv7 the ID of the tracked object can change
over time due to, the object moving out of frame or loss of detection for a short time.
When synchronizing the video and distance data each id of the tracked object is used to
make sure that the amount of training data is maximized from each video. The most
problematic case is when the tracked object is partially in the frame, this is shown below:

Figure 56: ID of the tracked object changes as the object moves into frame.

As shown in the figure above, the ID of the car changes as the car moves into frame, from
11 to 12, which can be a problem due to the performance of the camera is poor at extreme
angles due to the distortion of the fish-eye lens is greatest at extreme angles, as discussed

55


5.8 Critical event detection 5 DISCUSSION

in the report by Bharti [6]. This causes the data to be noisy which could affect the
training data and in turn the models, since the video data differs substantially from the
ground truth. However, due to the fact that the amount of data from these cases is small
compared to the whole training data set, this problem is not likely to affect the machine
learning algorithms noticeably. As the distance increases the same type of problems can
occur. These frames could be excluded by only considering objects that exist in a more
limited field of view or decreasing the distance limit of the detection model to ensure
better camera performance for all recorded frames. Another compromise, which is not
ideal, could be to exclude IDs which only appear for a short time (<1 second) from the
training data set.

5.8 Critical event detection

Currently, the relative kinematics of the traffic objects could be derived from video, but the
critical events are not able to be precisely detected. Firstly, the accuracy of the detected
kinematics on Naturalistic video data hasn’t been verified. The Naturalistic video does not
have ground truth data of its traffic objects. As a result, the accuracy of the kinematics
of traffic objects cannot be rigorously proven. By observing the Naturalistic video, we
found that some objects labeled as persons are far away from the ego, but the model
detects them close, although the model works perfectly on the test video. As a result,
we couldn’t believe the kinematics derived from the model when the model was applied
to the Naturalistic video. Secondly, the velocity and heading of the ego e-scooter are
not included in the given Naturalistic data. Currently, we only have the relative position
and velocity of the traffic objects. A more accurate trajectory prediction of the traffic
object could be made by combining both the relative kinematics of traffic objects and the
ego e-scooter kinematics such as velocity and heading. Thirdly, a simple critical event
detection algorithm could be developed and set, for example, if the predicted trajectory of
the traffic object in 2s is close to the ego e-scooter, then the algorithm is been triggered,
but the accuracy of the algorithm would not be good, which means the false positive rate
and the false negative rate would be high. In conclusion, the information derived from
the current model is insufficient to detect the critical event from the Naturalistic video
data.

56


6 CONCLUSION

6 Conclusion

In this section, the project will be summarized, including the most important steps during
the project.
The work done in this project shows that it is entirely possible to estimate the position
of the road user although with some uncertainties. For all road users, the whole model,
including both position estimation and smoothing, can reach a R2 above 0.9 on the test
track video. Furthermore, by manually visualizing the result, we also found that the
model has a good position estimation accuracy on naturalistic video. Estimating road-
users kinematics will assist in the identification of near-crash scenarios from naturalistic
data. Using kinematics to extract critical interactions is not only more objective, as each
viewer has their own definition of what is and what is not a critical interaction, but also
saves time compared to visual inspection.

Figure 57: A summarized view of the pipeline, where a regular video is analyzed to detect
the different road users. Then the positions of the road users are estimated using k-
neighbors regressor which is then filtered using DBSCAN. Finally, the annotated video
with trajectories and the JSON file with the kinematic information is obtained.

The overall process of the project work was described in Figure 57. When the video was
saved from the e-scooter, it would be fed to the object detection algorithm which would
detect and classify each road user. Then both the distance and position of the road users
would be estimated using the machine learning algorithms. Then, to remove any outliers
or noise from the predictions a DBSCAN-based noise remover was implemented to filter
out the outliers and produce a more accurate result. After that, an RTS smoother was
applied to both estimate the kinematics information and improve the distance estimation
accuracy. Finally, a JSON file that contains the kinetic information and an annotated
video with a plot that shows the detected object trajectory was generated. Both these final
outputs could then be used to automatically and objectively identify critical interactions
between different road users.
Using this pipeline video data from the e-scooter can be analyzed without a human needing
to watch everything. By defining a threshold for critical interactions, it would be possible
to objectively and automatically find these critical situations.

57


7 FUTURE WORK

7 Future work

For future work, it would be interesting to apply the methods used in this project on a
larger data set and compare the results. A larger data set would ideally contain more
road users, making it possible to train the model on a more diverse environment which
would make the models more representative of real traffic. It would also be beneficial
to obtain more data for each of the different road users to increase the robustness of
the models, especially for the car. In terms of the training of YOLOv7, similar projects
could benefit from using a more traffic-specific training data set instead of the more
general COCO training set. To further improve the YOLOv7 model, more training data
is needed since the performance of the models will likely increase with more diverse data of
different road users. Also, the rule-based cyclist detection model used in this project can
be improved as it’s not robust enough when the cyclist appears for a short time. A deep
learning-based Convolutional Neural Network (CNN) model can be built to distinguish
between a person, a bicycle, and a cyclist, leveraging its effectiveness and accuracy in
image recognition tasks.

58


REFERENCE REFERENCE

Reference

[1] WHO. Road traffic injuries. url: https://www.who.int/health-topics/road-
safety#tab=tab_3.

[2] European Commission. ITS Vulnerable Road Users. url: https://transport.
ec.europa.eu/transport-themes/intelligent-transport-systems/road/
action-plan-and-directive/its-vulnerable-road-users_en.

[3] Kumar Apurv. E-scooter Rider Detection System in Driving Environments. 2021.
url: https://doi.org/10.25394/PGS.15057183.v1.

[4] Kumar Apurv, Renran Tian, and Rini Sherony. “Detection of E-scooter Riders in
Naturalistic Scenes”. In: arXiv preprint arXiv:2111.14060 (2021).

[5] Jing Zhu and Yi Fang. “Learning object-specific distance from a monocular image”.
In: Proceedings of the IEEE/CVF International Conference on computer vision.
2019, pp. 3839–3848.

[6] Karan Bharti. Estimating road-user position from a camera: a machine learning
approach to enable safety applications. 2023. url: https://odr.chalmers.se/
items/c5c843b1-773e-4c9d-a8a7-b291ca6614fd.

[7] Yury Davydov, Wen-Hui Chen, and Yu-Chen Lin. “Supervised object-specific dis-
tance estimation from monocular images for autonomous driving”. In: Sensors 22.22
(2022), p. 8846.

[8] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao et al. “YOLOv7:
Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detec-
tors”. In: arXiv, 2022. url: https://doi.org/10.48550/arXiv.2207.02696.

[9] Andrew G. Howard et al. “MobileNets: Efficient Convolutional Neural Networks for
Mobile Vision Applications”. In: arXiv (2017).

[10] Mark Sandler et al. “MobileNetV2: Inverted Residuals and Linear Bottlenecks”. In:
arXiv (2018).

[11] Gareth James et al. An introduction to statistical learning. Vol. 112. Springer, 2013.

[12] Trevor Hastie et al. The elements of statistical learning: data mining, inference, and
prediction. Vol. 2. Springer, 2009.

[13] Christopher M Bishop. Pattern Recognition and Machine Learning by Christopher
M. Bishop. Springer Science+ Business Media, LLC, 2006.

[14] Trevor Hastie et al. The elements of statistical learning: data mining, inference, and
prediction. Vol. 2. Springer, 2009.

[15] Jeff Heaton. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning:
The MIT Press, 2016, 800 pp, ISBN: 0262035618. Vol. 19. 1-2. Springer, 2018,
pp. 305–307.

[16] Erich Schubert et al. “DBSCAN revisited, revisited: why and how you should (still)
use DBSCAN”. In: ACM Transactions on Database Systems (TODS) 42.3 (2017),
pp. 1–21.

[17] Caizhi Zhang et al. “Review of Clustering Technology and Its Application inCo-
ordinating Vehicle Subsystems”. In: Automotive Innovation (2022). url: https:
//doi.org/10.1007/s42154-022-00205-0.

59

https://www.who.int/health-topics/road-safety#tab=tab_3
https://www.who.int/health-topics/road-safety#tab=tab_3
https://transport.ec.europa.eu/transport-themes/intelligent-transport-systems/road/action-plan-and-directive/its-vulnerable-road-users_en
https://transport.ec.europa.eu/transport-themes/intelligent-transport-systems/road/action-plan-and-directive/its-vulnerable-road-users_en
https://transport.ec.europa.eu/transport-themes/intelligent-transport-systems/road/action-plan-and-directive/its-vulnerable-road-users_en
https://doi.org/10.25394/PGS.15057183.v1
https://odr.chalmers.se/items/c5c843b1-773e-4c9d-a8a7-b291ca6614fd
https://odr.chalmers.se/items/c5c843b1-773e-4c9d-a8a7-b291ca6614fd
https://doi.org/10.48550/arXiv.2207.02696
https://doi.org/10.1007/s42154-022-00205-0
https://doi.org/10.1007/s42154-022-00205-0


REFERENCE REFERENCE

[18] F. TUNG H. E. RAUCH and C. T. STRIEBEL. “Maximum likelihood estimates of
linear dynamic systems”. In: AIAA JOURNAL (1965).

[19] E. (n.d.) Downloadable dataset for e-scooter Rider Detection Task, and a trained
model to support the detection of e-scooter riders. url: http://situated-intent.
net/e-scooter_dataset/.

60

http://situated-intent.net/e-scooter_dataset/
http://situated-intent.net/e-scooter_dataset/

	Introduction
	Background
	Literature review
	Aims and objectives

	Theory
	Coordinate systems
	Polar coordinate system
	Rotation of coordinate systems

	YOLOv7
	MobileNetV2
	Machine learning models
	Linear Regression
	Decision Tree Regressor
	 K-nearest neighbors
	Random Forest Regressor
	MLP Regressor

	DBSCAN
	Rauch–Tung–Striebel smoother

	Methodology
	Data collection
	Object detection and classification
	YOLO detection
	E-scooter rider detection
	Cyclist detection

	Object position extraction
	Position from GPS
	Position from LiDAR

	Training data
	Pre-processed ground truth data
	Extracting bounding box information from camera frames
	Synchronizing LiDAR/GPS and camera data

	Data post-processing
	Trajectory extraction
	Outlier Removing
	RTS smoothing

	Evaluation methods
	MSE
	MAE
	R-squared


	Results
	Object detection model
	Object detection model for e-scooterist
	Object detection model for cyclists

	Position estimation model
	Position estimation model for cyclists
	Position estimation model for e-scooter riders
	Position estimation model for cars


	Discussion
	Discussion on the overall result
	Training dataset for YOLOv7
	Detection accuracy of PCM & CDM
	Limitations of the test data for distance estimation
	Calculation of the closest point of the car
	E-scooter orientation
	Object detection at extreme angles and distances
	Critical event detection

	Conclusion
	Future work
	Reference