Detection and classification of marine
vehicles
Master’s thesis in Computer Science and Engineering

ATHANASIOS ROFALIS

Department of Mechanics and maritime sciences

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021
www.chalmers.se

www.chalmers.se


Master’s thesis 2021

Detection and classification of marine vehicles

ATHANASIOS ROFALIS

Department of Mechanics and maritime sciences
Division of Vehicle Engineering and Autonomous Systems

Chalmers University of Technology
Gothenburg, Sweden 2021


Detection and classification of marine vehicles
ATHANASIOS ROFALIS

© ATHANASIOS ROFALIS, 2021.

Supervisor: Ola Benderius, Department of Mechanics and maritime sciences
Examiner: Ola Benderius, Department of Mechanics and maritime sciences

Master’s Thesis 2021:82
Department of Mechanics and maritime sciences
Division of Vehicle Engineering and Autonomous Systems
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Detection and classification of marine vehicles.

Typeset in LATEX, template by Magnus Gustaver
Printed by Chalmers Chalmers Digitaltryck
Gothenburg, Sweden 2021

iv


Detection and classification of marine vehicles
ATHANASIOS ROFALIS
Department of mechanics and maritime sciences
Chalmers University of Technology

Abstract
One of the most common tasks within the computer vision field is the detection and
classification of different objects. This thesis aims to deliver a software that can be
deployed into real world scenarios and mange to detect and classify marine vehicles
accurately. Using one of the pre-defined deep neural network models You look only
once (YOLO), we managed to achieve a high performance for the detection and
classification task. The training of the model took place using a specific dataset of
grayscale images, which led to a model that can classify the objects with an accuracy
of 68% and predict the relevant position with mean average precision (mAP) of
0.77. Moreover, the model tested into different weather conditions and achieved an
accyracy of 0.85% and mAP of 0.068. In general, the YOLO model seems to be
a robust detector that can be trained and deployed for detecting efficiently objects
with high performance.

Keywords: Classification, detection, deep learning, computer vision, YOLO

v


Acknowledgements
First of all I would like to thank everyone who supported me during my master’s
thesis and gave me the motivation to continue. Secondly, I feel that I need to
express my deepest acknowledgement to my supervisor at Chalmers university Ola
Benderius, who guided me during these months and supported me whenever I felt
overwhelmed. Moreover, it is important to thank everyone worked on the Revere’s
autonomous ships project and managed to collect the data that was used for this
thesis.

Athanasios Rofalis, Gothenburg, May 2021

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5
2.1 Object detection using tubelets . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sea surface analysis for ship detection . . . . . . . . . . . . . . . . . . 6
2.3 Dataset for ship detections . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Adversarial convolutional network . . . . . . . . . . . . . . . . . . . . 7
2.5 Integration and delivery . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methods 9
3.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Model’s execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Results 17

5 Discussion 19
5.1 Ethical perspective and data management . . . . . . . . . . . . . . . 21

6 Conclusion 23

Bibliography 25

A Appendix 1 I

ix


Contents

x


List of Figures

3.1 The learning procedure describing the steps of for the training process. 11
3.2 The model’s deployment receiving data either directly through cam-

era or extracted from a database. . . . . . . . . . . . . . . . . . . . . 14

4.1 Automated evaluation process including the three different type of
users and the interaction with the web-service. . . . . . . . . . . . . . 18

5.1 Percentage of objects’ numbers for each class. . . . . . . . . . . . . . 20

xi


List of Figures

xii


List of Tables

3.1 The YOLOv2 model architecture. . . . . . . . . . . . . . . . . . . . 12

4.1 YOLO evaluation into the training and test set using two different
thresholds for the IoU. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 YOLO evaluation into the training and test set on different weather
conditions using two different thresholds for the IoU. . . . . . . . . . 18

xiii


List of Tables

xiv


1
Introduction

Self-driving vehicles are considered the future while the traditional driving style
largely might be eliminated. During the last decades many companies have in-
cluded in their massive production vehicles advanced driver-assistance systems, like
the change line system or the emergency stop. Moreover, the Tesla corporation was
the first company who made the next step and made the autonomous driving tech-
nology accessible to everyone by producing a car capable of autonomous driving in
certain driving scenarios.. The software of the car is a complex achievement which
using cameras and other sensors is able to identify objects while calculating the tra-
jectory that the car needs to follow in order to protect the passengers. Therefore,
the detection and classification of different objects is necessary. That task can be
handled by the combination of automotive engineer and computer vision.

1.1 Motivation

Modern computer vision techniques are able to solve complex tasks in a short period
of time and also yield results that classic methods were not able to achieve. During
the machine learning era, algorithms have been implemented to simplify the com-
puter vision task of a system which nowadays does not require expensive hardware
and complex software. Managing to establish a model that is stable and accurate
the system might be able to prevent accidents while evolving the driving experience.
There are machine learning algorithms that are able to solve the classification and
detection task accurately, however it is important to ensure that it performs at its
highest potential by evaluating the model after its training. A worth mentioning
problem concerning machine learning is that it does not detect and classify objects
that it has not seen before. That issue is called interpolation and extrapolation
problem and it is one of the drawbacks of machine learning. This means that the
model needs to be executed in scenarios that was trained for. For instance, a traffic is
an example of a mathematically complex scenario which by nature is unpredictable,
therefore for an algorithm that is not trained based on that situation might perform
poorly and could cause an accident by not avoiding other objects. That problem is
considered a reason machine learning has not been deployed for autonomous driving
vehicles yet.

1


1. Introduction

1.2 Goal
Part of the excessive research on the autonomous driving field is the vehicle labora-
tory Revere at Chalmers university of technology in which theoretical and current
algorithms are implemented and deployed into autonomous vehicles. This thesis
aims to establish a software that will be able to solve the object classification and
detection task specifically for an autonomous boat. The software will be consisting
of a machine learning model trained for detecting marine vehicles using monochrome
cameras placed on the boat.

1.3 Research questions
This thesis aims to presents some results regarding a detection and classification
task. A method that can be implemented in order to present some results is by
deploying deep learning models using annotated data, so that the model becomes
as accurate and robust as it can be. The data, which is monochrome images, is
collected in a certain weather condition, therefore images with weather diversity, for
instance blurry images due to fog or rain, could affect the model’s performance and
robustness. Thus, we need to ensure that the model is trained properly in order
to minimize potential false detections while maximizing its accuracy into different
weather conditions. Furthermore, since the model will be used in autonomous vehi-
cles it is essential to make sure that the algorithm can be structured in a way that
can be executed on the vehicle’s processor while it is still possible to quickly update
the software. Therefore, we introduce the continuous integration and continuous
deployment (CI/CD) pipeline which is a way of automatically testing and deploying
an algorithm.

Considering the aforementioned, there are three research questions that this thesis
will try to give an answer to. Those research questions are:

• How much annotation data is needed to achieve high performance, with at
least 90% correct classifications and an acceptable level of false detections, in
detection and classification of at least five different ship classes?

• How well does the network behave in different weather conditions given that
it was trained only on data with no rain, snow, fog, or wind?

• How can the algorithms be structured based on the microservice architecture
and CI/CD pipelines?

1.4 Limitations
Considering the research questions mentioned above, we will try to answer them
specifically for the autonomous ship and deploy the model into real case scenarios.
The algorithm was not tested in different weather conditions or during night time.
The machine learning model in this project was constructed specifically for the boat
at Chalmers Revere lab and it used real case scenarios from this boat through the

2


1. Introduction

learning process. Specifically, the data was gathered at Göta älv by a monochrome
camera installed on the boat used for the project.

1.5 Outline
For this thesis project a software was developed, for deploying the machine learn-
ing model, running alongside the OpenDlv software, that was implemented by the
researchers at Revere lab, which is responsible for processing and transferring the
data to the next microservice. The main part of the software is a robust machine
learning model that was trained, deployed and tested for a maximized performance.
It was also trained for a fast and accurate detection and classification of marine
vehicles.

3


1. Introduction

4


2
Background

The idea of autonomous ships was presented in the early of 1970s by Rolf Schonknecht
who claimed that one day ships will not be controlled by a captain onboard, instead
they will be given directions from someone on land. As a first step in this direc-
tion, autopilots on boats were being installed during the last decades of the 20th

century. Excessive research then started off in 2011 with a project at Korea research
institute of ship and ocean engineering (KRISO). A year later (2012), the European
Union launched a similar project. These projects fueled this field of research, and
governments, such as Norway, as well as different corporations started to invest in
similar projects, Lloyd’s register and Rolls–Royce to mention a few. Nowadays,
there are many active on the field of autonomous ships with promising results. For
instance, Rolls–Royce marine has announced in 2018 that they have an autonomous
ship ready to be deployed at sea by 2025. Likewise, the Japanese consortium have
made public that they will also present an autonomous ship that same year [4].

The purpose of this thesis project is, as mentioned in previous sections, to detect and
classify marine vehicles using installed cameras on boats. However, part of the work
has also been to take a closer look at what has been studied and tested recently.
There are many publications regarding object detection combining computer vision
and machine learning from the foregoing years. The next section describes a few of
the methods used in recent studies.

2.1 Object detection using tubelets
The research paper ‘object detection from video tubelets with convolutional neural
networks’ has investigated a way of making object predictions on videos. This
specific inquiry was based on the ‘large scale visual recognition challenge 2015
(ILSVRC2015)’ and the ImageNet VID dataset. According to the authors, the pro-
posed method can solve the task of object detection accurately and can be deployed
on videos.

The model that they used was a deep neural network but the predictions of the po-
tential object’s position differ from the anchor box technique which is widely used.
The first part of the algorithm was something the authors called a ‘spatio-temporal
tubelet proposal’. This approach means that a line parses the image and tries to iden-
tify objects that might appear. A score-based algorithm was implemented in order
to predict the position of the objects. Consequently, they deployed the GoogLeNet

5


2. Background

and AlexNet models for calculating probabilities for each object while deleting pro-
posals with small probability. However, due to a high variance, they included two
additional steps into the model to minimize overfitting and achieve a higher per-
formance. Following that, they deployed max-pooling and reinstated the predicted
bounding boxes during the tubelet procedure. Lastly, a four layer 1-D CNN, called
‘temporal convolutional network (TNP)’, was trained and yield the decision if the
predicted bounding box overlaps the ground truth above 50% [5].

2.2 Sea surface analysis for ship detection
A different approach was published in 2018, treating ship detection by analysing
the sea surface. The data used for the model was high quality images which were
captured by a satellite.

The most important part of this model is how it analyses the sea surface. Due to the
fact that images are taken by a satellite, different weather conditions might affect
the outcome of the images. For instance, fog and clouds can give blurry images, and
wind naturally creates waves which can make the images distorted. Therefore, they
created a mathematical equation based on the number of pixels. The model that
they implemented and trained was a support vector machine (SVM) model able to
classify shape features. The proposed method called SDSSA, was able to remove
objects smaller than fifty pixels including the false detection using the compactness
and length-width ratio of the object. However, a key point of the method occurs
before the removal of false predictions, when the model creates a list of potential
ship candidates using a mathematical equation based on the pixels and the area
of an object. After completing these three steps the remaining candidates are the
detected objects that the algorithm outputs [6].

2.3 Dataset for ship detections
During 2018 a dataset was published and it consists of 31,455 images of six different
classes of marine vehicles. The name of the dataset is SeaShip and compared to
other datasets, like VOC2007 or CIFAR-10 the size of that particular dataset is
larger while the resolution of the images is also higher, specifically 1920x1080. The
images were extracted from surveillance videos from cameras on different locations
in order to have a diversity of the background and the weather conditions.

Although, the main purpose of the SeaShip paper was to present the dataset and the
annotation procedure, the authors went a step further providing also some baselines
and comparisons of three 2018 different state of the art models. These models were
the faster region based convolutional neural network (faster R-CNN), the you only
look once (YOLO) and the single shot multibox detector (SSD). The metric that was
used to evaluate the models’ performance were the mean average precision (mAP)
and the frames per second (FPS). The model with highest mAP was the faster R-
CNN with almost 0.930 and the model with the highest FPS was the YOLO with

6


2. Background

91 [7].

2.4 Adversarial convolutional network
The region based convolutional neural network (RCNN), is a widely used model
for detection and classifictation of objects. Although, it is a robust model with
average performance, there have been many researches trying to improve it. The
authors proposed two methods in order to optimize the performance of the Fast-
RCNN (FRCNN), which is a faster version of the RCNN model. Specifically, like
data augmentation techniques, these methods are able to change the images slightly
in order to train the FRCNN model into more difficult images and consequently
improve its performance.

These models are the Adversarial Spatial Dropout Network (ASDN) and the Adver-
sarial Spatial Transformer Network (ASTN). The ASDN is a model which removes
parts of the images and feed them as input to the FRCNN model which predicts
the position and class of a potential object. The ASTN model reforms the image,
by changing the position of some blocks of the images, and again feed to FRCNN
for classification and detection. After training the models using these techniques
the authors then tested them on the VOC2007 dataset and compared them to the
standard FRCNN model using the mean average precision (mAP) metric. Specifi-
cally, the ASTN model achieved 58.1 and the ASDN 58.5 and combing them into
one model gave 58.9, while the FRCNN managed to score 55.4 [8].

2.5 Integration and delivery
Modern software development is not only focused on implementing robust tech-
niques, but also aims to support the development and deployment itself. There-
fore, continuous integration and delivery techniques have recently been explored.
A combination of software developement and IT operations is the development and
operations (DevOps) and is a method for shortening the development’s cycle and
delivering software of higher quality.

The DevOps process can be divided into five phases: 1) continuous planning, 2)
integration, 3) deployment, 4) testing, and 5) monitoring. The planning includes
continuous communication with the customer adapting at the same time the soft-
ware. One important part of the whole procedure is the deployment in which dif-
ferent hardware systems are used to test the integrated software. During the last
two phases the software goes through testing by the developers in order to detect
possible bugs that could affect the efficiency of the software. Moreover, every step
in the DevOps process is executed completely automatically, even though one step
relies on the previous one. The advantages of applying such a method are that the
development’s cycle can be shortened and also there is no need to run the software
on different expensive hardware systems, since those have been replaced by the cloud
computing systems [9].

7


2. Background

8


3
Methods

The purpose of this thesis is to establish a software that can classify and detect
marine vehicles. However, the procedure of software development needs to be divided
into three parts, annotation, training and execution. The theoretical aspect of each
of these steps can be grounded without any input of the previous one. However,
during the implementation of the software every step relies on the previous one for
achieving the highest potential performance of the model.

3.1 Annotation
In supervised learning the dataset, which is used to train the machine learning model
and accomplish its task, is a fundamental part of the training procedure. In general,
data annotation is considered to be an underlying part of machine learning, since
it is the way of establishing the ground truth for training and testing. However,
in order to maximize the performance of the model the dataset needs to fulfill two
requirements. Firstly, it needs to be balanced, and secondly, properly annotated.

A balanced dataset is consisting of approximately the same number of data for the
different classification classes. For instance, if the model tries to solve a binary classi-
fication of images then we need to make sure that the dataset has an equal number
of images for each of the two classes. Since the main task of a machine learning
model after the training process is to achieve maximum performance, imbalanced
datasets are affecting negatively its goal and conclude to a model that makes pre-
dictions based on a non-uniform distribution [10]. However, some techniques can
be deployed and secure a balanced dataset. Two of the most common techniques
are the simple random sampling (SRS) and the trial-and-error methods. According
to the SRS technique, the sampling of the dataset is done randomly following the
uniform distribution. In the trial-and-error methods, the SRS method is performed
multiple times and afterwards the results are averaged [11]. To create and establish
a dataset that will be well-formed and able to be deployed to train this project’s
model efficiently, we are going to use the SRS technique managing to establish a
balanced dataset.

Data annotation is the procedure through which the data is composed into input-
output format, which is going to be used for the training of the model. In classifi-
cation and object detection problems, the dataset is important to have the format
of a vector object-class,x ,y, w, h where object-class is the type of the object.

9


3. Methods

This (x, y) is the starting point of the bounding box followed by the w, h, which are
the width and height respectively, used to calculate the other three coordinates of
the remaining corners of the box.

The data was captured by a monochrome camera installed on the boat, and was
stored into .raw format files with resolution 3208 × 2200 and bit depth 10. However,
the .raw images were scaled down and stored into .png files with resolution 1920x1200
and bit depth of 16 bits. Specifically, a microservice was implemented using the
OpenCv framework to access the .raw images and then produce the required .png
images. Using that open-source software labelImg the images properly annotated
into object-class,x ,y, w, h vector, which is the annotation format that the
thesis model supports.

3.2 Training
The training of the model is the part where a deep neural network model learns
through experience how to solve specific tasks, which in our case is to detect and
classify marine vehicles. Although the training is a fundamental part of the whole
process, the architecture of the neural network is not the same for each task. There
are many algorithms and models that are able to solve the detection and classifi-
cation task efficiently, like the region based CNN (R-CNN) [12] or the single shot
multiBox detector (SSD) [13]. However, the you only look once (YOLO) is a model
that outperforms most of the detectors and this is the reason why it was decided
to use this model. The YOLO model is already trained into a specific dataset, but
since we need it to be applicable to our marine vehicles dataset, the training will
be based on the dataset we have collected. Despite the fact that the architecture of
the model remains the same, it is essential to change some of its parameters. As it
is presented in Figure 3.1, the process is split into two parts with the first being the
model’s selection and dataset’s establishment. After forming the dataset, we need
to annotate it and use data augmentation techniques, like blurring or changing the
brightness of the images, to extend its size. Afterwards, we train the selected model
into the dataset multiple times, while fine-tuning its parameters for maximum per-
formance. Whenever the model reaches the highest of its performance, we export
and deploy it.

The YOLO model is using the bounding boxes technique over the image. Initially,
the model splits the image into SxS grid cells. For each cell the algorithm makes
predictions about possible bounding boxes while computing probabilities, named as
confidence scores. These scores represent the likelihood of each potential box to
have been placed correctly and contain an object. During the testing process, the
model multiplies these probabilities with the confidence score for each of the classes,
and selects the highest one [14].

However, the developers of YOLO wanted to improve it, consequently they created
an updated version that is faster and more accurate than the previous one. There-
fore, they presented the YOLOv2 by making changes compared to the first version.

10


3. Methods

Figure 3.1: The learning procedure describing the steps of for the training process.

First of all, they doubled the input size of the images and instead of 224x224, which
is the size of YOLOv1, they used 448x448 to train the classifier on the ImageNet
dataset. Also, they changed the way of making predictions of the bounding boxes
and instead they used the anchor box technique. According to that method, the
model can generate bounding boxes longer than the grid cells and also link multiple
objects to them. Moreover, they added batch normalization to every convolutional
layer of the model which lead to improvement of the performance. Furthermore, the
YOLOv2 is able to detect more than 20 classes which is the limit of classes that
YOLOv1 can detect. While there is great demand for fast detectors, like in au-
tonomous driving concepts, YOLO’s authors implemented an improved model that
was both robust and fast. Thus, instead of establishing a deeper model they reduced
the number of layers. The YOLOv1 consists of 24 convolutional layers, while the
YOLOv2 has 5 less, id est 19 layers, and the name of the model’s architecture is
Darknet-19 as it is shown in table 3.1. That way, they managed to establish a faster
version of YOLO while the robustness remains untouched [15]. In this thesis the
second version of YOLO is going to be used since it combines the robustness and
the speed that a detector and classifier needs. Furthermore, because the model’s
architecture is not large, iT could be faster during the training procedure especially
when the training process is performed on a single GPU.

Additionally, the YOLO developers presented the YOLOv3 model which is a larger
deep neural network model which gains in performance. The model’s architecture is
called Darknet-53 and consists of 53 convolutional layers. The model is able to detect
smaller objects compared to the objects that YOLOv1 and YOLOv2 can detect. An
important improvement is that YOLOv3 is able to make multi labels predtictions
meaning that an object can be classified into its class but also it will be categorized
into a broader class that it belongs [16]. However, the researchers continued their
work and presented the YOLOv4 which is the latest version of YOLO. According
to their results, the YOLOv4 model has higher performance regarding the detection

11


3. Methods

and localization part of the task but also is faster which means that it has higher
FPS rate and allows the model to make predictions accurately in shorter period of
time [17].

Darknet-19
Type Filters Size/Stride Output
Convolutional 32 3x3 224×224
Maxpool 2×2/2 112x112
Convolutional 64 3x3 112x112
Maxpool 2×2/2 56x56
Convolutional 128 3x3 56x56
Convolutional 64 1x1 56x56
Convolutional 128 3x3 56x56
Maxpool 2×2/2 28x28
Convolutional 256 3x3 28x28
Convolutional 128 1x1 28x28
Convolutional 256 3x3 28x28
Maxpool 2×2/2 14x14
Convolutional 512 3x3 14x14
Convolutional 256 1x1 14x14
Convolutional 512 3x3 14x14
Convolutional 256 1x1 14x14
Convolutional 512 3x3 14x14
Maxpool 2×2/2 7x7
Convolutional 1024 3x3 7x7
Convolutional 512 1x1 7x7
Convolutional 1024 3x3 7x7
Convolutional 512 1x1 7x7
Convolutional 1024 3x3 7x7
Convolutional 1000 1x1 7x7
Avgpool Global 1000
Softmax

Table 3.1: The YOLOv2 model architecture.

During the learning procedure, the model needs to be evaluated multiple times after
each training in order to have an insight on the performance of the model. For
this project the evaluation is divided into two parts. Firstly, we need to evaluate
how the classifier performs and secondly, how the detector does. Thus, we need
to use three different metrics, percentage accuracy of correctly classified objects,
percentage of accuracy of missclassified objects, and the mean average precision
(mAP) respectively. Since the dataset is annotated, we know in advance the optimal
result that the model needs to predict. However, in order to compute these metrics
we need to define the precision and recall. The mathematical expression of precision
is

P = TP

TP + FP

12


3. Methods

and recall is
R = TP

TP + FN

where TP = True Positive (predicted as positive and it was correct), TN = True
Negative (predicted as negative and it was correct) FP = False Positive (predicted
as positive but it was wrong) and FN = False Negative (Predicted as negative but
it was wrong). Therefore, the accuracy of the correctly classified objects is:

Ac = TN + TP

TN + TP + FN + FP

and the accuracy of the missclassified objects is:

Am = FN + FP

TN + TP + FN + FP

However, in object detection tasks the mAP is determined by the intersection over
union (IoU) which is the

IoU = AoO
AoU

, where AoO is the area of overlap which is the percentage of the predicting box
that overlaps the ground truth and AoU is the area of union which is the union
of the predicted and ground truth boxes. Setting up a threshold we calculate the
IoU for the predicted bounding box. If the IoU is bigger than the threshold then
we classify the prediction as TP otherwise as FP. Repeating this process for every
class in the dataset we compute the average precision which is the area under the
precision–recall curve. For the mAP we just compute the average of the AP of each
class in our dataset.

However, in order to report any results for the second research question we need to
handle the images in a specific way. As it was aforementioned, the data collection
will not be conducted on different weather conditions. Therefore, the images will
not have a weather diversity, but it is important to establish a method for evaluating
the trained model on different weather conditions. Accordingly, a script has been
implemented in order to create images with different light exposure, snow, rain or fog
conditions. Since these images are not part of the regular evaluation of the model,
we are going to use the same metrics, namely accuracy for classification and mAP
for detection. It is important to say that even though we are making changes to the
images, the ground truth remains the same, because the images are not going to be
rotated at all. Thus, the annotation made in the previous step has not changed and
those files can be used as a way of evaluating the model. The script that was used
to create the diverse images can be found on the Appendix 1.

3.3 Model’s execution
At this point, two of the main parts of the project have been successfully imple-
mented and tested in order to make sure that the software is ready for the deploying

13


3. Methods

Figure 3.2: The model’s deployment receiving data either directly through camera
or extracted from a database.

procedure. According to this step, we need to make sure that the implemented soft-
ware is applicable to the autonomous boat and there are no conflicts between and
the software and the hardware. Hence, we use the software development technique
CI/CD, which allows us to test if the software is applicable to the platform without
actually deploying it.

For the CI/CD part of the process we use a docker file that allows us to integrate
the software into a specific operating system with specific requirements. In our
project, since the machine learning model requires GPU for fast execution, we use
as a builder for the docker container the Linux Ubuntu distribution, specifically the
20.04 version while enabling the software responsible for executing the model using
Nvidia Cuda. Since this docker image is a fresh operating system that lives into a
docker container, we need to install the necessary packages needed for our software.
Afterwards, using a cmake file we build the software into the container. The code
is compiled and tested for any kind of conflicts, like non-installed packages. In this
project there are two major programs that are important for this part and those are
the Libcluon and the program that runs the model for detection and classification.
The fist one is the software that is used for the communication between components
in a distributed software system and the other one is the program that it is able to
execute the YOLO model into the frames shared through the cluon software.

As it is shown in the Figure 3.2, the process of the model’s deployment can be per-
formed using two different methods. The first one is to deploy the model into direct
data captured by the boat’s sensors and the second one is through the shared mem-
ory of a system using images extracted from a database. In the first case the camera
captures frames while the OpenDlv software handles the data and transfer it into
the opendlv-perception microservice that executes the YOLO model which returns
the potential detected objects. In the deployment through the shared memory the
stored images are parsed into the opendlv-perception microservice which executed
the YOLO model while making predictions for potential marine objects and their

14


3. Methods

relevant position. In both cases the procedure continues until the user stops the
data input from camera or database to the microservice respectively.

The evaluation of the model into new data is a complex task because it requires the
annotation of the new images. Similarly to the Zooniverse website [18], a web site
can be implemented in order to extract the files from the server and using an online
service, annotate the images and save the annotation files. Using the shared memory
technique the model can be trained into the new data and also be evaluated.

15


3. Methods

16


4
Results

This thesis aims to give a comprehensive answer to the research questions specified in
the introduction section. After the training of the model and the multiple evaluations
of the model, we managed to reach to the following results.

The first research question is about evaluating the trained model and give a compre-
hensive answer about the number of images we need in order to achieve the potential
highest classification and detection of the different classes. The dataset that we de-
signed and annotated was consisting of 1,000 images. The dataset was split into two
sets, the training and the test set which were used to train and evaluate the model
respectively. At the Table 4.1 are presented the performance results of the model
after the training.

IoU threshold = 0.25
Ac Am mAP

Training Set 68% 31.86% 0.75
Test Set 68% 31.87% 0.77

IoU threshold = 0.50
Ac Am mAP

Training Set 64% 35.5% 0.63
Test Set 68.1% 31.87% 0.68

Table 4.1: YOLO evaluation into the training and test set using two different
thresholds for the IoU.

The second research question is about evaluating the model into different weather
conditions. Since the model was trained into a dataset with no weather diversity,
the model was evaluated using both the training and the test sets in order to get a
better insight on how the model actually performs. At the Table 4.2, the results of
the evaluating procedure are presented and it is clear that the model achieved lower
scores compared to the original sets without weather conditions diversity.

The last question is about an optimal design of the algorithm giving the opportunity
to an external user to utilize the microservices and the model that we trained. A
specially designed webservice platform will host the implemented microservices and
through that site external users will have access to the collected data. As it is shown

17


4. Results

IoU threshold = 0.25
Ac Am mAP

Training Set 18.4% 81.59% 0.1506
Test Set 0.85% 91.48% 0.1106

IoU threshold = 0.50
Ac Am mAP

Training Set 16.2% 83.75% 0.107
Test Set 0.7% 92.90% 0.068

Table 4.2: YOLO evaluation into the training and test set on different weather
conditions using two different thresholds for the IoU.

Figure 4.1: Automated evaluation process including the three different type of
users and the interaction with the web-service.

in Figure 4.1 initially the data are stored into a database. From that database
we extract data which will be used for running the model. If we have the data
annotations needed as ground truth then we can run the microservice instantly and
report the results to the user. However, if the annotations are not available, we
need to follow a different approach. In that case, an external researcher needs to
annotate manually data that will be stored and used for the training and evaluation.
But, we need to make sure that the annotations are correct which is accomplished
through the filtering procedure. For instance, we do not gather data that have been
annotated once but we select data that are roughly similarly annotated by more
than annotators. The next step is to train the model using the configuration, that
the developer has decided as the optimal one, evaluate the model and just display
the results to the external user alongside with the results of the other algorithms
that will be executed using the same data.

18


5
Discussion

The results indicate that the model is able to solve the task of classification and
detection of marine vehicles. The first two questions are relevant to a machine
learning model and specifically about its performance. Those questions are about
the evaluation of the classification and object detection of marine vehicles based on
data collected for Chalmers Revere’s ongoing project related to autonomous ships.
Additionally, there is a question regarding a way of establishing a web service that
users will have access to the collected data and either use the algorithms and get
some results or develop one and deploy it. More specifically regarding the first two
questions, the model managed to achieve an accuracy of 68% for the classification
part on the test set and a mAP of 0.77 while on the training set the accuracy was
68% and mAP of 0.75. Concerning the last question we gave the theoretical baseline
for developing the required website and how the external users and researchers could
use the platform.

The dataset used for the training of the model consists of 1,050 grayscale images
with resolution of 1920x1080 and bit depth 10-bit. The dataset is split into two
sets: 1) the trainig set that contains 1,000 images and the annotation files and 2)
the test set which encloses 50 images with the annotated files which is around 5%
of the initial dataset. The training set is used to train the model and the test set
is for evaluating its performance. Both of the datasets are balanced, since they
contain approximately the same number of objects for each of the five classes as it
is shown in Figure 5.1. Moreover, the sets were established in a way that there are
no common images. That way the evaluation of the model is more accurate and we
can ensure that the testing of the model is performed on images that the model has
no access before.

The training of the model is a procedure that requires many repetitions while chang-
ing and adapting the parameters of the model in order to get its highest performance.
However, we managed to train a model that is able to solve our task efficiently and
make accurate predictions. Moreover, the model was initially trained for 1,500
epochs with input size of the images to be 512x512 and managed to achieve accu-
racy of correct classification 54.65% and mAP of 0.576 on the test set. Furthermore,
the training continued for additional 1,000 epochs with input size of 608x608. It is
worth mentioning that the model was trained on a singe GPU, specifically NVIDIA
GeForce RTX2060, for approximately 15 hours. For the evaluation part of the
performance the configuration changed and used the input size of 896x896 which
allowed the model to achieve the presented results. The input size was increased

19


5. Discussion

(a) Training set (b) Test set

Figure 5.1: Percentage of objects’ numbers for each class.

following the instructions of YOLO’s authors mentioning that it is important to
increase the input size of the model after its training. Additionally, the model was
trained for another 500 epochs with the input size of 640x640 but it did not manage
to outperform the performance of the previous configuration. However, even though
the training of the model is considered to be short, YOLO managed to achieve re-
markable performance. Specifically, with the IoU threshold set to 0.25 the model
managed to classify correctly 68% of the objects and have a mean average preci-
sion of 0.77 meaning that it detected the objects but it was not able to create the
bounding boxes as the ground truth of the annotation did.

On the other hand, the YOLO model did not manage to have a robust performance
when it was deployed into images on different weather conditions. That outcome
is rational considering that the model was not trained into images with weather
diversity. Since the model is a deep neural network and needs data to be trained
efficiently, if the dataset consisted of images into different weather conditions, it
could be possible to have a better performance. That is an issue that any machine
learning algorithm needs to face. The extrapolation and interpolation issue which
means that an algorithm is trying to solve a task that is harder than the one that
was trained for. In our case interpolation is the detection and classification of
marine vehicles into images with certain weather conditions, the task that the model
is trained to solve. The extrapolation is the harder version of the task which is
the detection and classification of marine vehicles into different weather conditions.
Since the deep neural network model was not trained to solve the task into images
with weather diversity, it performs poorly. The model tries to somehow distinct
the weather conditions from the images and make accurate predictions about the
marine vehicles appeared. That task is not easy to be solved by the model since
it was not trained on it,a fact that proves that the extrapolation problem remains
which is one of the drawbacks of machine learning.

20


5. Discussion

5.1 Ethical perspective and data management
Nowadays data is considered to be the new oil. Data is collected, stored and handled
in order to extract information which trespasses in some cases the user’s privacy
right. Therefore, it is important to secure that data has some security protocols
and no unauthorized users have access to it, but also it does not contain sensitive
personal information. In May 2018 EU has enacted the General Data Protection
Regulation (GDPR) to establish a legal background for data protection. Moreover,
the GDPR sets some boundaries on the collection, storage, and process of data and
specifically regulates that data must not contain information that could connect it
with the user that has been collected from [19]. The data that this thesis processes
is images collected by a marine vehicle and it might not contain sensitive informa-
tion, such as names, national ID numbers. However, the images of other boats can
enclose information that could eventually be personal. Consider an image of a boat.
Directly from the image, someone could extract the boat’s name and registration
number, information that is sufficient enough to search the owner of the ship through
Sjöfartsverket, the Swedish Maritime Administration.

Following the GDPR legislation, we need to secure that there are no violations. As
mentioned above about the design of the web service, there are three types of users:
the developer, the annotator, and the user. It is important to make sure that each
and every one of them complies with the GDPR and personal information laws.
However, the external user is not able to download and process the data because
he uses the web service waiting for the display of the results of the algorithms and
has no direct access to the data. Following that the annotator has no access to the
data since the annotation will be done online and no data will be downloaded. On
the other hand, the developer has access to the data, can download and process it
because the algorithm that is going to be implemented needs to be tested locally for
maximum performance. In that case, the GDPR law takes immediate effect. Before
even downloading any data the developers need to sign a contract stating that there
will not be any data leaks and no data will be stored after the implementation of
the algorithm.

21


5. Discussion

22


6
Conclusion

The task of detection and classification of objects is considered to be a traditional
task in the field of computer vision. Due to the excessive research on deep learn-
ing, there are many models that can be trained and deployed for this particular
task. This thesis aims to detect and classify marine vehicles using data collected
by monochrome camera on a boat which is part of the Chalmers Revere’s ongo-
ing project on autonomous ships. Moreover, the training of the model is not the
only task that the thesis addresses. Additionally, the thesis proposes a system that
integrates the model and training into an end-to-end data process.

Following the presented method, the predefined deep learning model, YOLO, man-
aged to classify the required objects with accuracy of 68% and detect the correct
bounding boxes in a percentage of 77%. However, the model was not able to achieve
the goal of the correct classification of at least 90% and have on the same time a
decent percentage of incorrect classifications. Additionally, a software for deploying
the model into real world scenarios was implemented providing a tool to test the
model and its performance into real time detections. Moreover, the thesis presented
the theoretical baseline for implementing a public web service in which users and
researchers will have access to the model and data while testing several implemented
algorithms into the same data.

The dataset that was used for the training of the model consists of 1000 images,
a number that could be increased in a future work since the is adequate size of
data that images can be extracted from. Also, the training of the model is consid-
ered short due to the fact that it was trained on a singe GPU, specifically NVIDIA
GeForce RTX2060, for 15 hours. However, with a larger dataset and a longer train-
ing the model could be more robust and accurate while achieving higher performance
compared to the presented one. Additionally, the model could be able to reach the
goal of at least 90% accuracy on the classification task while minimizing the wrong
classification percentage. Additionally, the model could be able to achieve higher
mAP percentage regarding the object’s detection which is related to the prediction
of the object’s relative position. Correspondingly, the model’s performance on the
dataset consisting of images with weather diversity is considered poor. That comes
to prove the already mentioned issue of machine learning models about interpolation
and extrapolation, since both the accuracy of the classification and the mAP per-
centage are low, which means that the model is not able to solve its task whenever
the are weather conditions different than the ones trained on.

23


6. Conclusion

According to this thesis, a baseline has been established for the detection and classi-
fication of marine vehicles related to the Chalmers Revere’s project on autonomous
ships. Handling data captured by a monochrome camera a deep neural network
model has been trained, providing some results on the classification and detection
of marine vehicles task. Moreover, if future work conducts following the proposed
recommendations the model could become more accurate and robust. In conclusion,
this current work hands over the process that can be followed in order to establish
a dataset using the captured images, train the model and eventually deploy it using
a software into microsevice architecture.

24


Bibliography

[1] Schmidhuber, Jürgen. “Deep Learning in Neural Networks: An Overview.” Neu-
ral Networks 61 (2015): 85–117. Crossref. Web.

[2] Shahin, Mojtaba, Muhammad Ali Babar, and Liming Zhu. "Continuous inte-
gration, delivery and deployment: a systematic review on approaches, tools,
challenges and practices." IEEE Access 5 (2017): 3909-3943.

[3] Wang Qi, Fu Li and Liu Zhenzhong, "Review on camera calibration," 2010
Chinese Control and Decision Conference, Xuzhou, 2010, pp. 3354-3358, doi:
10.1109/CCDC.2010.5498574.

[4] R. (2018, August 31).Timeline–Development of Autonomous Ships (1970s –
2018). Infomaritime.Eu.http://infomaritime.eu/index.php/2018/06/08/timeline-
development-of-autonomous-ships/

[5] Kai Kang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang; Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
pp. 817-825

[6] G. Yang, B. Li, S. Ji, F. Gao and Q. Xu, "Ship Detection From Optical Satel-
lite Images Based on Sea Surface Analysis," in IEEE Geoscience and Remote
Sensing Letters, vol. 11, no. 3, pp. 641-645, March 2014, doi: 10.1109/L-
GRS.2013.2273552.

[7] Z. Shao, W. Wu, Z. Wang, W. Du and C. Li, "SeaShips: A Large-Scale Precisely
Annotated Dataset for Ship Detection," in IEEE Transactions on Multimedia,
vol. 20, no. 10, pp. 2593-2604, Oct. 2018, doi: 10.1109/TMM.2018.2865686.

[8] XiaolongWang, Abhinav Shrivastava, Abhinav Gupta; Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp.
2606-2615

[9] Virmani, Manish. "Understanding DevOps & bridging the gap from continu-
ous integration to continuous delivery." Fifth international conference on the
innovative computing technology (intech 2015). IEEE, 2015.

[10] Provost, Foster. "Machine learning from imbalanced data sets 101." Proceedings
of the AAAI’2000 workshop on imbalanced data sets. Vol. 68. No. 2000. AAAI
Press, 2000.

[11] Reitermanova, Z. (2010). Data splitting. In WDS (Vol. 10, pp. 31-36)
[12] Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international con-

ference on computer vision (pp. 1440-1448).
[13] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A.

C. (2016, October). Ssd: Single shot multibox detector. In European conference
on computer vision (pp. 21-37). Springer, Cham.

25


Bibliography

[14] Redmon, Joseph, et al. "You only look once: Unified, real-time object de-
tection." Proceedings of the IEEE conference on computer vision and pattern
recognition. 2016.

[15] Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." Pro-
ceedings of the IEEE conference on computer vision and pattern recognition.
2017

[16] Redmon, J., Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv
preprint arXiv:1804.02767.

[17] Bochkovskiy, A., Wang, C. Y., Liao, H. Y. M. (2020). Yolov4: Optimal speed
and accuracy of object detection. arXiv preprint arXiv:2004.10934

[18] Zooniverse. 12 Dec. 2009, www.zooniverse.org/
[19] N. Gruschka, V. Mavroeidis, K. Vishi and M. Jensen, "Privacy Issues and Data

Protection in Big Data: A Case Study Analysis under GDPR," 2018 IEEE
International Conference on Big Data (Big Data), 2018, pp. 5027-5033, doi:
10.1109/BigData.2018.8622621.

26


A
Appendix 1

The script that it was used to reform the images by changing the brightness, blurring
them or adding snow or rain drops.

import cv2
import numpy as np
import g lob
import random
import os
import argparse

from numpy . l i b . type_check import imag

ap = argparse . ArgumentParser ( )
ap . add_argument("− i " , "−−image " , r equ i r ed = True , he lp = " path to images " )
args = vars ( ap . parse_args ( ) )

inputFolder = os . path . sep . j o i n ( [ a rgs [ " image " ] ] )
f o l d e r l e n = len ( inputFolder )

f o r img in glob . g lob ( inputFolder + " / ∗ . png " ) :

image = cv2 . imread ( img , cv2 .IMREAD_GRAYSCALE)
randomWeather = random . cho i c e ( [ " Br ighte r " , " Darker " , " Rainy " ,

" Snowy " , " Fogy " ] )

i f randomWeather == " Br ighte r " :

randomBrightness = random . rand int (10 ,80)
b r i gh t = np . ones ( image . shape , dtype = " uint8 " ) ∗ randomBrightness
b r i gh t I n c r e a s e = cv2 . add ( image , b r i gh t )
b r i gh t I n c r e a s e = ( ( b r i gh t I n c r e a s e + 1 ) ∗ 256 ) −1
cv2 . imwrite ( inputFolder + img [ f o l d e r l e n : ] , b r i g h t I n c r e a s e )

e l i f randomWeather == " Darker " :

randomDarkness = random . randint (10 ,80)

I


A. Appendix 1

br i gh t = np . ones ( image . shape , dtype = " uint8 " ) ∗ randomDarkness
br ightDecrease = cv2 . subt rac t ( image , b r i gh t )
br ightDecrease = ( ( br ightDecrease + 1 ) ∗ 256 ) −1
cv2 . imwrite ( inputFolder + img [ f o l d e r l e n : ] , b r i ghtDecrease )

e l i f randomWeather == "Rainy " :

r a in = [ ]
rain_drops = random . rand int (1000 ,2000)
random_number = random . randint (−10 ,10)
f o r elem in range ( rain_drops ) :

i f random_number < 0 :
x = random . rand int ( random_number , image . shape [ 1 ] )

e l s e :
x = random . rand int ( random_number , image . shape [ 1 ]

− random_number )
ra in . append ( ( x , random . rand int (0 , image . shape [0] − 5 ) ) )

f o r drop in ra in :
s ta r t_po int = ( drop [ 0 ] , drop [ 1 ] )
end_point = ( drop [ 0 ] + random_number , drop [ 1 ] + 5)
c o l o r = (175 ,195 ,204)
cv2 . l i n e ( image , start_point , end_point , co lo r , 5 )

image = cv2 . b lur ( image , ( 7 , 7 ) )
rainy_image = cv2 . add ( image , 7 0 )
rainy_image = ( ( rainy_image + 1 ) ∗ 256 ) −1
cv2 . imwrite ( inputFolder + img [ f o l d e r l e n : ] , rainy_image )

e l i f randomWeather == "Snowy " :

r a in = [ ]
rain_drops = random . rand int (1000 ,2000)
random_number = random . randint (−2 ,2)
f o r elem in range ( rain_drops ) :

i f random_number < 0 :
x = random . rand int ( random_number , image . shape [ 1 ] )

e l s e :
x = random . rand int ( random_number , image . shape [ 1 ]

− random_number )
ra in . append ( ( x , random . rand int (0 , image . shape [0] − 2 ) ) )

f o r drop in ra in :
s ta r t_po int = ( drop [ 0 ] , drop [ 1 ] )
end_point = ( drop [ 0 ] + random_number , drop [ 1 ] + 2)
c o l o r = (255 ,250 ,250)
cv2 . l i n e ( image , start_point , end_point , co lo r , 2 )

image = cv2 . b lur ( image , ( 5 , 5 ) )
snowy_image = cv2 . add ( image , 7 0 )

II


A. Appendix 1

snowy_image = ( ( snowy_image + 1 ) ∗ 256 ) −1
cv2 . imwrite ( inputFolder + img [ f o l d e r l e n : ] , snowy_image )

e l i f randomWeather == "Fogy " :

foggy_image= cv2 . b lur ( image , ( 1 0 , 1 0 ) )
foggy_image = ( ( foggy_image + 1 ) ∗ 256 ) −1
cv2 . imwrite ( inputFolder + img [ f o l d e r l e n : ] , foggy_image )

III


DEPARTMENT OF MECHANICS AND MARITIME SCIENCES
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Figures
	List of Tables
	Introduction
	Motivation
	Goal
	Research questions
	Limitations
	Outline

	Background
	Object detection using tubelets
	Sea surface analysis for ship detection
	Dataset for ship detections
	Adversarial convolutional network
	Integration and delivery

	Methods
	Annotation
	Training
	Model's execution

	Results
	Discussion
	Ethical perspective and data management

	Conclusion
	Bibliography
	Appendix 1