Automatic Calibration of Multiple
Fisheye Cameras
Trajectory based Extrinsic Calibration of a Multicamera Rig

Master’s thesis in Systems, Control and Mechatronics

OLA SÆTHERN & DAVID HEIMAN

DEPARTMENT OF MECHANICS AND MARITIME SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021
www.chalmers.se

www.chalmers.se


Master’s thesis 2021:38

Automatic Calibration of Multiple
Fisheye Cameras

Trajectory based Extrinsic Calibration of a Multicamera Rig

DAVID HEIMAN
OLA SAETHERN

Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems

Adaptive Systems Research Group
Chalmers University of Technology

Gothenburg, Sweden 2021


Automatic Calibration of Multiple Fisheye Cameras
Trajectory based Extrinsic Calibration of a Multicamera Rig
Ola Sæthern
David Heiman

© Ola Saethern and David Heiman, 2021.

Supervisor: Yaowen Xu, CPAC Systems AB
Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences

Master’s Thesis 2021:38
Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems
Adaptive Systems Research Group
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Feature tracking visualized in Gazebo VRX Simulator, with a customized
setup of vehicle and objects.

Typeset in LATEX, template by Magnus Gustaver
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021

iv


Automatic Calibration of Multiple Fisheye Cameras
Trajectory based Extrinsic Calibration of a Multicamera Rig
David Heiman
Ola Sæthern
Department of Mechanics and Maritime Sciences
Chalmers University of Technology

Abstract
This thesis presents a new technique for automatic extrinsic calibration of fisheye
cameras on boats. In contrast to most of the existing methods that require over-
lapping field of view, our technique does not, which opens up for completely free
placement of cameras. The method selected for this work was trajectory based cali-
bration, meaning that the fixed relative pose between the cameras can be determined
from the trajectories of the cameras. For development, Gazebo VRX simulator was
used to generate the data for calibration, and KITTI odometry to evaluate algo-
rithms. The method for generating the trajectories was visual inertial odometry
and ground truth trajectories were extracted from the simulator. The calibration
method was selected based on the availability of sensors onboard boats, for the ob-
jective to generate accurate trajectories. It was found that the developed trajectory
based calibration of relative pose between cameras is working, given an accurate
trajectory. The method was independent of which algorithm is used to obtain the
trajectory. However, there were difficulties getting visual inertial odometry alone to
generate trajectories accurate enough for the simulated data sets.

Keywords: VO, VIO, visual odometry, extrinsic calibration, trajectory based cali-
bration, marine, VRX, Gazebo, Computer Vision.

v


Acknowledgements
This thesis has been done as a final project to our master in Systems, Control and
Mechatronics at Chalmers University of Technology. The work was carried out on
behalf of CPAC Systems in spring 2021. We want to especially thank our supervisor
Yaowen for guidance and brilliant ideas. We want to thank our examiner Peter for
actively being present with years of experience with examination of thesis workers.
Also for the inspiring talks on the weekly meetings with both Yaowen and Peter.
And we want to thank CPAC systems for the opportunity!

Ola Sæthern and David Heiman, Gothenburg, June 2021

Thesis advisor: Yaowen Xu, CPAC Systems AB
Thesis examiner: Peter Forsberg, Department of Mechanics and Maritime Sci-
ences

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Intrinsic calibration . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Extrinsic calibration . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Camera models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Pinhole camera . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Fisheye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2.1 Scaramuzza . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2.2 Kannala-Brandt model . . . . . . . . . . . . . . . . . 8
2.2.2.3 Unified projection model . . . . . . . . . . . . . . . . 8
2.2.2.4 Undistortion . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Stereo vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 FAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 BRIEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 DLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8.1 Essential matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8.2 Fundamental matrix . . . . . . . . . . . . . . . . . . . . . . . 16

2.9 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Scale estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

ix


Contents

2.10.1 Bundle adjustment . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10.2 Loop closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.11 IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 Visual odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.13 Visual inertial odometry . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.14 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Methods 21
3.1 Choice of method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Simulator setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Data recording . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Detecting features on water . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Finding intrinsic and distortion parameters . . . . . . . . . . . . . . . 24
3.5 Obtaining a trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.1 Visual odometry . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.2 Visual inertial odometry . . . . . . . . . . . . . . . . . . . . . 25
3.5.3 Ground truth camera trajectories . . . . . . . . . . . . . . . . 26

3.6 Finding extrinsic parameters . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Addition of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Results 33
4.1 Visual odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Visual inertial odometry . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Extrinsic calibration on ground truth trajectories . . . . . . . . . . . 37

4.3.1 Trajectories with added noise . . . . . . . . . . . . . . . . . . 37

5 Discussion 41
5.1 Evaluation of visual odometry . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Evaluation of importance of more features . . . . . . . . . . . . . . . 41
5.3 Evaluation of feature extraction algorithms . . . . . . . . . . . . . . . 42
5.4 Evaluation of masking . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Evaluation of the addition of gyro . . . . . . . . . . . . . . . . . . . . 43
5.6 Quality evaluation of trajectories . . . . . . . . . . . . . . . . . . . . 43
5.7 Evaluation of results with added noise . . . . . . . . . . . . . . . . . 44
5.8 Determination of section of trajectory . . . . . . . . . . . . . . . . . . 44

6 Conclusion 45
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

x


List of Figures

1.1 How the visual algorithms relate to each other. . . . . . . . . . . . . . 2

2.1 A visual interpretation of the camera extrinsics, a rotation [R] and
translation [t] is applied on the Base frame {B} to reach the Camera
frame {C}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The derivation of the disparity depth estimate can be drawn from
this figure. CR and CL is the camera centers, P is a point, xL and
xR is the image point referring to the same point in the world frame.
The red and blue triangle have the same relationship between height
and width, T

Z
= T−xl+xr

Z−f . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 A visualization of epipolar geometry, xL and xR are the matched

points in the image plane. The epipolar line is displayed in red from
xR to eR(epipole) the intersection of the line between camera centers
in the right image plane. The plane XOLOR is the epipolar plane.
Note figure is taken from [33]. . . . . . . . . . . . . . . . . . . . . . . 15

2.4 An overview of coupled systems between camera sensors and IMU
sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 An overview of the sensor setup. . . . . . . . . . . . . . . . . . . . . . 23
3.2 Bird’s eye view of the sensor setup. . . . . . . . . . . . . . . . . . . . 23
3.3 Ground truth transforms. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Rig transforms, rigid link between the two cameras. . . . . . . . . . . 28

4.1 KITTI Odometry sequence 9. Different thresholds for detecting new
features [100, 400, 700, 2000] for [a, b, c, d]. . . . . . . . . . . . . . . 34

4.2 KITTI sequence 9, SIFT detector, Flann matcher, Lowe’s ratio 0.75
and 0.85. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 VIO Algorithm, SIFT, Lowe’s ratio [0.75, 0.75, 0.85, 0.95], [70, 100,
150, 200] update threshold, values of γ: [1.0, 1,0, 1.0, 1.0], meaning
no IMU influence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 VIO Algorithm, FAST-BRIEF with the values of γ: [0, 0.3, 0.7, 1.0]. . 36
4.5 The cameras own trajectory viewed from their coordinate system.

SIFT, Lowe = 0.75, γ = 0.3. . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 VRX simulation run. Top row, ground truth with additive translation

noise. Bottom row, ground truth with cumulative translation noise.
Noise level N (0, 0.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xi


List of Figures

5.1 Screenshot of vessel with and without a minimum euclidean distance
condition for SIFT. In the top picture there is approximately 40
matches, seen as the red dots. . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Resulting trajectory without max/min euclidean distance constraint
between matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xii


List of Tables

2.1 Description of the intrinsic parameters and common values for a camera. 7

3.1 The recorded ROS messages and a description. . . . . . . . . . . . . . 24

4.1 The angular and translation errors for the transformation between
front and left camera, for different values of additive noise. . . . . . . 39

4.2 The angular and translational error for the transformation between
front and left camera, for different values of Cumulative noise. . . . . 39

xiii


List of Tables

xiv


Nomenclature

BRIEF Binary Robust Independent Elementary Feature
CRF Corner Response Function
DLT Direct Linear transform
DoG Difference of Gaussian
EKF Extended Kalman Filter
EXIF Exchangeable Image file Format
FAST Features from Accelerated and Segments Test
FOV Field Of View
GNSS Global Navigation Satellite System
IMU Inertial Measuring Unit
MLE Maximum Likelihood Estimation
RANSAC Random Sample Consensus
ROI Region Of Interest
ROS Robot Operating System
SIFT Scale-invariant Feature Transform
SLAM Simultaneous Locating and Mapping
SV D Singular Value Decomposition
V IO Visual Inertial Odometry
V O Visual Odometry

xv


Nomenclature

xvi


1
Introduction

In current years, the utilization of image sensors has increased for motor vehicles
[17]. Advanced driver assistance systems is one example of its usefulness, helping
with driving and parking for increased safety. For increased robustness, a combina-
tion with other sensors such as LiDAR and radar can be employed. Those sensors
can support the camera in situations with poor visibility or tough light conditions,
e.g. direct sunlight, night time and snow. Implementation of camera functions can
contribute to the safety of autonomous vehicles, where awareness of surroundings
is critical. The awareness is needed to react to events and objects near the vehicle.
Image sensors can also be used to reveal blind spots that would otherwise be invis-
ible to the driver.

However, high performance camera calibration is necessary for high quality computer
vision applications, and in particular to fully utilize the potential of a camera as a
sensor. In practise, calibration is divided into intrinsic and extrinsic. The intrinsic
calibration aims at finding the correct way to undistort the camera image whereas
the extrinsic to find the correct coordinates of the camera in some known coordi-
nate system. Whenever there is a need to relate the camera to another coordinate
system, e.g. when working with stereo cameras, the transform between those coor-
dinate systems is needed. The transform is the cameras extrinsic parameters. For
detailed explanations of these concepts, see Section 2.1 Since extrinsic calibration
entails finding the relative pose of the cameras with respect to a base, it is necessary
for image stitching, 3D reconstruction and more. To do extrinsic calibration, high
quality odometry data and good key-frame matching can be used to calculate the
transform between the cameras and base frame. Several approaches to gather odom-
etry data are available depending on the sensor modalities: Visual odometry (VO),
visual inertial odometry (VIO), laser sensors, global navigation satellite systems as
well as loosely or tightly coupled combinations of these.

The focus of this thesis is to build a framework that automatically calibrates the
extrinsics of a rig of cameras by driving the vessel around. Inspiration from the
different algorithms mentioned in 1.1 has contributed to creating a combination of
techniques that result in good trajectories, considering difficulties imposed by the
marine simulation environment.

1


1. Introduction

1.1 Related work
Camera calibration is a well investigated field of computer vision, with many ap-
proaches and usages. The intrinsic calibration methods are mostly the same, and
uses a known pattern e.g. a chessboard. There exists a large variety in camera mod-
els, Mei [2] and Kannala-Brandt [29] are some of them. Mei, used in OpenCV, and
Kannala-Brandt can handle a wide variety of cameras. MATLAB’s calibration tool-
box is developed by Scaramuzza [3] and use the Scaramuzza model. The common
approach when doing visual odometry is to rectify the fisheye images by normalizing
rectifying, using the parameters found doing intrinsic calibration. In direct visual
odometry [22], however, greater field of view is reached by not converting fish-eye
images to rectified images, at the cost of epipolar lines becoming epipolar curves for
the plane-sweeping stereo matching. The approach is shown to have high perfor-
mance for both motion estimate accuracy and point cloud quality.

The intrinsic parameters are initialized and optimized over several images to per-
ceive a pattern like a chessboard in the right way, all vertical lines should meet in
a point. The thesis was mostly concerned with the extrinsic calibration, which can
be done in a few different ways. Hand eye calibration is commonly used in robotics
when calibrating the perceiving camera to the end effector of a robotic manipulator.
G. Carrera proposes a SLAM-based calibration method [19] for a multi camera rig
on a small wheeled robot. G. Carrera’s method requires the robot to make certain
maneuvers which are hard to achieve on a boat.

CamOdoCal [4] uses visual odometry to get the pose and transforms of the cameras,
key-frames are selected and used for feature matching between the different cameras
to find a relative pose. CamOdoCal also makes use of loop closures which makes
it very similar to visual-SLAM. To see the context of visual-SLAM vs. structure
from motion (SFM) and VO, see Fig. 1.1. CamOdoCal is a fully automatic calibra-
tion pipeline for intrinsic and extrinsic calibration. The same team have also made
a quicker variant using a prerecorded track in infrastructure based calibration [5],
with similar results.

Figure 1.1: How the visual algorithms relate to each other.

In [23] another approach is chosen for doing extrinsic calibration, the existing meth-
ods of comparing different odometry trajectories come up short because it can be
prone to failure, takes longer time, requires initial guesses and is generally con-
strained. The approach uses a common plane found in both cameras to get the

2


1. Introduction

extrinsics. With only three plane correspondences, good results are achieved. The
drawback with [23] approach is that it requires RGB-D cameras.

While visual odometry can work just fine, for applications with sparse features and
unreliable/changing features, it can be beneficial to add an additional sensor. In
VINS-Mono [14] the integration of an IMU is introduced. The IMU always provides
an estimate of pitch and roll which does not drift, along with some initial information
about the position by integrating the acceleration twice (deteriorate quickly over
time). On the bleeding edge of visual odometry research there is among others
OV2SLAM [13], which outperforms the previous best slam methods running real
time. The initial approach in this thesis was to attain a trajectory in a similar way
as CamOdoCal, but had the odometry aided with the addition of an IMU. The IMU
addition was for robustness during time windows of poor features, which is usually
the case in marine environments.

1.2 Purpose and goal
The primary goal was to replace the current manual calibration with an automatic
calibration of a camera rig. The calibration was to work on a marine vessel of
arbitrary size. A part of the purpose behind the task was that time needed for au-
tomatic calibration of a camera rig is a fraction of what manual calibration requires.
The manual calibration done at time of writing required the boat to be on land to
take measurements. Taking the boat on land was not always possible and always
costly. A second part of the purpose was that with one algorithm, any marine ves-
sel’s camera rig could be calibrated, thereby creating a generalized solution where
at present only a case-by-case solution exists. Furthermore, a generalized solution
would allow re-calibration in case any camera suffered an off-set as time progressed.
After implementation of the calibration, the next step was to use the camera rig to
build and maintain a seamless 360◦ view of the surroundings and additional other
views like disparity map, stixels [20] and segmentation if considered useful.

1.3 Problem statement
This thesis targets the problem of accurately calibrating a large camera rig, found
on marine vessels. Challenges arise when trying to archive good visual odometry on
water. On large vessels there are often multiple cameras arbitrarily placed, which is
a factor factor that makes manual calibration a time consuming task, hence creating
a need for automatic calibration.

1.4 Limitations
• No considerations were made for feature rich environments with high quality

features, which might have been available in a natural harbor or a harbor with
varied and high amount of objects.

3


1. Introduction

• The computing power limited the frame rate for the recorded data sets, which
had a negative impact on some algorithms.

• For some functions in the software, open-source code was used in order to save
time.

• Intrinsics were assumed to be provided either by the camera manufacturer, or
from the data sets.

1.5 Contribution
This thesis contributed with a specific calibration algorithm. Most noteworthy were
the ability to use non-overlapping cameras and possibility to deploy in environments
poor of features, e.g. marine environments. Although we have used existing soft-
ware parts where possible, our method is unique and offer advantages over existing
ditto. The advantages such as being able to use generically obtained trajectories for
calibration. In addition, the pipeline we have created is simple to understand and to
build upon and was implemented in Python and Robotic Operating System (ROS).

1.6 Thesis outline
The thesis is split into five chapters: introduction, theory, method, results and
conclusion. The introduction is a short study of relevant work in similar areas and
description of the problem, followed by the theory chapter describing the theory
needed to solve the problem of automatic extrinsic calibration. In the methodology
chapter, the approach taken and choice of hardware and software are covered. The
chapter also covers how the results are collected. The result section compares how
well different methods perform and the final chapter concludes the study.

4


2
Theory

This chapter contains explanations of theory segments necessary for understanding
the method.

2.1 Calibration
To utilize the full potential of cameras as sensors, it is important to have well
calibrated camera parameters and a correct transform (rotation and translation)
between camera pairs in a rig. Furthermore, with well calibrated cameras it is
possible to achieve good surround views, disparity calculations etc. In addition,
camera matrices can be split into intrinsic and extrinsic parts as such:

P = K[R|t] (2.1)
Here K is the intrinsic part and the R|t matrix the extrinsic part respectively. Cal-
ibration can essentially be regarded as to fit a model to match some a priori known
properties.

2.1.1 Intrinsic calibration
Intrinsic calibration is to estimate the camera parameters K which normalizes the
image. A normalized image has the principle axis in the image center. The coor-
dinates are normalized and pixels are corrected for skewness, aspect ratio for non
square pixels and distortions from the lens. Calibration process of camera intrinsics
is usually carried out by capturing a known object, such as a checkerboard pattern,
from several views. The inner and outer corners will be detected using an automatic
corner detection algorithm or manually selected. The intrinsic parameters needed
will differ depending on the camera model of choice, but the focal length and image
center is usually initialized with the known values from Exchangeable Image File
Format (EXIF) data of the images. The rest of the parameters for distortions, aspect
ratio and skew are initially set to 0. The initial values of the intrinsic parameters
are refined with a nonlinear optimization method. The objective of the optimization
is to minimize the sum of the reprojection errors on all images. To find the intrinsic
parameters of a camera, the camera first needs to be modelled. See Section 2.2 for
an explanation of how modelling a camera can be done.

Since intrinsic parameters are typically provided by manufacturers, the parameters
are assumed to be known in this case.

5


2. Theory

2.1.2 Extrinsic calibration
The goal of the extrinsic calibration is to estimate the relative position and orienta-
tion with respect to a global coordinate frame shown in Fig. 2.1. The estimation is
done by finding the rotation matrices and translation vectors for the different cam-
eras. There are a few ways to do this. Manually measuring can be very useful for
simple camera rigs but is tedious work on large and complex rigs. There are some
automatic ways of doing this with different constraints and methods. The different
methods of solving automatic extrinsic calibration will be explained in chapter 3.

Figure 2.1: A visual interpretation of the camera extrinsics, a rotation [R] and
translation [t] is applied on the Base frame {B} to reach the Camera frame {C}.

2.2 Camera models
Various lenses are modelled differently, to represent the projection from the 3D point
coordinates to image plane coordinates. The most common approach is to use the
pinhole camera model, which assumes no lens and is modelled after the first cameras
ever made. Alternatively one can use Kannala-Brandt or Mei models, which better
represent the fisheye lens.

2.2.1 Pinhole camera
The pinhole model considers focal lengths, skew, image center and aspect ratio. It
assumes the camera center is small enough to be considered a point. The camera
matrix of a pinhole camera is the following:

x = PX (2.2)

6


2. Theory

P = K[R|t] =

γf sf x0
0 f y0
0 0 1

 [R|t] (2.3)

Table 2.1: Description of the intrinsic parameters and common values for a camera.

Symbol Description Typical value
f Focal length 3-1000 mm
s Pixel skew, if pixel is a parallelogram f tanα ∼ 0
γ Aspect ratio, for non square pixels 1
[x0, y0] Image center [w/2, h/2]

The normalized points x can then be corrected for radial distortion, which can be
done by a simple polynomial model [7] as shown in Eq. (2.4).

x̂ = x(1 + k1r + k2r
2)

ŷ = y(1 + k1r + k2r
2)

(2.4)

In Eq. (2.4), r is the squared distance from the origin r = (x2 + y2). The sign of k1
usually determines whether it is a barrel or pincushion type of radial distortion. The
tangential distortion, on the other hand, is present when the lens is not parallel to
the sensor and is often neglected. Despite the fact that the pinhole camera model is
a crude approximation of cameras with lenses focusing the light, adjustable aperture
and varying focal lengths, it can still handle distortions quite well.

2.2.2 Fisheye
Fisheye lenses are wide angle lenses with extreme field of view (FOV) going up to
around 180◦. Fisheye lenses is commonly seen in surveillance and rear view cameras
because it covers large areas with few cameras. Fisheye cameras do not have a single
viewpoint but the small cluster of viewpoints are close enough to be considered as
one. Due to the extreme view they offer they can not be modeled with a pinhole
camera model and a simple polynomial radial distortion model. Instead there exist
a few other commonly used camera models that can capture everything from regular
lenses to omnidirectional lenses.

2.2.2.1 Scaramuzza

The Scaramuzza model [3] can represent all view angles from narrow FOV to omni-
directional lenses up to 195◦. In this report, we assume that the sensor is centered
and aligned with the lens. The following model to capture the lens distortions is
proposed:

P =

Xc

Yc
Zc

 = λ

 u
v

f(ρ)

 = λ

 u
v

a0 + a1ρ+ a2ρ
2 + a3ρ

3 + a4ρ
4

 . (2.5)

7


2. Theory

Here ρ is the distance between principle point and the pixel coordinate ρ =
√
u2 + v2.

To speed up calibration fewer parameters should be used and a reasonable simplifi-
cation for fisheye lenses is

df

dρ

∣∣∣∣
ρ=0

= 0 (2.6)

such that a1 = 0. One could potentially choose a higher polynomial degree, but
Scaramuzza experienced 4th order reached the best calibration results trade off.

2.2.2.2 Kannala-Brandt model

Kannala-Brandt model [29] is a camera model which covers most types of lenses, and
handles fisheye lenses better than the pinhole model. It is used in Infrastructure-
Based Calibration of a Multi-Camera Rig [5], where it is described as an accurate
model for the fisheye lens, but with the downside of a high computation time. The
perspective projection is described by 5 equations, where one is for pinhole model and
the rest is for fisheye variants. [29] proposes a general polynomial function instead,
to handle every fisheye camera type. For feasibility [5] proposes a reduced model
for fisheye cameras, the reduced model ignores the tangential distortion and uses 8
parameters [k1, k2, k3, k4, u0, v0, fx, fy]. The general description of radial distortion
for this camera model is
Dr(α) = (a1θ + a2θ

3 + a3θ
5)(p1cos(ϕ) + p2sin(ϕ) + p3cos(2ϕ) + p4sin(2ϕ)) (2.7)

and the tangential can be described by
Dt(θ, ϕ) = (b1θ + b2θ

3 + b3θ
5)(q1cos(ϕ) + q2sin(ϕ) + q3cos(2ϕ) + q4sin(2ϕ)) (2.8)

θ and ϕ are the two angles needed to describe the direction from which a light ray
enters the fisheye lens, similar as one see in light refraction between air and water.
The mapping from an incoming light ray to the image coordinate is then(

x
y

)
= r(θ)

(
cos(ϕ)
sin(ϕ)

)
(2.9)

where
r(θ) = o1θ + o2θ

3 + o3θ
5 + o4θ

7 + o5θ
9 . (2.10)

Describing the radius of the image coordinate from the principal point in this manner
allows us to accurately handle various types of lenses.

2.2.2.3 Unified projection model

This model is often referred to as Mei camera model [2] and it treats the mirror
and camera as one sensor. It is built for omnidirectional cameras and is similar to
Scaramuzza, but with one important exception which is the assumption that for
Mei there is only a single viewpoint for the fisheye. To model the radial distortion
a three parameter model is used:

L(ρ) = 1 + k1ρ
2 + k2ρ

4 + k5ρ
6 (2.11)

The tangential distortion dx abides by Eq. (2.12).

dx =
[
2k3xy + k4(ρ2 + 2x2)
k3(ρ2 + 2y2) + 2k4xy

]
(2.12)

8


2. Theory

2.2.2.4 Undistortion

When a camera captures an image the lens usually causes a distortion in the image
to some extent. Distortion effect is undesirable and should be removed, first one
have to normalize the image coordinates. Normalization is reached by multiplying
with K−1 and returns the image with coordinates centered at 0 and with a range of
±1. Secondly, one should apply the distortion from the lens, depending on which
model is used (Mei, Scaramuzza, Pinhole). Applying the distortion will move all
pixels, after which an interpolation method is applied to fill the empty spots with
estimated values. For a fisheye type of lens, the corners will be most distorted and
have a lower resolution. The reason is that they have the greatest distance from the
center of the image, and the correlation is shown in Eq. (2.4). Finally, to get back
to pixel coordinates again, the result is multiplied by K. Resulting in an undistorted
image.

2.3 Stereo vision
Stereo vision is becoming increasingly more popular for depth estimates. The phe-
nomena used to gain a depth estimate is disparity, the distance between the same
point in two images, which is also one of many ways humans sense depth. In stereo
cameras, the cameras are usually locked in place in one common plane. In Fig. 2.2
the geometry behind the concept is explained.

9


2. Theory

Figure 2.2: The derivation of the disparity depth estimate can be drawn from this
figure. CR and CL is the camera centers, P is a point, xL and xR is the image point
referring to the same point in the world frame. The red and blue triangle have the
same relationship between height and width, T

Z
= T−xl+xr

Z−f .

When manipulating the expression from the relation between the two triangles, one
is left with an expression for depth Z:

Z = Tf

xl − xr
(2.13)

What determines the range and accuracy of the depth estimates is the resolution,
lens and baseline between cameras T. Since xL and xR is measured in pixels, there is
no continuous scale and the absolute max depth possible to calculate is Tf , when the
disparity is 1 px. Another thing that might affect the accuracy is how well features
is matched between stereo images, this varies with the texture of the surfaces of the
measured objects. An interactive spreadsheet can display these relations in [30].

2.4 Feature extraction
To find and track objects in images, these objects are broken down into small parts
such as corners or other features with characteristic properties. These features are

10


2. Theory

first detected through tests of e.g. changing illumination, then described more specif-
ically. The description is used to match features that have been detected in different
images.

Good and reliable feature extraction is the foundation for many computer vision
applications. When features are properly described you can find the same features in
another image even if it is scaled, rotated or illuminated differently. Several methods
exists for doing this, some of the most important is described in this section.

2.4.1 SIFT
Scale Invariant Feature Transform (SIFT) is a feature detection and description
method consisting of the following steps [34].

Scale-space extrema detection

To find points of interest a search over all scales and image locations is performed.
A difference of Gaussians (DoG) function is used. Define an image’s scale space as

L(x, y, σ) = G(x, y, σ) ∗ I(x, y) (2.14)

where G(x, y, σ) is a Gaussian of changeable scale

G(x, y, σ) = 1
2πσ2 e

−(x2+y2)/2σ2
, (2.15)

I(x, y) is an input image and ∗ denotes a convolution. Then the difference of Gaus-
sians becomes

D(x, y, σ) = L(x, y, kσ)− L(x, y, σ) . (2.16)

In Eq. (2.16) k = 21/s and s is the number of stages that each octave is divided into.
With the DoG function a good approximation to the so called scale-normalized
Laplacian of Gaussians can be made, which means that true scale-invariance is
achieved. The scale space is contructed like a pyramid, with the bottom layer being
an octave of full-size images with DoG from increasingly blurred images. Each
octave is down-sampled by a factor of 2.

Keypoint localization

Extrema are found by iterating over each DoG image, checking if the center point
in a 3-by-3 square is a minima or maxima and then comparing to the corresponding
squares in the closest neighboring DoG images. If the contrast is higher, then a
point of interest has been found. A certain threshold is set for how strong contrast
is needed to consider the keypoint robust to noise and therefore worth keeping.
Another robustness aspect of keypoints is whether or not they can be accurately
localized, and one case where that is difficult is when a keypoint is along an edge.
Along edges, the DoG function might yield a similar enough result further down the
sides of the edge. To resolve that issue, keypoints with a ratio of principal curvatures
above a decided threshold are rejected.

11


2. Theory

Orientation assignment

A feature orientation is assigned from local image properties. The descriptor be-
comes relative to the assigned orientation and not the rotation of the image, thus
becoming rotation invariant. An orientation histogram is created from gradient ori-
entations of sample points in the vicinity of the keypoint. The histogram covers
360◦ in 10◦ intervals.

Keypoint descriptor

Computation of descriptor by taking a 16 by 16 sample array divided into 4 by
4 subregions. For each sample point the magnitude and orientation is calculated.
Then a Gaussian blur is applied on the sample array. For each subregion the updated
values of sample points are accumulated, such that the magnitudes are added and
the orientations are divided into 8 generalized directions. Hence, forming a 128
element feature descriptor.

2.4.2 FAST
Features from Accelerated Segment Test (FAST) [35] is a feature detector which
compares a pixel’s px intensity to 16 surrounding pixels placed more or less in a
circle around the pixel. A least number m of these pixels that should be either less
or more intense, for the pixel to be considered a corner can be chosen. A threshold
is applied to ensure distinctiveness of the feature. A typical case is m = 12, and for
that case the test can be accelerated by at first ascertaining that at least three of
four pixels placed in compass directions from px were more or less intense. To avoid
several pixels being detected adjacent to each other, non-maximum suppression can
be applied.

2.4.3 BRIEF
Binary Robust Independent Elementary Features [36] (BRIEF) is a feature descrip-
tor method, unlike SIFT the feature points are described as binary strings of test
results where each test can be described by Eq. (2.17). In Eq. (2.17), the test result
depends on the input smoothed patch p and the intensities p(x), p(y) of p at given
image coordinates x, y.

τ(p;x, y) =

1 if p(x) < p(y)
0 otherwise

(2.17)

Common descriptor sizes ns are between 128 to 512 bits. The feature descriptor fns

can be composed according to Eq. (2.18).

fns(p) :=
∑

1≤i≤ns

2i−1τ(p;xi, yi) (2.18)

12


2. Theory

2.5 Feature matching

The two most commonly used matchers are the brute force matcher and FLANN [37]
based matcher. Brute force matcher as the name suggests goes through all keypoints
in both images and matches the ones with the smallest distance i.e the best match.
Distances are typically Hamming or L2-norm. L2- norm is the Euclidean distance
and can be used for SIFT with its float value descriptors. Hamming distance is
in this case the number of elements in a binary string that differs. It works well
for feature descriptors such as BRIEF which descriptors are binary strings. Fast
Library for Approximate Nearest Neighbor (FLANN) based matcher builds a kd-
tree with descriptors and does not run through every possible descriptor to find a
nearest neighbor, instead it finds an approximate best match via the kd-tree. Thus,
becoming efficient for larger sets of descriptors compared to brute force matcher.

2.6 RANSAC

Random sample consensus [27] (RANSAC) is a model fitting algorithm, see Algo-
rithm 1. Given the least number of data points needed to estimate a model, the data
points that fit this model within a predefined threshold are called inliers. The best
model is defined as the model with the most inliers. It is found through iteration
until either enough iterations have passed, or until it is very likely that the best
model has been found. The inlier ratio ε is the ratio of inliers compared to the total
number of points.

Probability of picking n inliers randomly εn and probability of a sample containing
at least one outlier (1 − εn) are used to determine when the algorithm can finish
more rapidly.

Algorithm 1: RANSAC
Input: m data points;
Initialization:k = 0, ε = ε0, kmax = log(η)/log(1− εn);
while k < kmax do

Model estimation from n random data points;
Estimate number of inliers;
if currInliers>bestInliers then

Update best model;
Update ε = (number of inliers)/m;
Update kmax = log(η)/log(1− εn);

end
end
Return: Best model

13


2. Theory

2.7 DLT

Direct linear transformation (DLT) is a simple way of finding a model. The first step
is to compose a system matrix consisting of a number of linear equations, depending
on the problem it needs a different amount of points and equations, e.g. camera
resection N >= 6 and homography estimation N >= 4. The goal is to find the null
space of the system matrix M, this is done with a singular value decomposition and
the solution lies within the top N elements of the rightmost column of V (the right
singular matrix). DLT can be used as a minimal solver in an RANSAC algorithm,
to avoid using outliers in the estimate.

2.8 Epipolar geometry

Epipolar geometry is a way of relating two or more cameras to each other, it gives
a way of perceiving the depth and gaining their relative pose to each other. When
capturing an image of a scene, the information about the depth is lost. All the
points lying on the line, projected by λx = PX (viewing ray) of camera 1, refers
to the same image coordinate. With the addition of another camera, it is possible
to get the correct depth of the 3D point from the matching point xR. xR can be
quickly found by searching along the epipolar line and not the whole image, this is
called the epipolar constraint. To find the epipolar line, the first camera is set to
be in [I|0] pose and the second camera is rotated and translated by P2. The scene
point X can bee seen as a function of λ, so we get Eq. (2.19). Namely, the equation
for the epipolar line.

P2X(λ) =
[
R2 t2

] [λx
1

]
= R2λx+ t2 (2.19)

In Fig. 2.3, the green triangle OLXOR is called the epipolar plane. The projection
of the other camera centers passes through the epipole eL = P1OR, the epipole is
represented by the null space of the corresponding camera matrix. The epipole is
always present, but can only be found in the image if the cameras can see each other.
All the possible epipolar lines from different image coordinates passes through the
epipole. A line can be described by two points, and the general line equation is
lTx = ax + by + c = 0. Another useful trick to know is that the cross product
between two vectors give a perpendicular vector to both of them. Taking the cross
product of two lines will also return the point of the intersection of the lines.

14


2. Theory

Figure 2.3: A visualization of epipolar geometry, xL and xR are the matched points
in the image plane. The epipolar line is displayed in red from xR to eR(epipole) the
intersection of the line between camera centers in the right image plane. The plane
XOLOR is the epipolar plane. Note figure is taken from [33].

2.8.1 Essential matrix
An essential matrix is the foundation of epipolar geometry, it contains information
about the rotation and translation between the cameras. As opposed to the funda-
mental matrix, the essential matrix can only handle normalized image coordinates.
However, when the image is normalized and rectified then E = F. The essential
matrix is defined by Eq. (2.20)

E = R[t]× (2.20)
That satisfies the epipolar constraint.

xT2n
Ex1n = 0 (2.21)

Here, the notation of t is the skew symmetric matrix of t, i.e. a matrix composed of
the cross product. The essential matrix has the following properties. Two constraints
on the singular values, two of them should be equal, and one should be zero. These
constraints should always be enforced. The determinant of E should be zero and
the epipolar constraint should be fulfilled.
One can find the essential matrix in a few ways, if the fundamental matrix F and
the camera intrinsics K is known then E can be found with Eq. (2.22)

E = KT
2 FK1 (2.22)

It can also be found directly from five or more point correspondences, using either
RANSAC or singular value decomposition (SVD). Calculating it with RANSAC is

15


2. Theory

most efficient, a minimal model like the 5 point algorithm from David Nister [21] is
applied to random samples and the solution from the samples that gave the most
inliers is chosen. A solution from SVD can yield better results if the images are not
too noisy.
From the essential matrix, the transform between the two camera centers can be
recovered. The first camera is always set to be at 0 with no rotation, and then R
and t can be found from the singular value decomposition of E, [U,S,V] = svd(E),
where R and t are described by Eq. (2.23)

t = U [1 : 3, 3], R = UWV T (2.23)

Here W is selected in such a way that it satisfies the constraints on the singular
values of E:

W =

0 −1 0
1 0 0
0 0 1

 (2.24)

2.8.2 Fundamental matrix
The fundamental matrix is similar to the essential matrix, except the addition of
the intrinsic parameters so it handles pixel coordinates directly.

F = K−1T
2 EK−1

1 (2.25)

F is not restricted the same way with the singular values, but has a similar epipolar
constraint.

uT2 Fu1 = 0 (2.26)

Similar to the essential matrix, the fundamental matrix is also rank deficient and
its determinant is 0. To ensure the determinant is 0, the last singular value is set to
0. For finding the fundamental matrix having 7 degrees of freedom, the 7 or 8 point
algorithm is chosen to solve for F. It can be done using RANSAC or DLT.

2.9 Triangulation
Triangulation is essentially to find the coordinate of a 3D scene point visible in two
cameras. Triangulation is done by taking the matched points and set them up in
a direct linear transform (DLT) equation. The goal is to solve Eq. (2.27) for the
common 3D point X.

λ1x1 = P1X...λ2x2 = P2X (2.27)

To get rid of the scale λ, the cross product is used and we get Eq. (2.28)

x1 × (P1X) = 0 (2.28)

Because the multiplicand and the multiplier have the same direction the cross prod-
uct is equal to zero, writing out the cross product in Eq. (2.28) P n is the nth row

16


2. Theory

of P .
(P 3

1X
T )y − (P 2

1X
T ) = 0

(P 1
1X

T )− (P 3
1X

T )x = 0
(P 2

1X
T )x− (P 1

1X
T )y = 0

(2.29)

From Eq. (2.29) we can factor out X and we get the DLT matrix MX = 0 of size
4N × 4(N number of cameras), finally the 3D scene point is found by taking the
singular value decomposition (SVD) and decomposing the right singular vector.

2.10 Scale estimation
Scale estimation is needed because when using tools from epipolar geometry one get
the transform between two cameras, but the translation is only defined up to scale.
The scale can either be estimated from a ground truth, triangulated 3D points or
from an additional sensor e.g. GNSS or IMU. For visual odometry image data is the
only type of data available, the scale is estimated by the difference in the matched
triangulated points from the current and previous frame.

‖tk−1,k‖
‖tk,k+1‖

= ‖X
′
0:k−1 −X0:k−1‖
‖X ′1:k −X1:k‖

(2.30)

In Eq. (2.30), X ′ represent the matched scene points in the new frame and X is
from the old frame. From Eq. 2.30 you get an array of ratios, a common approach
to get a relative scale is simply to select the median of the array as a way to reduce
the effect of outliers.

One can aid the scale measurements either with adding another camera (stereo
vision) or an IMU or GNSS sensor. Equipped with a calibrated stereo camera setup,
one can calculate the disparity between matches, see Section 2.3, and get a good
point cloud with correct depths. Then, applying the same method as for monocular
vision on the 3D points gives the scale. With the addition of an IMU, one can loosely
or tightly couple it to get better estimates of the metric scale. The accelerometer
data is usually integrated in between the frames only and used together with a bias
and measurement error model and a Kalman filter. The integration approach is
shown in [25].

2.10.1 Bundle adjustment
Bundle adjustments (BA) refers to bundles of light that go from a feature point to
a camera, and the aim is to reduce the reprojection errors. BA can be described as
the problem of refining a visual reconstruction by optimizing the 3D scene points,
relative motion parameters and viewing parameter estimates. Given normalized
image points xi,j, find 3D scene pointsXj = [Xj; 1] and camera matrices Pi = [Ri; ti].
To reduce reprojection errors the cost function

n∑
i=1

m∑
j=1

(
xtij −

R2
iXj + t1i

R′1Xj + t3i
, x2

ij −
R2

1Xj + t21
R3
iXj + +3

i

)
‖2 (2.31)

17


2. Theory

is minimized. Here the subscript of a rotation matrix R specifies the camera and the
superscript specifies the row of the rotation matrix. Bundle adjustment optimizes
camera poses or scene points or both.

2.10.2 Loop closure
When a vessel needs to keep track of its position and pose only from camera feed
there will be small errors, that will increase continuously without stop. Unless there
is a way to recognize when the vessel comes back to a previously visited location,
because then the difference in position and pose currently believed to exist can be
erased while also correcting the locations visited since the vessel was last at the lo-
cation revisited. To be able to make these adjustments a way to recognize that the
vessel is in the same location as before needs to be known. The combination of de-
tecting such occurrences and adjusting the positions and poses is called loop closure.

Loop closure detection can be divided into two parts. Visual place recognition and
geometric verification. Different options for visual place recognition include bag of
visual words, deep learning, visual feature matching, and kd-tree search. [8] The
procedure to match images with bag of visual words is to first convert images to
numerical vectors, and then compare these vectors in the bag of words space. [11]
Benefits of bag of words is the speed and efficiency of the image matching. [11] The
downside is perceptual aliasing, namely that two locations are erroneously found to
be the same.

Image matching with KD-tree search can be summarized as follows. Key frames are
images that are either spatially far apart or have limited overlap, where the first
frame is always a key frame. For all such key frames the features are added to a
feature pool D. Perform a nearest neighbor search, in the current KD-tree, for each
feature in a current image. Then for the top K key frames, with the most matched
features to a current image, do a similarity test checking whether a loop closure
has been found. The similarity test uses feature matching on image level and is
accurate. The benefits of this method is the accuracy of the image matching, while
the problem can be scalability of the environment since no offline feature extraction
is performed, thus creating computational speed issues for larger environments. [9]

2.11 IMU
An inertial measurement unit (IMU) is used to measure the specific force and angu-
lar velocity, and with these measurements the position is later calculated. In visual
inertial odometry, sensor output from the cameras and the IMU can be either loosely
coupled or tightly coupled. The difference of how the sensor data flows for the two
approaches is illustrated in Fig. 2.4.

Inspiration for Fig. 2.4 was taken from [38], however different sensors were used
and a simpler illustration was made. The point was that for loosely coupled sensor
setups, the data from one sensor might go through an algorithm before being fused

18


2. Theory

Figure 2.4: An overview of coupled systems between camera sensors and IMU
sensors.

with data from another sensor. In the loosely coupled setting a common approach is
to use the extended kalman filter (EKF), here the state propagation typically stems
from IMU data and the pose calculated from camera feed is used for the update [14].
The EKF is a filtering method, while there are also optimization based techniques
for visual inertial odometry.

2.12 Visual odometry
Visual odometry mono or stereo is a method for generating a trajectory solely based
on observations from cameras. VO is a cheap and usually very good way to get
accurate estimates of the pose. The approach has some shortcomings when it comes
to robustness in featureless environments and when the object comes to a stop. The
stereo methods can yield some improvements to the mono scheme because of better
scale estimation. The simplest approach of visual odometry is described below.

1. Firstly features have to be matched between the current and previous frame.
2. After two frames are captured and matched, the essential (or fundamental

matrix if the intrinsic is not available) is found according to description in
Section 2.8.

3. From the essential matrix the pose of the current frame with respect to [I|0]
is decomposed.

4. With the new pose the matched points are triangulated to a current 3D point
cloud.

19


2. Theory

5. From the current point cloud and previous point cloud the relative scale is
estimated as described in 2.10

6. Calculate the current total position and rotation:

Rtotk = RkRtotk−1 (2.32)

,
Posek = Posek−1 + RtotktkS (2.33)

In addition to the simple approach, one can add several local and global optimizing
steps and loop closure. However, logging frames, matches, poses and 3D points is
needed then and it essentially becomes a SLAM algorithm.

2.13 Visual inertial odometry
Visual inertial odometry combines camera feed and IMU feed to estimate a trajec-
tory. Where the visual odometry is less robust in certain situations, the addition
of an IMU can provide extra stability. E.g. when the object travels through an
area of few features, then it has support in the IMU measurements to know if it is
changing its trajectory. To get the information about the position from the IMU, the
accelerometer readings are integrated twice and this will accumulate errors quickly.
The orientation can on the contrary be accurately estimated. Roll and pitch can be
observed directly due to the gravity vector, and the yaw can be integrated from the
gyroscope which is accurate over shorter periods.

2.14 Framework
ROS Robotic Operating System is a framework which is commonly run on embed-
ded machines. It is structured in such a way that it is one master or core, with
one or more nodes related to it. Each node typically performs some function and
exchange messages in between the other nodes. The node can either publish or sub-
scribe to a topic, e.g. in a car one node could measure the state of the vehicle and
publish among other velocity messages to a topic. Then a cruise control node would
subscribe to the same topic and use the velocity messages to control the acceleration
needed to keep a constant velocity.

A nice feature of ROS is that all messages have a timestamp so it is easy to synchro-
nize, and one can easily find the sensor updates in between each frame captured.

20


3
Methods

Camera calibration includes estimating parameters for intrinsics, distortion and ex-
trinsics. The intrinsic parameters are [fx, fy, cx, cy, s, α] and the distortion param-
eters varies depending on the complexity of the lens. All the mentioned parame-
ters are fitted by using images with known features, often images of chessboards.
The extrinsic parameters are transformation matrix containing rotation matrix and
translation vector. The rotation matrix from a base orientation, usually one of the
cameras’ orientations, to another camera’s orientation. Likewise, the translation
from a base position to a camera’s position

Regarding extrinsic calibration, the most common situation is that a stereo camera
is used, which shares a large portion of FOV and the features in this shared section
are used to find the parameters. When there is no shared FOV between cameras,
or the nature of the lens reduces the quality of the overlap beyond a certain degree,
other methods of estimating extrinsic parameters are needed. Some categories that
have proven to yield accurate results are trajectory, SLAM, planar and mirror based
methods [24].

Mirror based methods require a calibration pattern such as a chessboard to be visi-
ble to all cameras with the help of mirrors. With static camera pose and calibration
pattern, the mirror pose can be calculated and in turn the relative pose between
cameras can be calculated. Problems with mirror based methods include obtaining
high quality captures of calibration pattern due to mirror poses, where there is a
difficulty in properly arranging mirror poses while handling constraints imposed by
the physical environment [24]. Trajectory based methods cameras are fixed on a
rig, with relative poses between cameras that do not change regardless of how the
rig moves. An image sequence per camera is produced, during which time the rig
has to experience complex movements or the performance can decrease [24]. SLAM
based methods estimates the extrinsics after registering the scene details with each
camera. Challenges include accurate 3D reconstruction of the scene [24].

Planar calibration [23] can be done if all cameras are RGB-D type. Plane calibration
method require the cameras to view one plane from several views or have several
planes visible in one image in each camera. The algorithm works by first giving an
initial guess of the extrinsics to find plane correspondences in each camera and the
normal vector of the plane visible in each camera. From the plane correspondences
they do a MLE (maximum likelihood estimation) of first relative rotation then the
translation. Plane calibration method is quick and reliable.

21


3. Methods

For applications with 360◦ view it is common to have some overlap between the
different frames. With overlap it may be possible to estimate the essential matrix and
from that recover the relative rotation between the two cameras and a translation up
to scale. A problem with finding features on the edges of frames from cameras with
a high field of view, is that the edges are of lower resolution due to the construction
of the lens.

3.1 Choice of method

To the best of our knowledge there were not any papers on automatic calibration
of non overlapping cameras for marine vessels to date. A consideration to take into
account for extrinsic calibration on vessels is that water is not well suited for finding
good features. When good features are not available one should have additional
sensors along cameras to estimate the pose. A water vehicle does not have the same
traction as a land going one, and the same motion models do not apply directly.
Camera placement is not as well defined as in cars, and there are large varieties in
size and construction.

The choice in this report was based on the need for a general solution, that worked
for a lot of different setups. E.g. fisheye lenses, regular cameras and stereo cameras.
It was also not desired to require external cameras, mirrors or calibration patterns.
The planar calibration method would have been a good option. However, it would
have limited us to RGB-D cameras.

The factors mentioned above lead the way to a trajectory based method. The
trajectory based methods could be done on all cameras and the position estimates
could be fused together with other available sensors on a vessel.

3.2 Simulator setup

For maximal flexibility and availability to get exactly the data set desired with no
cost, a simulator was chosen. The selected simulator is based upon ROS and Gazebo,
the simulator world and vessel was built by Open Robotics for their Virtual RobotX
competition[26]. The platform based upon Marine Advanced Robotics vessels, is
equipped with an IMU, LiDAR, cameras and GPS. For this work the original sensor
package was modified. The boat was fitted with 4 cameras, placed 90◦ upon each
other and every camera was aimed downwards with an angle of 15◦. The new sensor
setup was specified in a yaml file used to generate a new urdf file for the vessel. The
camera sensor itself was replaced with a wide angle camera model with 180 ◦ FOV,
which was done by modifying a camera sensor specification file wamw_camera.xacro.
In Fig. 3.2 the modified sensor setup is shown in bird’s eye and side view.

22


3. Methods

Figure 3.1: An overview of the sensor setup.

Figure 3.2: Bird’s eye view of the sensor setup.

To enrich the water with features, several docks were added to the Gazebo world in
order for all the cameras to have some features available.

3.2.1 Data recording

The data set was collected in ROS bag files. The ROS topics recorded are displayed
in Table 3.1. The bag file was recorded while steered in a circuit with the keyboard
and a localization node was running to get the /tf topic. With four cameras the
laptops used in this thesis manage recording images at 6 FPS, IMU is kept at 15 Hz
and tf at 100 Hz.

23


3. Methods

Table 3.1: The recorded ROS messages and a description.

ROS topics Description

/tf Dynamic transforms, world–>base,
base–>proppellar

/tf_static All static transforms
/wamv/sensors/cameras/front_camera/image_raw RAW image data
/wamv/sensors/cameras/right_camera/image_raw RAW image data
/wamv/sensors/cameras/back_camera/image_raw RAW image data
/wamv/sensors/cameras/left_camera/image_raw RAW image data

/wamv/sensors/cameras/front_camera/camera_info Camera info: K, resolution,
Distortion, ...

/wamv/sensors/imu/imu/data Angular vel, Linear acc and timestamp

3.3 Detecting features on water

Compared to roads, water is not stationary in any way. The water surface is highly
reflective and changes shape constantly. Because of the wide field of view, parts
of the boat are visible in some of the cameras. Features tracked on the boat will
distort the trajectory the boat takes, because the features tracked will appear to be
stationary. The stationary features is a problem when there are few other features
available and they will be considered inliers by the RANSAC algorithm. For that
reason it was desired to have as few features as possible detected of the water and
the vessel itself.

To avoid features on the vessel, a black mask was multiplied with the images. The
black region will not be tracked by most trackers, but SIFT does sometimes find
features along the edge. When SIFT was used, it was needed to define a region
of interest (ROI) where the features were tracked. The good features were from
objects relatively stationary on the water. E.g. docks, buoys, pillars or objects on
the coastline.

The docks added to the simulator environment were identical to each other, so a
constraint on the maximum distance between matches in two consecutive frames
and a minimum of 2 pixels apart was added. The maximum distance was directly
correlated to the frame rate and was adjusted accordingly.

3.4 Finding intrinsic and distortion parameters

The focus of this thesis was not about finding the intrinsic parameters. For data
sets, e.g. KITTI odometry [28], the intrinsic parameters were provided. On the
simulator, the camera intrinsic matrix was provided in the camera info message
published from the camera sensor. The distortion parameters, resolution and FOV
could be set in the sensor specification wamw_camera.xacro file.

24


3. Methods

3.5 Obtaining a trajectory
The method of choice depended on independent trajectories from all of the cameras.
To calibrate the extrinsics of the cameras solely based on trajectories, they needed
to be accurate.

3.5.1 Visual odometry
Visual odometry is a simple but effective way of obtaining a trajectory. The draw-
back of visual odometry is that it requires a good amount of features spread out
evenly in the image. Because it could then:

• Track more features through a longer time before computing new features.
• Get a better estimate of the essential or fundamental matrix.

A commonly used trick is to exclude any non forward dominated movement. This
constraint does not always hold, when turning in sharp corners the boat is sliding
sideways a bit at the same time, and of course from the perspective of the camera
on the stern of the boat, the motion is backward.

3.5.2 Visual inertial odometry
Visual inertial odometry was used to add stability to the visual odometry when there
was a lack of features. To get the right yaw rotation, an IMU was used. The IMU
was seated at the platform of the boat with its x axis in the driving direction of the
boat and so the rotation around z axis represents heading the boat. The coordinate
system of the gyroscope in the IMU was transformed to match the camera with Eq.
(3.1)

Roptical =

 0 0 1
−1 0 0
0 −1 0

 (3.1)

The gyro was considered to be noise free in this application, a reasonable approxi-
mation when a bias compensated IMU is usually off by 1◦/min and the sequences
made in this report were typically < 1 min. The update frequency of the IMU was
set in the IMU specification file. From the rosbag, that frequency f was read. There
were typically 4-5 IMU updates per frame when IMU was running at 15 Hz, and
they were integrated in between each frame to get the rotation between frames.

rIMU = ω
1
f

(3.2)

The final rotation for a frame update was found with a simple complementary filter
shown in Eq. (3.3)

r = (1− γ)rIMU + γrcam (3.3)

Both the angular velocity measurements from the IMU and the poses were filtered
with a short moving average filter, to get rid of the non smooth behavior.

25


3. Methods

3.5.3 Ground truth camera trajectories
For testing the extrinsic calibration algorithm, a ground truth trajectory was used.
In order to synchronize the transformation needed by the algorithm and the data
extracted from the simulator, an understanding of the simulator coordinate system is
needed. Fig. 3.3 shows the relation between the simulator world coordinate system,
and the vehicle. The vehicle has a base, from which further transformations were
needed to reach each individual camera.

Figure 3.3: Ground truth transforms.

In a real world scenario, to estimate trajectories for each camera, the first motion
T 1
i for camera i would be set as

T 1
i =

[
I 0
0T 1

]
, (3.4)

i.e. the origin. And the transformation T ki would represent a motion from the
"origin", i.e. first camera frame of the same camera, to the kth camera frame of that
camera. Naturally, the pose of the kth frame could then be calculated as a subsequent
multiplication of each transform up until that frame. However, for the ground truth
based on the simulator values, this generated issues. These issues stemmed from the
representation of data in the simulator. The time-varying rotation and translation
from world to vehicle base coordinates were read from the ROS topic tf with frame
id wamv/odom and child frame id wamv/base_link. Static data was read from
tf_static. These were defined when creating the vessel in the simulator. A list of
the four used static links from vessel base to camera center are:

• Base to post - Tbase_to_post
• Post to arm - Tpost_to_arm
• Arm to camera - Tarm_to_cam
• Camera to optical - Tcam_to_opt

26


3. Methods

for which Tbase_to_post was unique for each camera while the following were identical.
To obtain a transformation Ts,i from base to optical, calculation of

Ts,i = Tbase_to_post,iTpost_to_armTarm_to_camTcam_to_opt . (3.5)

was made. Another necessary transformation was T kd from the simulator world co-
ordinates to the current base of the vehicle. Let the scenario with all the transforms
of the simulator by described by Fig. 3.3.

From Fig. 3.3, it could be deduced that the desired transformation T ki corresponded
to

T ki = T−1
s,i T

0,−1
d T kd Ts,i , (3.6)

where T−1
s,i was calculated by applying Eq. (3.7) on Eq. (3.5) and T 0,−1

d by applying
Eq. (3.7) on T 1

d . T 0,−1
d denoted the inverse transformation from world to vehicle

base coordinates for the first set of frames.

T−1 =
[
RT −RT t
0T 1

]
(3.7)

3.6 Finding extrinsic parameters

For each camera, a sequence of transformations T ki from a reference pose to the
current pose was obtained from ground truth. The reference pose of a camera was
defined by setting the pose of the 0th frame to and identity matrix. The calibration
required transformation data for at least two cameras, a master camera 0 and slave
cameras i. To recover the relative pose ∆Ti between two cameras set up according
to Fig. 3.4 two major steps were taken, solving for the relative rotation ∆Ri and
solving for the relative translation ∆ti given the estimated ∆Ri.

27


3. Methods

Figure 3.4: Rig transforms, rigid link between the two cameras.

From Fig. 3.4 in combination with an assumption of rigidity Eq. (3.8) was derived.

T k0 ∆Ti = ∆TiT ki (3.8)

The construction of a pose T where R is a rotation matrix and t a translation vector
is shown in Eq. (3.9).

T =
[
R t
0T 1

]
(3.9)

Given Eq. (3.9), Eq. (3.8) could be divided into two constraints where Eq. (3.10)
was for rotation and Eq. (3.11) was for translation.

Rk
0∆Ri = ∆RiR

k
i (3.10)

Rk
0∆ti + tk0 = ∆Rit

k
i + ∆ti (3.11)

Examples of vectorization of a matrix and the kronecker product of matrices are

28


3. Methods

shown in Eq. (3.12) and Eq. (3.13) respectively.

vec

([
a11 a12
a21 a22

])
=


a11
a21
a12
a22

 (3.12)

[
a11 a12
a21 a22

]
⊗
[
b11 b12
b21 b22

]
=


a11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22

 (3.13)

The use of vectorization and kronecker product came in handy for the property

vec(M1M2M3) = (M1 ⊗MT
3 )vec(M2) , (3.14)

which allowed for rewriting Eq. (3.10) and Eq. (3.11) into

(I9 −Rk
0 ⊗Rk

i )vec(∆Ri) = 0 (3.15)

and
(I3 −Rk

0)∆ti = tk0 −∆Rit
k
i (3.16)

respectively. For kmax motions a resulting systems of linear equations to be solved
became 

I9 −R1
0 ⊗R1

i
...

I9 −Rk
0 ⊗Rk

i
...

I9 −Rkmax
0 ⊗Rkmax

i


︸ ︷︷ ︸

Mi

vec(∆Ri) = 0 . (3.17)

A vector wi = vec(Wi) in the null space of Mi in Eq. (3.17) was calculated via
singular value decomposition, see Eq. (3.18). Numpy syntax was applied. E.g.
MATLAB the matrix named Vi is transposed.

UiSiVi = Mi

⇒ wi = Vi[−1, :]
(3.18)

Rotation estimate is described in Eq. (3.19)

∆Ri = Wisign(det(Wi))|det(Wi)|−1/3 (3.19)

29


3. Methods

From Eq. (3.16), iterating motions from the sequence of frames Eq. (3.20) was
obtained.



I3 −R1
0

...
I3 −Rk

0
...

I3 −Rkmax
0


︸ ︷︷ ︸

Ni

∆ti =



t10 −∆Rit
1
i

...
tk0 −∆Rit

k
i

...
tkmax
0 −∆Rit

kmax
i


︸ ︷︷ ︸

Oi

(3.20)

Translation ∆ti was estimated by multiplication of pseudo-inverse of Ni and both
sides of Eq. (3.20).

3.7 Addition of noise
The nature of the noise coming from a generated track is cumulative, so that the
noisy rotation would affect the whole trajectory coming after. Disturbances were
created following zero mean Gaussian distributions with independently selected am-
plitudes for rotation noise and translation noise. Individual samples were generated
for each element of R and t shown in Eq. (3.21), (3.22).

R =

r11 r12 r13
r21 r22 r23
r31 r32 r33

+

n11 n12 n13
n21 n22 n23
n31 n32 n33

 (3.21)

t =

t1t2
t3

+

n1
n2
n3

 (3.22)

Two methods of adding noise were tried. The first was to simply generate noise
and add it to each ground truth pose. However, it was reasoned that in reality
the effects on a trajectory would be permanent if it had diverged from the ground
truth. Therefore, a cumulative noise was also tried. Here the noise was generated in
the same fashion but the list of poses were instead divided up into transformations
between subsequent poses and by multiplying each transformation up until a frame
index k, the noise would be included and saved in those transformations.

The distribution of noise was chosen as Gaussian because of the unpredictability
of waves and wind disturbances. Why were the additive and cumulative models
chosen? Additive because it was desired to try the calibration algorithm on the
trajectory that was “more or less” correct, in that way the constraint on fixed links
between cameras easier to enforce correctly. The cumulative model was chosen as
reality-based situation, that whenever a transformation between two frames was
badly estimated, the error would stack to the previous errors.

30


3. Methods

3.8 Optimization
After the initial estimate was found one could apply a nonlinear refinement method
to further improve the result. The goal was to minimize the error function in Eq.
(3.23).

L(∆qi,∆ti) =
N∑
n=1
‖Ani ∆qi‖+ ‖On

i ∆ti − tn0‖ (3.23)

The subscript represents the slave camera number, and the superscript is the frame
number. Quaternions were used rather than rotation matrices so it became 7 pa-
rameters to estimate rather than 12. There were different optimization algorithms
that could minimize the error function and the most common for visual applications
is Levenberg Marquardt, a damped least squares approach. In this work however,
Trust Region Reflective was used since it allowed bounds. Typically it would yield
only a slightly better solution if the initial guess was close and not already perfect.

31


3. Methods

32


4
Results

This chapter contains results from the stages towards extrinsic calibration, and
results on how well the trajectory based calibration handles noise.

4.1 Visual odometry
The first test was to investigate if visual odometry would be sufficient for calibra-
tion of the cameras. The algorithm was tested on KITTI Odometry sequence 9.
The KITTI data sets are very popular, feature rich, rectified, provided with ground
truth and are commonly used as benchmarks. The features and descriptors were
generated using FAST and BRIEF and a brute force matcher.

The first test conducted was for how long to track features. The tracking was done
by matching the features in the latest frame with the tracked features until the
number of matches were below a certain threshold. If the threshold was crossed, all
matches from the latest frame pair would be used for feature tracking. In Fig. 4.1
this was done with FAST+BRIEF with the limits [100,400,700,2000].

In Fig. 4.1 the biggest difference was between 100 and 400, while there was an
improvement between 400, 700 and 2000, it was not as significant.

A similar test was done using SIFT. With SIFT, Lowe’s test [32] ratio determined
how good the matches had to be, which also determined the amount of features
found. Fig. 4.2 displays two ratios commonly used. Higher ratio, more key points.

33


4. Results

−100 0 100 200 300 400

−100

0

100

200

300

400

500
GT
VO

(a)
− 100 0 100 200 300

0

100

200

300

400

500

GT

VO

(b)

− 100 0 100 200 300

0

100

200

300

400

500

GT

VO

(c)
− 100 0 100 200 300

0

100

200

300

400

500

GT

VO

(d)

Figure 4.1: KITTI Odometry sequence 9. Different thresholds for detecting new
features [100, 400, 700, 2000] for [a, b, c, d].

− 100 0 100 200 300

0

100

200

300

400

500

GT

VO

(a)
− 100 0 100 200 300

0

100

200

300

400

500

GT

VO

(b)

Figure 4.2: KITTI sequence 9, SIFT detector, Flann matcher, Lowe’s ratio 0.75
and 0.85.
34


4. Results

SIFT outperformed FAST and BRIEF, in terms of distance from ground truth for
the KITTI data set. In the test, having 0.75 Lowe’s ratio performed better than
0.85. The result means that neglecting the lower quality matches that were allowed
for 0.85 ratio was preferable. In terms of time consumption SIFT took 10 times
longer than FAST and BRIEF on the same sequence.

4.2 Visual inertial odometry
First test, see Fig. 4.3, was to see what performed best on the simulator, with no
IMU influence, γ = 1 using SIFT which proved to be the best for the KITTI data set.

50 60 70
−40

−30

−20

−10

0

10

20

30

GT
VIO

(a)
50 60 70

−30

−20

−10

0

10

20

30

GT
VIO

(b)

50 60 70

−30

−20

−10

0

10

20

30

GT
VIO

(c)
50 60 70 80

−40

−30

−20

−10

0

10

20

30

GT
VIO

(d)

Figure 4.3: VIO Algorithm, SIFT, Lowe’s ratio [0.75, 0.75, 0.85, 0.95], [70, 100,
150, 200] update threshold, values of γ: [1.0, 1,0, 1.0, 1.0], meaning no IMU influence.

Since the gyroscopes in IMUs do not drift significantly over the course of a minute,

35


4. Results

the heading of the boat could be accurately found by integrating the angular velocity
of the z axis of the IMU. In Fig. 4.4 the values of γ = [0, 0.3, 0.7, 1] are tested.
With a γ value of 1, this is a VO algorithm with FAST and BRIEF feature extractor
since most of the features available here are corners.

40 50 60 70
−30

−20

−10

0

10

20

30

GT
VIO

(a)
40 50 60 70

� 30

� 20

� 10

0

10

20

30

GT

VIO

(b)

50 60 70

� 30

� 20

� 10

0

10

20

30

GT

VIO

(c)
50 60 70 80 90

� 40

� 30

� 20

� 10

0

10

20

30

GT

VIO

(d)

Figure 4.4: VIO Algorithm, FAST-BRIEF with the values of γ: [0, 0.3, 0.7, 1.0].

Even if a gamma of 0 gave the correct rotations, it was not perfect with respect to
the estimated translations and scales. The best result was from γ = 0.3.

For calibration, a trajectory from all of the cameras was needed. The test was
conducted with SIFT, 0.75 Lowe’s ratio, γ = 0.5, 100 update threshold. The result
is shown in Fig. 4.5.

The trajectories in Fig. 4.5 are behaving in the way you would expect from four
cameras 90◦ upon each other.

36


4. Results

45 50 55 60

−20

−10

0

10

Front
Start

50 60 70 80

10

20

30
Right
Start

40 45 50 55
0

10

20

30

40

50

Back
Start

20 40 60
−30

−20

−10

0

10

Left
Start

Figure 4.5: The cameras own trajectory viewed from their coordinate system.
SIFT, Lowe = 0.75, γ = 0.3.

4.3 Extrinsic calibration on ground truth trajec-
tories

In the section below, results for clean ground truth trajectories are presented, along
with results for the same trajectories with subsequently added noise.

4.3.1 Trajectories with added noise
Four plots are shown in Fig. 4.6. The upper row illustrates the effect of additive
noise applied to the translation vectors. The ground truth, plotted for comparison
in each subfigure, was similar since the errors only affect the trajectory locally. The
bottom row of Fig. 4.6 features the cumulative noise, where the effect on the trans-
lation vectors was permanent. As a result, the trajectories differed much more from
the ground truth.

For additive rotation noise, N (0, 0.0001) started to result in noticeable errors. Fur-
ther increasing the standard deviation with an order of amplitude, increased the
highest translation error by about 5 times. In general, the optimized transform
improved slightly, but not significantly. The same standard deviation increase for
cumulative rotation noise, instead caused an increased error of two orders of mag-
nitude. For 176 poses per camera the algorithm took about 0.4 sec to perform an
extrinsic calibration between a pair. The increase in translation errors appeared

37


4. Results

x

−40−30−20−100102030

y

−7.5
−5.0
−2.5
0.0
2.5
5.0
7.5
10.0

z

−30
−20
−10
0
10
20
30

Front
Right
Back
Left
Front_gt
Right_gt
Back_gt
Left_gt

(a)

x

−40−30−20−100102030
y

−7.5−5.0−2.50.0 2.5 5.0 7.5 10.0

z

−30
−20
−10
0
10
20
30

Front
Right
Back
Left
Front_gt
Right_gt
Back_gt
Left_gt

(b)

x

−40−30−20−100102030

y

−7.5
−5.0
−2.5
0.0
2.5
5.0
7.5
10.0

z

−30
−20
−10
0
10
20
30

Front
Right
Back
Left
Front_gt
Right_gt
Back_gt
Left_gt

(c)

x

−40−30−20−100102030
y

−7.5−5.0−2.5 0.0 2.5 5.0 7.5 10.0

z

−30
−20
−10
0
10
20
30

Front
Right
Back
Left
Front_gt
Right_gt
Back_gt
Left_gt

(d)

Figure 4.6: VRX simulation run. Top row, ground truth with additive translation
noise. Bottom row, ground truth with cumulative translation noise. Noise level
N (0, 0.1).

more or less proportional to the increase in translation noise standard deviation,
both for additive and cumulative. However, the error values were orders of magni-
tude higher when applying cumulative noise.

38


4. Results

Table 4.1: The angular and translation errors for the transformation between front
and left camera, for different values of additive noise.

NOT OPTIMIZED
Rotation error in deg Translation error in cm

stdR stdT x y z x y z
0 0 0 0 0 0 1.11438E-03 -1.75659E-04

0.0001 0 0.1360499 0.00264001 0.12934821 -0.64888966 2.108529787 -4.12838181
0.001 0 -0.0659454 0.0063285 -0.05965763 0.3074497067 -9.793094693 0.40649455

0 0.001 0 0 0 0.002163605 0.12289717 0.083389795
0 0.01 0 0 0 0.051626345 2.924271455 -0.64143827
0 0.1 0 0 0 -0.5982221 -21.42732076 -6.327840835

0.0001 0.0001 0.15780312 -0.00081437 0.15404562 -0.750350035 4.30544334 -1.95124992
0.001 0.01 2.2095363 0.0096517 2.13640167 -10.6353266 -24.81991883 -22.5635101

OPTIMIZED
Rotation error in deg Translation error in cm

stdR stdT x y z x y z
0 0 0 0 0 0 1.11438E-03 -1.75659E-04

0.0001 0 0.1501814 0.00152543 0.051467363 -0.64888895 2.108552713 -1.70618995
0.001 0 -0.9304554 0.00107273 -0.898786696 0.30736692 -9.438791183 0.49425971

0 0.001 0 0 0 0.002163605 -0.30704108 0.083389795
0 0.01 0 0 0 0.051626345 2.924271455 -0.64143827
0 0.1 0 0 0 -0.5982221 -21.42732076 -6.327840835

0.0001 0.0001 0.25382848 0.00016301 0.2477051 -0.75034958 4.305438495 -1.9512479
0.001 0.01 0.48233448 -0.01627355 0.4633280 -10.63597427 -23.968070 -22.350672

Table 4.2: The angular and translational error for the transformation between
front and left camera, for different values of Cumulative noise.

NOT OPTIMIZED
Rotation error in deg Translation error in cm

stdR stdT x y z x y z
0 0 0 0 0 3.14078E-03 -1.44161E-01 3.90152E-02

0.0001 0 -2.36737E-01 -3.55691E-02 -1.84910E-01 -6.40138E-01 -3.91255E+01 -6.60384E+00
0.001 0 -1.00882E+00 1.10114E-02 -8.40895E-01 8.42317E+00 -3.00070E+02 -5.01448E+01

0 0.001 0 0 0 4.52670E-01 -2.73476E+01 4.23833E+00
0 0.01 0 0 0 -5.84884E+00 -2.24266E+02 -3.06379E+01
0 0.1 0 0 0 3.50068E+01 -4.03357E+03 -1.01745E+03

0.0001 0.0001 -1.34492E-03 1.88126E-02 4.66248E-02 -2.15189E-01 -1.48278E+01 2.98691E+00
0.001 0.01 9.05291E-01 3.79198E-01 4.07749E-01 3.44968E+00 -1.20469E+02 2.49594E-02

OPTIMIZED
Rotation error in deg Translation error in cm

stdR stdT x y z x y z
0 0 0 0 0 3.14107E-03 -1.44161E-01 3.90154E-02

0.0001 0 -1.72583E-01 -2.97113E-02 -1.44507E-01 -6.40140E-01 -3.91204E+01 -6.60257E+00
0.001 0 -9.24552E-01 8.84481E-02 -7.36161E-01 8.42684E+00 -2.95187E+02 -4.89000E+01

0 0.001 0 0 0 4.52670E-01 -2.73476E+01 4.23833E+00
0 0.01 0 0 0 -5.84884E+00 -2.24266E+02 -3.06379E+01
0 0.1 0 0 0 3.50068E+01 -4.03357E+03 -1.01745E+03

0.0001 0.0001 -3.96211E-02 4.19596E-03 7.23770E-03 -2.15188E-01 -1.48277E+01 2.98693E+00
0.001 0.01 1.06713E+00 4.99078E-01 6.93456E-01 3.44967E+00 -1.20464E+02 2.62176E-02

39


4. Results

40


5
Discussion

The discussion chapter primarily contains interpretations and implications of the
Results chapter.

5.1 Evaluation of visual odometry
The proposed algorithm showed promising results for KITTI Odometry, other than
being some degrees off in the corners and a bit too small scale estimate it did well on
the 1590 long image sequence KITTI 9. The algorithm did so without the addition
of loop closure and bundle adjustment. The level of accuracy was still not good
enough for the extrinsic calibration algorithm. Loop closure and bundle adjustment
might have lifted it to acceptable levels.

5.2 Evaluation of importance of more features
The availability of high quality features proved to be the most important factor.
When applying the VO algorithm to the simulator, it found at most 300 features for
FAST+BRIEF and 200 for SIFT. In those matches there were also a few wrongfully
matched due to matching an equal object from a different location. The threshold
for updating matches needed to be lowered for it to have some continuity and this
gave less jittery movement in this case.

In Fig. 4.1 it is clear how important it was for the scale estimate to have plenty
of keypoints. Evenly distributed features were also an important factor for a good
motion estimate, this was also an issue on water which cannot be interpreted as
stationary in any way. As a result, sections of the image became featureless. The
previously mentioned factors motivates enriching the simulator world with a lot more
and varied features. To run that simulation with a decent frame rate would require
a better computer or have it recording offline. Another option from enhancing
simulations is to set up a camera rig on a boat and record data in a real harbor
or natural harbor. The objects on land would be more varied there than in the
simulator, and provide better and more features.

41


5. Discussion

5.3 Evaluation of feature extraction algorithms
This thesis primarily used SIFT and FAST+BRIEF for compatibility with ROS.
There were big differences in how they found features, where FAST+BRIEF is a
corner detector SIFT is more general. The strict conditions of the matches found
with SIFT ensured the matches found were really good and it had the upper edge
on the KITTI Data set, while for the simulator FAST+BRIEF seemed to have the
upper edge. Another positive aspect of the corner detector and binary descriptors
is that they were a lot faster to compute, ten times faster than SIFT from our
experience.

5.4 Evaluation of masking
It was often unavoidable to have a part of the vehicle present in the image. The ve-
hicle present in the image caused problems if there were few other features present,
namely that the features detected on the vessel were considered to be the inliers. To
address this problem one could either apply a mask to the image or have a check that
discards matches that were stationary over two frames. A euclidean distance-check
is very general and can be applied to all cameras, whereas a mask needs to be tailor
made to the specific camera, and it is also possible to add an maximum distance
to exclude false matches. The distance check had the problem of not handling a
standstill situation where all matches were generally stationary.

Since the simulator had a low frame rate and the boat was always moving the
distance-check was selected. SIFT detected the most features of the boat, which
was very problematic. The effect of masking is shown in Fig. 5.1.

Figure 5.1: Screenshot of vessel with and without a minimum euclidean distance
condition for SIFT. In the top picture there is approximately 40 matches, seen as
the red dots.

42


5. Discussion

The resulting trajectory when the matches on the vessel were not discarded can be
seen in Fig. 5.2. Therefore, naturally, a mask was always applied.

50 60 70 80 90
−30

−20

−10

0

10

20

30
GT
VIO

Figure 5.2: Resulting trajectory without max/min euclidean distance constraint
between matches.

In Fig. 5.2 the trajectory estimation fails in the corner when the features on the
boat were considered to be inliers.

5.5 Evaluation of the addition of gyro

With just visual odometry the algorithm tended to underestimate the rotation,
as a result the loops were not closed. Compared to linear acceleration from the
accelometer, the angular velocity around an axis is the same for the entire body,
which meant that the heading estimated from the gyro could be applied to all of the
cameras. To have the majority of the rotation influence taken from the gyroscope
improved the estimated trajectory.

5.6 Quality evaluation of trajectories

The trajectories from all of the cameras were behaving as expected, but the trajec-
tories were simply not good enough for the extrinsic calibration. The worst part of
it was usually the scale estimate, not being concise enough. If it would be the same
for all cameras it would be defined up to scale but, this is not the case. Correct
scale estimate would help a lot.

43


5. Discussion

5.7 Evaluation of results with added noise
In Fig. 4.6, observe how the cumulative noise trajectories "derailed", such that
the relative positions and orientations were no longer fixed relative to each other.
Since the goal behind the extrinsic calibration method we employ was to find the
best solution to a fixed distance between trajectories, and the trajectories changing
differently impacts the fixed transform, the quality of the result deteriorated signif-
icantly with large amounts of this type of noise.

It became apparent that high cumulative noise was not handled well by the estima-
tion of translation, cumulative noise was also the most realistic scenario. In Table
4.1 and 4.2 it is clear that the translational error in Y axis was generally the greatest.
The error in Y axis was due to the planar motion of the vessel on still water which
made it sensitive to noise in the Y direction of the camera. At the time of writing,
the top tier KITTI odometry results were at 0.53 % translation error and 0.0009
[deg/m] rotation error, this would be well acceptable results to do calibration on.

5.8 Determination of section of trajectory
It was discovered that selecting different sections of the trajectory, to estimate ex-
trinsics from, had a big impact on the result. Looking at the trajectory in advance
of the calibration could provide benefits. To avoid deterioration of the trajectory
caused by cumulative noise one should take the earlier poses rather than the later.
The section of trajectory chosen to calibrate with was very important. When hav-
ing singular motions (no turns), the pseudo-inverse of Ni from Eq. (3.20) diverges.
When this happened, the translation would diverge equally much. Therefore, choos-
ing sections of the trajectory when constantly turning the vessel proved to give a
much better calibration. In Fig. 4.6, one can see that the first part is dead straight,
and one should select a portion of the trajectory from the first corner for best re-
sults.

44


6
Conclusion

The main research question was how automatic extrinsic calibration of a multiple
fisheye camera rig can be achieved in a marine environment, a question which proved
to be more challenging than first expected. In the simulator, the absence of good
features spread across the image was an issue when it came to automatically calibrat-
ing the cameras. There are still improvements needed for creating the trajectories
enabling an accurate calibration. Another research question was whether we would
achieve better results with VO or VIO. It was seen that VIO gave more accurate
trajectories than VO when there were not sufficiently good features to track. The
computation of trajectories with VO and VIO were both done offline (not while
recording the images). Both the VO and VIO algorithm had real-time performance
given binary features (FAST+BRIEF), but SIFT was far from that. The trajectory
based calibration was very fast, and a pair of cameras with 176 poses each were cal-
ibrated in less than half a second. Another question was how to determine whether
or not the calibration was accurate enough. In the simulator the ground truth poses
were available to compare with. To determine if results were good enough, checks
on if features had subpixel accuracy for reasonable distances from the cameras could
be performed, or a comparison to ground truth.

6.1 Future work
For future development of this work, the focus should be on developing a better VIO
algorithm with the possibility of adding a GNSS sensor, because they are regularly
available in boats. With the addition of an accurate GNSS, the drift in position
would be eliminated. New data sets should also be made on an actual boat, in
a harbor surrounded rich in features. The data recorded should be from several
cameras, IMU and GNSS. The new data sets should satisfy the conditions of top
tier odometry algorithms (15+ FPS, IMU 100 Hz) to see how well they can do on
water applications.

45


6. Conclusion

46


Bibliography

[1] Gustaver, M. (2020) A Chalmers University of Technology Master’s thesis tem-
plate for LATEX. Unpublished.

[2] C. Mei and P. Rives, Single view point omnidirectional camera calibration from
planar grids. in ICRA 2007.

[3] Davide Scaramuzza, Agostino Martinelli, Roland Siegwart. A Toolbox for Eas-
ily Calibrating Omni-directional Cameras. Iros, 2006, Benjing, China. inria-
00359941

[4] L. Heng, B. Li and M. Pollefeys, "CamOdoCal: Automatic intrinsic and ex-
trinsic calibration of a rig with multiple generic cameras and odometry," 2013
IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo,
Japan, 2013, pp. 1793-1800, doi: 10.1109/IROS.2013.6696592.

[5] L. Heng, M. Bürki, G. H. Lee, P. Furgale, R. Siegwart and M. Pollefeys,
"Infrastructure-based calibration of a multi-camera rig," 2014 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), Hong Kong, China,
2014, pp. 4912-4919, doi: 10.1109/ICRA.2014.6907579.

[6] M. Ferrera, A. Eudes, J. Moras, M. Sanfourche and G. Le Besnerais,
"OV2SLAM : A Fully Online and Versatile Visual SLAM for Real-Time Applica-
tions," in IEEE Robotics and Automation Letters, IEEE 2021. arXiv:2102.04060
[cs.CV]

[7] Richard Szeliski, Computer Vision: Algorithms and Applications, draft:
Springer September 3, 2010, pp. 58-60

[8] Azzam, R., Taha, T., Huang, S. et al. Feature-based visual simultane-
ous localization and mapping: a survey. SN Appl. Sci. 2, 224 (2020).
https://doi.org/10.1007/s42452-020-2001-3

[9] Yang Liu and H. Zhang, "Indexing visual features: Real-time loop closure
detection using a tree structure," 2012 IEEE International Conference on
Robotics and Automation, Saint Paul, MN, 2012, pp. 3613-3618, doi: 10.1109/I-
CRA.2012.6224741.

[10] Bill Triggs, Philip Mclauchlan, Richard Hartley, Andrew Fitzgibbon. Bundle
Adjustment – A Modern Synthesis. International Workshop on Vision Algo-
rithms, Sep 2000, Corfu, Greece. pp.298–372, 10.1007/3-540-44480_21 . inria-
00548290

[11] D. Galvez-López and J. D. Tardos, "Bags of Binary Words for Fast Place Recog-
nition in Image Sequences," in IEEE Transactions on Robotics, vol. 28, no. 5,
pp. 1188-1197, Oct. 2012, doi: 10.1109/TRO.2012.2197158.

[12] Michel Goossens, Frank Mittelbach, and Alexander Samarin. The LATEX Com-
panion. Addison-Wesley, Reading, Massachusetts, 1993.

47


Bibliography

[13] Ferrera, Maxime et al. “OV2SLAM : A Fully Online and Versatile Visual SLAM
for Real-Time Applications.” (2021).

[14] T. Qin, P. Li and S. Shen, "VINS-Mono: A Robust and Versatile Monocular
Visual-Inertial State Estimator," in IEEE Transactions on Robotics, vol. 34,
no. 4, pp. 1004-1020, Aug. 2018, doi: 10.1109/TRO.2018.2853729.

[15] Mukherjee, D., Jonathan Wu, Q.M. & Wang, G. A comparative experimental
study of image feature detectors and descriptors. Machine Vision and Applica-
tions 26, 443–466 (2015). https://doi.org/10.1007/s00138-015-0679-9

[16] Knuth: Computers and Typesetting,
http://www-cs-faculty.stanford.edu/~{}uno/abcde.html

[17] MSW motor, 2020, 15 Types of Automotive Sensors Your Vehicle is Likely to
Have
https://mzwmotor.com/automotive-sensor-types/

[18] Author(s) Initial(s). Surname(s), Title of the Book, xth ed. City of Publisher,
(U.S. State or Country if the City is not ‘well known’): Publisher, Year of
Publication, pp. xxx–xxx

[19] G. Carrera, A. Angeli and A. J. Davison, "SLAM-based automatic extrin-
sic calibration of a multi-camera rig," 2011 IEEE International Conference
on Robotics and Automation, Shanghai, 2011, pp. 2652-2659, doi: 10.1109/I-
CRA.2011.5980294.

[20] Hernandez-Juarez, D., Schneider, L., Cebrian, P. et al. Slanted Stixels: A Way
to Represent Steep Streets. Int J Comput Vis 127, 1643–1658 (2019). https:
//doi.org/10.1007/s11263-019-01226-9

[21] D. Nister, "An efficient solution to the five-point relative pose problem," in
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no.
6, pp. 756-770, June 2004, doi: 10.1109/TPAMI.2004.17.

[22] P. Liu, L. Heng, T. Sattler, A. Geiger and M. Pollefeys, "Direct visual odom-
etry for a fisheye-stereo camera," 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 2017, pp.
1746-1752, doi: 10.1109/IROS.2017.8205988.

[23] E. Fernández-Moral, J. González-Jiménez, P. Rives and V. Arévalo, "Ex-
trinsic calibration of a set of range cameras in 5 seconds without pattern,"
2014 IEEE/RSJ International Conference on Intelligent Robots and Systems,
Chicago, IL, USA, 2014, pp. 429-435, doi: 10.1109/IROS.2014.6942595

[24] F. Zhao, T. Tamaki, T. Kurita, B. Raytchev and K. Kaneda, "Marker-
based non-overlapping camera calibration methods with additional support
camera views," 2018 Image and Vision Computing, Vol. 70, pp. 46-54, doi:
https://doi.org/10.1016/j.imavis.2017.12.006.

[25] Ariane Spaenlehauer, Vincent Frémont, Ahmet Sekercioglu, Isabelle Fantoni.
A Loosely-CoupledApproach for Metric Scale Estimation in Monocular Vision-
Inertial Systems. IEEE InternationalConference on Multisensor Fusion and In-
tegration for Intelligent Systems (MFI 2017), Nov 2017,Daegu, South Korea.
pp.137-143. hal-01678915

[26] Brian Bingham, Carlos Aguero, Michael McCarrin, Joseph Klamo, Joshua
Malia, Kevin Allen, Tyler Lum, Marshall Rawson, & Rumman Waqar (2019).

48

https://doi.org/10.1007/s00138-015-0679-9
http://www-cs-faculty.stanford.edu/~{}uno/abcde.html
https://mzwmotor.com/automotive-sensor-types/
https://doi.org/10.1007/s11263-019-01226-9
https://doi.org/10.1007/s11263-019-01226-9


Bibliography

Toward Maritime Robotic Simulation in Gazebo. In Proceedings of MTS/IEEE
OCEANS Conference.

[27] Fischler & Bolles, Random Sample Consensus: A Paradigm for Model Fitting
with Applications to Image Analysis and Automated Cartography. CACM 1981

[28] Andreas Geiger, Philip Lenz, & Raquel Urtasun (2012). Are we ready for Au-
tonomous Driving? The KITTI Vision Benchmark Suite. In Conference on
Computer Vision and Pattern Recognition (CVPR). 4

[29] Kannala, Juho & Brandt, Sami. (2006). A Generic Camera Model and Calibra-
tion Method for Conventional, Wide-Angle, and Fish-Eye Lenses. IEEE trans-
actions on pattern analysis and machine intelligence. 28. 1335-40. 10.1109/T-
PAMI.2006.153.

[30] Flir, https://www.flir.eu/globalassets/support/iis/knowledge-base/
modifiedstereoaccuracy.xls

[31] C. Forster, L. Carlone, F. Dellaert, D. Scaramuzza. (2015). IMU Preintegra-
tion on Manifold for Efficient Visual-Inertial Maximum-a-Posteriori Estimation.
Robotics: Science and Systems (RSS) conference.

[32] https://en.wikipedia.org/wiki/Scale-invariant_feature_transform#
Feature_matching_and_indexing

[33] Figure of epipolar geometry: https://en.wikipedia.org/wiki/Epipolar_
geometry#/media/File:Epipolar_geometry.svg

[34] D. G. Lowe. (2004). Distinctive Image Features from Scale-Invariant Keypoints.
International Journal of Computer Vision.

[35] E. Rosten and T. Drummond. (2006) Machine Learning for High-Speed Corner
Detection. ECCV, 9th, Vol. 1. pp. 433-436.

[36] M. Calonder, V. Lepetit, C. Strecha and P. Fua. (2010) BRIEF: Binary Robust
Independent Elementary Features. ECCV, 11th. pp. 778-792.

[37] M. Muja and D. G. Lowe. (2009) Fast Approximate Nearest Neighbors with
Automatic Algorithm Configuration, International Conference on Computer
Vision Theory and Applications

[38] G. Falco, M. Pini and G. Marucco. (2017) Loose and Tight GNSS/INS Integra-
tions: Comparison of Performance Assessed in Real Urban Scenarios. 17(2):255

49

https://www.flir.eu/globalassets/support/iis/knowledge-base/modifiedstereoaccuracy.xls
https://www.flir.eu/globalassets/support/iis/knowledge-base/modifiedstereoaccuracy.xls
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform#Feature_matching_and_indexing
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform#Feature_matching_and_indexing
https://en.wikipedia.org/wiki/Epipolar_geometry#/media/File:Epipolar_geometry.svg
https://en.wikipedia.org/wiki/Epipolar_geometry#/media/File:Epipolar_geometry.svg


Bibliography

50


Department of Mechanics and Maritime Sciences
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Figures
	List of Tables
	Introduction
	Related work
	Purpose and goal
	Problem statement
	Limitations
	Contribution
	Thesis outline

	Theory
	Calibration
	Intrinsic calibration
	Extrinsic calibration

	Camera models
	Pinhole camera
	Fisheye
	Scaramuzza
	Kannala-Brandt model
	Unified projection model
	Undistortion


	Stereo vision
	Feature extraction
	SIFT
	FAST
	BRIEF

	Feature matching
	RANSAC
	DLT
	Epipolar geometry
	Essential matrix
	Fundamental matrix

	Triangulation
	Scale estimation
	Bundle adjustment
	Loop closure

	IMU
	Visual odometry
	Visual inertial odometry
	Framework

	Methods
	Choice of method
	Simulator setup
	Data recording

	Detecting features on water
	Finding intrinsic and distortion parameters
	Obtaining a trajectory
	Visual odometry
	Visual inertial odometry
	Ground truth camera trajectories

	Finding extrinsic parameters
	Addition of noise
	Optimization

	Results
	Visual odometry
	Visual inertial odometry
	Extrinsic calibration on ground truth trajectories
	Trajectories with added noise


	Discussion
	Evaluation of visual odometry
	Evaluation of importance of more features
	Evaluation of feature extraction algorithms
	Evaluation of masking
	Evaluation of the addition of gyro
	Quality evaluation of trajectories
	Evaluation of results with added noise
	Determination of section