Monocular Simultaneous Localisation
and Mapping for Road Vehicles
Master’s thesis in Systems, Control and Mechatronics

MATHIAS ERNST, SAMUEL SCHEIDEGGER

Department of Signals and Systems
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2015


Master’s thesis EX027/2015

Monocular Simultaneous Localisation and
Mapping for Road Vehicles

MATHIAS ERNST, SAMUEL SCHEIDEGGER

Department of Signals and Systems
Signal processing and Biomedical engineering
Chalmers University of Technology

Gothenburg, Sweden 2015


Monocular Simultaneous Localisation and Mapping for Road Vehicles

MATHIAS ERNST, SAMUEL SCHEIDEGGER

© MATHIAS ERNST, SAMUEL SCHEIDEGGER, 2015.

Supervisor: Erik Stenborg, Volvo Car Corporation
Examiner: Lennart Svensson, Department of Signals and Systems

Master’s thesis
Department of Signals and Systems
Signal processing and Biomedical engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Illustration of feature points tracked over a sequence of images.

Typeset in LATEX
Printed by [Name of printing company]
Gothenburg, Sweden 2015

iv


Monocular Simultaneous Localisation and Mapping for Road Vehicles

MATHIAS ERNST, SAMUEL SCHEIDEGGER
Department of Signals and Systems
Chalmers University of Technology

Abstract
Autonomous cars hold great promise for the future of transportation enabling more
efficient driving and safer vehicles for both the occupants and other road users. A key
aspect to achieving autonomous driving is accurate positioning of the vehicle. One
way of meeting this goal is to build a map and simultaneously perform localisation
within that map. This problem in known as the Simultaneous Localisation and
Mapping (SLAM) problem. To avoid each vehicle having to build its own map, one
can also localise the vehicles position within a previously made map, and such a
map could be made using SLAM methods.

This thesis explores a few different approaches to solving the SLAM problem and de-
scribes the implementation of a graph-based SLAM system using monocular camera
data. The system that has been developed uses feature tracking, epipolar geometry
and local bundle adjustment for visual odometry, relying on velocity and time mea-
surements to recover the information lost in the scale ambiguity of monocular image
data. Topological loop closures are found by using the FAB-MAP 2.0 algorithm. The
developed system uses the OpenCV library for feature detection and description, the
g2o and CHOLMOD libraries for graph-representation and bundle adjustment, and
the OpenFABMAP library for running the FAB-MAP 2.0 algorithm. The algorithm
is tested on the public KITTI datasets and is shown to successfully build consistent
maps and localise within them.

Keywords: Visual SLAM, Monocular, Visual Odometry, Bundle Adjustment.

v


Acknowledgements
We would like to thank our supervisors, Erik Stenborg and Anders Karlsson, and
our examiner, Lennart Svensson, for support during our work. We also would like
to thank the authors of our sister project, Robin Lindholm and Carl-Johan Pålsson,
for the interesting discussions, and the other participants in our study group, Lars
Hammarstrand, Nicklas Gustafsson, Maryam Fatemi and Malin Lundgren for their
useful input. And also, we would like to thank Viktor Lindström for help debugging
our code and Mikael Persson for his expert help on the subject. Finally we would
like to thank Volvo Car Corporation for the opportunity to perform this work.

Mathias Ernst, Samuel Scheidegger, Gothenburg, June 2015

vii


Contents

List of Figures xi

List of Tables xiii

List of Acronyms xv

1 Introduction 1
1.1 Problem background . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Feature based methods . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Simultaneous Localisation And Mapping . . . . . . . . . . . . . . . . 24
2.3.1 Loop closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 EKF SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 FastSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.4 Graph-based SLAM . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Implementation 35
3.1 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.1 Features and tracking . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.2 Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Simultaneous Localisation And Mapping . . . . . . . . . . . . . . . . 37
3.2.1 FAB-MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Geometric loop closure . . . . . . . . . . . . . . . . . . . . . . 38

4 Results 41
4.1 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix


Contents

5 Discussion 49
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 53

A Parameter List I

x


List of Figures

2.1 Basic pinhole camera model. Shows how the point, X, in space is
projected onto the image plane, the principal point, p, and the camera
centre, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The principal point with offset from the image coordinate origin. . . . 7
2.3 Sketch of how a heavily distorted image of a perfect square would be

corrected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 The projected parallel lines on a plane converges to the vanishing

points, v1 and v2, and constructs the vanishing line, l. . . . . . . . . 10
2.5 The left figure is showing the two camera view centres, C and C′,

the world point, X, the projected image points, x and x′, and the
epipolar plane π. The right figure shows the epipolar points, e and
e′, the line of the second view, l′, and how all world points in the
direction of X must be projected onto this line. . . . . . . . . . . . . 12

2.6 Illustration of how the four different choices of camera matrices for a
given essential matrix can be interpreted. . . . . . . . . . . . . . . . . 16

2.7 Illustration of how RANSAC would work to estimate a line. The
figures shows two randomly selected points in blue, rejected outliers
in red and the inliers in green for four different hypothesises. Here
the inliers of the hypothesis in the bottom right image will be selected
as the final inliers as they are more than in the other hypothesises. . . 20

2.8 A plot of the number of required iterations of RANSAC for a 99%
probability of success at different percentage of outliers with different
number of minimal points. . . . . . . . . . . . . . . . . . . . . . . . . 21

2.9 Illustrations of how a circular motion can be used describe the motion
of a vehicle using the Ackermann steering principle. . . . . . . . . . . 22

2.10 Basic structure of the graph in graph-based SLAM. White circles are
pose nodes, black squares are landmarks and solid and dashed edges
are pose-to-pose and pose-to-landmark constraints respectively. At
20 and 21 loop closures are detected when old landmarks 5 and 7 are
observed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.11 An illustration of the sparseness and symmetry of the information
matrix, H, corresponding to the graph in figure 2.10. Only the
shaded elements are non-zero, black elements are pose nodes, dark
grey corresponds to landmarks and middle grey the edges between
these. Loop closures manifest as off-diagonal elements at the sym-
metric pairs 〈(20, 5), (5, 20)〉 and 〈(21, 7), (7, 21)〉. . . . . . . . . . . . 33

xi


List of Figures

2.12 Comparison of the squared error function and the Huber error. . . . . 34

4.1 Plots of evaluation data for dataset 1 using VO. Average translation
error: 1.22 %. Average rotation error: 0.00550 deg/m. . . . . . . . . . 42

4.2 Plots of evaluation data for dataset 2 using VO. Average translation
error: 0.995 %. Average rotation error: 0.00415 deg/m. . . . . . . . . 43

4.3 Plots of evaluation data for dataset 3 using VO. Average translation
error: 2.03 %. Average rotation error: 0.00709 deg/m. . . . . . . . . . 44

4.4 Plots of evaluation data for dataset 4 using VO. Average translation
error: 1.82 %. Average rotation error: 0.00250 deg/m. . . . . . . . . . 45

4.5 Execution time for every 25th step of the different parts of the VO
for dataset 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.6 Plots of evaluation data for dataset 1 using the full SLAM solu-
tion. Average translation error: 1.21 %. Average rotation error:
0.00558 deg/m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.7 Plots of evaluation data for dataset 2 using the full SLAM solu-
tion. Average translation error: 1.00 %. Average rotation error:
0.00405 deg/m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8 Execution time for every 25th step of the different parts of the SLAM
algorithm for dataset 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.9 Plot of aligned trajectory and distance to ground truth for dataset
1. The dotted blue line in the left figure is the unaligned estimate.
The RMS distance improved from 7.214 m to 2.435 m by rotating and
2.030 m with 6 DoF alignment . . . . . . . . . . . . . . . . . . . . . . 48

4.10 Plot of aligned trajectory and distance to ground truth for dataset
2. The dotted blue line in the left figure is the unaligned estimate.
The RMS distance improved from 9.905 m to 4.044 m by rotating and
3.731 m with 6 DoF alignment . . . . . . . . . . . . . . . . . . . . . . 48

5.1 A frame from dataset 3, with the tracked feature points, which are
approximately coplanar, shown in green. . . . . . . . . . . . . . . . . 50

xii


List of Tables

2.1 Properties and performance of feature detectors [32]. . . . . . . . . . 18

4.1 Properties of the datasets used for evaluation of the implemented
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Number of detected loop closures before and after the two firewall
conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Results of the evaluation of the datasets. . . . . . . . . . . . . . . . . 46

A.1 List of user tunable parameters and the values used in the result section. I

xiii


List of Tables

xiv


List of Acronyms

BA Bundle Adjustment.

BoW Bag-Of-Words.

BRIEF Binary Robust Independent Elementary Features.

BRISK Binary Robust Invariant Scalable Keypoints.

CCD Charge-Coupled Device.

CENSURE Center Surround Extrema.

CMOS Complementary Metal Oxide Semiconductor.

DoF Degrees Of Freedom.

DoG Difference-Of-Gaussian.

EIF Extended Information Filter.

EKF Extended Kalman Filter.

FAB-MAP Fast Appearance-Based Mapping.

FAST Features from Accelerated Segment Test.

GNSS Global Navigation Satellite System.

GPS Global Positioning System.

GPU Graphics Processing Unit.

IAC Image of Absolute Conic.

ICR Instantaneous Centre of Rotation.

IF Information Filter.

IMU Inertial Measurement Unit.

KLT Kanade–Lucas–Tomasi.

xv


List of Acronyms

LSD-SLAM Large Scale Direct SLAM.

ORB Oriented FAST and Rotated BRIEF.

PTAM Parallel Tracking And Mapping.

RANSAC Random Sample Consensus.

RMS Root Mean Square.

SeqSLAM Sequence SLAM.

SIFT Scale Invariant Feature Transform.

SLAM Simultaneous Localisation And Mapping.

SURF Speeded Up Robust Features.

SVD Singular Value Decomposition.

UKF Unscented Kalman Filter.

V-SLAM Visual Simultaneous Localisation And Mapping.

VCC Volvo Car Corporation.

VO Visual Odometry.

xvi


1
Introduction

In 2011, an estimated 1.3 million people where killed in road traffic accidents; by
2020 this is expected to grow to 1.9 million people, annually [1]. In 2015, a study by
the U.S. Department of Transportation showed that 94% of the road traffic accidents
in the U.S. were caused by drivers [2]. The number of road traffic accidents could be
reduced by identifying when a driver is about to make a mistake that will cause an
accident, and take action to prevent the accident. Volvo Car Corporation (VCC) has
such a system in production today, in form of the City Safety system, which intends
to autonomously brake the car to avoid collisions at low speeds. This concept could
be expanded, to further reduce accidents, with other systems, e.g. lane keeping or
even fully autonomous vehicles. Considering the vast number of accidents caused
by human error, it is clear that autonomous vehicles that are less prone to driver
errors has the potential to drastically reduce the number of accidents. Furthermore,
autonomous vehicles has the potential to free up billions of hours for people currently
spent acting as the human control system of the vehicles

The ever-increasing vehicle ownership and energy demand from the transport sector
pose a concern for the environment. The worldwide transport sector, in 2010, stood
for 28% of the total energy consumption and caused 23% of total energy-related
CO2 emissions with a majority, 72%, coming from road vehicles [3, pp. 603]. Efforts
have been made to develop more energy efficient engines, and to use alternative
energy sources. For example, hybrid vehicles, using both internal combustion engines
and electric motors, and purely electric vehicles, have been developed. No matter
which propulsion system is used, further gains in energy efficiency can be made
by adopting efficient driver behaviour. In a fully autonomous vehicle this can be
done automatically. The energy consumption can be minimised by route planning,
platooning, better control of the power components, and regenerative drive system
[4, pp. 5] to a degree not achievable by a human driver.

Autonomous vehicles is a hot research topic worldwide, as they can make traffic
safer and more efficient in terms of both energy use and traffic flow and several
experimental platforms have been tested. Google has been testing self driving cars
since 2009, using expensive sensors such as LIDARs, on public roads [5]. Initially
the test were limited to freeways, but later also included more complex city streets.
In 2014 Daimler successfully tested a car on a route with a variety of difficult traf-
fic situations, using close-to-production sensor hardware. VCC plans to put 100
cars on the roads of Gothenburg by 2017, that will be self-driving on a number of
predetermined roads.

1


1. Introduction

1.1 Problem background

To enable autonomous systems, sensor data input is required. For lane keeping,
a system which estimates the road lane and the vehicle’s lateral position in the
lane is necessary. For a fully autonomous vehicle, a map of the surroundings and
other traffic is required [7]. The limited accuracy of Global Navigation Satellite
Systems (GNSSes), such as the Global Positioning System (GPS), and the poor
performance or non-existent signal in some environments, such as urban canyons or
tunnels, make them insufficient for such applications [8]. Therefore, the use of other
types of sensors is required to replace or complement GNSSes. Such other sensors
could for example be wheel odometry, steering angle sensors, range sensors, such as
LIDAR or radar detectors, or cameras. A problem in common for using these types
of sensors to estimate position, is that the estimates are incremental and noisy, and
thus subject to a drift. One consequence of the drift when building a map is that
the representation of the location in the map depends on the measurement noise
and the path taken to that location, causing inconsistent maps. To reduce the drift,
information from a GPS, which is absolute in the global frame and thus does not
suffer from drift, could be fused with sensors that are more accurate in the local
frame. Another technique to reduce drift and enforce consistency is to adjust the
estimated trajectory when visiting a location with previously known information
about the position. To accomplish this, a map in which it is possible to find the
ego-pose has to be built, and to build a map, a known position is required. This
“chicken or the egg” problem is often referred to as the Simultaneous Localisation
And Mapping (SLAM) problem [9, pp. 222].

1.2 Purpose

The purpose of this work is to evaluate how digital cameras mounted on a vehicle
can be used to estimate the ego-motion of the vehicle and build a map, in which
it is possible to localise the ego pose. This concept is commonly referred to as the
Visual Simultaneous Localisation And Mapping (V-SLAM) problem.

A review of the current literature on the subject should be done, and the most
common and most promising methods to solve the SLAM problem should be in-
vestigated. A solution to the V-SLAM problem should be proposed and evaluated.
Further improvements to the system, unresolved issues, and areas and techniques
that would be useful to explore further should be pointed out for future work. An
evaluation of how the V-SLAM techniques could be deployed for building a large
scale map should be considered. Aspects such as the accuracy of the mapping and
localisation, the robustness in various road environments, and feasibility should be
evaluated. Lastly, avenues for further investigation in the area of V-SLAM should
be identified.

2


1. Introduction

1.3 Delimitations

In this project a full 6 Degrees Of Freedom (DoF) SLAM system was built using
monochrome monocular image data from a calibrated camera. Vehicle speed mea-
surements were used to resolve the scale ambiguity inherent in monocular camera
data. he system does not handle dynamic scenes by classifying moving objects, such
as other traffic, and taking this into consideration in the map building. Even though
computational complexity of the different algorithms is considered, the system is not
intended to run in real-time.

The implementation takes unrectified images and speed measurements as inputs,
and produces an estimate of the trajectory and a map in the form of a point cloud
of estimated feature positions. For loop closure detections, training data in the
form of a set of images from environments similar to the ones to be driven in is also
needed.

1.4 Methodology
The thesis work began with a study of the SLAM problem based on the lectures
and exercises from the course Robot Mapping1 held by Cyril Stachniss at Freiburg
University. A further literature study regarding V-SLAM in particular was done,
covering such topics as camera calibration, projective geometry, image features,
Visual Odometry (VO), and appearance based loop closures.

The first implementations of the VO system were made in Matlab, but the cho-
sen solution was later reimplemented in C++ to be able to utilise certain third
party libraries. The performance of the system was evaluated on datasets from the
KITTI Vision Benchmark Suite2 [10]. The KITTI datasets are logs from sensors
mounted on a car when different paths where driven. The sensors include 4 forward
looking cameras, 1 64 layer LIDAR, and a highly accurate combined Inertial Mea-
surement Unit (IMU) and GNSS. Evaluation was performed in the same manner as
in the KITTI Benchmark, to enable comparison with other solutions, i.e. checking
the average accumulated deviation from ground truth in subsequences of certain
lengths.

1.5 Related work
VO, or estimation of the ego-motion from consecutive camera images has previously
been studied in the literature, and several different solutions have been proposed.
One solution, proposed by Nistér et al. in 2004, uses a stereo camera and feature
tracking, and achieves successful results from aerial, automotive and handheld plat-
forms [11]. A lane-based solution, LaneLoc, purely for road vehicles, using pre-built
maps in which on-line localisation is performed, was initially proposed by Schreiber
et al. in 2013 [12] and was later used by Ziegler et al. in 2014 in Daimler’s Bertha

1http://ais.informatik.uni-freiburg.de/teaching/ws13/mapping/
2http://www.cvlibs.net/datasets/kitti/

3

http://ais.informatik.uni-freiburg.de/teaching/ws13/mapping/
http://www.cvlibs.net/datasets/kitti/


1. Introduction

project [6]. In 2013 Engel et al. proposed a semi-dense direct method, using image
gradients, achieving real-time performance on a modern smartphone [13].

Different solutions to the V-SLAM problem has been proposed. In 2003 Davison
used an Extended Kalman Filter (EKF) and feature points for mapping and pose
estimation. The EKF based solutions were outperformed by a particle filter based
solution by Montemerlo et al. in 2002 [15], FastSLAM, which was then improved,
and named FastSLAM 2.0, by Montemerlo et al. in 2003 [16]. A solution inspired
by the hippocampus of rodents, RatSLAM, was proposed by Milford et al. in
2004 [17]. In 2007, Klein and Murray proposed a solution, Parallel Tracking And
Mapping (PTAM), which broke the barrier to real-time direct augmented reality
with inexpensive cameras. A large scale real-time V-SLAM solution was proposed
by Lim et al. in 2014 [19] using feature point matching and graph based SLAM.

Solutions for pure loop closure detection have been proposed. Cummins and New-
man in 2008 proposed Fast Appearance-Based Mapping (FAB-MAP) [20], which
was improved in 2011 by Cummins and Newman to FAB-MAP 2.0 [21]. Both ver-
sions detects loop closures by exploiting highly distinctive feature points. Milford
and Wyeth in 2012 proposed Sequence SLAM (SeqSLAM), which uses sequences
instead of single images. These loop closure detection algorithms have been used
by various implementations for VO, to achieve a full V-SLAM solution, e.g. the
semi-dense direct VO solution, by Engel et al. in 2013, was improved by Engel et
al. in 2014 to Large Scale Direct SLAM (LSD-SLAM), a full V-SLAM solution by
using FAB-MAP 2.0 for loop closure detection in real-time.

4


2
Theory

In this chapter, the theory on which the implemented system relies is presented.
Starting with the theory on how the camera maps the environment to an image,
continuing with how to incrementally estimate the ego-motion from the images us-
ing visual odometry, and ending with how SLAM algorithms can be used to localise
and build large-scale, consistent maps of the environment. While the final imple-
mentation is a graph-based SLAM system, EKF SLAM and the particle filter based
FastSLAM algorithm were considered during the study of the SLAM problem and
are also presented here.

2.1 Camera
A camera maps reflected light from 3D objects in space onto a 2D image [24, pp.
153]. Typically, the light reflected by the environment is projected, using optical
lenses, onto a Complementary Metal Oxide Semiconductor (CMOS) or Charge-
Coupled Device (CCD) sensor, which both use the photoelectric effect [25] and
analogue to digital converters to digitise the image. To use the images to build a
map of the environment, the way the camera projects the environment to the image
must be known. The maps considered here are locations of points on the surface of
objects in the world. A mathematical camera model is used to represent the camera
projection.

2.1.1 Camera model

The basic pinhole camera model projects a point in space onto an image point on
a plane, called the image plane or the focal plane. Let the centre of projection, the
camera centre, C, be the origin of a Euclidean coordinate system. A point in space,
X = (X, Y, Z)T , is projected to a point, x = (x, y)T , on the image plane, Z = f ,
where the line connecting the centre of projection and the point in space intersects
the image plane, as shown in figure 2.1. It can be seen that the point in space, X,
in R3 is mapped to the image point, x, in R2, as follows [24, pp. 154]:(

X, Y, Z
)T 7→ (

f ·X/Z, f · Y/Z
)T
. (2.1)

The line which is perpendicular to the image plane and meets the camera centre, is
called the principal axis. The point where the principal axis meets the image plane
is called the principal point, p, or the optical centre.

5


2. Theory

Y

X

xC

x
X

Z
y

p
Principal axis

Image plane
Camera
centre

Y

C
Zp

f

f · Y/Z

Figure 2.1: Basic pinhole camera model. Shows how the point, X, in space is projected
onto the image plane, the principal point, p, and the camera centre, C.

If the world point and image point are represented as homogeneous coordinates, the
mapping can be written as a matrix multiplication:

X
Y
Z
1

 7→
f ·Xf · Y

Z

 =

f 0
f 0

1 0



X
Y
Z
1

 . (2.2)

Hereafter X will represent the world point as a homogeneous 4-vector (X, Y, Z, 1)T
and x will represent the image point as a homogeneous 3-vector (x, y, 1)T , then (2.2)
can be written as

x = PXcam, (2.3)

where P is the 3× 4 homogeneous camera projection matrix and Xcam is a point in
space in the coordinate system with the origin in the camera centre.

In (2.2) it is assumed that the principal point is in the origin of the image plane
coordinate system, which it may not be, so in general the mapping is [24, pp.
155]

(
X, Y, Z

)T 7→ (
f ·X/Z + px, f · Y/Z + py

)T
=
(
xcam, ycam

)T
, (2.4)

where xcam and ycam are the image pixel coordinates with the principal point at p =
(px, py)T , as in figure 2.2. This can be represented as homogeneous coordinates:


X
Y
Z
1

 7→
f ·X + Z · px
f · Y + Z · py

Z

 =

f px 0
f py 0

1 0



X
Y
Z
1

 . (2.5)

By expressing K as

K =

f py
f py

1

 , (2.6)

(2.5) can be written as
x = K

[
I 0

]
X, (2.7)

6


2. Theory

x

y

p
xcam

ycam

x0

y0

Figure 2.2: The principal point with offset from the image coordinate origin.

where K is called the camera calibration matrix and [I|0] denotes I and 0 stacked
horizontally.

This model assumes square pixels. In the case of a camera, with a CCD or CMOS
sensor, this is not a necessity. To compensate for this, unequal scale factors can be
introduced, and the camera calibration matrix, K, can be generalised as

K =

mx

my

1


f px

f py
1

 =

αx x0
αy y0

1

 . (2.8)

For added generality, a skew parameter, s, can be added:

K =

αx s x0
αy y0

1

 . (2.9)

For normal cameras, s = 0. If s 6= 0, this can be interpreted as a skewing of the
pixels in a sensor array, as if the x- and y-axes of the pixels were not perpendicular.
This can occur if the image is a photography of an image itself, where the principal
axis is not perpendicular to the plane of the image [24, pp. 164].

In general, the world points will be expressed in a world coordinate system, which
usually will not coincide with the camera coordinate system. The relation between
the two coordinate systems can be expressed as a rotation and a translation. If
X̃ is an inhomogeneous 3-vector representing a point in space, expressed in world
coordinates, the coordinates of the point in the camera coordinate system can be
written as X̃cam = R(X̃ − C̃) [24, pp. 156], where R is a 3 × 3 rotational matrix
representing the orientation of the camera coordinate system and C̃ the coordinates
of the camera origin expressed in world coordinates. In homogeneous coordinates,
this can be expressed as

Xcam =
[
R −RC̃
0 1

]
X, (2.10)

which, with (2.7), gives
x = KR

[
I −C̃

]
X, (2.11)

7


2. Theory

and the camera matrix, P :
P = KR

[
I −C̃

]
. (2.12)

For convenience, the camera centre is usually not expressed in world coordinates,
but with t = −RC̃, which gives the camera matrix

P = K
[
R t

]
. (2.13)

K is often referred to as the internal camera parameters, or intrinsic matrix, and
[R|t] as the external camera parameters, or the extrinsic matrix. Finding K is seen
as a part of the calibration of the camera. If K has the form as in (2.9), the model
is called finite projective camera and has 11 degrees of freedom [24, pp. 175].

This linear model will only hold for cameras with pinhole lenses. For cameras with
optical lenses, i.e. most cameras, the world point will not be collinear with the
projected point and the camera centre [24, pp. 189]. The most significant error is a
radial distortion caused by the lens, as illustrated in figure 2.3, which becomes more
severe with shorter focal length and lesser optical quality. To still be able to use a
linear camera model, the image can be corrected to what it would have been with a
perfect linear camera. The correction for radial distortion has to be applied as a first
step in the image processing [24, pp. 190]. A world point in camera coordinates,
Xcam, gets linearly projected onto an image plane at (x̃, ỹ)T in focal length units
as (

x̃, ỹ, 1
)T

=
[
I 0

]
Xcam. (2.14)

Correction

Figure 2.3: Sketch of how a heavily distorted image of a perfect square would be cor-
rected.

The actual projected point (xd, yd)T can be related to the ideal point as(
xd
yd

)
= L(r̃)

(
x̃
ỹ

)
, (2.15)

where r̃ =
√
x̃2 + ỹ2 is the radial distance from the centre of radial distortion and

L(r̃) is a distortion factor. This can be written in pixel coordinates as

x̂ = xc + L(r)(x− xc) ŷ = yc + L(r)(y − yc), (2.16)

where (x, y)T are the measured coordinates, (x̂, ŷ)T are the corrected coordinates,
(xc, yc)T the centre of radial distortion and r =

√
(x− xc)2 + (y − yc)2 the radial

8


2. Theory

distance from the centre of radial distortion. L(r) is defined only for positive values
and L(0) = 1. L(r) is usually approximated by a Taylor expansion L(r) = 1 +κ1r+
κ2r

2 +κ3r
3 + .... Determining the coefficients, κn and the centre of radial distortion,

(xc, yc)T , is usually considered as part of the calibration of the camera.

2.1.2 Camera calibration

A camera is calibrated when the image can be corrected to linear projection and K
is known. In this case, the projection of a world point on the image plane can be
related to the direction to the world point from the camera centre, and can be written
X̃ = λd. This would then map to the image point as x = K[I|0](λdT , 1)T = Kd, up
to a scale, and the direction of an image point would thus be d [24, pp. 208].

The angle between two image points can, according to the cosine formula for the
angle between two vectors, be written as

cos θ = dT1 d2√
dT1 d1

√
dT2 d2

= (K−1x1)T (K−1x2)√
(K−1x1)T (K−1x1)

√
(K−1x2)T (K−1x2)

= xT1 (K−TK−1)x2√
xT1 (K−TK−1)x1

√
xT2 (K−TK−1)x2

.

(2.17)

This shows that the angle between the direction of two points can be measured if
K is known. It can also be shown that a scene plane defined by an image line, l and
the camera centre, with the normal, n, can be related by K, such that n = lK [24,
pp. 209].

To find the parameters of the intrinsic matrix, K, the geometry of points projected
from infinitely far away is used. Points on the plane at infinity, π∞, can be written
as X∞ = (dT , 0)T , and are projected onto an image plane with H = KR as

x = KR
[
I −C̃

] (d
0

)
= KRd = Hd. (2.18)

This projection is independent of the camera position, C, and only depends on the
internal parameters of the camera and rotation, and thus the calibration can be
made independent of the position of the camera.

A conic is a curve described by a second degree equation in the plane [24, pp.
30]:

ax2 + bxy + cy2 + dx+ ey + f = 0, (2.19)

or in homogeneous coordinates

ax2
1 + bx1x2 + cx2

2 + dx1x3 + ex2x3 + fx2
3 = 0. (2.20)

This can be written in matrix form as

xTCx = 0, (2.21)

9


2. Theory

where

C =

 a b/2 d/2
b/2 c e/2
d/2 e/2 f

 . (2.22)

Under the point transformation (2.18), a conic is transformed as C ′ = H−TCH−1,
since xTCx = dTH−TCH−1d = dTC ′d.

Points on the absolute conic, Ω∞, a conic on a plane at infinity, π∞, satisfies [24,
pp. 81]

x2
1 + x2

2 + x2
3
x4

}
= 0, (2.23)

which gives (
x1, x2, x3

)
I
(
x1, x2, x3

)T
= 0. (2.24)

Thus, the C that describes Ω∞ has to be the identity matrix and Ω∞ is also a conic
of purely imaginary points. The conic, Ω∞, maps to ω = H−TCH−1 = H−T IH−1 =
(KR)−T )I(KR) = (KKT )−1. Thus, if ω is determined, K is also determined. ω is
called the Image of Absolute Conic (IAC).

It can be shown that all circles intersect Ω∞, and that the support plane for a circle,
π, intersect π∞ in a line and this line intersects Ω∞ in two points [24, pp. 28].
These two points are the circular points of π. The projected circular points lie on
ω, and are also vanishing points of π.

The vanishing line of a plane π is the line where π intersects π∞. This line is
the same for all parallel planes [24, pp. 82]. The line can be constructed by two
vanishing points of a plane, the points where two parallel, projected, lines on the
plane converge, as in figure 2.4.

l
v1 v2

Figure 2.4: The projected parallel lines on a plane converges to the vanishing points, v1
and v2, and constructs the vanishing line, l.

Using the result from (2.17) gives

cos θ = vT1ωv2√
vT1ωv1

√
vT2ωv2

, (2.25)

10


2. Theory

which gives that the vanishing points of two perpendicular directions satisfy [24, pp.
219]

vT1ωv2 = 0. (2.26)

Writing the IAC as

ω =

ω1 ω2 ω4
ω2 ω3 ω5
ω4 ω5 ω6

 , (2.27)

it can be shown that cameras with zero skew will put a constraint on ω such
that

ω2 = 0 (2.28)
and a camera with square pixels in addition give a constraint such that

ω1 = ω3. (2.29)

Representing ω as a 6-vector w = (ω1, ω2, ω3, ω4, ω5, ω6)T , constraints in (2.26),
(2.28) and (2.29) can be written in the form of aTw = 0.

For (2.26), with v = (v1, v2, v3)T and u = (u1, u2, u3)T as vanishing points of two
perpendicular directions, a can be written as [24, pp. 225]

a =
(
v1u1, v1u2 + v2u1, v2u2, v1u3 + v3u1, v2u3 + v3u2, v3u3

)T
, (2.30)

for (2.28), a can be written as

a =
(
0, 1, 0, 0, 0, 0

)T
(2.31)

and for (2.29), a can be written as

a =
(
1, 0, −1, 0, 0, 0

)T
. (2.32)

By stacking aT from n of these constraints in a matrix, A, of size n× 6 and solving
Aw = 0 using Singular Value Decomposition (SVD) [24, pp. 593], gives a least-
squares solution for ω. ω can be decomposed into the intrinsic camera matrix, K,
using matrix inversion and Cholesky factorisation [24, pp. 582]. A minimum of
five independent constraints are required, which can be achieved by using multiple
constraints from (2.26).

This calibration assumes that the scene is projected to the image plane linearly.
If this is not the case, the image has to be corrected as in (2.16). To do this the
coefficients of L(r) and the centre of radial distortion have to be determined. This
is usually done by an iterative minimisation of a cost function on the projection of
a known pattern, e.g. a chessboard pattern or a Tsai grid [24, pp. 192]. Complete
algorithms are proposed by Heikkilä and Silvén in 1997 [26] and Zhang in 2000
[27].

There are different tools for camera calibration available. Some of the most popular
ones [28] are the Camera Calibration Toolbox for Matlab1 and tools available in the
OpenCV library [29].

1http://www.vision.caltech.edu/bouguetj/calib_doc/

11

http://www.vision.caltech.edu/bouguetj/calib_doc/


2. Theory

2.2 Visual Odometry
VO is the process of estimating the ego-motion based on consecutive camera images.
The classical approach is to detect distinctive projected world points, the points
between images, and use the pixel coordinates of the points to determine the ego-
motion. These methods are usually called feature based methods.

2.2.1 Feature based methods

To relate the pose of two camera views, epipolar geometry can be used. To denote
camera matrices, projected points and other quantities related to the second view, ′
will be used. All cameras are assumed to project world points onto the image plane
linearly.

Epipolar geometry

Epipolar geometry is independent of the scene structure, and only depends on the
internal and external parameters of the two cameras, i.e. the camera matrix from
(2.12). A world point, X, is projected onto the image plane of two views at x and
x′. The two camera centres, C and C′, the world point and the projected points
will be coplanar [24, pp. 239], as can be seen in figure 2.5. Call this plane, π, the
epipolar plane. Knowing that the projected points have to be on the epipolar plane,
the search for a matching point for x in the second view, C′, is limited to the line
where the image plane intersects the epipolar plane, the epipolar line, as is shown
in figure 2.5. The epipoles, e and e′, are the points where the baseline, the line
between the two camera centres, intersects each image plane.

C C′

X

x x′

π

Baseline

X

x

X?

X?

e e′
l′

Figure 2.5: The left figure is showing the two camera view centres, C and C′, the world
point, X, the projected image points, x and x′, and the epipolar plane π. The right figure
shows the epipolar points, e and e′, the line of the second view, l′, and how all world
points in the direction of X must be projected onto this line.

The fundamental matrix, F , is a central part of epipolar geometry. The fundamental
matrix relates the two views as x′TFx = 0, where x and x′ is the projection of a
world point, X, in the first and second view, respectively. The fundamental matrix

12


2. Theory

is a 3× 3 matrix of rank 2 [24, pp. 239]. From the fundamental matrix the relative
pose between two views can be calculated, up to a scale. The fundamental matrix
is independent of scene structure, but can be calculated from scene points [24, pp.
239].

In a pair of images, any world point, X, projected onto one of the images, x, the
matching point, x′, on the other image has to be on the corresponding epipolar line,
l′. Thus there is a mapping x 7→ l′ [24, pp. 242].

According to (2.7) the world point, X, is projected to an image point, x, as x = PX.
The ray, back-projected from x to X, can be written as [24, pp. 162]

X(λ) = P+x + λC, (2.33)

where P+ the pseudo-inverse of P , C is the null-vector of P and also the camera
centre, defined by PC = 0, and λ is a parametrisation of the ray. The camera
centre is back projected onto the line at C, when λ → ∞, and as the world point,
P+x, when λ = 0. These two points are projected in the second view as P ′C
and P ′P+x. These two points lie on the epipolar line of the second view, l′, thus
l′ = (P ′C)× (P ′P+x). As the point P ′C is the epipole of the second view, e′, this
can be written as l′ = [e′]×(P ′P+)x = Fx, which gives

F = [e′]×(P ′P+), (2.34)

and the map x 7→ l′
l′ = Fx. (2.35)

[e′]× denotes the skew-symmetric of e′.

The skew-symmetric matrix of a vector a = (a1, a2, a3)T is defined as [24, pp.
581]

[a]× =

 0 −a3 a2
a3 0 −a1
−a2 a1 0

 , (2.36)

which relates to the cross product as

a × b = [a]×b. (2.37)

To find a method for calculating F from image coordinates alone, consider two
cameras in the coordinate system of the first camera, P = K[I|0] and P ′ = K ′[R|t],
then [24, pp. 244]

P+ =
[
K−1

0T

]
C =

(
0
1

)
(2.38)

and, as the epipole of the second camera, e′, is the projection of the first camera
centre, P ′C,

F = [P ′C]×P ′P+ = [K ′t]×K ′RK−1 = K ′−T [t]×RK−1 = K ′−TR[RT t]×K−1

= K ′−TRKT [KRT t]×.
(2.39)

13


2. Theory

Then, as [24, pp. 244]

e = P

(
−RT t

1

)
= KRT t and e′ = P ′

(
0
1

)
= K ′t, (2.40)

F = [e′]×K ′RK−1 = K ′−T [t]×RK−1 = K ′−TR[RT t]×K−1

= K ′−TRKT [e]×.
(2.41)

If x and x′ correspond, x′ will lie on the epipolar line l′ = Fx, thus x′T l′ = 0 [24,
pp. 245], which gives

x′TFx = 0. (2.42)

Writing x = (x, y, 1)T and x′ = (x′, y′, 1)T and denoting fnm as the elements of F ,
(2.42) gives rise to the linear equation

x′xf11 + x′yf12 + x′f13 + y′xf21 + y′yf22 + y′f23 + xf31 + yf32 + f33 = 0, (2.43)

which can be expressed by a vector multiplication as(
x′x, x′y, x′, y′x, y′y, y′, x, y, 1

)
f = 0, (2.44)

where f is a 9-vector of the elements in F in row-major order. From a set of n
points, a set of linear equations can be written as

Af =


x′1x1 x′1y1 x′1 y′1x1 y′1y1 y′1 x1 y1 1
... ... ... ... ... ... ... ... ...

x′nxn x′nyn x′n y′nxn y′nyn y′n xn yn 1

 f = 0. (2.45)

For an exact solution to exist for this, A has to be of rank 8, thus a minimum
of 8 points are required to solve this equation. Since there is noise in the image
coordinates from quantisation errors and other sources, more points can be used to
find a better estimate of f . In this case A may be of rank 9. A solution can be
found by SVD, A = UDV T , where the last column of V will be the singular vector,
corresponding to the smallest singular value of A, and also the least-squares solution
of f , which minimises ‖Af‖, subject to ‖f‖ = 1 [24, pp. 280]. The solution for f can
then be decomposed to F .

The fundamental matrix is a singular matrix of rank 2, and many applications of
it relies on that fact [24, pp. 280]. Solving the linear equations of from (2.45) does
not guarantee this, and steps to enforce this should be performed. The fundamental
matrix obtained from the linear equations can be modified to satisfy this by replacing
F with F ′ that minimises the Frobenius norm ‖F − F ′‖ subject to detF ′ = 0. This
can be obtained by the SVD of F = UDV T , where D is a 3 × 3 diagonal matrix
with the singular values of F in descending order on the diagonal. Replacing the
smallest singular value in D with 0, calling it D′, and constructing F ′ = UD′V T ,
then F ′ will minimise the Frobenius norm ‖F − F ′‖ [24, pp. 281].
Image coordinates can come in different formats, some define the upper left corner of
the image as the origin and others define the centre of the image as the origin. It can

14


2. Theory

be shown that this, as well as the magnitude of the pixel coordinates, influence the
result when retrieving the fundamental matrix using the 8-point algorithm described
above [24, pp. 281]. This is due to the difference in magnitude of the elements of
A which are two elements multiplied, like x′x and x′y, compared to single elements,
like x and y. The condition number of A will be higher when the magnitude of the
elements differs more, which will lead to bigger errors when enforcing rank 2 on the
fundamental matrix [30]. Therefore it is considered an essential step to normalise
the image points [24, pp. 281]. A suggested normalisation x̂i = Txi and x̂′i = T ′x′i is
to construct the transformation T and T ′ such that it will place the points with the
centroid at the origin and with the Root Mean Square (RMS) distance equal to

√
2

[24, pp. 282]. Then perform the steps of the 8-point algorithm with the normalised
points and retrieve the normalised fundamental matrix F̂ ′. Finally retrieve the
fundamental matrix corresponding to the original data as F = T ′T F̂ ′T [24, pp.
282]. This is called the normalised 8-point algorithm for F .

A specialisation of the fundamental matrix is the essential matrix, E. The funda-
mental matrix can be seen as a generalisation of the essential matrix, where the
assumption of a calibrated camera is removed.

Consider a camera matrix P = K[R|t], and a world point projected to an image
point as x = PX. If the intrinsic matrix, K, is known, its inverse can be applied to
the image point which gives x̂ = K−1x. This gives x̂ = [R|t]X, where x̂ is the image
point in normalised coordinates [24, pp. 257]. The camera matrix K−1P = [R|t]
is called the normalised camera matrix [24, pp. 257]. A pair of normalised camera
matrices would then, according to (2.39), result in an essential matrix as

E = [t]×R = R[RT t]×. (2.46)

The essential matrix is defined as

x̂′TEx̂ = 0. (2.47)

Substituting x̂ and x̂′ with the corresponding points x↔ x′ gives x′TK ′−TEK−1x =
0, which compared to x′TFx = 0 gives

E = K ′TFK. (2.48)

Both the translation, t, and the rotation, R, in the essential matrix E = [t]×R has
3 DoF, however, the essential matrix has only five due to a scale ambiguity.

Nistér in 2004 proposed an efficient solution to determine the essential matrix with
five correspondences [31], which is the minimal set to fully solve the 5 DoF essential
matrix.

It can be shown that the essential matrix is a matrix where two of its singular values
are equal and the third is equal to zero [24, pp. 258]. The two equal singular values
can be chosen freely, due to the scale ambiguity, and are often set to 1. It can also
be shown that for a given essential matrix, with the SVD E = Udiag(1, 1, 0)V T ,

15


2. Theory

and a first camera matrix P = [I|0], there are four possible solutions for the second
camera matrix [24, pp. 259]:

P ′1 =
[
UWV T u3

]
P ′2 =

[
UWV T −u3

]
P ′3 =

[
UW TV T u3

]
P ′4 =

[
UW TV T −u3

]
,

(2.49)

where

W =

0 −1 0
1 0 0
0 0 1

 . (2.50)

Of the four different solutions, only one is correct. The different solutions can be
interpreted as is shown in figure 2.6. To determine which of the solutions is correct,

C C′ C C′

C C′ C C′

Figure 2.6: Illustration of how the four different choices of camera matrices for a given
essential matrix can be interpreted.

the respective translation and rotation can be used to triangulate the positions of
the points used to determine the essential matrix. The solution which renders the
points in front of both cameras is the correct one, top left in figure 2.6.

A world point is projected through two cameras onto their image planes as x = PX
and x′ = P ′X. These equations can be combined into a form AX = 0. This form
can be obtained by the cross product x × (PX) = 0, which written out gives [24,
pp. 312]

x(p3TX)− (p3TX) = 0
y(p3TX)− (p2TX) = 0
x(p2TX)− y(p1TX) = 0,

(2.51)

16


2. Theory

where piT denotes the i:th row of P . Combined with the analogue for x′×(P ′X) = 0,
this can be written in the form AX = 0 with

A =


xp3T − p1T

yp3T − p2T

x′p′3T − p′1T
y′p′3T − p′2T

 . (2.52)

This can be solved using SVD, A = UDV T , where the last column of V will corre-
spond to the least-squares solution of X [24, pp. 312].

A more efficient algorithm for obtaining t and R is proposed by Nistér in 2004
[31].

Feature detection and description

To use the methods described, points in the image planes of successive images that
are projections of the same point on a physical object must be identified. There are
two different, generally used approaches to find image points corresponding to the
same world point: feature matching and feature tracking.

Feature matching is to independently detect interesting feature points in two images,
construct a feature descriptor for each image point, which should describe the points
in a unique way, but still be invariant for different views, and then match the feature
points between the two images using their descriptors.

The feature detectors can be divided into two groups: corner detectors and blob
detectors. A corner is defined as a point where two or more edges intersect [32]. A
blob is a pattern in the image which differs from its immediate neighbourhood in
terms of intensity, colour and texture, and is not an edge nor a corner [32]. Desired
properties for feature detectors are localisation accuracy, the pixel coordinate of
where the feature is detected should be precise; repeatable, the same feature should
be detected in different images; computationally efficient; robust to noise, compres-
sion artifacts and blur; distinctive, to be able to match the feature accurately across
different images, and invariance to both photometric and geometric changes [32].
Corner detectors are in general faster to compute and give a more accurate image
position, but finds less distinctive features than blob detectors. Blob detectors are
usually less invariant to changes in scale and viewpoint [32]. The choice of cor-
ner detector depends on computational constraints, environment type and motion
baseline, and should be carefully considered [32].

There are many feature detectors available. Commonly used corner detectors are
Harris [33], Features from Accelerated Segment Test (FAST) [34] and Shi-Tomasi
[35] and blob detectors are Scale Invariant Feature Transform (SIFT) [36], Speeded
Up Robust Features (SURF) [37] and Center Surround Extrema (CENSURE) [38].
Feature detectors usually work in two steps. The first step is to apply a feature-
response function on the entire image [32]. For instance, Harris uses the corner
response function and SIFT uses the Difference-Of-Gaussian (DoG) operator [32].
The second step is to localise all local extrema points on the output of the first step,

17


2. Theory

C
or
ne
r

D
et
ec
to
r

Bl
ob

D
et
ec
to
r

R
ot
at
io
n

In
va
ria

nt

Sc
al
e

In
va
ria

nt

A
ffi
ne

In
va
ria

nt
1

R
ep

ea
ta
bi
lit
y

Lo
ca
lis
at
io
n

A
cc
ur
ac
y

R
ob

us
tn
es
s

Effi
ci
en
cy

Harris × × +++ +++ ++ ++
Shi-Tomasi × × +++ +++ ++ ++

FAST × × × ++ ++ ++ ++++
SIFT × × × × +++ ++ +++ +
SURF × × × × +++ ++ ++ ++

CENSURE × × × × +++ ++ +++ +++

Table 2.1: Properties and performance of feature detectors [32].
1 Not truly affine invariant, but found to be invariant up to certain
changes in viewpoint.

by applying nonmaxima suppression [32]. To achieve invariance to scale, the feature
detector is often applied to images of different scale, however this will lead to loss of
accuracy in the pixel coordinate due to the lower resolution and is computationally
heavy [38]. In CENSURE, another approach, where the Laplacian across scale
feature-response is used, which calculates the features at every pixel at all scales,
and achieves real-time performance [38]. A summary of properties and performance
of different feature detectors is given in table 2.1.

For each detected feature point a compact descriptor based on the region around each
point is calculated. The simplest feature descriptor is a descriptor of the appearance
of the point, i.e. the intensity of each pixel in a region around the feature point. Then
the sum of squared differences or the normalised cross correlation can be used to
compare the descriptors [32]. However, these descriptors are not very robust to
changes in orientation, scale and viewpoint [32]. More elaborate descriptors are
the SIFT descriptor [36], the SURF descriptor [37], Binary Robust Independent
Elementary Features (BRIEF) [39], Oriented FAST and Rotated BRIEF (ORB)
[40] and Binary Robust Invariant Scalable Keypoints (BRISK) [41]. In common for
these descriptors is that they generate a vector, usually 64 or 128 elements long.
SIFT and SURF produce vectors of real numbers, while, as their name indicate,
BRISK, ORB and BRIEF produce binary vectors.

The SIFT descriptor algorithm works by dividing the region around the feature
point into a 4×4 grid, for which each of eight histograms of gradients with different
orientations are calculated and concatenated into the 128 element descriptor vector
[32]. It is desired for a descriptor to be invariant to rotation. In SIFT, this achieved
by calculating the image gradient orientation of a region around the feature point
and rotating the calculation of histograms accordingly [36].

The set of features from two images can then be exhaustively matched, using a
similarity measure on the feature descriptors. For SIFT and SURF the Euclidean

18


2. Theory

distance can be used [32], and for BRIEF, ORB and BRISK the Hamming distance
can be used [39][40][41]. The complexity of exhaustive matching is O(n2), and be-
comes impractical when the number features becomes large [32]. Binary descriptors
easily outperforms floating point descriptors in speed of matching, as the Hamming
distance of a binary vector can be calculated extremely fast on a modern CPU in the
form of a bitwise XOR operation [41]. The matching process can also be speeded up
by using an indexing structure, such as a search tree or a hash table [32]. If there is
a known motion model where other sensors, like an IMU, can be used to estimate
a motion, given the first image point, the epipolar line of the second image, where
the matching point should recur, can be calculated as shown in section 2.1.2, and
the search for a matching feature point can be limited [32].

Feature tracking is a different approach to find matching points. The tracker
is initialised with a set of feature points, which are to be tracked. Often the
Kanade–Lucas–Tomasi (KLT) [35] feature tracker is used [32]. These methods re-
quire that the images are taken in close proximity [32]. The KLT tracker estimates
the displacement of a point by estimating an affine image transformation of regions
in the image around a feature point, feature windows. The affine image motion
assumes a point, x = (x, y)T , in the first image, I1, will move to a point, Ax + d, in
the second image I2, and that

I2(Ax + d) = I1(x), (2.53)

where d is the feature windows displacement, A = I +D, and where

D =
[
dxx dxy
dyx dyy

]
. (2.54)

The displacement, d, and the deformation matrix, D, are then found by minimising
[35]

ε =
∫∫
W

[I2(Ax + d)− I1(x)]2w(x) dx, (2.55)

where W is the feature window and w(x) is a weighting function. The weighting
function can e.g. be equal to 1 or a Gaussian-like function, to emphasize the central
area of the feature window [35]. The integral is linearised using truncated Taylor
expansion and then minimised using the Newton-Raphson method [35].

RANSAC

Both feature matching and feature tracking will result in incorrect point correspon-
dences between images, outliers, which will bias the result if included in estimation
of the motion. Random Sample Consensus (RANSAC) [42] has been established
as the standard method for outlier rejection [32]. The idea behind RANSAC is to
estimate a number of hypothesis models by repeatedly sampling a randomly selected
minimum set of correspondences and count the total number of correspondences in
consensus with the estimated hypothesis [32]. The matching correspondences of the
hypothesis generating the most matches will be selected as inliers. An example of

19


2. Theory

Figure 2.7: Illustration of how RANSAC would work to estimate a line. The figures
shows two randomly selected points in blue, rejected outliers in red and the inliers in
green for four different hypothesises. Here the inliers of the hypothesis in the bottom right
image will be selected as the final inliers as they are more than in the other hypothesises.

how the RANSAC sampling process, to estimate a line in a plane, could look like is
illustrated in figure 2.7.

The number of samples required for finding a correct hypothesis, N , with a proba-
bility of success, P , is determined by the percentage of outliers, ε, and the number
points in the minimal set, s [32]:

N = log (1− P )
log (1− (1− ε)s) . (2.56)

As can be seen in figure 2.8, the number of required samples increases exponentially
in the number of minimal points in the set, thus it is desired for the minimum set
required to make a hypothesis to be as small as possible.

20


2. Theory

0 0.2 0.4 0.6 0.8 10

200

400

600

800

1 000

Fraction of outliers

N
um

be
r

of
ite

ra
tio

ns 1-point
5-point
8-point

Figure 2.8: A plot of the number of required iterations of RANSAC for a 99% probability
of success at different percentage of outliers with different number of minimal points.

RANSAC can be used with the normalised 8-point algorithm described previously,
with 8 points in the minimal set. To calculate a distance to the remaining points, the
distance to the epipolar line can be used [32]. Often the Sampson distance [43], which
is a first order approximation of the geometric distance, is used, as the geometric
error is quite complex in nature [24, pp. 98][32]. The Sampson approximation of a
geometric distance is εT (JJT )−1ε, where ε is the Sampson cost and J the Jacobian
of the Sampson cost. The Sampson approximation of the epipolar line becomes [24,
pp. 287]

(x′TFx)2

JJT
. (2.57)

From the explicit expression of x′TFx in (2.43), and the definition of J [24, pp.
287],

JJT = (Fx)2
1 + (Fx)2

2 + (F Tx′)2
1 + (F Tx′)2

2, (2.58)

where (Fx)j represents the jth element of the vector Fx. Thus, the Sampson cost
function for a point correspondence distance to the epipolar line, given a fundamental
matrix is

(x′TFx)2

(Fx)2
1 + (Fx)2

2 + (F Tx′)2
1 + (F Tx′)2

2
. (2.59)

As it is desired for the minimal set of corresponding points to be as small as pos-
sible, to reduce the number of required RANSAC iterations, algorithms requiring
less correspondences than the normalised 8-point algorithm, like the highly efficient
5-point algorithm proposed by Nistér in 2004, could be used. Using the 5-point
algorithm, which requires 5 correspondences, compared to the normalised 8-point
algorithm, which requires 8, would reduce the number of required iterations from
1177 to 145, with a 99% success rate and 50% outliers, according to (2.56).

In 2009 Scaramuzza et al. proposed a solution for RANSAC for on-road vehicles,
only requiring 1 point [44]. With 1 correspondence, according to (2.56), the number
of required RANSAC iterations with a 99% success rate and 50% outliers is 7. The
algorithm requires only 1 correspondence, as a restricted motion model, assuming
circular motion, is used [44]. As can be seen in figure 2.9a, a planar motion can be
described by three parameters, the yaw angle θ and the polar coordinates (ρ, ϕ)T

21


2. Theory

[44]. Since, when using a monocular camera, the scale factor is unknown, ρ can be
set to 1 [44]. Assuming the camera to move along a circumference, perpendicular
to the Instantaneous Centre of Rotation (ICR), as in figure 2.9a, gives ϕ = θ/2,
thus, only θ has to be estimated, which can be done with a 1 point correspondence
[44]. The Ackermann steering configuration, which is common in automobiles [45,

ρ

θ

ϕ = θ/2

θ

π−θ
2

π−θ
2

ICR
RICR

C

C′

(a) Description of circular motion.

ICR

RICR

Copt

Creal

(b) The Ackermann steering principle.

Figure 2.9: Illustrations of how a circular motion can be used describe the motion of a
vehicle using the Ackermann steering principle.

pp. 37], ensures an ICR will exist, located in the center of the rear axle [45, pp.
68]. Thus, the above assumptions made about the motion in the 1 point RANSAC
are reasonable for cars, if the camera is placed as Copt in figure 2.9b [44]. However,
in practice, the steering angle of cars is small and thus the radius of the curvature
is big, which allows relaxation of constraint on the placement of the camera to the
principal axis of the camera being perpendicular to the rear-wheel axle [44].

The circular motion can be described by a rotational matrix R and a translation
matrix t as

R =

 cos θ 0 sin θ
0 1 0

− sin θ 0 cos θ

 t = ρ ·

sinϕ
0

cosϕ

 , (2.60)

if rotation is about the y-axis and translation in the xz plane is considered. It is
known from (2.47) that the essential matrix, E, relates two corresponding image
points, x = (x, y, 1)T and x′ = (x′, y′, 1)T , as x′Ex = 0, and from (2.46), that
E = [t]×R. Using (2.60), (2.46), and the constraint ϕ = θ/2, an expression for the
essential matrix can be obtained:

E = ρ ·

 0 − cos θ
2 0

cos θ
2 0 sin θ

2
0 sin θ

2 0

 . (2.61)

22


2. Theory

Inserting this into (2.47) gives

sin θ2(y′ + y) + cos θ2(y′x− x′y) = 0, (2.62)

and the rotation angle can then be obtained:

θ = −2 arctan
(
y′x− x′y
y′ + y

)
. (2.63)

Bundle Adjustment

Bundle Adjustment (BA) is a method for minimising the reprojection error of fea-
tures detected in an image by adjusting the camera pose estimate and feature 3D
position estimates [24, pp. 434]. Consider a set of camera poses, described by cam-
era matrices, Pi, and a set of 3D feature points, Xj, detected at image coordinates,
xi,j. The reprojection error is then the distance between the detected feature point
in the image, xi,j, and the image coordinates of the feature reprojected onto the
image given the camera pose and the 3D feature position, PiXj. Since this gener-
ally can not be solved exactly due to noisy measurements, the objective is to find
estimates of the camera matrices and feature point positions, P̂i and X̂j, respec-
tively, that minimise the total reprojection error. Assuming Gaussian noise on the
measurements the maximum likelihood, estimates can be found by minimising the
sum of the squared reprojection errors:

min
P̂i,X̂j

∑
i,j∈C

d(P̂iX̂j,xi,j)2, (2.64)

where d(a,b) is the geometric distance in the image between homogeneous image
points, a and b, and C is the set of pairs i, j, for which feature Xj has been seen
from pose i, i.e. xi,j exists. BA needs a good initialisation [24, pp. 435]. This
estimate can be provided by triangulating new features when they have been seen
in two consecutive frames whose poses are estimated using epipolar geometry, as
detailed above.

It has been shown that BA is the optimal non-linear least-squares SLAM algorithm
[46], however, the complexity of minimising (2.64) with the Levenberg-Marquardt
algorithm is cubic in the number of poses and features [46][24, pp. 435], which makes
in impractical to optimised the full graph online. A way of reaping the benefits of
the optimality of BA, while maintaining constant complexity, making it possible to
use BA for VO, is to partition the sequence of images by windowing the latest n
frames, and only performing BA for the corresponding poses and the features in
those images [24, pp. 453]. This is called local BA.

2.2.2 Direct methods

Direct methods optimises the geometry directly on the image intensities, which
enables the possibility of using all information in the image [23]. By this, direct

23


2. Theory

methods circumvents the limitation of feature based methods, that “only informa-
tion that conforms to the feature type can be used” [23]. Direct methods have
higher accuracy and robustness, compared to feature based methods, in particular
in areas with few feature points, and in addition provides more information about
the geometry of the environment [23].

Direct methods for visual odometry has been well established for RGB-D and stereo
cameras [23], such as the solution by Klose et al. in 2013 [47], but solutions have
been proposed for monocular cameras by Pizzoli et al. in 2014 [48] and Stühmer et
al. in 2010 [49]. All these solutions are computationally demanding and requires
a state-of-the-art Graphics Processing Unit (GPU) to run in real-time [23]. In
2013 Engel et al. proposed a semi-dense solution for direct visual odometry for a
monocular camera [13], which significantly reduces the computational complexity,
compared to previous work, but does not build a globally consistent map including
loop closures [23]. The approach of the method is to spend computations where
the information gain is maximised [13]. This is done by by calculating a semi-dense
inverse depth map only for the regions of the image with non-negligible gradient
[13]. This solution for visual odometry was incorporated in a full solution for the
SLAM problem by Engel et al. in 2014, which corrects accumulated scale drift and
closes loops [23].

2.3 Simultaneous Localisation And Mapping
The SLAM problem consists of building a map and localising within that map si-
multaneously. There are several kinds of maps that can be used to represent the
surroundings. Maps consisting of the locations of discrete point-like landmarks, i.e.
point clouds, are well suited to be used with camera data [50]. In 2001, Dissanayake
et al. showed that it is possible to build accurate maps with only relative mea-
surements, assuming Gaussian noise, linear models and static landmarks [51]. More
specifically they proved that the only lower limit on the variance in the estimates of
the pose and landmarks is the variance of the initial pose.

The SLAM problem can be formulated as calculating, or approximating, the prob-
ability distribution

p(x1:t,m|z1:t, u1:t, x0), (2.65)
where x1:t is the trajectory of poses x1 to xt, m is the map, i.e. the position of the
observed landmarks, z1:t are all measurements, u1:t are all the motion commands
issued and x0 is the initial pose [9, pp. 310] [52]. Using a motion model appropriate
for the given system, the motion commands can be translated into an odometry
and u is indeed sometimes referred to as odometry. The initial position is usually
considered to be arbitrary, x0 can be set to the origin with zero uncertainty.

An equally important variant is the online SLAM problem of finding

p(xt,m|z1:t, u1:t, x0), (2.66)

where only the current pose is sought together with the map, given all measurements
and motion commands [9, pp. 309]. Factorising the right-hand side of (2.66) by

24


2. Theory

applying Bayes theorem, omitting x0 for brevity, gives

p(xt,m|z1:t, u1:t) = p(xt,m|z1:t−1, u1:t)p(zt|xt,m)
p(zt|z1:t−1, u1:t)

. (2.67)

The observation model, p(zt|xt,m), is the probability of observing zt at time t given
a location xt and a map. p(xt,m|z1:t−1, u1:t) is the motion prediction, given the new
controls, which can be written as

p(xt,m|z1:t−1, u1:t) =
∫
p(xt−1,m|z1:t−1, u1:t−1)p(xt|xt−1, ut) dxt−1. (2.68)

This clearly factors the prediction into the motion model, p(xt|xt−1ut), and the
previous posterior, p(xt−1,m|z1:t−1, u1:t−1).

The main difference between VO using BA and V-SLAM is that V-SLAM in general
tries to create a globally consistent map, as opposed to the local consistency of VO
[28], and thereby reducing the drift of the ego-motion. A way of ensuring the global
consistency of the map is by detecting the return to a previously visited location, a
loop closure.

2.3.1 Loop closure

Because of the drift inherent in successively building a map by incremental, imper-
fect odometry, the map representation will never be perfectly aligned when returning
to a previously visited location. To make sure the map is globally consistent, the
SLAM algorithm needs to explicitly connect the new location with the previously
visited location to which it corresponds, and adjust the path in between to take
the new connection into consideration in order to make the parts of the map in-
volved consistent. The process thus has two basic steps: loop detection and loop
closing.

It is important to note that while loop closures are very useful in a SLAM system,
introducing erroneous loop closures can cause catastrophic errors in the map, and
ego-motion estimates that are hard to recover from, if past data associations can
not be undone. It is thus very important not to introduce false positives when
closing loops. A common way of measuring the performance of a loop detector is
the recall at 100% precision, i.e. the proportion of all the true loop closures detected
without accepting any false positives. With good VO, only occasional loop closures
are needed to straighten out the map and trajectory.

In the case of monocular V-SLAM, there are three general categories of loop detec-
tion [53]: map-to-map, image-to-map and image-to-image.

Map-to-map loop detection consists of matching the features of two submaps consid-
ering the feature descriptor as well as the relative geometric position of the features.
This method has been used in an EKF SLAM system by Clemente et al. in 2007
[54], and in a Graph-based SLAM system by Eade and Drummond in 2008 [55].
When the submaps have too few features in common to match, 100% precision
cannot be maintained [53].

25


2. Theory

Image-to-map loop detection aims to find correspondences between the latest image
and features in a map. This method is used in a Graph-based SLAM system by Lim
et al. in 2014 [19] and in EKF SLAM with submaps by Williams et al. in 2008
[56]. Lim et al. in 2014 used a vocabulary tree to find feature correspondences with
earlier feature point locations and when enough features match geometric verification
is performed using RANSAC to reject outliers [19].

Image-to-image loop detection, or appearance only based loop detection, works by
matching the current image to previously seen images. One such algorithm is FAB-
MAP 2.0, by Cummins and Newman in 2011 [21], further details of which is given
below. Another variant of the image-to-image approach is matching a recent se-
quence of images to previous sequences, as implemented in SeqSLAM by [22] in
2012 [22]. In the case of vehicles travelling on roads, it is clear that sequences of
images from a forward looking camera can provide useful information for resolving
problems with visual aliasing, since the vehicles in the same location will move along
similar paths. However, the procedure used by Milford and Wyeth in 2012 [22], of
taking the absolute difference between down-sampled and patch normalised images
is highly sensitive to changes in view-point orientation [57]. A representation of
the image differences that is less susceptible to such variations could improve the
performance.

One limitation in common for the image-to-image systems examined here is that only
a topological loop closure is achieved. If a geometric loop closure is sought the rela-
tive pose between the recent image and the found match must be computed.

FAB-MAP

FAB-MAP 2.0 [21], by Cummins and Newman in 2011, is the state of the art method
for loop closure detection [22][58]. It builds on FAB-MAP [20] by Cummins and
Newman in 2008. Both use the Bag-Of-Words (BoW) representation of images.
The words consist of feature descriptor vectors detected in the image, that have
then been quantised with respect to a vocabulary, i.e. the descriptor is replaced
by the closest word, in L2 sense, in the vocabulary. The vocabulary is built by
clustering the descriptors found in a training set of images and choosing the centres
of the clusters as the vocabulary words. The set of descriptors that are represented
by each word corresponds to the Voronoi region in the descriptor space around
the word. An image is then a “bag” of the vocabulary words found in the image.
The point of this procedure is to get a discrete representation of the visual words
which can be indexed. This allows for simple enumeration of the vocabulary words
appearing in each image as well as the creation of an inverted index which describes
which location each word appears in.

A common strategy for building a vocabulary is to initialise the clustering with
uniformly distributed random words. Cummins and Newman in 2011 showed that
a radius-based initialization that requires the initial guesses to be separated by a
minimum distance gives improved recall at a given precision [21].

The loop detection is done by matching the BoW description of the latest image to

26


2. Theory

those previously seen, taking into account the uniqueness of the words, and their
co-visibility. FAB-MAP can either assume the features are independent of each
other, the naive Bayes approximation, or each feature can be conditioned on at
most one other feature, resulting in a tree structured graphical model. The naive
Bayes approach has one unique solution, the probability of observing a feature is
proportional to its frequency in the training data. There are many different tree
structures for a given number of nodes, the one that most closely approximates the
original distribution is a Chow-Liu tree [20]. Word occurrence are highly correlated
between words, and accounting for the correlation in this approximate manner is
shown to improve results [20].

The probabilistic model, underlying FAB-MAP, is based on unobservable “scene
elements”, eq, which give rise to visual word observations, zq. These are binary
variables that take the value 1 when the scene element is in view or when a word
describing that scene element is detected, respectively. The detection of words is
characterised by two parameters corresponding to the probabilities p(zq = 1|eq = 1)
and p(zq = 1|eq = 0) which are the true and false positive probabilities. These
two parameters together with a training data set is needed to initialise, i.b build
vocabulary and tree, and run the FAB-MAP algorithm. The observation derived
from an image at time k is Zk = z1, ..., zv, with v being the number of words in
the vocabulary. A location Li is modelled as the belief about the presence of scene
elements p(e1 = 1|Li), . . . , p(e1 = 1|Li). This allows a formulation of the belief about
the current location very similar to a recursive Bayesian estimation. The probability
for the ith location being the current location given the observations Z1 to Zk can
be written

p(Li|Z1:k) = p(Zk|Li, Z1:k−1)p(Li|Z1:k−1)
p(Zk|Z1:k−1) , (2.69)

where p(Li|Z1:k−1) is the location prior, p(Zk|Li, Z1:k−1) the observation likelihood,
and p(Zk|Z1:k−1) a normalising term. Commonly, the normalising term does not
have to calculated the in Bayesian estimation because the probability function cov-
ers all possible outcomes, however, in (2.69), the outcomes only cover previously
visited locations, and the possibility that the measurement is taken from an entirely
new location still remains. To determine this, the normalising term must also be
evaluated [20].

FAB-MAP will give a probability for each previous location of it being a loop closure
with the latest image, and those above a certain threshold will be accepted as a loop
closure. FAB-MAP 2.0 also performs a verification of the epipolar geometry between
the potential loop closures with RANSAC and weighs this into the ranking.

2.3.2 EKF SLAM

EKF SLAM is based on the EKF. The EKF applies the classic Kalman filter to a
non-linear model by linearising the model, using the first order Taylor expansion.
This form of SLAM was the first to be implemented. It is an online form of SLAM
and uses feature-based maps. As EKF SLAM is based on the EKF it assumes
that the measurement noise and the motion model uncertainty has a Gaussian dis-

27


2. Theory

tribution. EKFs tend to handle large uncertainties in the posterior poorly, since
the linearisation may introduce errors too large to handle [9, pp. 312]. The EKF
SLAM algorithm is also limited to only using positive information, it cannot take
into account the absence of landmarks [9, pp. 313].

The state vector in EKF SLAM consists of the current pose, an n-dimensional vec-
tor, xt, and the landmark positions, m. Call the combined M -dimensional state
vector yt = (xTt ,mT )T . EKF SLAM calculates an approximation to the posterior
p(yt|z1:t, u1:t). This is done by following the standard EKF procedure outlined be-
low.

Initialise the mean and covariance of yt in the origin with zero variance for the pose
state and infinite variance for the landmarks,

µ0 =
(
0 . . . 0

)
and Σ0 =

[
0 0
0 ∞ · I

]
(2.70)

for the mean and covariance matrix respectively. Σ0 is an M ×M matrix in which
the diagonal zero is n× n, and ∞ · I is a matrix with ∞ on the diagonal. Then the
predicted mean and covariance is calculated:

µ̄t = g(µt−1, ut), (2.71)

where µ̄ is the predicted mean, µt−1 is the previous estimate of the mean, and
g(µt−1, ut) is the motion model.

Σ̄t = GtΣt−1G
T
t +Rt (2.72)

Here, Σ̄t is the covariance of the prediction, Gt = ∂g(µt−1,ut)
∂yt−1

is the Jacobian of the
motion model, and Rt is the motion noise. This is how the covariance propagates
in the linear model [9, pp. 314].

The following assumes that the data association problem is solved by a feature
tracker, loop closure algorithm, or other method so that correspondences of the
features are known and stored in a correspondence vector, ct, where the elements
cit are the index of the ith landmark seen at time t, 1 < i < Nt, with Nt the
number of landmarks at time t. If the landmark has never been seen before, add the
estimated mean and covariance, of the landmark position, to the end of the state
vector. In the case of bearing only measurement, there is no sensible estimate of the
position of the landmark until it has been seen multiple times, and the position can
be triangulated.

The mean and covariance can then be updated by calculating the estimated mea-
surement for each individual landmark seen at time t, ẑit = hj(µ̄t), where hj is the
measurement function and j = cit, that is the expected measurement given what is
known about the landmarks so far. For a camera, this is the pixel coordinates of the
landmark, when projecting it to the camera of the current pose. For each landmark
the update is then carried out:

Ki
t = Σ̄i−1

t H iT
t (H i

tΣ̄i−1
t H iT

t +Qt)−1

µ̄it = µ̄i−1
t +Ki

t(zit − ẑit)
Σ̄i
t = (I −Ki

tH
i
t)Σ̄i−1

t

, (2.73)

28


2. Theory

where Ki
t is the Kalman gain, and H i

t is the Jacobian of the measurement func-
tion. Finally, setting µt = µ̄Nt

t , and Σt = Σ̄Nt
t , gives the estimate of the mean and

covariance for the entire state.

One of the drawbacks of EKF SLAM is the poor scaling with the number of land-
marks. The complexity of EKF SLAM is O(l2) for l landmarks. Since the number of
landmarks increases with the distance driven, squared complexity means the com-
putational problem becomes intractable for large maps.

There are several variants of Kalman filter based SLAM algorithms, among them
other variants of the basic Kalman filter that has been extended to handle non-
linear filtering such as the Unscented Kalman Filter (UKF). There are also similar
filters that, unlike the EKF and other direct descendants of the Kalman filter, uses
covariance matrices to quantify the uncertainty, use the inverse of the covariance
matrix, the information matrix. These are called Information Filters (IFs). The
Extended Information Filter (EIF) is one such filter extended to handle the non-
linear case by linearisation, as in the EKF [9, pp. 75]. IFs trade simple mean
calculation and complex covariance calculation in Kalman filters, for complex mean
calculation and simple information matrix calculation [9, pp. 73]. One major benefit
of using IFs is that, unlike covariance matrices, the information matrices in SLAM
are usually sparse, since there is only odometry between consecutive poses, and only
a subset of the landmarks are seen from each pose. An information matrix is only
non-zero when there is a connection between the corresponding pose-pose or pose-
landmark pair [9, pp. 79]. Sparse problems can be solved significantly faster than
dense ones.

2.3.3 FastSLAM

FastSLAM is the result of applying particle filters to solve the SLAM problem. A
particle filter represents a probability distribution by a finite number of random sam-
ples of the distribution, instead of a parametric description of a certain probability
function, such as the mean and covariance parameters used to represent Gaussian
distributions [9, pp. 85]. For this reason particle filters are sometimes called non-
parametric filters. While the sample based representation used in particle filters is
only approximate, it can represent a much broader class of distributions, than the
commonly used Gaussian distribution, e.g. multimodal distributions. Non-linear
transformations of distributions are also easily applied to sets of samples without
the need for linearisation, to generate a set of samples in the resulting distribution
[9, pp. 97].

To maintain performance, particle filters need a certain density of particles in the
state space, to assure that some particles are close to the correct state [9, pp. 112].
As the “volume” of the state space grows exponentially with its dimension and the
dimension of the state space grows with the number of landmarks, using a particle
filter to estimate the entire state, including landmark positions, would quickly be-
come intractable. Fortunately, since there are no measurements between landmarks,
the correlations between landmarks arise solely through the uncertainty in the state
[9, pp. 437]. Thus, if the poses are known, the landmarks are independent of each

29


2. Theory

other. Each individual particle assumes that the trajectory is known (the uncer-
tainty is instead described by the spread of the particles pose estimates), and can
therefore estimate the position of the landmarks independently of each other. This
can be described as factorising the SLAM posterior (2.65) to

p(x1:t,m|z1:t, u1:t, c1:t) = p(x1:t|z1:t, u1:t, c1:t)
N∏
n=1

p(mn|x1:t, z1:t, c1:t), (2.74)

for a trajectory x1:t, a map m of N landmarks with the correspondences c1:t, and
odometry and measurements, u1:t and z1:t. Thus, each landmark can be indepen-
dently estimated with a parametric filter like the EKF. This greatly reduces the
pose states, whose posterior distribution is estimated with particles, to a constant
number of states, 6 in the case of 3D poses. The resulting particle filter is called a
Rao-Blackwellised particle filter. In FastSLAM, each particle has an estimate of the
pose, and estimates the location of the landmarks, with an EKF for each landmark
[9, pp. 439]. A particle with index [k] in FastSLAM thus contains an estimate
of its trajectory, x[k]

1:t, and a set of parametric, specifically Gaussian, estimates for
all the landmarks seen so far, {µ[k]

1 ,Σ
[k]
1 , µ

[k]
N ,Σ

[k]
N }. The FastSLAM algorithm iter-

ates over each of the M particles and does a motion prediction and measurement
update.

The motion prediction update differs between FastSLAM 1.0 and 2.0. In the
1.0 algorithm the proposal distribution sampled is simply the motion prediction
p(xt|xt−1, ut), while the 2.0 version samples a distribution p(xt|x1:t−1, u1:t, z1t). The
benefit of the 2.0 approach is that the proposal distribution better covers the target
distribution [9, pp. 451]. If the control and odometry are poor predictors of the
posterior, few of the samples drawn from the proposal distribution will lie in the
regions of high probability density in the target distribution, the posterior. This
means that the distribution of particles diverge from the posterior. A way of coping
with this effect is resampling [9, pp. 447], which is done as a last step.

Before resampling the measurement update is carried out, which consists of doing an
EKF update, of the mean and covariance, of each measured feature, in each particle.
If a feature has not been seen the estimate remains unchanged.

When resampling, M new particles are drawn with replacement from the current
set of particles, which then become the new set of particles used to represent the
posterior. The probability of a particle being drawn is proportional to its importance
weight. The importance weight, w[k]

t , is the quotient of the target distribution and
the proposal [9, pp. 448]

w
[k]
t = target

proposal = p(x[k]
1:t|z1:t, u1:t, c1:t)

p(x1:t|z1:t−1, u1:t, c1:t−1)

= η p(zt|x[k]
t , z1:t−1, u1:t, c1:t)p(x[k]

1:t|, z1:t−1, u1:t, c1:t)
p(x1:t|z1:t−1, u1:t, c1:t−1)

= η p(zt|x[k]
t , z1:t−1, u1:t, c1:t) = η p(zt|xt, ct)

, (2.75)

which is the measurement likelihood. η is a normalising factor that can be ignored,

30


2. Theory

since the probability of drawing a particle is only proportional to w[k]
t . Resampling

is done as a last step in FastSLAM, after the measurement update.

One of the disadvantages of particle filter is the inflexibility under loop closure.
When the particles are resampled, fewer and fewer estimates of “old” poses will be
represented among the particles. Since only the estimates still represented by one of
the particles are available for the filter to choose from, it might be that the “locally”
best approximations no longer contains the “globally” best match, when including
a loop closure.

2.3.4 Graph-based SLAM

Graph-based SLAM has a very intuitive structure: each pose and each landmark is
a node in a graph, and between the nodes there are edges that represent measure-
ments. Typically there are edges between two successive pose nodes, and between
a pose node and each node corresponding to a landmark seen from that pose, as is
illustrated in figure 2.10. These measurements form constraints on the location of
the poses and landmarks involved. Since no real measurement is exact these con-
straints will not be perfectly consistent. The goal of graph-based SLAM is to find a
configuration of the nodes which minimise this inconsistency.

1 4 6 9 12

14

1618

20

21

2

3

5

7 88 10

11
13

15
17

19

Figure 2.10: Basic structure of the graph in graph-based SLAM. White circles are pose
nodes, black squares are landmarks and solid and dashed edges are pose-to-pose and pose-
to-landmark constraints respectively. At 20 and 21 loop closures are detected when old
landmarks 5 and 7 are observed.

Graph-based SLAM system can be split into a front-end and a back-end. The front-
end builds a graph from the measurements and finds loop closures when they arise,
setting the constraints of the graph. The back-end takes the graph built by the front-
end and optimises it with respect to the constraints in the graph. The front-end deals
directly with the sensor data, and thus is highly dependent on the kind of sensors
used, the method of finding loop closures and the specifics of the implementation,
while the back-end works on the more general and abstract representation of the
graph.

Each edge in the graph corresponds to an error function. The type of error function
depends on what type of measurement the edge represents. A pose-pose edge could
be the Euclidean distance between two poses or the 6 DoF relative transform between
the nodes, or a pose-landmark edge could represent the reprojection error in an
image in which the landmark appears. The error is the difference between the

31


2. Theory

actual measurement and the expected measurement, given the current values in the
graph:

eij(xi,xj) = zij − ẑij(xi,xj). (2.76)
Here, eij is the error associated with the edge between node xi and node xj, and zij
is the measurement between nodes with index i and j. ẑij(xi,xj) is the expected
measurement for the pair of nodes xi, xj. A common way of finding the parameters
that best explains a set of measurements is minimising the square of the errors over
the set of possible parameters, which results in the maximum likelihood estimate.
Assuming the measurements are affected independently by Gaussian noise, this is
equivalent to finding the node poses that minimises the negative log-likelihood of
the observations [59]. Let x = (xT1 , ...,xTT )T be a vector of poses xi, and Ωij be the
information matrix of the measurements from node i to node j. The likelihood of
the observations given the poses and landmarks is

p(z|x) =
∏
i,j∈C

η e−(zij−ẑij)T Ωij(zij−ẑij) =
∏
i,j∈C

η e−eT
ijΩijeij , (2.77)

where C is the set of all pairs i, j, for which a measurement exists, and Ωij is the
information matrix. Taking the logarithm of this gives

ln
 ∏
i,j∈C

η e−eT
ijΩijeij

 =
∑
i,j∈C

(
ln η − eTijΩijeij

)
= A−

∑
i,j∈C

eTijΩijeij, (2.78)

where A is a constant that does not depend on the node states. Maximising the
right hand side in this, finding the maximum likelihood estimate, is equivalent to
minimising

F (x) =
∑
i,j∈C

eTijΩijeij, (2.79)

which is the sum of squared error terms weighted by their information matrix. This
can be written as finding a solution to x∗:

x∗ = arg min
x

F (x). (2.80)

With a good initial guess, x̌, of the poses and landmark positions, the Gauss-Newton
or Levenbarg-Marquardt algorithms can be used for for minimising F [50]. These
algorithms rely on iteratively refining x̌ by linearising the error functions, and solving
the resulting linear minimisation problem to find a better state vector.

Simplifying the notation for eij(xi,xj) to eij(x) and eij(x̌) to ěij, the linear approx-
imation of the error function around the initial guess can be written as its first order
Taylor expansion,

eij(x) = eij(x̌ + ∆x) ≈ ěij + Jij∆x, (2.81)
where Jij is the Jacobian of eij(x) in x̌. The terms in (2.79) can be expressed
as

eTijΩijeij ≈ (ěij + Jij∆x)TΩij(ěij + Jij∆x)
= ěTijΩij ěij + 2ěTijΩijJij∆xij + ∆xTJTijΩijJij∆x
= cij + 2bTij∆xij + ∆xTijHij∆xij,

(2.82)

32


2. Theory

where cij = ěTijΩij ěij, bTij = ěTijΩijJij, and Hij = JTijΩijJij. Now the approximation
of F (x) close to x̌ can be written as

F (x̌ + ∆x) ≈
∑
i,j∈C

cij + 2bTij∆xij + ∆xTijHij∆xij

= c+ 2bT∆x + ∆xTH∆x.
(2.83)

The quadratic form in this has a global minima if H is positive semi-definite. H
is an information matrix [50], thus at least positive semi-definite [60]. The global
minima occurs when the gradient of the quadratic form is 0, that is

2b +H∆x = 0. (2.84)

Thus, solving H∆x∗ = −2b gives ∆x∗. Setting the new initial guess to x̌+∆x∗, and
iterating until some criteria of convergence is reached will yield an approximation
of x∗.

In V-SLAM it is more common to use the Levenberg-Marquardt algorithm for finding
∆x∗. It is very similar to Gauss-Newton, the difference being that instead of solving
(2.84),

2b + (H + λI)∆x = 0 (2.85)

is solved. Here, λ is a parameter to be chosen for each iteration. If λ is “large
enough”, H + λI will have full rank and be positive definite. This is helpful in the
V-SLAM context, sinceH can lack positive definiteness due to scale ambiguity.

Solving (2.85) requires inverting H +λI. Matrix inversion has complexity of O(n3),
which quickly becomes intractable for matrices of the sizes relevant to graph-based
SLAM with landmarks. Luckily, H, is sparse and symmetric, stemming from the
fact that each landmark is only seen from a few poses and the only pose-pose edges
are between consecutive poses, as well as the terms JTijΩJij being symmetric, as
is illustrated in figure 2.11. There are very efficient solvers for sparse symmetric
systems, the complexity of which depend on the structure of H.

5 10 15 20

5

10

15

20

Node index

N
od

e
in

de
x

Figure 2.11: An illustration of the sparseness and symmetry of the information matrix,
H, corresponding to the graph in figure 2.10. Only the shaded elements are non-zero,
black elements are pose nodes, dark grey corresponds to landmarks and middle grey the
edges between these. Loop closures manifest as off-diagonal elements at the symmetric
pairs 〈(20, 5), (5, 20)〉 and 〈(21, 7), (7, 21)〉.

33


2. Theory

When tracking features in the images, sometimes erroneous associations are ob-
tained, resulting in outliers when triangulating or adding edge constraints to the
graph. These can have a large detrimental impact on the final result, since the
square of an already large error can overwhelm the smaller, inlier errors. Seen in a
Gaussian light, the outliers are deemed very unlikely measurements, so unrealisti-
cally large adjustments to the other estimates are made to accommodate the outlier.
To lessen this effect one can use a robust error function. These usually mimic the
squared error function for small errors to come close to a Gaussian maximum like-
lihood estimate, but grow slower than quadratically for larger errors to lessen the
influence of the outliers. Such a robust error function is the Huber error, which, at
a set magnitude of the error, grows linearly instead of quadratically, as is shown in
figure 2.12.

−2 0 20
2
4
6
8

z − ẑ

F

Huber error
Squared error

Figure 2.12: Comparison of the squared error function and the Huber error.

34


3
Implementation

To evaluate the performance of the methods described in chapter 2, a solution for
the 6 DoF SLAM problem has been implemented. The rest of this chapter details
the implementation of the solution.

The implementation is written in C++ and makes use third-party libraries. Non-
primitive data types, mathematical methods and feature point related methods from
the OpenCV library [29] are used. OpenCV was chosen since it is a widely used,
cross-platform library for computer vision tasks with a C++ interface and includes
most of the functionality that was provided by Matlab in the first implementation of
the VO system. For handling the graph structure and optimisation the g2o library
[60] is used as it is specifically tailored to graph-based SLAM problems and bundle
adjustment. For loop closure detection OpenFABMAP [58], as provided in OpenCV,
is used. OpenFABMAP is an open source implementation of FAB-MAP, the loop
closure method for V-SLAM that was chosen for this implementation. Tunable
parameters are presented in appendix A.

3.1 Visual Odometry

The implementation of the VO assumes known camera calibration parameters. The
system is divided into two parts. The first takes raw camera images, rectifies the
images and tracks features in the consecutive images. This enables the use of epipolar
geometry for an initial estimate of the incremental ego-motion and triangulation of
the detected features. The second part takes the initial estimate and further refines
it with windowed bundle adjustment.

3.1.1 Features and tracking

To extract corresponding feature points between two consecutive images a KLT
tracker [35] is used. The tracker is initialised with a specified number, KLTNPoints,
FAST features [34], with an intensity threshold KLTFASTThreshold, in the first
image and tracked in the second image. Points for which no corresponding point
is found in the second image are removed. RANSAC, using the normalised 8-point
algorithm, is used to remove outliers with a Sampson distance to the epipolar line
greater than KLTRANSAC8PointThreshold. The number of RANSAC iterations is
specified in KLTRANSAC8PointIterations. If points are lost, new FAST features
are added in second image to later be tracked in the next image. VO gives better

35


3. Implementation

results when the features are evenly distributed in the image [32]. To achieve an
even distribution, the image is divided into a grid of KLTNWin fields, with margins
specified in KLTWinWidthMargin and KLTWinBottomMargin, where no grid fields
are made to avoid creating features that are likely to disappear out of view in
the next image. The number of feature points are counted in each field, and new
FAST feature points will be added only in the fraction, KLTWinThreshold, of fields
containing the least number of feature points.

As suggested by Scaramuzza et al. in 2009, if the distance between the matched
points is less than KLTMotionDistanceThreshold pixels for a fraction equal to KLT-
MotionRatioThreshold of the points, no motion is assumed, and the second image is
discarded. This is repeated until an image not fulfilling this condition is found.

The fundamental matrix is calculated using the feature points remaining after track-
ing and RANSAC with the normalised 8-point algorithm described in section 2.2.1.
Using a known camera calibration matrix, the essential matrix is obtained accord-
ing to (2.48). From the essential matrix, two possible rotational matrices and two
possible unit length translation vectors between the two frames can be obtained,
as shown in (2.49). The equation shows that the two possible translation vectors
are parallel, in opposite directions. To determine the correct translation vector, the
speed of the vehicle, which can be obtained from wheel odometry or an IMU, is used.
The speed is also used to fix the scale ambiguity of the monocular camera. This is
done by setting the length of the translation vector to the distance travelled between
the two images, according to the speed. To find the correct rotational matrix, the
matrix describing a rotation less than 180° in all axes is selected.

The above steps are repeated for each consecutive pair of frames. The rotational
matrix, R, and the translation, t, are combined and expressed as a relative homo-
geneous transformation matrix:

T =
[
R t
0 1

]
(3.1)

The current pose, expressed as a homogeneous transformation matrix, is found by
right-multiplying the previous relative transformation matrices up to the current
state. The first pose is initialised in the origin of the camera coordinate system,
with the identity matrix.

3.1.2 Bundle Adjustment

Each pose is added as a pose node, VertexCam, in a g2o [60] graph, with an SBACam
estimate. An SBACam estimate contains a fixed camera calibration matrix and an
extrinsic camera matrix, a pose, to be optimised. The camera calibration matrix is
previously known and set accordingly, and the pose is set to the pose obtained from
the estimated essential matrix.

For each new feature point tracked across two consecutive frames, a point node,
VertexSBAPointXYZ, is added to the graph. The point node has an estimate of a
position in R3 to be optimised. The initial estimate is obtained by triangulating

36


3. Implementation

the point from the first two poses it is seen from. For each pose where the point
is tracked, a projection edge, EdgeProjectP2MC, is added. The projection edge
connects a pose node with a point node. The measurement is the pixel coordinate
of the observed point in the image from the relevant pose. The error function is the
reprojection error of the point to the image according to (2.11). The information
matrix of the measurement is set to a diagonal matrix σ−2 · I, where σ2 is PixelVar.
To fix the scale ambiguity of the monocular camera, a distance edge, EdgeSBAScale,
is added between each consecutive pose node. The measurement is the geometric
distance between the two poses, and the error function is the offset between the
estimate and the measurement. The measurement is set to the distance between the
poses according to the speed of the vehicle and the information to σ−2, where σ2 is
DistanceVar.

Between each consecutive pose node, a pose-pose edge, EdgeSBACam, is created
and added to the graph. These edges are not considered during the incremental
VO, but are instead used when loop closures are detected to avoid doing costly BA
optimisation on the entire graph. They are updated until the poses they connect
fall out of the BA window. The measurement is the pose to pose transformation
between the connected nodes, expressed in a 6-vector with Euclidean coordinates
describing the transformation and a unit quaternion with real part omitted (assumed
positive) describing the rotation, as x = (x, y, z, q0, q1, q2)T . Using homogeneous
transformation matrices, the error is calculated as

Terr = T−1
measT

−1
1 T2. (3.2)

The information matrix is set to a diagonal matrix with (σ−2
t , σ−2

t , σ−2
t , σ−2

r , σ−2
r , σ−2

r )T
on the diagonal, where σ2

t is set to TranslVar and σ2
r is set to RotVar.

For each iteration, a graph with the projection edges and distance edges connect-
ing the latest BAWindow number of pose nodes, are optimised BAIterations times
with the Levenberg–Marquardt algorithm, OptimizationAlgorithmLevenberg, using
a Huber error function and a solver based on sparse Cholesky factorisation, Linear-
SolverCholmod.

3.2 Simultaneous Localisation And Mapping
To get a full SLAM implementation, the estimate of the trajectory and map from
the bundle adjusted VO is again adjusted when loop closures are detected to ensure
that the map is consistent. This is divided into detecting loops using OpenFABMAP
and finding the geometric relation between the current and old pose in a step similar
to the first part of the VO.

3.2.1 FAB-MAP

The FAB-MAP 2.0 [21] algorithm is set up using a vocabulary and Chow-Liu tree
[20], which have been created offline, as well as a training data set. The probability
of a feature being detected, p(z = 1 | e = 1), is set to FMPzGe, and the probability
of a false positive, p(z = 1 | e = 0), is set to FMPzGNe.

37


3. Implementation

For each new image, a set of Star feature points, from the OpenCV library [29], which
is a variant of the CENSURE [38] feature detector, is extracted. The threshold of
the detector is dynamically adjusted, in maximum FMStarIterations iterations, to
give between FMStarNMin and FMStarNMax feature points, and is initialised at
FMStarThreshold. For the