Automatic LiDAR-camera calibration
Extrinsic calibration for a LiDAR-camera pair using structure
from motion and stochastic optimization

Master’s thesis in Systems, Control and Mechatronics

THERESE DAHLBERG
VALTER STRÖMBERG

DEPARTMENT OF MECHANICS AND MARITIME SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2022
www.chalmers.se

www.chalmers.se


Master’s thesis 2022:27

Automatic LiDAR-camera calibration

Extrinsic calibration for a LiDAR-camera pair using structure from
motion and stochastic optimization

THERESE DAHLBERG
VALTER STRÖMBERG

Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems

Chalmers University of Technology
Gothenburg, Sweden 2022


Automatic LiDAR-camera calibration
Extrinsic calibration for a LiDAR-camera pair using structure from motion and
stochastic optimization
THERESE DAHLBERG
VALTER STRÖMBERG

© THERESE DAHLBERG, VALTER STRÖMBERG 2022.

Advisor: Torgeir Langfjord Nordgård, CPAC Systems AB
Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences

Master’s Thesis 2022:27
Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Point clouds generated from structure from motion and LiDAR visualized
together.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2022

iv


Automatic LiDAR-camera calibration
Extrinsic calibration for a LiDAR-camera pair using structure from motion and
stochastic optimization
THERESE DAHLBERG
VALTER STRÖMBERG
Department of Mechanics and Maritime Sciences
Chalmers University of Technology

Abstract
This thesis presents an approach to automatically and simultaneously perform ex-
trinsic calibration of a LiDAR and a camera. Nowadays, a multitude of sensors are
used in a majority of vehicles. Having correctly calibrated sensors is essential for
attaining accurate data to use in various sensor dependent applications. Today’s
LiDAR-camera calibration methods are often performed manually or require exter-
nally introduced calibration objects. However, the method proposed in this thesis
is only dependent on 3D LiDAR point clouds and camera images. The method
consists of two major parts. Firstly, the camera images were converted to 3D point
clouds using a structure from motion pipeline, ensuring that the data from both
sensors were comparable. Secondly, a genetic algorithm with an objective function
based upon a 3D voxel grid filter was used to iteratively compare the overlap of the
point clouds until convergence. The method proved to be successful in creating 3D
point clouds from camera images and accurately estimating the rotational param-
eters for both sensors. However, it was not as robust and accurate as anticipated
when estimating the sensor positions.

Keywords: LiDAR-camera calibration, stochastic optimization, genetic algorithm,
structure from motion, point clouds.

v


Acknowledgements
This thesis was carried out in the spring of 2022. We would like to thank our supervi-
sor Torgier Langfjord Nordgård for the many interesting and insightful conversations
during the period. We would also like to thank our examiner Peter Forsberg for the
rewarding tips that he provided. Lastly, we would like to thank CPAC Systems AB
for the opportunity to do this thesis.

Therese Dahlberg & Valter Strömberg, Gothenburg, June 2022

Thesis advisor: Torgeir Langfjord Nordgård, CPAC Systems AB
Thesis examiner: Peter Forsberg, Department of Mechanics and Maritime Sci-
ences

vii


List of acronyms

2D Two dimensions
3D Three dimensions
DoF Degrees of freedom
FoV Field of view
GA Genetic algorithm
GNSS Global navigation satellite system
GPS Global positioning system
IMU Inertial measurement unit
LiDAR Light detection and ranging
Radar Radio detection and ranging
RANSAC Random sample consensus
SIFT Scale-invariant feature transform
SfM Structure from motion

ix


Contents

List of Acronyms ix

1 Introduction 1
1.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 LiDAR-camera calibration . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Biologically inspired optimization algorithms . . . . . . . . . . 5

1.4 Research questions and limitations . . . . . . . . . . . . . . . . . . . 5
1.4.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 LiDAR properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Camera properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Essential and fundamental matrix . . . . . . . . . . . . . . . . . . . . 11
2.7 Structure from motion . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7.1 N-point algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.2 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.9.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.3 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Methods 21
3.1 Simulation and collection of data . . . . . . . . . . . . . . . . . . . . 22
3.2 Process and combine data . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Conversion of camera images to 3D point clouds . . . . . . . . 23
3.2.2 Scale estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 24

xi


Contents

3.3 Fitness model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Sensor poses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Results 29
4.1 Structure from motion pipeline . . . . . . . . . . . . . . . . . . . . . 29
4.2 GA parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 GA calibration results . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Defect objective function . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Discussion 35
5.1 Structure from motion generated point clouds . . . . . . . . . . . . . 35
5.2 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusion 39
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A Boxplots I
A.1 Small solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Large solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . III

xii


1
Introduction

Nowadays, a multitude of sensors is used in autonomous vehicles to collect data
about the state of the vehicle and its surroundings. The sensor data has plenty of
purposes and is used in various applications and systems of the vehicle. Ranging
sensors are widely used in adaptive cruise control, cameras have become an extensive
aid in parking assistance and positioning sensors play a huge role in navigation. Even
more critical applications, such as active safety features, depend heavily on the use
of various sensors.

The most prominent sensors used today are: ranging sensors - such as LiDARs and
radars, imaging sensors - such as cameras, inertia sensors - such as IMUs, and po-
sitioning sensors - such as GNSS and GPS. Different types of sensors have different
purposes. However, they all have one important aspect in common; the requirement
of delivering accurate and reliable data for further data manipulation. For exam-
ple, if the ranging sensor in an active safety system indicates incorrect distances
to surrounding objects, serious complications could follow, such as activating the
emergency break in the wrong situation. Consequently, counteracting the purpose
of the active safety system.

To ensure that the sensors correctly record the surrounding data they need to be
calibrated. Properly calibrated sensors are the basis for all further data manipulation
and fusing, such as the examples mentioned above, but also object detection, image
segmentation, Simultaneous localization and mapping (SLAM) [1], adaptive cruise
control, various safety functions and more.

This thesis will focus on two types of sensors - cameras and LiDARs. Cameras and
LiDARs have become increasingly popular, especially in the automotive industry.
Nowadays, even other industries are starting to make use of cameras and LiDARs in
their products, for example, robotic lawn mowers and autonomous vacuum cleaners.
Even the boat industry is starting to implement adaptive cruise control and assisted
docking algorithms, both highly dependent on being aware of the surroundings.
Without a human to gather information, a camera and LiDAR can be seen as
replacements for the human eye.

1


1. Introduction

1.1 Calibration

Sensor calibration can be divided into two categories, intrinsic and extrinsic calibra-
tion. Both methods are quite different but equally crucial for reliably being able to
use the sensor data. Intrinsic calibration aims to determine the internal parameters
of a sensor. What these parameters are is different depending on the sensor. For a
camera, it is related to how the image is altered in relation to reality, for example,
distortion or resolution. For a LiDAR, it is for instance the range it can detect or
the field of view (FoV). However, the shared property of intrinsic parameters is that
they are often used in the extrinsic calibration process. Intrinsic parameters are
often given by the manufacturer of the sensor, but can sometimes change over time.

Extrinsic calibration, on the other hand, intends to determine a sensor’s pose or
the calibration parameters, i.e. how it is placed in relation to some other actor,
for example, its host vehicle. In Figure 1.1 an intuition of extrinsic calibration is
depicted, where the aim is to find the transformation T (rotation R and translation
t), that relates the sensor coordinate system S, to the vehicle coordinate system
W . Knowing T , the sensor data can be transformed into the vehicle’s coordinate
system W , and consequently turned into usable information for the vehicle. For
example, if a LiDAR gives a distance to a certain object, the measurement is given
in relation to the LiDAR itself, i.e. in the LiDAR’s coordinate system S. However,
when transforming the LiDAR measurements into the vehicle coordinate system
W , the distance to the object is possibly changed, which can be explained by the
measurement now being in relation to the vehicle and not the LiDAR. In most cases,
it is only of interest to have information in relation to the vehicle coordinate system,
since it is the vehicle that has to avoid obstacles. Also, all integrated systems and
applications are most commonly based upon the vehicle’s coordinate system.

𝑥

𝑧

𝑦

𝑥

𝑧

𝑦

𝑇 =  𝑅|𝑡

{𝑊}

{𝑆}

Figure 1.1: Calibration is the process of finding the Transformation T that maps
the sensor’s coordinate system S to the vehicle coordinate system.

2


1. Introduction

Extrinsic calibration can be done manually, i.e. the distance and rotation of the
sensor from the origin of the vehicle are measured by hand. The calibration is
performed either during the manufacturing of the vehicle or afterward. However, it
is not an easy task due to the usually obscure and inconvenient sensor placements.
Extrinsic calibration can also be done automatically, i.e. using an algorithm to
calculate the sensor positions.

Automatic sensor calibration algorithms can be done either online, i.e the calibration
happens as the vehicle is operated, or offline, i.e the sensor data is collected as the
vehicle is operated but the calibration is done afterward. An advantage of online
calibration is that the sensors can be re-calibrated if their pose is altered due to
external influences. However, offline calibration enables the use of more powerful
computers, and thus more computationally heavy algorithms can be used. Another
difference is the time constraint imposed on the algorithm in online calibration,
while this is usually not an issue for algorithms that run offline.

1.2 Aim
The aim of this thesis is to develop, test and validate a fully automatic calibration
algorithm for LiDAR-camera extrinsic co-calibration, thus eliminating the need for
manual interference. Co-calibration means that only one calibration algorithm is
used to simultaneously calibrate both camera and LiDAR. Given a rough estimation
of the calibration parameters, the algorithm should be able to find the relation
between an arbitrary number of cameras and LiDARs that has an overlapping FoV.

1.3 Related work
The subject of extrinsic sensor calibration is well studied and there are a great
variety of approaches to take inspiration from. The biggest complication when it
comes to LiDAR-camera calibration is to properly process and compare the data
collected by each of the sensor types. LiDAR data consists of clusters of points in
3D, point clouds, whereas camera data is depicted in 2D. This section will introduce
the most relevant sources of LiDAR-camera calibration along with some content on
relevant optimization algorithms.

1.3.1 LiDAR-camera calibration
The most prominent dilemma when working with cameras and LiDARs simultane-
ously is that they operate in different dimensions, making it hard to compare the
raw data of the two sensor types. A solution to this problem is to either remove
one dimension from the LiDAR data or add one dimension to the camera data. It is
important in both cases to find shared information as it is the baseline for producing
a satisfactory calibration result. A common way to find shared information is to
extract interesting regions in the scene called features.

3


1. Introduction

3D to 2D conversion of LiDAR data

The technique of converting 3D point clouds to 2D starts by finding the same features
in the point cloud and image. The point cloud is then projected down on the image
and thus removing one dimension. The features in the image and the projected
LiDAR data can now be compared in the same dimension. To make sure that the
data align as well as possible, the data is iterated through an optimization algorithm.
There exist different methods for extracting equivalent features in both point clouds
and images.

Firstly, edge detection is a method for finding so-called edges, e.g. corners or natural
occurring lines resulting from shadows in an image. The edges are then converted
to lines and can be compared with the same occurring lines in the point cloud data.
By applying an optimization algorithm the best match of overlapping lines can be
found, which in turn can be used to find the extrinsic calibration parameters [2].

Secondly, [3] describes a line correspondence calibration method, which utilizes the
concept of so-called infinity lines to achieve a similar comparison as [2]. Lastly,
calibration objects or targets can be used as a fixed feature with known dimensions.
Calibration targets introduce known information in both the LiDAR and camera
data, thereby simplifying the optimization process [4].

2D to 3D reconstruction of camera data

The method of converting the 2D camera data to 3D requires more than one image of
the scene. The reason for this is that it is possible to extract the depth information
given two or more images of the same scene. In stereo vision, a stereo camera -
two parallel cameras mounted on a rigid frame - produce image pairs of the scene,
whereupon the depth can be extracted. The depth data is then used to create 3D
points depicting the scene. These points can then be compared with the LiDAR
3D points. A common approach in calibration using 2D to 3D reconstruction is
to introduce calibration targets to ensure that the LiDAR and camera have one
common known object in their vicinity. The calibration target is then used to
perfectly overlap the point clouds. From there the sensor poses can be extracted [5],
[6]. Optimization is then used to find the most accurate overlap of the compared
sensor data.

Another method is structure from motion (SfM) which does a 3D reconstruction of
camera images taken of the same scene. The approach is similar to stereo vision.
Images are used to construct a 3D point cloud of a scene by estimating depth
data from two or more images. It is, however, different from stereo vision since
only one camera is needed to collect data. In [7] an SfM pipeline was used to
reconstruct camera data into 3D points for comparison with LiDAR data on the
fly. One advantage of SfM is that it can be done with only one camera and thus
the calibration is not dependent on including a stereo camera. It is also worth
noting that a densification algorithm was used on the reconstructed point clouds to
increase the amount of information extracted from the images. An example of the
SfM pipeline is presented in [8], where the authors reconstructed the city of Rome

4


1. Introduction

in 3D based on only images.

1.3.2 Biologically inspired optimization algorithms
Previous work, namely [9], shows that a calibration method for 2D LiDARs has
been successfully developed using a genetic algorithm (GA). This algorithm was
compared with the more common method of Iterative closest point (ICP) and was
shown to be superior in almost all cases. In [10], a fully automatic calibration
method using 3D LiDARs and biologically inspired algorithms was developed. It
produced satisfactory results using both Particle swarm optimization (PSO) and a
genetic algorithm. Also, the method should work in any environment and with any
number of sensors, and all sensors can be calibrated simultaneously. [10] will be
used as a baseline for this thesis.

1.4 Research questions and limitations
Since biologically inspired optimization has proven effective in previous work, mainly
[9] and [10], it is a suitable foundation for this thesis as well. Based upon this
reasoning the following research questions are investigated:

• How can an SfM pipeline be used to convert camera images to 3D point clouds
of satisfactory standards when it comes to comparison with 3D LiDAR data?

• How will an objective function that has been proven successful in LiDAR-
LiDAR calibration [10], perform if used in a LiDAR-camera co-calibration
built on the same optimization method?

• How does the calibration algorithm perform with regards to robustness, specif-
ically disturbances, and how will it affect the quality of the estimated parame-
ters?

1.4.1 Limitations
This work will assume that all sensors produce noise-free data. Additionally, it
is assumed that vehicle data collected from IMU and GNSS sensors are available,
meaning that ground truth values for simplifying the testing are accessible. All
sensor data will be collected from simulations exclusively. It is also assumed that
all sensors used for calibration have the same number of degrees of freedom (DoF).
Moreover, the sensors are assumed to already be intrinsically calibrated and there
will be no need to re-calibrate during simulations and tests. Additionally, it is
assumed that the scenes used to collect data are rich in features and that objects
are in the range of both sensors. Furthermore, the simulated data will only be
tested during linear movement, i.e. no sharp turns or fast changes in direction will
be enforced on the sensors used.

Only an offline calibration algorithm will be in focus in this thesis, removing time
constraints and computational complexity from the implementation. Lastly, the

5


1. Introduction

calibration parameters will only be determined in relation to each other and not the
vehicle.

6


2
Theory

In this chapter, the relevant theory is presented. First, each of the sensors used -
LiDAR and camera - are introduced, followed by methods for converting 2D camera
images to 3D points. Lastly, the optimization method used, a genetic algorithm,
and all its related elements are presented.

2.1 LiDAR properties
LiDAR stands for Light Detection And Ranging. By emitting and detecting light at
a specific wavelength only, LiDARs measure distances by timing the return of the
reflected light rays. Therefore, LiDARs only capture specific points, meaning that
the more points captured the higher resolution of the collected data. The clusters
of points generated by LiDARs are often referred to as point clouds. A single laser
will produce a point cloud in 2D, showing the distance to objects from the sensor
but not their height. If multiple lasers at different angles are used, the point cloud
will be in 3D.

Rotating LiDARs have lasers mounted on a revolving frame. The rotational element
is used to collect data from a 360° field of view. One important aspect to take into
consideration when it comes to rotating LiDARs is that a time distortion will be
introduced in the data if the LiDAR is moving while capturing data.

Solid state LiDARs, on the other hand, use a single laser beam to illuminate the
surroundings. An array of receivers mounted both horizontally and vertically, detect
the reflected beams. The removal of rotating parts has the advantage of being
cheaper to manufacture but also to be more robust [11]. However, solid state LiDARs
cannot capture a 360 ° field of view like the rotating LiDAR.

2.2 Camera properties
There are plenty of different types of cameras but all of them share the same basic
characteristics. To illustrate and introduce some important concepts, a simplifica-
tion of a pinhole camera is shown in Figure 2.1. A camera captures light rays from
the vicinity to create a 2D image. The rays come from different angles but will
eventually meet at one point - the projection centre O of the camera. The light

7


2. Theory

beam perpendicular to the image plane and with a zero-degree angle to the projec-
tion centre will pass through the principal point p of the image plane. The distance
between the image plane and the projection centre is called the focal length f and
is often measured in pixels. The focal length can be found through the following
equation

f = w

2 tan(fov
2 )

, (2.1)

where fov is the field of view and w is the width of the image in pixels.

The camera properties are expressed in the intrinsic matrix K, shown in equation
(2.2). K is important in a lot of applications where cameras are used and is usually
estimated by intrinsic calibration.

K =

f γ px

0 f py

0 0 1

 (2.2)

px and py are the x and y coordinate of the principle point and γ is a skew coefficient
which often is 0.

y

𝑥

z

Image plane

𝑝

Projection
center

𝑂

f

𝑂
𝑝

y

z

Image plane

𝑿
𝑿

𝒙

𝒙

Figure 2.1: Basic illustration of camera mechanics. x is the point in the image
plane where X projects to. X is a point in the world and is represented by Cartesian
coordinates. x on the other hand, is described with image coordinates. The image
coordinate system often has its origin in the middle of the image plane.

2.3 SIFT
Scale-invariant feature transform, or SIFT for short, is an algorithm to extract im-
age features [12]. SIFT features are invariant to some basic image manipulations,
such as scale, rotation and illumination, which makes it a robust method for feature
extracting. In addition to key feature points, e.g. corners, certain shapes, or other
distinct features in an image, SIFT also provides some useful quantitative informa-
tion about the extracted features. The steps in SIFT are here described briefly. The
first step is to apply a difference of Gaussians on the image, followed by extracting
local extrema to create a scale-space. Next, gradients are computed around each
found extrema. The orientations of the gradients are then downsampled and put
into a histogram. The most prominent direction and the corresponding scale are

8


2. Theory

chosen as the descriptor of a particular key feature point. The next step after hav-
ing extracted SIFT features is often to match the features, which can be done by
comparing descriptors.

2.4 RANSAC
Random sample consensus (RANSAC), first presented in [13], is an algorithm that
performs binary characterization (inliers and outliers) on data points to fit the best
possible model. RANSAC consists of three main steps. The first is to randomly
sample the number of necessary data points needed to fit the model. The next step
is to fit the sampled points to the model and lastly, depending on how accurate the
resulting match is, assign a score. A data point is considered an inlier if it lies within
a certain predefined threshold. Otherwise, it is considered an outlier. These steps
are iterated until the best model has most likely been found. RANSAC is often used
in line-fitting problems.

2.5 Epipolar geometry
The material presented in this section is accessed from [14, Ch. 12] and [15, Ch. 7],
and concern the subject of epipolar geometry. Two cameras in 3D space give rise
to certain geometrical constraints. If the projection centres of the two cameras are
at O and O′ respectively, the geometry shown in Figure 2.2 is valid. If X is a point
in 3D-space, it will together with the two camera projection centres O and O′ form
the epipolar plane. Epipolar geometry provides the basic tools for computing the
projection of X onto the second image plane (i2 in Figure 2.2) from the projection
of X in the first image plane i1, called x′ and x respectively.

𝒆′𝒆

𝒙′

𝑿𝟑

𝑿𝟏

𝑿𝟐

𝑿

𝑂′𝑂

𝒙 Epipolar plane

𝑖1

Epipolar lines

𝑙 𝑙′

(𝑅, 𝐭)

Figure 2.2: Visualization of epipolar geometry. O and O′ are the projection centres
of the first and second cameras respectively.

The viewing ray from the camera projection centre O, i.e. the vector X − O, will

9


2. Theory

pass all possible scene points that projects onto x. In Figure 2.2, X1, X2 and X3
are examples of this. The vector X − O projects on the second image plane (right
image plane in Figure 2.2) as the epipolar line, l′, marked in red. Since the points
X1, X2 and X3 all lie on X − O their projections in the second image have to lie
somewhere on l′, thus limiting the search space from the entire image to the epipolar
line exclusively.

The projection of O onto i2 is known as the epipole e′, and it also lies on the epipolar
line, l′. In fact, for all points on i1, their formed epipolar line will pass through the
epipole on i2 (and vice versa). Moreover, the condition that all possible points -
which can correspond to some point in the other image plane - have to lie on the
epipolar line is called the epipolar constraint. A parameterization describing the
epipolar line can be formulated as

dRx + t ∼ l (2.3)

where d is a scalar, R a rotation and t a translation.

The epipolar line of x could alternatively be represented by the line equation

l>x = 0. (2.4)

To find the equation of this line, two points on the line must be chosen and inserted
in the line equation, i.e. l>(Rx + t) = 0

l>t = 0
. (2.5)

Equation (2.5) can be rewritten asl • (Rx + t) = 0
l • t = 0

, (2.6)

indicating that l needs to be perpendicular to both points since if the dot product
of two vectors is zero they are perpendicular. Taking the cross product of the two
chosen points, a third vector perpendicular to the two points is created as

l = t× (Rx + t) = t× t︸ ︷︷ ︸
=0

+t× (Rx), (2.7)

and thus the sought after l is found. Additionally, since t ∼ e′, equation (2.7) can
be formulated as

l = e′ × (Rx) =
[
e′

]
×
Rx. (2.8)

where
[
e′

]
×
is the cross product matrix operator of e′:

[
e′

]
×

=

 0 −e′3 e′2
e′3 0 −e′1
−e′2 e′1 0

 (2.9)

10


2. Theory

The epipolar constraint can now be written as

x′>
[
e′

]
×
Rx = 0, (2.10)

if x′ is the corresponding projection of x in i2. What
[
e′

]
×
R actually does is

mapping any point x in the first image to the corresponding epipolar line in the
second image. The matrix [e′]×R is called the fundamental matrix and is further
explained in Section 2.6.

2.6 Essential and fundamental matrix
The essential matrix E, introduced by [16], just as the fundamental matrix, F , are
widely used in camera geometry. It is with E or F that the epipolar lines can be
found, as described in Section 2.5. Both E and F fulfill the epipolar constraint as
x′>Ex or x′>Fx, compare with equation (2.10).

The essential matrix depends only on the extrinsic camera parameters, meaning that
to find E, the intrinsic matrix K of both cameras must be known. If the intrinsic
parameters of the cameras are unknown, it is not possible to perform normalization
of the image coordinates, which is a requirement for using E. E is defined by

E = [t]×R (2.11)

or alternatively
E = [e′]×R (2.12)

to coincide with the notation used in Section 2.5.

F on the other hand, can be seen as a generalization of E since it does not require
the intrinsic parameters to be known. F is dependent on both the intrinsic and
extrinsic parameters of the cameras, which means that the image points will be in
pixel coordinates. It is also pixels that are related to epipolar lines in the epipolar
constraint. F is related to E through equation (2.13)

E = K ′>FK, (2.13)

and the epipoles are the left and right null spaces of both F and E.

Both E and F are rank deficient and have a determinant of zero. However, unlike
the essential matrix which has five degrees of freedom, the fundamental matrix has
seven. To find F and E it is common to use the N-point algorithm described in
Section 2.7.1.

2.7 Structure from motion
Structure from motion (SfM) is a technique used to reconstruct 3D point clouds
from a set of 2D image correspondences along with recovering the 3D camera poses,

11


2. Theory

also known as Euclidean reconstruction. One important property of SfM is that the
method can only reconstruct 3D views up to a scale, meaning that only the ratio
and direction of the reconstructed points are given. If the scale parameter is to be
decided, further information about the environment is needed. A simple example is
shown in Figure 2.3. The information presented in the following section is retrieved
from [14, Ch. 11].

Figure 2.3: A simple structure from motion example. The blue moon object
projects to different places in the image plane depending on the camera pose the
moment the image is taken. This information can be used to find the camera
movement and a set of 3D points of the object.

The first step of performing SfM is to find corresponding key feature points between
images, which can be done by using any feature extracting method, e.g. SIFT, as
mentioned in Section 2.3. The key feature points are then matched and the outliers
are often found and rejected by using RANSAC, described in Section 2.4. Knowing
a set of matching key feature points in two images, the relative pose between the
cameras can be established by forming the following camera matrix P

P = K
[
R t

]
. (2.14)

K is the intrinsic matrix, R a rotation and t a translation to the other camera.
More specifically, P relate 3D points to 2D image points as

x = PX, (2.15)

a property used to establish the relative relationship between two cameras. Assum-
ing two cameras in R3, the respective camera matrices, P and P ′, can be represented
by

P =
[
I 0

]
(2.16)

and
P ′ =

[
R t

]
, (2.17)

12


2. Theory

since the pose between the cameras is relative. Using equation (2.3), a point x in
the first image, is projected onto the second image by

x′ = dRx + t, (2.18)

where d is a scalar. Furthermore, equation (2.18) indicates that x′ has to lie some-
where on the epipolar line of the second image, as described in Section 2.5.

Now, if both sides of equation (2.18) are multiplied with the cross product matrix
operator of t [

t
]
×

=

 0 −t3 t2
t3 0 −t1
−t2 t1 0

 , (2.19)

the following equation is formed[
t
]
×

x′ = d
[
t
]
×
Rx +

[
t
]
×

t︸ ︷︷ ︸
=0

. (2.20)

Multiplying both sides with x′> gives

0 = x′>
[
t
]
×

x′ = dx′>(
[
t
]
×
R)x, (2.21)

which follows from
[
t
]
×
being a skew-symmetric matrix and therefore the left hand

side of the equation above cancels out to zero. What is left on the right-hand side
of the equation is now the epipolar constraint, see equation (2.10) for reference,

x′> (
[
t
]
×
R)︸ ︷︷ ︸

=F

x = 0 (2.22)

with d omitted since it is only a scalar. From the above derivations it is clear that
to find P and P ′, F or alternatively E, has to be known.

2.7.1 N-point algorithms
The task of finding F or E is done by using some N-point algorithm, where N is
the minimal points needed to solve an equation system. Expanding equation (2.22)
with xi =

[
xi yi zi

]>
and x′i =

[
x′i y′i z′i

]>
gives

x′>i Fxi =F11x
′
ixi + F12x

′
iyi + F13x

′
izi+

F21y
′
ixi + F22y

′
iyi + F23y

′
izi+

F31z
′
ixi + F32z

′
iyi + F33z

′
izi

(2.23)

which can be written on matrix form for N entries as
x′1x1 x′1y1 x′1z1 . . . z′1z1
x′2x2 x′2y2 x′2z2 . . . z′2z2
... ... ... . . . ...

x′nxn x′nyn x′nzn . . . z′nzn


︸ ︷︷ ︸

A



F11
F12
F13
...
F33

 =



0
0
0
...
0

 (2.24)

13


2. Theory

Generally, the number of points needed to solve the system of equations for F and E
has to be at least the same as the number of unknown parameters in each respective
matrix. However, even if the fundamental matrix has nine entries, it only takes eight
different matched points, i.e. N ≥ 8. The eight degrees of freedom derive from the
camera poses only being determined up to a scale factor, i.e. only the direction and
ratio between the reconstructed points and camera poses can be found. To solve
for E, only five points have to be used, since the intrinsic parameters are already
known and thus do not need to be estimated.

By using Single Value Decomposition (SVD) the system of equations in equation
(2.24) can be solved. Equation (2.25) describes the SVD of A and the entries of F
can be found in the last column of V .

A = USV > (2.25)

Having found F (or E), the next step is to retrieve the camera matrices. Since e′
and t are equivalent, P ′ can also be described as:

P ′ =
[
R e′

]
. (2.26)

The next to last step is to calculate e′, which is done by solving for the null space
of F , and then R can finally be found by using equations (2.12) and (2.13). When
P and P ′ are known the 3D points can be recovered with triangulation as described
in Section 2.7.2.

2.7.2 Triangulation

Triangulation is a method used to find a 3D point given a set of images of the same
object [17]. To use this method, the camera matrices P and P ′ for the cameras
involved must be known, see Section 2.7. In its simplest form, triangulation can
be described by calculating where the projection rays of the image points x and x’
intersect. Figure 2.4 shows an illustration of triangulation. Equation (2.27) - (2.30)
shows an example of how to mathematically solve for the sought after 3D position.
This simple form of triangulation is only applicable when the images have epipolar
geometry.

14


2. Theory

𝑿

𝒙 𝒙’

Figure 2.4: By projecting two points x and x′ from two cameras it is possible to
find the corresponding 3D point X by calculating their intersection point.

Expanding equation (2.15) in Section 2.7 for both x and x′ yields

x = PX =

p1X
p2X
p3X



x′ = P ′X =

p
′
1X
p′2X
p′3X

 ,
(2.27)

where p1-p3 are the row vectors of P . The goal is to find the point X that satisfies
equation (2.27). The first step to solve for X is to take the cross product of the
image point x on both sides of equation (2.27), and since the vectors x and PX are
of the same direction the cross-product is equal to zero, i.e.

x×

p1X
p2X
p3X

 =

x2p3PX− p2X
−x1p3X + p1X
x1p2X− x2p1X

 = 0. (2.28)

Since the last row is a linear combination of the first two, it can be left out. With
the last row removed and after a simplification, the final equation can be written as:[

x1p3 − p1
x2p3 − p2

]
X = 0. (2.29)

The same steps can be done to x′, resulting in the final relationship shown in equation
(2.30) from which X can be solved.

x1p3 − p1
x2p3 − p2
x′1p

′
3 − p′1

x′2p
′
3 − p′2

 X = 0 (2.30)

15


2. Theory

Cheirality

When it comes to camera geometry, all points found on the epipolar lines are not
necessarily projected from in front of the cameras. In [18] it is stated that for a
point X to be in front of the cameras it has to fulfill the following condition

r3(X−O) > 0. (2.31)

Where r3 is the third row of the rows and columns in P that represents the camera
rotation matrix, R. Cheirality is an important aspect to take into consideration, for
example in SfM, since points found projecting from behind the camera are obviously
outliers and should not be taken into consideration.

2.8 Z-score
The Z-score or the standard score is used to measure how many standard deviations
away a certain measurement is from the mean. The Z-score is useful for identifying
outliers in a large data set. In a normal distribution, 99.7% of the values lies within
the ±3σ interval, meaning that a Z-score greater than three is highly implausible,
and can therefore be considered an outlier. The Z-score is calculated as follows

Z = x− µ
σ

(2.32)

where x is the measurement, µ is the mean of the collected measurements and σ is
the standard deviation.

2.9 Genetic algorithm
A genetic algorithm or GA is an optimization algorithm inspired by evolution and
natural selection. The theory in this section is based on [19, Ch. 3]. The GA can
be summarized in three distinctive steps - initialization, evaluation, and evolution,
where the latter two are repeated until termination. All steps will be further ex-
plained in this section.

2.9.1 Initialization
As in all optimization, there is a set of parameters to be optimized, x1, x2, ..., xn

where n is the number of parameters. In a genetic algorithm, one of these sets i.e.
x1 . . . xn is called an individual. Each individual has a chromosome made out of
genes that describes the set of parameters. The value of a gene can be described
with equation (2.33). Where C is a chromosome and gi is the gene that represents
the i:th parameter in x1 . . . xn.

xi = C(gi) ∀i ∈ x1 . . . xn (2.33)

The values that the genes are allowed to take are restricted to a pre-defined range,
called the search space. It is within this range that the algorithm will search for
solutions. All the genes do not need to have the same search space.

16


2. Theory

The algorithm process starts by initializing a population of individuals with a pop-
ulation size Ns. A higher population size yields a higher computational complexity
but does not necessarily lead to better results. The optimal population size is depen-
dent on the application. Furthermore, to initialize all the individuals’ chromosomes,
each gene is given a value by uniformly drawing a random value from the specific
gene’s search space. The initialization is described in equation (2.34).

Ci(g) ∼ U(lb(g), ub(g)) ∀i ∈ Ns (2.34)

ub and lb are the upper and lower bounds of the search space for a specific gene. It
is crucial to choose a proper search space. If set too small, the solution might fall
outside the search space, if too big, the algorithm might never converge.

2.9.2 Evaluation

The quality of each individual has to be evaluated during the optimization. The
evaluation occurs after the individuals have evolved. The evolution process will
be described in Section 2.9.3. The quality measurement is called fitness and is
calculated with a fitness function. When all individuals have been tested for fitness,
the value can be used to determine the strongest individual and thus which to favor
in the next evolution process.

2.9.3 Evolution

After the evaluation step, all individuals are evolved to create a new population for
the next generation. The evaluation process usually consists of the following steps.

Selection

Selection is the process of determining which individuals to be used when forming the
new generation. There are many methods to perform selection, but roulette-wheel
selection and tournament selection are the two most common. In roulette-wheel
selection, an individual is chosen proportional to their fitness. An individual with
higher fitness will have a higher chance of being selected. The method has gotten its
name from the similarities with a roulette wheel, as illustrated in Figure 2.5. The
probability Pi of choosing an individual is calculated as

Pi = Fi∑N
J=1 Fj

, (2.35)

where Fi is the fitness of individual i.

17


2. Theory

𝑖₁
10

𝑖₅
100

𝑖₄
80

𝑖₂
30

𝑖₃
50

Figure 2.5: Roulette wheel selection can be interpreted as a spin of the wheel, where
the size of each section is determined by the fitness of the individuals participating
in the selection process.

Tournament selection, on the other hand, chooses individuals randomly to partic-
ipate in a tournament. The amount of individuals selected is determined by the
tournament size, j - a pre-defined parameter. From the selected individuals, a win-
ner is determined based on its fitness. However, to give individuals with less fitness
a chance, the winner is selected with the probability pattern described in equa-
tion (2.36). Pj is the probability to choose individual j where j = 1 represents
the individual with the highest fitness of the selected individuals and so on. pt is
the probability to choose the individual with the highest fitness. Figure 2.6 is a
visualization of tournament selection.

P1 = pt

P2 = pt(1− pt)
P3 = pt(1− pt)2

...
Pj = pt(1− pt)j−1

(2.36)

18


2. Theory

𝑖₁          10

Random selection Fitness selection Winner

𝑖₁          10

𝑖₃          50

𝑖₂          30

𝑖₄          80

𝑖₅        100

𝑖₂          30

𝑖₅        100

𝑖₅        100

Figure 2.6: Tournament selection with tournament size 3. Note that, the best
individual does not always win.

Crossover

Crossover is a process that mimics the reproduction process in nature. The idea is
that two individuals’ genes are mixed up in the hope that the best of each carries
over to the new individual. Crossover is a very effective method, but can easily
lead to local optima if not regulated. Therefore, crossover is only done to a portion
of the individuals, determined by the probability pc. To perform crossover, the
chromosome of the selected individuals is split at a random gene. The chromosomes
then swap genes at the splitting point, see Figure 2.7 where the line represents the
splitting point.

Figure 2.7: Visualisation of crossover. The line represents the crossover point
where the individuals swap genes. The result is two new individuals.

Mutation

The process of mutation makes sure that the individuals evolve and new solutions
are explored. Even though mutation does not always lead to better fitness it is
important for the purpose of exploration. Mutation works by randomly changing
some of the genes to another value in the valid search space. The probability that
a gene mutates pm is often set to m/n where m is the number of mutations per
individual and n is the number of genes in one individual. A constant mutation rate
is often used since it is hard to decide how the mutation rate should vary.

19


2. Theory

When a mutation occurs, the new value is drawn from a uniform distribution. The
width of the distribution is determined by the creep rate crate, which regulates how
far from its former value a gene can mutate. If the new mutated value would lie
outside the search space, it is important to limit the value and instead set the new
mutation value to the search space limit. The mutation process is illustrated in
Figure 2.8.

1 2 3

1 1 3 4 5 6 8 8 9

4 5 6 7 8 9

Figure 2.8: Illustration of mutation. The red arrows represent the genes being
mutated. In this example the number of mutated genes are two.

Elitism

Even though the best individual is likely to be selected for the next generation, it
is not a certainty. Even if the best individual is included in the next generation, it
cannot be guaranteed that the individual will not undergo mutation. Elitism fixes
this problem by always including the best individual in the next generation without
any changes, ensuring that the best individual of each generation will not be lost.

Termination
The evaluation - evolution circle will continue until a termination criterion has been
fulfilled, which for instance can be a certain number of generations. Finally, the
individual with the highest fitness in the last generation is returned as the solution
of the optimization.

20


3
Methods

This chapter introduces and explains all the methods used to arrive at the result.
The overarching task of the method was to find the poses of one camera and one
LiDAR mounted on the same car with overlapping FoV. The method can in short
terms be described by the following steps:

1. Collection of simulated sensor data.

2. Conversion of camera images to 3D point clouds to be able to accurately
combine and compare with the LiDAR data in a fitness model.

3. Definition of a fitness model.

4. Definition of an objective function to maximize for the highest fitness and thus
also the best sensor poses.

5. Optimization.

The flow from beginning to end can also be seen in Figure 3.1.

Simulation 
and collection 

of data

Process and 
combine data

Fitness model

Optimization
Objective 
function

Sensor
poses

Figure 3.1: Overview of the method.

21


3. Methods

3.1 Simulation and collection of data
To simulate and collect useful data for the project, the CARLA simulator was used
[20]. CARLA is a simulator built upon Unreal Engine for vehicle simulation and has
a good interface for integrating different sensors. There were two main reasons why
a simulator was needed and used, the first one being that it was fast and convenient
for producing different scenarios and tests. Secondly, no real sensor data which
satisfied our requirements were available.

The LiDAR type used was solid state. Based on the presented theory in Section 2.1,
introducing calibration methods for solid state LiDARs will be of great importance
in the near future. The FoV of the LIDAR used was 90 °. Since the only LiDAR
model included in the CARLA library was a rotating LiDAR, a solid state one had
to be emulated. It was possible to emulate solid state LiDAR characteristics in
CARLA by using depth cameras. The same approach was used in [10] with great
success. The range of the LiDAR was set to 50 meters. In addition to a LiDAR, a
camera was used in the simulation. The camera was an RGB pinhole camera with
a resolution of 1920× 1080 and an FoV of 90 °. The camera had no distortion. The
sensors’ ground truth position in relation to the vehicle coordinate system can be
seen in Table 3.1.

Table 3.1: Values of true calibration parameters for the LiDAR and the camera.

Sensor
Translation (m) Rotation (°)
x y z ϕ θ ψ

Camera 1 -1.6 2.8 0 0 -90
LiDAR 1.25 -1.6 2.8 0 0 -90

To evaluate the algorithm’s robustness to disturbances while collecting data, we
decided to simulate five different scenarios where we release some of the vehicle’s
DoF to account for human imperfection and uneven ground. The different scenarios
are listed in Table 3.2. During the simulation a random number was drawn from a
uniform distribution, then the vehicle’s pose was altered accordingly. The distribu-
tion bounds were set to [−2, 2]° since we considered it to be a fair approximation of
human error.

3.2 Process and combine data
After having collected the simulated data, the next step was to perform data fusion
of the collected point clouds and images. In this thesis, it was desirable to keep
the 3D properties of the LiDAR point cloud and therefore the camera data was
converted to 3D point clouds with an SfM pipeline. To make the collected data
comparable, some pre-processing was needed. The goal of this section is to explain
how the 2D images collected by the camera were converted to 3D points, enabling

22


3. Methods

3D to 3D comparison of two point clouds. Figure 3.2 displays the steps presented
in this section.

“Process and combine data/Data fusion” flowchart

SfM

Data from 
simulation 

(LiDAR, 
Camera & 
IMU/GPS)

Extract N 
camera frames 

from 
simulation

Calculate scale 
and rescale 
SfM points 

Extract M 
LiDAR frames

Camera

LiDAR

GPS/IMU

Objective 
function

Figure 3.2: The steps performed in order to be able to fairly compare the data
from the camera and LiDAR.

The camera images were converted to 3D point clouds by using SfM. Moreover, the
number of camera frames and the scale factor to use in the SfM pipeline had to be
chosen. The scale factor was needed to ensure that the SfM point cloud and the
LiDAR point cloud had the same size. In the case of the LiDAR data, only the
number of frames to use had to be decided.

Table 3.2: The different car test case scenarios used in the simulations. The
rotational parameters are offsets from their respective ground truth values. The
translation parameters (x,y,z) were not exposed to any external disturbances. Note
that the parameters are related to the car and not the sensors.

Scenario ϕ (°) θ (°) ψ (°)
Base 0 0 0
Roll ±2 0 0
Pitch 0 ±2 0
Yaw 0 0 ±2

Combined ±2 ±2 ±2

3.2.1 Conversion of camera images to 3D point clouds
SfM, described in Section 2.7, was used to convert the camera images to 3D points,
see Figure 3.3 and Algorithm 1 for reference. After the 3D points had been recon-
structed, the points behind the camera are identified and removed, by utilizing the
cheirality condition mentioned in Section 2.7.2. Finally, an outlier filter is applied

23


3. Methods

based on the z-score, presented in Section 2.8, of each point’s distance to the esti-
mated camera position. A point is considered an outlier if it has a Z-score of Z > 3
or Z < −3, i.e. if a point lies beyond the ±3σ interval of the normal distribution of
the reconstructed 3D points.

Estimate 
camera poses, 

P and P’

Find essential 
matrix, E

Match found 
features

Set of k 
images

Choose 2 
images

SIFT feature 
extraction

Triangulate 
points using P 

and P’

Figure 3.3: Flow chart displaying the most prominent steps of SfM

Algorithm 1: Structure from motion
Input: imageSet← Set of k images from simulation.
Output: 3Dpoints, 3D point cloud constructed from the input image set
for I ∈ imageSet do

if first iteration then
SIFTpoints1 ← Detect SIFT key feature points points and extract
descriptors for first image I1

else
SIFTpointsk ← Detect SIFT key feature points and descriptors for
current image, Ik

matches← Match features for current and previous image, Ik and Ik−1
E ← Estimate the essential matrix with 5-point algorithm
relPose← Estimate relative pose of camera from E
points← Find corresponding points across all previous matches
3Dpoints← Triangulate(points)
3Dpoints← Remove points behind camera and outliers

end
SIFTpointsk−1 ← SIFTpointsk

end

3.2.2 Scale estimation
Since SfM only reconstructs 3D points up to ratio and direction but not scale, a
scale factor needs to be estimated to be able to properly compare the LiDAR point
cloud with the one constructed by SfM.

As mentioned in Section 1.4 it is assumed that the vehicle is equipped with posi-
tioning sensors such as GPS/GNSS and IMU. The data collected by these sensors
was used to estimate the car pose in the simulation, and thereafter calculate an
estimation of the pose of the camera in world coordinates for each frame. Assuming
that the resulting camera poses from SfM are correct in direction and ratio, they

24


3. Methods

can be used to calculate the scale factor between the point cloud generated by SfM
and the point cloud from the LiDAR sensor. The scale factor between frame i and
i+ 1 is calculated as:

s = ‖ti+1 − ti‖
‖Ci+1 −Ci‖

. (3.1)

Where Ci is the position of the camera extracted from the camera matrix P from
SfM. ti is the position of the camera calculated from the car pose as

ti = tc,i +Rc,iv. (3.2)

tc,i is the position of the car, Rc,i is the rotation matrix of the car at frame i, and v
is the translation between the origin of the car and the camera. Theoretically, the
scale factor should not change between frames, but to ensure robustness the scale
factor is chosen as the median of all calculated scale factors.

3.3 Fitness model
A fitness model was needed for measuring how well two point clouds overlap. The
more overlap the better and thus a higher fitness score was given out. The fitness
score was important for defining the objective function of the optimization described
in Sections 3.4 and 3.5. The fitness model used was based on a voxel grid filter, i.e.
a cubical grid filter that downsamples a point cloud. Each voxel that contained at
least one 3D point was reduced to only having one 3D point in the center of the
voxel, as shown in Figure 3.4.

(a) A simple point cloud
before filtering.

(b) Applying the filter on
the point cloud.

(c) Point cloud after filter-
ing.

Figure 3.4: Illustration of applying a voxel grid filter to a simple point cloud.
Images are taken from [10] with permission from the authors.

The downsampling feature of the voxel grid filter was utilized when defining the
fitness score as

fscore = points before filtering− points after filtering. (3.3)

It follows from equation (3.3) that a greater overlap between point clouds results in
higher fitness score.

25


3. Methods

3.4 Objective function
Since the calibration parameters were to be found by optimization, an objective
function was needed. The objective was to find the best possible overlap of the
LiDAR and SfM point clouds. Therefore, the objective function had to be defined
in a way that takes the overlap into consideration. The fitness model from Section
3.3 was used to define the objective function as

max f(point clouds, calibration parameters). (3.4)

The calibration parameters consisted of the pose of each sensor used, see Figure 3.5
for reference. The function f was defined as the fitness score fscore in equation (3.3),
meaning the objective function maximizes the fitness score.

xl yl zl ϕl θl ψl xc yc zc ϕc θc ψc

1

Figure 3.5: An array holding the calibration parameters of the sensors. The
subscript l or c indicates a LiDAR or camera sensor respectively.

An important part of choosing the objective function was to choose the size of the
voxels. The point cloud registered from the LiDAR was bounded by its lowest reso-
lution and thus the voxel size had to be decided with this taken into consideration.
Equation (3.5) states that at a distance r from the LiDAR, objects with a width
greater than or equal to reslow will be detected, which is the lowest resolution of the
LiDAR. α is the angular resolution of the LiDAR. The size of the voxels was set to
reslow, to correctly filter the points far away.

reslow = 2πr α

360 (3.5)

3.5 Optimization
The optimization problem was solved using a GA, which was described in Section
2.9. Each degree of freedom for each sensor was represented by a gene, illustrated
in Figure 3.5, adding up to twelve genes per chromosome and individual. The SfM
point cloud had to be re-scaled according to equation (3.1) before being used in the
fitness function. The objective function used was the one described in Section 3.4,
and after the GA had terminated the best individual of all generations was returned.
The termination was decided to occur after a certain number of generations. The
algorithm is described in Algorithm 2.

3.6 Sensor poses
After the optimization was terminated, the fittest individual’s genes were returned
as the sought-after sensor poses. The sensor poses were however only defined in

26


3. Methods

relation to each other since no information about the car, or how it was related to
the sensors, was incorporated into the optimization. Thus, one sensor had to be
constrained to its ground truth values, thereby enabling the comparison between
the pose of the other sensor and its ground truth value.

Algorithm 2: GA optimization
Input: point_cloud_list← All point clouds
Output: overall_best, Resulting best sensor poses.
Initialization: Generate N tuples of chromosomes - each of length kn, where
k is the number of DoF for a sensor and n is the number of sensors used -
representing an individual in the population.
termination← FALSE
n_gen← 0 Create a counter holding the number of generations evolved.
while ¬termination do

for i ∈ individuals do
for j in range(N) do

transformed_point_clouds← Transform point clouds according to
the pose encoded in the current individual, i.
fscore ← evaluate(transformed_point_clouds) /* Save score */
.

end
end
if n_gen ≥ max_gen then

termination← TRUE /* max_gen is the number of generations
to run before termination */

else
while n_iter < N do

j1, j2 ← Selection of two individuals from the evaluated population.
j1, j2 ← Crossover of the two chromosomes from the two individuals
selected.
j1, j2 ← Mutation of the newly crossed chromosomes.
Nnew ← Add generated individuals.

end
end
N ← Nnew

best_individuals← Save best individual.
overall_best← max(best_individuals)
n_gen+ +

end

27


3. Methods

28


4
Results

This chapter presents the results achieved from implementing the method. The
results are produced by simulating the five different cases presented in Section 3.1.
The cases are compared to each other with respect to robustness, accuracy and
their overall performance. First, the results of using the SfM pipeline to reconstruct
camera images to 3D are presented. Thereafter, the results produced by applying the
GA to the different cases will be presented along with the used tuning parameters.

4.1 Structure from motion pipeline
The SfM result was visually evaluated due to the lack of a proper statistical valida-
tion method. The point clouds generated with SfM were compared to an image of
the scene to evaluate if a satisfying representation of the scene had been achieved.
Another validation test for SfM was to compare the SfM point cloud and the Li-
DAR point cloud with respect to depth and scale. The relevant plots can be seen
in Figures 4.1 and 4.2.

Figure 4.1: SfM point cloud (blue) and LiDAR point cloud (red) in the same plot,
with correct overlap. It is visible that the proportions an depth of the SfM point
cloud matches the LiDAR point cloud. The SfM point clouds consisted of a fewer
number of points than the LiDAR point clouds.

29


4. Results

(a) LiDAR point cloud only.

(b) SfM point cloud and LiDAR point viewed from the side.

Figure 4.2: LiDAR point cloud in (a) and SfM and LiDAR point cloud from a side
view.

Each point cloud was generated from a set of ten images. The size of the image
set was chosen both with consideration of the size of the LiDAR point cloud and to
ensure that a large number of features were detected. Furthermore, only one LiDAR
frame was used to not increase the complexity of the problem.

The resulting point clouds generated from SfM as well as an image of the scene are
depicted in Figure 4.3. The base case was used as a baseline of what the SfM point
cloud should look like and was compared with all other cases. The roll case showed
many similarities to the base case, only a slight noise was the exception from the
base case. However, a significant amount of noise was observed in the pitch case,
resulting in a not as well reconstructed scene. The yaw case had a slightly lower
level of noise than the pitch case. Subsequently, the combined case suffered from
the bad characteristics of the pitch and yaw case, creating a noisy point cloud in
the process. Moreover, all point clouds had a tendency to duplicate objects. The
duplications were most prominent in the yaw, pitch and combined cases.

30


4. Results

(a) Picture of the scene. (b) Base case.

(c) Roll case. (d) Pitch case.

(e) Yaw case. (f) Combined case.

Figure 4.3: The result from SfM. The same scene was used to test all five cases,
generating five different image sequences over the same scene. Each point cloud was
then generated by the corresponding sequence of images.

4.2 GA parameters
The tuning of the GA proved to be hard, not only due to a large number of param-
eters but also because it was difficult to determine if the GA had converged or not.
Without knowledge about the GA’s convergence, we decided to base the tuning on
the calibration result.

The algorithm consistently found better results while using tournament selection
rather than roulette wheel selection, indicating that a better result correlates with
the increase of exploration that tournament selection gives. A high tournament
probability and low crossover probability proved to yield the best result. The tour-
nament size was found to be best at around 1

3 of the population size. We also

31


4. Results

decided to use a decreasing creep rate enabling exploration at the beginning and
exploitation in the later stages. Changing the mutation probability did not result in
any drastic changes. We therefore decided to use the standard value of 1/n, where
n is the number of genes. A population size of 16 together with 4000 generations
was used to achieve a total of 64000 evaluations. The rest of the tuning parameters
are presented in Table 4.1.

Table 4.1: Chosen GA parameters

Parameter Value
Population size 16
Generations 4000

Selection method Tournament
Tournament prob. 0.7
Tournament size 5
Crossover prob. 0.25
Mutation prob. 1

n

Small creep rate pos. 0.2 m
Small creep rate ang. 5 °
Large creep rate pos. 0.4 m
Large creep rate ang. 10 °
Final creep rate pos. 0.02 m
Final creep rate ang. 0.3 °

4.3 GA calibration results

The five simulated scenes mentioned in Section 3.1 were tested with a small and
large search space shown in Table 4.2. The GA optimization was decided to have
produced a successful result if the translation distance in each direction was ≤ 0.1
m and the rotation about each axis ≤ 1.5° for both search spaces. The average error
for each scenario is presented in Tables 4.3 and 4.4 for the small and large search
space respectively. Table 4.5 shows the number of successful calibrations of each
simulated case. Lastly, in Tables 4.6 and 4.7 the best individuals and their gene
error of the different cases and search spaces are presented.

Table 4.2: Ranges of the two different search spaces used. Each range is defined
as offsets from the ground truth values.

Small search space Large search space
Translation (m) -0.2 to 0.2 -0.4 to 0.4
Rotation (°) -5 to 5 -10 to 10

32


4. Results

Table 4.3: Average gene error over 50 iterations for the small search space.

Case x (m) y (m) z (m) ϕ (°) θ (°) ψ (°)
Base 0.10271 0.10214 0.0917 0.5186 0.43337 0.36453
Roll 0.1072 0.06835 0.10608 0.62706 0.53484 0.637
Pitch 0.10232 0.19293 0.13557 0.9062 0.54228 0.43143
Yaw 0.11988 0.24481 0.11443 0.41859 0.95017 0.3637

Combined 0.11757 0.11608 0.14224 0.33499 1.26116 0.64808

Table 4.4: Average gene error over 50 iterations for the large search space.

Case x (m) y (m) z (m) ϕ (°) θ (°) ψ (°)
Base 0.15648 0.11302 0.16834 0.55368 0.56454 0.47714
Roll 0.20837 0.18403 0.23106 0.67023 0.93962 0.77826
Pitch 0.24845 0.35643 0.24677 0.96736 1.03174 2.82658
Yaw 0.19258 0.3064 0.17277 0.44325 0.64249 0.50601

Combined 0.19852 0.24436 0.21896 0.4234 1.42513 2.58035

Table 4.5: Number of successful calibrations for each search space and case. Both
search spaces ran for 50 iterations each for every case.

Case Success small search space Success large search space
Base 8 3
Roll 9 3
Pitch 2 0
Yaw 0 0

Combined 5 2

Table 4.6: Best gene error of 50 iterations in the small search space. The values
in parenthesis are the defined successful bounds for each parameter.

Case x
(±0.1 m)

y
(±0.1 m)

z
(±0.1 m)

ϕ
(±1.5°)

θ
(±1.5°)

ψ
(±1.5°) Success

Base -0.05756 0.04017 -0.03869 -0.11712 0.04437 -0.05622 Yes
Roll -0.00014 -0.03709 0.00521 0.69075 0.65235 -0.6388 Yes
Pitch 0.00161 0.07728 -0.15436 -0.14994 0.92051 0.04268 No
Yaw -0.04373 0.11843 -0.08408 0.42493 0.04871 0.05618 No

Combined -0.05228 -0.03663 0.02607 0.00133 1.03496 -0.82478 Yes

33


4. Results

Table 4.7: Best gene error of 50 iterations in the large search space. The values in
parenthesis are the defined successful bounds for each parameter.

Case x
(±0.1 m)

y
(±0.1 m)

z
(±0.1 m)

ϕ
(±1.5°)

θ
(±1.5°)

ψ
(±1.5°) Success

Base 0.01993 0.05776 0.02387 0.66528 0.64115 -0.27976 Yes
Roll -0.01065 -0.05963 0.04416 0.93475 0.79486 -0.44219 Yes
Pitch -0.20421 0.09324 -0.04029 1.1714 -0.26305 0.52156 No
Yaw -0.12259 0.16861 -0.1152 0.01276 -0.21716 0.15345 No

Combined -0.06394 0.02919 -0.01791 0.06172 1.44898 -0.89724 Yes

4.4 Defect objective function
For an objective function to work as intended, the ground truth parameters must
generate the highest possible fitness score. However, the proposed objective function
could generate higher fitness scores than the ground truth, which can be seen in
Figure 4.4. This phenomenon caused the algorithm to never converge at ground
truth, as it should, and made it hard to determine if convergence had been achieved.

0 500 1000 1500 2000 2500 3000 3500 4000
Generation

13700

13800

13900

14000

14100

14200

14300

14400

Fi
tn

es
s

Max fitness and fitness using ground truth
Ground truth
Max fitness

Figure 4.4: Example of the GA finding higher fitness than ground truth. The red
line is ground truth and the blue is maximum fitness found by the GA. The vertical
axis is fitness and the horizontal axis is number of generations.

34


5
Discussion

In this chapter the results collected are discussed based upon the research questions
stated in Section 1.4. The methods used to achieve the results presented are also
discussed and their performance evaluated.

5.1 Structure from motion generated point clouds

One of the main difficulties when designing the LiDAR-camera calibration algorithm
was how to represent the camera data in 3D. As stated in the first research question,
this thesis has investigated how an SfM pipeline could contribute to constructing a
3D point cloud from camera data. As presented in Section 4.1, Figure 4.1, Figure
4.2b and Figure 4.3, the algorithm was successful to the degree that it was possible
to distinguish that the SfM point cloud was a reconstruction of the image data.
However, the algorithm worked better for some of the cases.

The noise or duplication phenomenon characterizing the pitch, yaw and combined
case appears to be a direct consequence of SfM not being able to align points when
the camera images were converted to 3D point clouds. The phenomenon can also be
seen in the roll case, but not to the same extent as in the pitch, yaw and combined
cases. Theoretically, SfM should not be affected by the camera movement during the
simulation. However, the camera pose estimation used in SfM or the triangulation
might have been flawed, which would explain the duplication of points observed in
some of the cases.

The accuracy and refinement of the constructed point cloud also depend on the
images used. One interesting aspect is if the results would have been better if the
camera captured images more frequently and in higher resolution, thus generating
both more and better images and in turn generating point clouds with less noise.
Additionally, an image of higher resolution would increase the number of features
that can be detected, hence increasing the chance of denser point clouds.

The second part of the first research question refers to comparability between the
two different types of point clouds. First and foremost, the two types of point clouds
were based on different concepts. The LiDAR point cloud was based on raw data
and was, therefore, more reliable, while the camera point cloud was constructed in

35


5. Discussion

several steps and was, therefore, more sensitive to errors. Since the point clouds
were generated by different methods, it was not straightforward if a generated SfM
point cloud were of satisfactory standards in relation to the LiDAR point cloud.
Nonetheless, a fewer number of points were a pervading trait of the SfM point
cloud. As explained in Section 2.7, the number of points used in SfM was based
on the quantity of found SIFT features. In earlier works, mainly [10], the point
clouds used were only produced by LiDARs, leading to that the same object was
represented by similar clusters of LiDAR points in all point clouds intercepting the
object. Due to SfM point clouds being generated from features, the same object
did not necessarily consist of a similar 3D point cluster as in the LiDAR data. It
was not a guarantee that the object would be visible if it did not contain enough
features. Contradictory, some objects or features appearing in the SfM point cloud
did not appear in the LiDAR point cloud.

Finally, implementing Bundle Adjustment (BA) might help improve the results of
the SfM point clouds by reducing noise. In this thesis, BA was not used, but it was
tested. However, the test proved to introduce more noise rather than compensate
for it.

5.2 Objective function

The objective function did not work as intended shown by the result in Section 4.4.
We knew from [10] that the objective function was not perfect. However, for LiDAR-
LiDAR calibration, the objective function was good enough to get a satisfactory
calibration. One explanation for why the objective function performed worse for
LiDAR-camera calibration is that the point clouds are fundamentally different. In
LiDAR-LiDAR calibration, the point clouds are built the same way and should in
theory be able to overlap perfectly. However, since LiDAR point clouds are more
evenly distributed while SfM point clouds are generated from clusters in feature-
heavy areas, an overlap as precise as in the LiDAR-LiDAR case cannot necessarily be
expected. This difference in performance could indicate that the objective function
works better with similar types of point clouds or denser point clouds.

Since the fitness is dependent on a voxel grid filter, it can be biased towards voxels
containing many points, no matter if the points are in the correct place or not.
Another way to view it is that the algorithm will favor empty voxels since an empty
voxel will decrease the number of points after downsampling, thus increasing the
score. For LiDAR-LiDAR calibration, the highest number of empty voxels should be
when the point clouds overlap perfectly, thus making ground truth have the highest
fitness. However, since the LiDAR point cloud and SfM point cloud are different,
ground truth may not be the pose that maximizes the number of empty voxels
and thus the fitness. The result of using this type of filtering as a fitness function
was that the algorithm did not converge toward ground truth, which explains the
insufficient quality of the results.

To better be able to determine if a LiDAR-camera calibration is feasible with the

36


5. Discussion

method we have proposed, a new objective function needs to be formulated. The
new objective function would preferably match regions of points rather than specific
points since the point clouds are built differently. One idea was to base a new fitness
function on plane surfaces and use RANSAC with plane fitting. However, the new
approach would be heavily reliant on planar features such as buildings or the ground,
meaning that it would not work in a non-planar environment, such as a forest.

5.3 Overall performance
The results of the GA were not as accurate as anticipated. There are nonetheless
some interesting aspects to discuss. The results presented in Section 4.3 and the
spread of the boxplots in Appendix A clearly showed that a smaller search space
seemed to be profitable, which could be explained by the increased probability to
sample the sensor poses inside the successful bounds. The larger search space al-
lowed the GA to explore further away from the ground truth, hence decreasing the
probability to find successful parameters and thus needing more generations to find
optimal values.

Regarding the errors in parameter estimations, one interesting aspect was the differ-
ence in error between the estimated rotation and translation. The angular param-
eters could be successfully estimated for all five test cases if the small search space
were used, see Table 4.3. Moreover, the error distribution over 50 iterations was
inside the successful bounds with a great margin for the base and roll cases, which
can be seen in the boxplots in Appendix A and in Table 4.3. The algorithm can
thus be used if the aim is to calibrate the angular parameters of a LiDAR-camera
pair. One explanation for the algorithm being able to successfully estimate rotation
is that rotational offsets give rise to larger distributions of points in the voxel grid
filter. Thus, there will be fewer empty voxels, hence decreasing the fitness. Even
small angular offsets can cause large distances between points that should overlap.
Positional offsets, however, may not be of as much importance, which coincides with
the results of the positional estimation not being as adequate as the rotational. If
points in the SfM point cloud are still close to coinciding with other parts of the same
object, it could still generate a high fitness, since many points may still overlap.

Since the angular parameters could be successfully estimated in all five cases with
the small search space, the algorithm can be considered robust to rotational dis-
turbances applied to the vehicle when estimating the sensors’ roll, pitch and yaw
angles. Disturbances in the roll angle of the vehicle gave the best results, having no
disturbance at all excluded. The algorithm is thus regarded to be the most robust
to vehicle disturbances in roll.

37


5. Discussion

38


6
Conclusion

This thesis has investigated a concept for automatic LiDAR-camera co-calibration.
One LiDAR and one camera were used. The objective was to find the extrinsic
parameters of both sensors. The project consisted of two main parts - the first was
to convert camera images to 3D. The second part was to align 3D point clouds
from the camera and LiDAR and optimize for the best result, thus retrieving the
sought-after sensor poses.

Camera images were converted from 2D to 3D using a structure from motion
pipeline. Five different disturbance scenarios were tested. Overall, the quality of
the generated point clouds varied in resolution and alignment. The base case, where
no rotational disturbance was introduced, produced the best point cloud in the
sense of refinement and accuracy. As the camera and LiDAR point clouds were
not generated by the same method, different objects in the point clouds were not
represented by the same number of points. Some objects were only visible in the
LiDAR point cloud, others only in the SfM point cloud, leading to difficulties in the
aspect of comparability.

A genetic algorithm was used to optimize the objective function and thus calibrate
the sensors. However, the result did not coincide with what we considered a satis-
factory calibration result. The primary reason for the somewhat poor results was
the flawed objective function. The objective function could achieve better fitness
than ground truth, making the algorithm search for a solution that was not ground
truth, which proved problematic. However, the calibration results of the angular
parameters showed robustness to rotational disturbances in almost all cases, indi-
cating that the method still works for calibrating roll, pitch and yaw. Together with
the use of the small search space, the calibration method can be used to thoroughly
calibrate the roll, pitch and yaw parameters for a LiDAR-camera pair.

6.1 Future work

The proposed method is not perfect, and to be able to function properly some future
work is needed. Most importantly, the objective function and possibly the fitness
function should be revised and improved. Furthermore, densification of point clouds
should be investigated, to see if denser SfM point clouds will increase the number of

39


6. Conclusion

successful calibrations. Another recommendation is to more thoroughly investigate
if Bundle Adjustment can improve point cloud alignment and noise reduction.

Ultimately, additional sensors should be introduced, along with possibilities to con-
strain the sensors to the vehicle frame already in the optimization, not only in
relation to each other. Also, a bigger variety of simulation parameters and ex-
tended testing of different scenarios should be included in further development of
the method.

40


Bibliography

[1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping:
part I,” IEEE Robotics & Automation Magazine, vol. 13, no. 2, pp. 99–110,
Jun. 2006, issn: 1070-9932. doi: 10.1109/MRA.2006.1638022.

[2] X. Liu, C. Yuan, and F. Zhang, “Fast and Accurate Extrinsic Calibration for
Multiple LiDARs and Cameras,” Sep. 2021.

[3] Z. Bai, G. Jiang, and A. Xu, “LiDAR-Camera Calibration Using Line Cor-
respondences,” Sensors, vol. 20, no. 21, p. 6319, Nov. 2020, issn: 1424-8220.
doi: 10.3390/s20216319.

[4] L. Zhou, Z. Li, and M. Kaess, “Automatic Extrinsic Calibration of a Camera
and a 3D LiDAR Using Line and Plane Correspondences,” in 2018 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), IEEE,
Oct. 2018, pp. 5562–5569, isbn: 978-1-5386-8094-0. doi: 10 . 1109 / IROS .
2018.8593660.

[5] A. Dhall, K. Chelani, V. Radhakrishnan, and K. M. Krishna, “LiDAR-Camera
Calibration using 3D-3D Point correspondences,” May 2017.

[6] S. A. Rodriguez F., V. Fremont, and P. Bonnifait, “Extrinsic calibration be-
tween a multi-layer lidar and a camera,” in 2008 IEEE International Con-
ference on Multisensor Fusion and Integration for Intelligent Systems, IEEE,
Aug. 2008, pp. 214–219, isbn: 978-1-4244-2143-5. doi: 10.1109/MFI.2008.
4648067.

[7] B. Nagy and C. Benedek, “On-the-Fly Camera and Lidar Calibration,” Remote
Sensing, vol. 12, no. 7, p. 1137, Apr. 2020, issn: 2072-4292. doi: 10.3390/
rs12071137.

[8] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, “Building Rome
in a day,” in 2009 IEEE 12th International Conference on Computer Vision,
IEEE, Sep. 2009, pp. 72–79, isbn: 978-1-4244-4420-5. doi: 10.1109/ICCV.
2009.5459148.

[9] L. Brown and G. Lindberg, “Extrinsic calibration of multiple 2D Lidars us-
ing a Genetic Algorithm,” M.S. thesis, Chalmers University of Technology,
Gotheburg, 2019.

[10] T. Langfjord Nordgård and O. Ravndal Skjølingstad, “Calibration of 3D li-
dars A fully automatic and robust method for calibrating multiple 3D lidars
using only point cloud data,” M.S. thesis, Chalmers University of Technology,
Gothenburg, 2020.

41

https://doi.org/10.1109/MRA.2006.1638022
https://doi.org/10.3390/s20216319
https://doi.org/10.1109/IROS.2018.8593660
https://doi.org/10.1109/IROS.2018.8593660
https://doi.org/10.1109/MFI.2008.4648067
https://doi.org/10.1109/MFI.2008.4648067
https://doi.org/10.3390/rs12071137
https://doi.org/10.3390/rs12071137
https://doi.org/10.1109/ICCV.2009.5459148
https://doi.org/10.1109/ICCV.2009.5459148


Bibliography

[11] M. Khader and S. Cherian, “An Introduction to Automotive LIDAR,” Dal-
las, Tech. Rep., 2020. [Online]. Available: https://www.ti.com/lit/pdf/
slyy150.

[12] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”
Tech. Rep., 2004.

[13] M. A. Fischler and R. C. Bolles, “Random sample consensus,” Communica-
tions of the ACM, vol. 24, no. 6, pp. 381–395, Jun. 1981, issn: 0001-0782. doi:
10.1145/358669.358692.

[14] R. Szeliski, Computer Vision. Cham: Springer International Publishing, 2022,
isbn: 978-3-030-34371-2. doi: 10.1007/978-3-030-34372-9.

[15] R. Klette, Concise Computer Vision. London: Springer London, 2014, isbn:
978-1-4471-6319-0. doi: 10.1007/978-1-4471-6320-6.

[16] H. C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from
two projections,” Nature, vol. 293, no. 5828, pp. 133–135, Sep. 1981, issn:
0028-0836. doi: 10.1038/293133a0.

[17] R. I. Hartley and P. Sturm, “Triangulation,” Computer Vision and Image
Understanding, vol. 68, no. 2, pp. 146–157, Nov. 1997, issn: 10773142. doi:
10.1006/cviu.1997.0547.

[18] R. I. Hartley, “Chirality,” International Journal of Computer Vision, vol. 26,
no. 1, pp. 41–61, 1998, issn: 09205691. doi: 10.1023/A:1007984508483.

[19] M. Wahde, Biologically Inspired Optimization Methods - an introduction.
Southampton, UK: WIT Press, 2008.

[20] CARLA Team, CARLA, 2022. [Online]. Available: https://carla.org/.

42

https://www.ti.com/lit/pdf/slyy150
https://www.ti.com/lit/pdf/slyy150
https://doi.org/10.1145/358669.358692
https://doi.org/10.1007/978-3-030-34372-9
https://doi.org/10.1007/978-1-4471-6320-6
https://doi.org/10.1038/293133a0
https://doi.org/10.1006/cviu.1997.0547
https://doi.org/10.1023/A:1007984508483
https://carla.org/


A
Boxplots

The error of each simulated case are presented below in boxplots. In each boxplot
the area between the red lines are the successful bounds as in offset from the true
value. The blue line at zero is the desired error.

A.1 Small solution space

(a) Angular errors. (b) Positional errors.

Figure A.1: Base case with small solution space.

(a) Angular errors. (b) Positional errors.

Figure A.2: Roll case with small solution space.

I


A. Boxplots

(a) Angular errors. (b) Positional errors.

Figure A.3: Pitch case with small solution space.

(a) Angular errors. (b) Positional errors.

Figure A.4: Yaw case with small solution space.

(a) Angular errors. (b) Positional errors.

Figure A.5: Combined case with small solution space.

II


A. Boxplots

A.2 Large solution space

(a) Angular errors. (b) Positional errors.

Figure A.6: Base case with large solution space.

(a) Angular errors. (b) Positional errors.

Figure A.7: Roll case with large solution space.

(a) Angular errors. (b) Positional errors.

Figure A.8: Pitch case with large solution space.

III


A. Boxplots

(a) Angular errors. (b) Positional errors.

Figure A.9: Yaw case with large solution space.

(a) Angular errors. (b) Positional errors.

Figure A.10: Combined case with large solution space.

IV


DEPARTMENT OF MECHANICS AND MARITIME SCIENCES
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	Introduction
	Calibration
	Aim
	Related work
	LiDAR-camera calibration
	Biologically inspired optimization algorithms

	Research questions and limitations
	Limitations


	Theory
	LiDAR properties
	Camera properties
	SIFT
	RANSAC
	Epipolar geometry
	Essential and fundamental matrix
	Structure from motion
	N-point algorithms
	Triangulation

	Z-score
	Genetic algorithm
	Initialization
	Evaluation
	Evolution


	Methods
	Simulation and collection of data
	Process and combine data
	Conversion of camera images to 3D point clouds
	Scale estimation

	Fitness model
	Objective function
	Optimization
	Sensor poses

	Results
	Structure from motion pipeline
	GA parameters
	GA calibration results
	Defect objective function

	Discussion
	Structure from motion generated point clouds
	Objective function
	Overall performance

	Conclusion
	Future work

	Boxplots
	Small solution space
	Large solution space