Monocular Simultaneous Localisation and Mapping for Road Vehicles Master’s thesis in Systems, Control and Mechatronics MATHIAS ERNST, SAMUEL SCHEIDEGGER Department of Signals and Systems CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2015 Master’s thesis EX027/2015 Monocular Simultaneous Localisation and Mapping for Road Vehicles MATHIAS ERNST, SAMUEL SCHEIDEGGER Department of Signals and Systems Signal processing and Biomedical engineering Chalmers University of Technology Gothenburg, Sweden 2015 Monocular Simultaneous Localisation and Mapping for Road Vehicles MATHIAS ERNST, SAMUEL SCHEIDEGGER © MATHIAS ERNST, SAMUEL SCHEIDEGGER, 2015. Supervisor: Erik Stenborg, Volvo Car Corporation Examiner: Lennart Svensson, Department of Signals and Systems Master’s thesis Department of Signals and Systems Signal processing and Biomedical engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Illustration of feature points tracked over a sequence of images. Typeset in LATEX Printed by [Name of printing company] Gothenburg, Sweden 2015 iv Monocular Simultaneous Localisation and Mapping for Road Vehicles MATHIAS ERNST, SAMUEL SCHEIDEGGER Department of Signals and Systems Chalmers University of Technology Abstract Autonomous cars hold great promise for the future of transportation enabling more efficient driving and safer vehicles for both the occupants and other road users. A key aspect to achieving autonomous driving is accurate positioning of the vehicle. One way of meeting this goal is to build a map and simultaneously perform localisation within that map. This problem in known as the Simultaneous Localisation and Mapping (SLAM) problem. To avoid each vehicle having to build its own map, one can also localise the vehicles position within a previously made map, and such a map could be made using SLAM methods. This thesis explores a few different approaches to solving the SLAM problem and de- scribes the implementation of a graph-based SLAM system using monocular camera data. The system that has been developed uses feature tracking, epipolar geometry and local bundle adjustment for visual odometry, relying on velocity and time mea- surements to recover the information lost in the scale ambiguity of monocular image data. Topological loop closures are found by using the FAB-MAP 2.0 algorithm. The developed system uses the OpenCV library for feature detection and description, the g2o and CHOLMOD libraries for graph-representation and bundle adjustment, and the OpenFABMAP library for running the FAB-MAP 2.0 algorithm. The algorithm is tested on the public KITTI datasets and is shown to successfully build consistent maps and localise within them. Keywords: Visual SLAM, Monocular, Visual Odometry, Bundle Adjustment. v Acknowledgements We would like to thank our supervisors, Erik Stenborg and Anders Karlsson, and our examiner, Lennart Svensson, for support during our work. We also would like to thank the authors of our sister project, Robin Lindholm and Carl-Johan Pålsson, for the interesting discussions, and the other participants in our study group, Lars Hammarstrand, Nicklas Gustafsson, Maryam Fatemi and Malin Lundgren for their useful input. And also, we would like to thank Viktor Lindström for help debugging our code and Mikael Persson for his expert help on the subject. Finally we would like to thank Volvo Car Corporation for the opportunity to perform this work. Mathias Ernst, Samuel Scheidegger, Gothenburg, June 2015 vii Contents List of Figures xi List of Tables xiii List of Acronyms xv 1 Introduction 1 1.1 Problem background . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Feature based methods . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Simultaneous Localisation And Mapping . . . . . . . . . . . . . . . . 24 2.3.1 Loop closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 EKF SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 FastSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.4 Graph-based SLAM . . . . . . . . . . . . . . . . . . . . . . . . 31 3 Implementation 35 3.1 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Features and tracking . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Simultaneous Localisation And Mapping . . . . . . . . . . . . . . . . 37 3.2.1 FAB-MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Geometric loop closure . . . . . . . . . . . . . . . . . . . . . . 38 4 Results 41 4.1 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 ix Contents 5 Discussion 49 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Bibliography 53 A Parameter List I x List of Figures 2.1 Basic pinhole camera model. Shows how the point, X, in space is projected onto the image plane, the principal point, p, and the camera centre, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The principal point with offset from the image coordinate origin. . . . 7 2.3 Sketch of how a heavily distorted image of a perfect square would be corrected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 The projected parallel lines on a plane converges to the vanishing points, v1 and v2, and constructs the vanishing line, l. . . . . . . . . 10 2.5 The left figure is showing the two camera view centres, C and C′, the world point, X, the projected image points, x and x′, and the epipolar plane π. The right figure shows the epipolar points, e and e′, the line of the second view, l′, and how all world points in the direction of X must be projected onto this line. . . . . . . . . . . . . 12 2.6 Illustration of how the four different choices of camera matrices for a given essential matrix can be interpreted. . . . . . . . . . . . . . . . . 16 2.7 Illustration of how RANSAC would work to estimate a line. The figures shows two randomly selected points in blue, rejected outliers in red and the inliers in green for four different hypothesises. Here the inliers of the hypothesis in the bottom right image will be selected as the final inliers as they are more than in the other hypothesises. . . 20 2.8 A plot of the number of required iterations of RANSAC for a 99% probability of success at different percentage of outliers with different number of minimal points. . . . . . . . . . . . . . . . . . . . . . . . . 21 2.9 Illustrations of how a circular motion can be used describe the motion of a vehicle using the Ackermann steering principle. . . . . . . . . . . 22 2.10 Basic structure of the graph in graph-based SLAM. White circles are pose nodes, black squares are landmarks and solid and dashed edges are pose-to-pose and pose-to-landmark constraints respectively. At 20 and 21 loop closures are detected when old landmarks 5 and 7 are observed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.11 An illustration of the sparseness and symmetry of the information matrix, H, corresponding to the graph in figure 2.10. Only the shaded elements are non-zero, black elements are pose nodes, dark grey corresponds to landmarks and middle grey the edges between these. Loop closures manifest as off-diagonal elements at the sym- metric pairs 〈(20, 5), (5, 20)〉 and 〈(21, 7), (7, 21)〉. . . . . . . . . . . . 33 xi List of Figures 2.12 Comparison of the squared error function and the Huber error. . . . . 34 4.1 Plots of evaluation data for dataset 1 using VO. Average translation error: 1.22 %. Average rotation error: 0.00550 deg/m. . . . . . . . . . 42 4.2 Plots of evaluation data for dataset 2 using VO. Average translation error: 0.995 %. Average rotation error: 0.00415 deg/m. . . . . . . . . 43 4.3 Plots of evaluation data for dataset 3 using VO. Average translation error: 2.03 %. Average rotation error: 0.00709 deg/m. . . . . . . . . . 44 4.4 Plots of evaluation data for dataset 4 using VO. Average translation error: 1.82 %. Average rotation error: 0.00250 deg/m. . . . . . . . . . 45 4.5 Execution time for every 25th step of the different parts of the VO for dataset 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6 Plots of evaluation data for dataset 1 using the full SLAM solu- tion. Average translation error: 1.21 %. Average rotation error: 0.00558 deg/m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.7 Plots of evaluation data for dataset 2 using the full SLAM solu- tion. Average translation error: 1.00 %. Average rotation error: 0.00405 deg/m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.8 Execution time for every 25th step of the different parts of the SLAM algorithm for dataset 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.9 Plot of aligned trajectory and distance to ground truth for dataset 1. The dotted blue line in the left figure is the unaligned estimate. The RMS distance improved from 7.214 m to 2.435 m by rotating and 2.030 m with 6 DoF alignment . . . . . . . . . . . . . . . . . . . . . . 48 4.10 Plot of aligned trajectory and distance to ground truth for dataset 2. The dotted blue line in the left figure is the unaligned estimate. The RMS distance improved from 9.905 m to 4.044 m by rotating and 3.731 m with 6 DoF alignment . . . . . . . . . . . . . . . . . . . . . . 48 5.1 A frame from dataset 3, with the tracked feature points, which are approximately coplanar, shown in green. . . . . . . . . . . . . . . . . 50 xii List of Tables 2.1 Properties and performance of feature detectors [32]. . . . . . . . . . 18 4.1 Properties of the datasets used for evaluation of the implemented algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Number of detected loop closures before and after the two firewall conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Results of the evaluation of the datasets. . . . . . . . . . . . . . . . . 46 A.1 List of user tunable parameters and the values used in the result section. I xiii List of Tables xiv List of Acronyms BA Bundle Adjustment. BoW Bag-Of-Words. BRIEF Binary Robust Independent Elementary Features. BRISK Binary Robust Invariant Scalable Keypoints. CCD Charge-Coupled Device. CENSURE Center Surround Extrema. CMOS Complementary Metal Oxide Semiconductor. DoF Degrees Of Freedom. DoG Difference-Of-Gaussian. EIF Extended Information Filter. EKF Extended Kalman Filter. FAB-MAP Fast Appearance-Based Mapping. FAST Features from Accelerated Segment Test. GNSS Global Navigation Satellite System. GPS Global Positioning System. GPU Graphics Processing Unit. IAC Image of Absolute Conic. ICR Instantaneous Centre of Rotation. IF Information Filter. IMU Inertial Measurement Unit. KLT Kanade–Lucas–Tomasi. xv List of Acronyms LSD-SLAM Large Scale Direct SLAM. ORB Oriented FAST and Rotated BRIEF. PTAM Parallel Tracking And Mapping. RANSAC Random Sample Consensus. RMS Root Mean Square. SeqSLAM Sequence SLAM. SIFT Scale Invariant Feature Transform. SLAM Simultaneous Localisation And Mapping. SURF Speeded Up Robust Features. SVD Singular Value Decomposition. UKF Unscented Kalman Filter. V-SLAM Visual Simultaneous Localisation And Mapping. VCC Volvo Car Corporation. VO Visual Odometry. xvi 1 Introduction In 2011, an estimated 1.3 million people where killed in road traffic accidents; by 2020 this is expected to grow to 1.9 million people, annually [1]. In 2015, a study by the U.S. Department of Transportation showed that 94% of the road traffic accidents in the U.S. were caused by drivers [2]. The number of road traffic accidents could be reduced by identifying when a driver is about to make a mistake that will cause an accident, and take action to prevent the accident. Volvo Car Corporation (VCC) has such a system in production today, in form of the City Safety system, which intends to autonomously brake the car to avoid collisions at low speeds. This concept could be expanded, to further reduce accidents, with other systems, e.g. lane keeping or even fully autonomous vehicles. Considering the vast number of accidents caused by human error, it is clear that autonomous vehicles that are less prone to driver errors has the potential to drastically reduce the number of accidents. Furthermore, autonomous vehicles has the potential to free up billions of hours for people currently spent acting as the human control system of the vehicles The ever-increasing vehicle ownership and energy demand from the transport sector pose a concern for the environment. The worldwide transport sector, in 2010, stood for 28% of the total energy consumption and caused 23% of total energy-related CO2 emissions with a majority, 72%, coming from road vehicles [3, pp. 603]. Efforts have been made to develop more energy efficient engines, and to use alternative energy sources. For example, hybrid vehicles, using both internal combustion engines and electric motors, and purely electric vehicles, have been developed. No matter which propulsion system is used, further gains in energy efficiency can be made by adopting efficient driver behaviour. In a fully autonomous vehicle this can be done automatically. The energy consumption can be minimised by route planning, platooning, better control of the power components, and regenerative drive system [4, pp. 5] to a degree not achievable by a human driver. Autonomous vehicles is a hot research topic worldwide, as they can make traffic safer and more efficient in terms of both energy use and traffic flow and several experimental platforms have been tested. Google has been testing self driving cars since 2009, using expensive sensors such as LIDARs, on public roads [5]. Initially the test were limited to freeways, but later also included more complex city streets. In 2014 Daimler successfully tested a car on a route with a variety of difficult traf- fic situations, using close-to-production sensor hardware. VCC plans to put 100 cars on the roads of Gothenburg by 2017, that will be self-driving on a number of predetermined roads. 1 1. Introduction 1.1 Problem background To enable autonomous systems, sensor data input is required. For lane keeping, a system which estimates the road lane and the vehicle’s lateral position in the lane is necessary. For a fully autonomous vehicle, a map of the surroundings and other traffic is required [7]. The limited accuracy of Global Navigation Satellite Systems (GNSSes), such as the Global Positioning System (GPS), and the poor performance or non-existent signal in some environments, such as urban canyons or tunnels, make them insufficient for such applications [8]. Therefore, the use of other types of sensors is required to replace or complement GNSSes. Such other sensors could for example be wheel odometry, steering angle sensors, range sensors, such as LIDAR or radar detectors, or cameras. A problem in common for using these types of sensors to estimate position, is that the estimates are incremental and noisy, and thus subject to a drift. One consequence of the drift when building a map is that the representation of the location in the map depends on the measurement noise and the path taken to that location, causing inconsistent maps. To reduce the drift, information from a GPS, which is absolute in the global frame and thus does not suffer from drift, could be fused with sensors that are more accurate in the local frame. Another technique to reduce drift and enforce consistency is to adjust the estimated trajectory when visiting a location with previously known information about the position. To accomplish this, a map in which it is possible to find the ego-pose has to be built, and to build a map, a known position is required. This “chicken or the egg” problem is often referred to as the Simultaneous Localisation And Mapping (SLAM) problem [9, pp. 222]. 1.2 Purpose The purpose of this work is to evaluate how digital cameras mounted on a vehicle can be used to estimate the ego-motion of the vehicle and build a map, in which it is possible to localise the ego pose. This concept is commonly referred to as the Visual Simultaneous Localisation And Mapping (V-SLAM) problem. A review of the current literature on the subject should be done, and the most common and most promising methods to solve the SLAM problem should be in- vestigated. A solution to the V-SLAM problem should be proposed and evaluated. Further improvements to the system, unresolved issues, and areas and techniques that would be useful to explore further should be pointed out for future work. An evaluation of how the V-SLAM techniques could be deployed for building a large scale map should be considered. Aspects such as the accuracy of the mapping and localisation, the robustness in various road environments, and feasibility should be evaluated. Lastly, avenues for further investigation in the area of V-SLAM should be identified. 2 1. Introduction 1.3 Delimitations In this project a full 6 Degrees Of Freedom (DoF) SLAM system was built using monochrome monocular image data from a calibrated camera. Vehicle speed mea- surements were used to resolve the scale ambiguity inherent in monocular camera data. he system does not handle dynamic scenes by classifying moving objects, such as other traffic, and taking this into consideration in the map building. Even though computational complexity of the different algorithms is considered, the system is not intended to run in real-time. The implementation takes unrectified images and speed measurements as inputs, and produces an estimate of the trajectory and a map in the form of a point cloud of estimated feature positions. For loop closure detections, training data in the form of a set of images from environments similar to the ones to be driven in is also needed. 1.4 Methodology The thesis work began with a study of the SLAM problem based on the lectures and exercises from the course Robot Mapping1 held by Cyril Stachniss at Freiburg University. A further literature study regarding V-SLAM in particular was done, covering such topics as camera calibration, projective geometry, image features, Visual Odometry (VO), and appearance based loop closures. The first implementations of the VO system were made in Matlab, but the cho- sen solution was later reimplemented in C++ to be able to utilise certain third party libraries. The performance of the system was evaluated on datasets from the KITTI Vision Benchmark Suite2 [10]. The KITTI datasets are logs from sensors mounted on a car when different paths where driven. The sensors include 4 forward looking cameras, 1 64 layer LIDAR, and a highly accurate combined Inertial Mea- surement Unit (IMU) and GNSS. Evaluation was performed in the same manner as in the KITTI Benchmark, to enable comparison with other solutions, i.e. checking the average accumulated deviation from ground truth in subsequences of certain lengths. 1.5 Related work VO, or estimation of the ego-motion from consecutive camera images has previously been studied in the literature, and several different solutions have been proposed. One solution, proposed by Nistér et al. in 2004, uses a stereo camera and feature tracking, and achieves successful results from aerial, automotive and handheld plat- forms [11]. A lane-based solution, LaneLoc, purely for road vehicles, using pre-built maps in which on-line localisation is performed, was initially proposed by Schreiber et al. in 2013 [12] and was later used by Ziegler et al. in 2014 in Daimler’s Bertha 1http://ais.informatik.uni-freiburg.de/teaching/ws13/mapping/ 2http://www.cvlibs.net/datasets/kitti/ 3 http://ais.informatik.uni-freiburg.de/teaching/ws13/mapping/ http://www.cvlibs.net/datasets/kitti/ 1. Introduction project [6]. In 2013 Engel et al. proposed a semi-dense direct method, using image gradients, achieving real-time performance on a modern smartphone [13]. Different solutions to the V-SLAM problem has been proposed. In 2003 Davison used an Extended Kalman Filter (EKF) and feature points for mapping and pose estimation. The EKF based solutions were outperformed by a particle filter based solution by Montemerlo et al. in 2002 [15], FastSLAM, which was then improved, and named FastSLAM 2.0, by Montemerlo et al. in 2003 [16]. A solution inspired by the hippocampus of rodents, RatSLAM, was proposed by Milford et al. in 2004 [17]. In 2007, Klein and Murray proposed a solution, Parallel Tracking And Mapping (PTAM), which broke the barrier to real-time direct augmented reality with inexpensive cameras. A large scale real-time V-SLAM solution was proposed by Lim et al. in 2014 [19] using feature point matching and graph based SLAM. Solutions for pure loop closure detection have been proposed. Cummins and New- man in 2008 proposed Fast Appearance-Based Mapping (FAB-MAP) [20], which was improved in 2011 by Cummins and Newman to FAB-MAP 2.0 [21]. Both ver- sions detects loop closures by exploiting highly distinctive feature points. Milford and Wyeth in 2012 proposed Sequence SLAM (SeqSLAM), which uses sequences instead of single images. These loop closure detection algorithms have been used by various implementations for VO, to achieve a full V-SLAM solution, e.g. the semi-dense direct VO solution, by Engel et al. in 2013, was improved by Engel et al. in 2014 to Large Scale Direct SLAM (LSD-SLAM), a full V-SLAM solution by using FAB-MAP 2.0 for loop closure detection in real-time. 4 2 Theory In this chapter, the theory on which the implemented system relies is presented. Starting with the theory on how the camera maps the environment to an image, continuing with how to incrementally estimate the ego-motion from the images us- ing visual odometry, and ending with how SLAM algorithms can be used to localise and build large-scale, consistent maps of the environment. While the final imple- mentation is a graph-based SLAM system, EKF SLAM and the particle filter based FastSLAM algorithm were considered during the study of the SLAM problem and are also presented here. 2.1 Camera A camera maps reflected light from 3D objects in space onto a 2D image [24, pp. 153]. Typically, the light reflected by the environment is projected, using optical lenses, onto a Complementary Metal Oxide Semiconductor (CMOS) or Charge- Coupled Device (CCD) sensor, which both use the photoelectric effect [25] and analogue to digital converters to digitise the image. To use the images to build a map of the environment, the way the camera projects the environment to the image must be known. The maps considered here are locations of points on the surface of objects in the world. A mathematical camera model is used to represent the camera projection. 2.1.1 Camera model The basic pinhole camera model projects a point in space onto an image point on a plane, called the image plane or the focal plane. Let the centre of projection, the camera centre, C, be the origin of a Euclidean coordinate system. A point in space, X = (X, Y, Z)T , is projected to a point, x = (x, y)T , on the image plane, Z = f , where the line connecting the centre of projection and the point in space intersects the image plane, as shown in figure 2.1. It can be seen that the point in space, X, in R3 is mapped to the image point, x, in R2, as follows [24, pp. 154]:( X, Y, Z )T 7→ ( f ·X/Z, f · Y/Z )T . (2.1) The line which is perpendicular to the image plane and meets the camera centre, is called the principal axis. The point where the principal axis meets the image plane is called the principal point, p, or the optical centre. 5 2. Theory Y X xC x X Z y p Principal axis Image plane Camera centre Y C Zp f f · Y/Z Figure 2.1: Basic pinhole camera model. Shows how the point, X, in space is projected onto the image plane, the principal point, p, and the camera centre, C. If the world point and image point are represented as homogeneous coordinates, the mapping can be written as a matrix multiplication: X Y Z 1  7→ f ·Xf · Y Z  = f 0 f 0 1 0   X Y Z 1  . (2.2) Hereafter X will represent the world point as a homogeneous 4-vector (X, Y, Z, 1)T and x will represent the image point as a homogeneous 3-vector (x, y, 1)T , then (2.2) can be written as x = PXcam, (2.3) where P is the 3× 4 homogeneous camera projection matrix and Xcam is a point in space in the coordinate system with the origin in the camera centre. In (2.2) it is assumed that the principal point is in the origin of the image plane coordinate system, which it may not be, so in general the mapping is [24, pp. 155] ( X, Y, Z )T 7→ ( f ·X/Z + px, f · Y/Z + py )T = ( xcam, ycam )T , (2.4) where xcam and ycam are the image pixel coordinates with the principal point at p = (px, py)T , as in figure 2.2. This can be represented as homogeneous coordinates:  X Y Z 1  7→ f ·X + Z · px f · Y + Z · py Z  = f px 0 f py 0 1 0   X Y Z 1  . (2.5) By expressing K as K = f py f py 1  , (2.6) (2.5) can be written as x = K [ I 0 ] X, (2.7) 6 2. Theory x y p xcam ycam x0 y0 Figure 2.2: The principal point with offset from the image coordinate origin. where K is called the camera calibration matrix and [I|0] denotes I and 0 stacked horizontally. This model assumes square pixels. In the case of a camera, with a CCD or CMOS sensor, this is not a necessity. To compensate for this, unequal scale factors can be introduced, and the camera calibration matrix, K, can be generalised as K = mx my 1  f px f py 1  = αx x0 αy y0 1  . (2.8) For added generality, a skew parameter, s, can be added: K = αx s x0 αy y0 1  . (2.9) For normal cameras, s = 0. If s 6= 0, this can be interpreted as a skewing of the pixels in a sensor array, as if the x- and y-axes of the pixels were not perpendicular. This can occur if the image is a photography of an image itself, where the principal axis is not perpendicular to the plane of the image [24, pp. 164]. In general, the world points will be expressed in a world coordinate system, which usually will not coincide with the camera coordinate system. The relation between the two coordinate systems can be expressed as a rotation and a translation. If X̃ is an inhomogeneous 3-vector representing a point in space, expressed in world coordinates, the coordinates of the point in the camera coordinate system can be written as X̃cam = R(X̃ − C̃) [24, pp. 156], where R is a 3 × 3 rotational matrix representing the orientation of the camera coordinate system and C̃ the coordinates of the camera origin expressed in world coordinates. In homogeneous coordinates, this can be expressed as Xcam = [ R −RC̃ 0 1 ] X, (2.10) which, with (2.7), gives x = KR [ I −C̃ ] X, (2.11) 7 2. Theory and the camera matrix, P : P = KR [ I −C̃ ] . (2.12) For convenience, the camera centre is usually not expressed in world coordinates, but with t = −RC̃, which gives the camera matrix P = K [ R t ] . (2.13) K is often referred to as the internal camera parameters, or intrinsic matrix, and [R|t] as the external camera parameters, or the extrinsic matrix. Finding K is seen as a part of the calibration of the camera. If K has the form as in (2.9), the model is called finite projective camera and has 11 degrees of freedom [24, pp. 175]. This linear model will only hold for cameras with pinhole lenses. For cameras with optical lenses, i.e. most cameras, the world point will not be collinear with the projected point and the camera centre [24, pp. 189]. The most significant error is a radial distortion caused by the lens, as illustrated in figure 2.3, which becomes more severe with shorter focal length and lesser optical quality. To still be able to use a linear camera model, the image can be corrected to what it would have been with a perfect linear camera. The correction for radial distortion has to be applied as a first step in the image processing [24, pp. 190]. A world point in camera coordinates, Xcam, gets linearly projected onto an image plane at (x̃, ỹ)T in focal length units as ( x̃, ỹ, 1 )T = [ I 0 ] Xcam. (2.14) Correction Figure 2.3: Sketch of how a heavily distorted image of a perfect square would be cor- rected. The actual projected point (xd, yd)T can be related to the ideal point as( xd yd ) = L(r̃) ( x̃ ỹ ) , (2.15) where r̃ = √ x̃2 + ỹ2 is the radial distance from the centre of radial distortion and L(r̃) is a distortion factor. This can be written in pixel coordinates as x̂ = xc + L(r)(x− xc) ŷ = yc + L(r)(y − yc), (2.16) where (x, y)T are the measured coordinates, (x̂, ŷ)T are the corrected coordinates, (xc, yc)T the centre of radial distortion and r = √ (x− xc)2 + (y − yc)2 the radial 8 2. Theory distance from the centre of radial distortion. L(r) is defined only for positive values and L(0) = 1. L(r) is usually approximated by a Taylor expansion L(r) = 1 +κ1r+ κ2r 2 +κ3r 3 + .... Determining the coefficients, κn and the centre of radial distortion, (xc, yc)T , is usually considered as part of the calibration of the camera. 2.1.2 Camera calibration A camera is calibrated when the image can be corrected to linear projection and K is known. In this case, the projection of a world point on the image plane can be related to the direction to the world point from the camera centre, and can be written X̃ = λd. This would then map to the image point as x = K[I|0](λdT , 1)T = Kd, up to a scale, and the direction of an image point would thus be d [24, pp. 208]. The angle between two image points can, according to the cosine formula for the angle between two vectors, be written as cos θ = dT1 d2√ dT1 d1 √ dT2 d2 = (K−1x1)T (K−1x2)√ (K−1x1)T (K−1x1) √ (K−1x2)T (K−1x2) = xT1 (K−TK−1)x2√ xT1 (K−TK−1)x1 √ xT2 (K−TK−1)x2 . (2.17) This shows that the angle between the direction of two points can be measured if K is known. It can also be shown that a scene plane defined by an image line, l and the camera centre, with the normal, n, can be related by K, such that n = lK [24, pp. 209]. To find the parameters of the intrinsic matrix, K, the geometry of points projected from infinitely far away is used. Points on the plane at infinity, π∞, can be written as X∞ = (dT , 0)T , and are projected onto an image plane with H = KR as x = KR [ I −C̃ ] (d 0 ) = KRd = Hd. (2.18) This projection is independent of the camera position, C, and only depends on the internal parameters of the camera and rotation, and thus the calibration can be made independent of the position of the camera. A conic is a curve described by a second degree equation in the plane [24, pp. 30]: ax2 + bxy + cy2 + dx+ ey + f = 0, (2.19) or in homogeneous coordinates ax2 1 + bx1x2 + cx2 2 + dx1x3 + ex2x3 + fx2 3 = 0. (2.20) This can be written in matrix form as xTCx = 0, (2.21) 9 2. Theory where C =  a b/2 d/2 b/2 c e/2 d/2 e/2 f  . (2.22) Under the point transformation (2.18), a conic is transformed as C ′ = H−TCH−1, since xTCx = dTH−TCH−1d = dTC ′d. Points on the absolute conic, Ω∞, a conic on a plane at infinity, π∞, satisfies [24, pp. 81] x2 1 + x2 2 + x2 3 x4 } = 0, (2.23) which gives ( x1, x2, x3 ) I ( x1, x2, x3 )T = 0. (2.24) Thus, the C that describes Ω∞ has to be the identity matrix and Ω∞ is also a conic of purely imaginary points. The conic, Ω∞, maps to ω = H−TCH−1 = H−T IH−1 = (KR)−T )I(KR) = (KKT )−1. Thus, if ω is determined, K is also determined. ω is called the Image of Absolute Conic (IAC). It can be shown that all circles intersect Ω∞, and that the support plane for a circle, π, intersect π∞ in a line and this line intersects Ω∞ in two points [24, pp. 28]. These two points are the circular points of π. The projected circular points lie on ω, and are also vanishing points of π. The vanishing line of a plane π is the line where π intersects π∞. This line is the same for all parallel planes [24, pp. 82]. The line can be constructed by two vanishing points of a plane, the points where two parallel, projected, lines on the plane converge, as in figure 2.4. l v1 v2 Figure 2.4: The projected parallel lines on a plane converges to the vanishing points, v1 and v2, and constructs the vanishing line, l. Using the result from (2.17) gives cos θ = vT1ωv2√ vT1ωv1 √ vT2ωv2 , (2.25) 10 2. Theory which gives that the vanishing points of two perpendicular directions satisfy [24, pp. 219] vT1ωv2 = 0. (2.26) Writing the IAC as ω = ω1 ω2 ω4 ω2 ω3 ω5 ω4 ω5 ω6  , (2.27) it can be shown that cameras with zero skew will put a constraint on ω such that ω2 = 0 (2.28) and a camera with square pixels in addition give a constraint such that ω1 = ω3. (2.29) Representing ω as a 6-vector w = (ω1, ω2, ω3, ω4, ω5, ω6)T , constraints in (2.26), (2.28) and (2.29) can be written in the form of aTw = 0. For (2.26), with v = (v1, v2, v3)T and u = (u1, u2, u3)T as vanishing points of two perpendicular directions, a can be written as [24, pp. 225] a = ( v1u1, v1u2 + v2u1, v2u2, v1u3 + v3u1, v2u3 + v3u2, v3u3 )T , (2.30) for (2.28), a can be written as a = ( 0, 1, 0, 0, 0, 0 )T (2.31) and for (2.29), a can be written as a = ( 1, 0, −1, 0, 0, 0 )T . (2.32) By stacking aT from n of these constraints in a matrix, A, of size n× 6 and solving Aw = 0 using Singular Value Decomposition (SVD) [24, pp. 593], gives a least- squares solution for ω. ω can be decomposed into the intrinsic camera matrix, K, using matrix inversion and Cholesky factorisation [24, pp. 582]. A minimum of five independent constraints are required, which can be achieved by using multiple constraints from (2.26). This calibration assumes that the scene is projected to the image plane linearly. If this is not the case, the image has to be corrected as in (2.16). To do this the coefficients of L(r) and the centre of radial distortion have to be determined. This is usually done by an iterative minimisation of a cost function on the projection of a known pattern, e.g. a chessboard pattern or a Tsai grid [24, pp. 192]. Complete algorithms are proposed by Heikkilä and Silvén in 1997 [26] and Zhang in 2000 [27]. There are different tools for camera calibration available. Some of the most popular ones [28] are the Camera Calibration Toolbox for Matlab1 and tools available in the OpenCV library [29]. 1http://www.vision.caltech.edu/bouguetj/calib_doc/ 11 http://www.vision.caltech.edu/bouguetj/calib_doc/ 2. Theory 2.2 Visual Odometry VO is the process of estimating the ego-motion based on consecutive camera images. The classical approach is to detect distinctive projected world points, the points between images, and use the pixel coordinates of the points to determine the ego- motion. These methods are usually called feature based methods. 2.2.1 Feature based methods To relate the pose of two camera views, epipolar geometry can be used. To denote camera matrices, projected points and other quantities related to the second view, ′ will be used. All cameras are assumed to project world points onto the image plane linearly. Epipolar geometry Epipolar geometry is independent of the scene structure, and only depends on the internal and external parameters of the two cameras, i.e. the camera matrix from (2.12). A world point, X, is projected onto the image plane of two views at x and x′. The two camera centres, C and C′, the world point and the projected points will be coplanar [24, pp. 239], as can be seen in figure 2.5. Call this plane, π, the epipolar plane. Knowing that the projected points have to be on the epipolar plane, the search for a matching point for x in the second view, C′, is limited to the line where the image plane intersects the epipolar plane, the epipolar line, as is shown in figure 2.5. The epipoles, e and e′, are the points where the baseline, the line between the two camera centres, intersects each image plane. C C′ X x x′ π Baseline X x X? X? e e′ l′ Figure 2.5: The left figure is showing the two camera view centres, C and C′, the world point, X, the projected image points, x and x′, and the epipolar plane π. The right figure shows the epipolar points, e and e′, the line of the second view, l′, and how all world points in the direction of X must be projected onto this line. The fundamental matrix, F , is a central part of epipolar geometry. The fundamental matrix relates the two views as x′TFx = 0, where x and x′ is the projection of a world point, X, in the first and second view, respectively. The fundamental matrix 12 2. Theory is a 3× 3 matrix of rank 2 [24, pp. 239]. From the fundamental matrix the relative pose between two views can be calculated, up to a scale. The fundamental matrix is independent of scene structure, but can be calculated from scene points [24, pp. 239]. In a pair of images, any world point, X, projected onto one of the images, x, the matching point, x′, on the other image has to be on the corresponding epipolar line, l′. Thus there is a mapping x 7→ l′ [24, pp. 242]. According to (2.7) the world point, X, is projected to an image point, x, as x = PX. The ray, back-projected from x to X, can be written as [24, pp. 162] X(λ) = P+x + λC, (2.33) where P+ the pseudo-inverse of P , C is the null-vector of P and also the camera centre, defined by PC = 0, and λ is a parametrisation of the ray. The camera centre is back projected onto the line at C, when λ → ∞, and as the world point, P+x, when λ = 0. These two points are projected in the second view as P ′C and P ′P+x. These two points lie on the epipolar line of the second view, l′, thus l′ = (P ′C)× (P ′P+x). As the point P ′C is the epipole of the second view, e′, this can be written as l′ = [e′]×(P ′P+)x = Fx, which gives F = [e′]×(P ′P+), (2.34) and the map x 7→ l′ l′ = Fx. (2.35) [e′]× denotes the skew-symmetric of e′. The skew-symmetric matrix of a vector a = (a1, a2, a3)T is defined as [24, pp. 581] [a]× =  0 −a3 a2 a3 0 −a1 −a2 a1 0  , (2.36) which relates to the cross product as a × b = [a]×b. (2.37) To find a method for calculating F from image coordinates alone, consider two cameras in the coordinate system of the first camera, P = K[I|0] and P ′ = K ′[R|t], then [24, pp. 244] P+ = [ K−1 0T ] C = ( 0 1 ) (2.38) and, as the epipole of the second camera, e′, is the projection of the first camera centre, P ′C, F = [P ′C]×P ′P+ = [K ′t]×K ′RK−1 = K ′−T [t]×RK−1 = K ′−TR[RT t]×K−1 = K ′−TRKT [KRT t]×. (2.39) 13 2. Theory Then, as [24, pp. 244] e = P ( −RT t 1 ) = KRT t and e′ = P ′ ( 0 1 ) = K ′t, (2.40) F = [e′]×K ′RK−1 = K ′−T [t]×RK−1 = K ′−TR[RT t]×K−1 = K ′−TRKT [e]×. (2.41) If x and x′ correspond, x′ will lie on the epipolar line l′ = Fx, thus x′T l′ = 0 [24, pp. 245], which gives x′TFx = 0. (2.42) Writing x = (x, y, 1)T and x′ = (x′, y′, 1)T and denoting fnm as the elements of F , (2.42) gives rise to the linear equation x′xf11 + x′yf12 + x′f13 + y′xf21 + y′yf22 + y′f23 + xf31 + yf32 + f33 = 0, (2.43) which can be expressed by a vector multiplication as( x′x, x′y, x′, y′x, y′y, y′, x, y, 1 ) f = 0, (2.44) where f is a 9-vector of the elements in F in row-major order. From a set of n points, a set of linear equations can be written as Af =  x′1x1 x′1y1 x′1 y′1x1 y′1y1 y′1 x1 y1 1 ... ... ... ... ... ... ... ... ... x′nxn x′nyn x′n y′nxn y′nyn y′n xn yn 1  f = 0. (2.45) For an exact solution to exist for this, A has to be of rank 8, thus a minimum of 8 points are required to solve this equation. Since there is noise in the image coordinates from quantisation errors and other sources, more points can be used to find a better estimate of f . In this case A may be of rank 9. A solution can be found by SVD, A = UDV T , where the last column of V will be the singular vector, corresponding to the smallest singular value of A, and also the least-squares solution of f , which minimises ‖Af‖, subject to ‖f‖ = 1 [24, pp. 280]. The solution for f can then be decomposed to F . The fundamental matrix is a singular matrix of rank 2, and many applications of it relies on that fact [24, pp. 280]. Solving the linear equations of from (2.45) does not guarantee this, and steps to enforce this should be performed. The fundamental matrix obtained from the linear equations can be modified to satisfy this by replacing F with F ′ that minimises the Frobenius norm ‖F − F ′‖ subject to detF ′ = 0. This can be obtained by the SVD of F = UDV T , where D is a 3 × 3 diagonal matrix with the singular values of F in descending order on the diagonal. Replacing the smallest singular value in D with 0, calling it D′, and constructing F ′ = UD′V T , then F ′ will minimise the Frobenius norm ‖F − F ′‖ [24, pp. 281]. Image coordinates can come in different formats, some define the upper left corner of the image as the origin and others define the centre of the image as the origin. It can 14 2. Theory be shown that this, as well as the magnitude of the pixel coordinates, influence the result when retrieving the fundamental matrix using the 8-point algorithm described above [24, pp. 281]. This is due to the difference in magnitude of the elements of A which are two elements multiplied, like x′x and x′y, compared to single elements, like x and y. The condition number of A will be higher when the magnitude of the elements differs more, which will lead to bigger errors when enforcing rank 2 on the fundamental matrix [30]. Therefore it is considered an essential step to normalise the image points [24, pp. 281]. A suggested normalisation x̂i = Txi and x̂′i = T ′x′i is to construct the transformation T and T ′ such that it will place the points with the centroid at the origin and with the Root Mean Square (RMS) distance equal to √ 2 [24, pp. 282]. Then perform the steps of the 8-point algorithm with the normalised points and retrieve the normalised fundamental matrix F̂ ′. Finally retrieve the fundamental matrix corresponding to the original data as F = T ′T F̂ ′T [24, pp. 282]. This is called the normalised 8-point algorithm for F . A specialisation of the fundamental matrix is the essential matrix, E. The funda- mental matrix can be seen as a generalisation of the essential matrix, where the assumption of a calibrated camera is removed. Consider a camera matrix P = K[R|t], and a world point projected to an image point as x = PX. If the intrinsic matrix, K, is known, its inverse can be applied to the image point which gives x̂ = K−1x. This gives x̂ = [R|t]X, where x̂ is the image point in normalised coordinates [24, pp. 257]. The camera matrix K−1P = [R|t] is called the normalised camera matrix [24, pp. 257]. A pair of normalised camera matrices would then, according to (2.39), result in an essential matrix as E = [t]×R = R[RT t]×. (2.46) The essential matrix is defined as x̂′TEx̂ = 0. (2.47) Substituting x̂ and x̂′ with the corresponding points x↔ x′ gives x′TK ′−TEK−1x = 0, which compared to x′TFx = 0 gives E = K ′TFK. (2.48) Both the translation, t, and the rotation, R, in the essential matrix E = [t]×R has 3 DoF, however, the essential matrix has only five due to a scale ambiguity. Nistér in 2004 proposed an efficient solution to determine the essential matrix with five correspondences [31], which is the minimal set to fully solve the 5 DoF essential matrix. It can be shown that the essential matrix is a matrix where two of its singular values are equal and the third is equal to zero [24, pp. 258]. The two equal singular values can be chosen freely, due to the scale ambiguity, and are often set to 1. It can also be shown that for a given essential matrix, with the SVD E = Udiag(1, 1, 0)V T , 15 2. Theory and a first camera matrix P = [I|0], there are four possible solutions for the second camera matrix [24, pp. 259]: P ′1 = [ UWV T u3 ] P ′2 = [ UWV T −u3 ] P ′3 = [ UW TV T u3 ] P ′4 = [ UW TV T −u3 ] , (2.49) where W = 0 −1 0 1 0 0 0 0 1  . (2.50) Of the four different solutions, only one is correct. The different solutions can be interpreted as is shown in figure 2.6. To determine which of the solutions is correct, C C′ C C′ C C′ C C′ Figure 2.6: Illustration of how the four different choices of camera matrices for a given essential matrix can be interpreted. the respective translation and rotation can be used to triangulate the positions of the points used to determine the essential matrix. The solution which renders the points in front of both cameras is the correct one, top left in figure 2.6. A world point is projected through two cameras onto their image planes as x = PX and x′ = P ′X. These equations can be combined into a form AX = 0. This form can be obtained by the cross product x × (PX) = 0, which written out gives [24, pp. 312] x(p3TX)− (p3TX) = 0 y(p3TX)− (p2TX) = 0 x(p2TX)− y(p1TX) = 0, (2.51) 16 2. Theory where piT denotes the i:th row of P . Combined with the analogue for x′×(P ′X) = 0, this can be written in the form AX = 0 with A =  xp3T − p1T yp3T − p2T x′p′3T − p′1T y′p′3T − p′2T  . (2.52) This can be solved using SVD, A = UDV T , where the last column of V will corre- spond to the least-squares solution of X [24, pp. 312]. A more efficient algorithm for obtaining t and R is proposed by Nistér in 2004 [31]. Feature detection and description To use the methods described, points in the image planes of successive images that are projections of the same point on a physical object must be identified. There are two different, generally used approaches to find image points corresponding to the same world point: feature matching and feature tracking. Feature matching is to independently detect interesting feature points in two images, construct a feature descriptor for each image point, which should describe the points in a unique way, but still be invariant for different views, and then match the feature points between the two images using their descriptors. The feature detectors can be divided into two groups: corner detectors and blob detectors. A corner is defined as a point where two or more edges intersect [32]. A blob is a pattern in the image which differs from its immediate neighbourhood in terms of intensity, colour and texture, and is not an edge nor a corner [32]. Desired properties for feature detectors are localisation accuracy, the pixel coordinate of where the feature is detected should be precise; repeatable, the same feature should be detected in different images; computationally efficient; robust to noise, compres- sion artifacts and blur; distinctive, to be able to match the feature accurately across different images, and invariance to both photometric and geometric changes [32]. Corner detectors are in general faster to compute and give a more accurate image position, but finds less distinctive features than blob detectors. Blob detectors are usually less invariant to changes in scale and viewpoint [32]. The choice of cor- ner detector depends on computational constraints, environment type and motion baseline, and should be carefully considered [32]. There are many feature detectors available. Commonly used corner detectors are Harris [33], Features from Accelerated Segment Test (FAST) [34] and Shi-Tomasi [35] and blob detectors are Scale Invariant Feature Transform (SIFT) [36], Speeded Up Robust Features (SURF) [37] and Center Surround Extrema (CENSURE) [38]. Feature detectors usually work in two steps. The first step is to apply a feature- response function on the entire image [32]. For instance, Harris uses the corner response function and SIFT uses the Difference-Of-Gaussian (DoG) operator [32]. The second step is to localise all local extrema points on the output of the first step, 17 2. Theory C or ne r D et ec to r Bl ob D et ec to r R ot at io n In va ria nt Sc al e In va ria nt A ffi ne In va ria nt 1 R ep ea ta bi lit y Lo ca lis at io n A cc ur ac y R ob us tn es s Effi ci en cy Harris × × +++ +++ ++ ++ Shi-Tomasi × × +++ +++ ++ ++ FAST × × × ++ ++ ++ ++++ SIFT × × × × +++ ++ +++ + SURF × × × × +++ ++ ++ ++ CENSURE × × × × +++ ++ +++ +++ Table 2.1: Properties and performance of feature detectors [32]. 1 Not truly affine invariant, but found to be invariant up to certain changes in viewpoint. by applying nonmaxima suppression [32]. To achieve invariance to scale, the feature detector is often applied to images of different scale, however this will lead to loss of accuracy in the pixel coordinate due to the lower resolution and is computationally heavy [38]. In CENSURE, another approach, where the Laplacian across scale feature-response is used, which calculates the features at every pixel at all scales, and achieves real-time performance [38]. A summary of properties and performance of different feature detectors is given in table 2.1. For each detected feature point a compact descriptor based on the region around each point is calculated. The simplest feature descriptor is a descriptor of the appearance of the point, i.e. the intensity of each pixel in a region around the feature point. Then the sum of squared differences or the normalised cross correlation can be used to compare the descriptors [32]. However, these descriptors are not very robust to changes in orientation, scale and viewpoint [32]. More elaborate descriptors are the SIFT descriptor [36], the SURF descriptor [37], Binary Robust Independent Elementary Features (BRIEF) [39], Oriented FAST and Rotated BRIEF (ORB) [40] and Binary Robust Invariant Scalable Keypoints (BRISK) [41]. In common for these descriptors is that they generate a vector, usually 64 or 128 elements long. SIFT and SURF produce vectors of real numbers, while, as their name indicate, BRISK, ORB and BRIEF produce binary vectors. The SIFT descriptor algorithm works by dividing the region around the feature point into a 4×4 grid, for which each of eight histograms of gradients with different orientations are calculated and concatenated into the 128 element descriptor vector [32]. It is desired for a descriptor to be invariant to rotation. In SIFT, this achieved by calculating the image gradient orientation of a region around the feature point and rotating the calculation of histograms accordingly [36]. The set of features from two images can then be exhaustively matched, using a similarity measure on the feature descriptors. For SIFT and SURF the Euclidean 18 2. Theory distance can be used [32], and for BRIEF, ORB and BRISK the Hamming distance can be used [39][40][41]. The complexity of exhaustive matching is O(n2), and be- comes impractical when the number features becomes large [32]. Binary descriptors easily outperforms floating point descriptors in speed of matching, as the Hamming distance of a binary vector can be calculated extremely fast on a modern CPU in the form of a bitwise XOR operation [41]. The matching process can also be speeded up by using an indexing structure, such as a search tree or a hash table [32]. If there is a known motion model where other sensors, like an IMU, can be used to estimate a motion, given the first image point, the epipolar line of the second image, where the matching point should recur, can be calculated as shown in section 2.1.2, and the search for a matching feature point can be limited [32]. Feature tracking is a different approach to find matching points. The tracker is initialised with a set of feature points, which are to be tracked. Often the Kanade–Lucas–Tomasi (KLT) [35] feature tracker is used [32]. These methods re- quire that the images are taken in close proximity [32]. The KLT tracker estimates the displacement of a point by estimating an affine image transformation of regions in the image around a feature point, feature windows. The affine image motion assumes a point, x = (x, y)T , in the first image, I1, will move to a point, Ax + d, in the second image I2, and that I2(Ax + d) = I1(x), (2.53) where d is the feature windows displacement, A = I +D, and where D = [ dxx dxy dyx dyy ] . (2.54) The displacement, d, and the deformation matrix, D, are then found by minimising [35] ε = ∫∫ W [I2(Ax + d)− I1(x)]2w(x) dx, (2.55) where W is the feature window and w(x) is a weighting function. The weighting function can e.g. be equal to 1 or a Gaussian-like function, to emphasize the central area of the feature window [35]. The integral is linearised using truncated Taylor expansion and then minimised using the Newton-Raphson method [35]. RANSAC Both feature matching and feature tracking will result in incorrect point correspon- dences between images, outliers, which will bias the result if included in estimation of the motion. Random Sample Consensus (RANSAC) [42] has been established as the standard method for outlier rejection [32]. The idea behind RANSAC is to estimate a number of hypothesis models by repeatedly sampling a randomly selected minimum set of correspondences and count the total number of correspondences in consensus with the estimated hypothesis [32]. The matching correspondences of the hypothesis generating the most matches will be selected as inliers. An example of 19 2. Theory Figure 2.7: Illustration of how RANSAC would work to estimate a line. The figures shows two randomly selected points in blue, rejected outliers in red and the inliers in green for four different hypothesises. Here the inliers of the hypothesis in the bottom right image will be selected as the final inliers as they are more than in the other hypothesises. how the RANSAC sampling process, to estimate a line in a plane, could look like is illustrated in figure 2.7. The number of samples required for finding a correct hypothesis, N , with a proba- bility of success, P , is determined by the percentage of outliers, ε, and the number points in the minimal set, s [32]: N = log (1− P ) log (1− (1− ε)s) . (2.56) As can be seen in figure 2.8, the number of required samples increases exponentially in the number of minimal points in the set, thus it is desired for the minimum set required to make a hypothesis to be as small as possible. 20 2. Theory 0 0.2 0.4 0.6 0.8 10 200 400 600 800 1 000 Fraction of outliers N um be r of ite ra tio ns 1-point 5-point 8-point Figure 2.8: A plot of the number of required iterations of RANSAC for a 99% probability of success at different percentage of outliers with different number of minimal points. RANSAC can be used with the normalised 8-point algorithm described previously, with 8 points in the minimal set. To calculate a distance to the remaining points, the distance to the epipolar line can be used [32]. Often the Sampson distance [43], which is a first order approximation of the geometric distance, is used, as the geometric error is quite complex in nature [24, pp. 98][32]. The Sampson approximation of a geometric distance is εT (JJT )−1ε, where ε is the Sampson cost and J the Jacobian of the Sampson cost. The Sampson approximation of the epipolar line becomes [24, pp. 287] (x′TFx)2 JJT . (2.57) From the explicit expression of x′TFx in (2.43), and the definition of J [24, pp. 287], JJT = (Fx)2 1 + (Fx)2 2 + (F Tx′)2 1 + (F Tx′)2 2, (2.58) where (Fx)j represents the jth element of the vector Fx. Thus, the Sampson cost function for a point correspondence distance to the epipolar line, given a fundamental matrix is (x′TFx)2 (Fx)2 1 + (Fx)2 2 + (F Tx′)2 1 + (F Tx′)2 2 . (2.59) As it is desired for the minimal set of corresponding points to be as small as pos- sible, to reduce the number of required RANSAC iterations, algorithms requiring less correspondences than the normalised 8-point algorithm, like the highly efficient 5-point algorithm proposed by Nistér in 2004, could be used. Using the 5-point algorithm, which requires 5 correspondences, compared to the normalised 8-point algorithm, which requires 8, would reduce the number of required iterations from 1177 to 145, with a 99% success rate and 50% outliers, according to (2.56). In 2009 Scaramuzza et al. proposed a solution for RANSAC for on-road vehicles, only requiring 1 point [44]. With 1 correspondence, according to (2.56), the number of required RANSAC iterations with a 99% success rate and 50% outliers is 7. The algorithm requires only 1 correspondence, as a restricted motion model, assuming circular motion, is used [44]. As can be seen in figure 2.9a, a planar motion can be described by three parameters, the yaw angle θ and the polar coordinates (ρ, ϕ)T 21 2. Theory [44]. Since, when using a monocular camera, the scale factor is unknown, ρ can be set to 1 [44]. Assuming the camera to move along a circumference, perpendicular to the Instantaneous Centre of Rotation (ICR), as in figure 2.9a, gives ϕ = θ/2, thus, only θ has to be estimated, which can be done with a 1 point correspondence [44]. The Ackermann steering configuration, which is common in automobiles [45, ρ θ ϕ = θ/2 θ π−θ 2 π−θ 2 ICR RICR C C′ (a) Description of circular motion. ICR RICR Copt Creal (b) The Ackermann steering principle. Figure 2.9: Illustrations of how a circular motion can be used describe the motion of a vehicle using the Ackermann steering principle. pp. 37], ensures an ICR will exist, located in the center of the rear axle [45, pp. 68]. Thus, the above assumptions made about the motion in the 1 point RANSAC are reasonable for cars, if the camera is placed as Copt in figure 2.9b [44]. However, in practice, the steering angle of cars is small and thus the radius of the curvature is big, which allows relaxation of constraint on the placement of the camera to the principal axis of the camera being perpendicular to the rear-wheel axle [44]. The circular motion can be described by a rotational matrix R and a translation matrix t as R =  cos θ 0 sin θ 0 1 0 − sin θ 0 cos θ  t = ρ · sinϕ 0 cosϕ  , (2.60) if rotation is about the y-axis and translation in the xz plane is considered. It is known from (2.47) that the essential matrix, E, relates two corresponding image points, x = (x, y, 1)T and x′ = (x′, y′, 1)T , as x′Ex = 0, and from (2.46), that E = [t]×R. Using (2.60), (2.46), and the constraint ϕ = θ/2, an expression for the essential matrix can be obtained: E = ρ ·  0 − cos θ 2 0 cos θ 2 0 sin θ 2 0 sin θ 2 0  . (2.61) 22 2. Theory Inserting this into (2.47) gives sin θ2(y′ + y) + cos θ2(y′x− x′y) = 0, (2.62) and the rotation angle can then be obtained: θ = −2 arctan ( y′x− x′y y′ + y ) . (2.63) Bundle Adjustment Bundle Adjustment (BA) is a method for minimising the reprojection error of fea- tures detected in an image by adjusting the camera pose estimate and feature 3D position estimates [24, pp. 434]. Consider a set of camera poses, described by cam- era matrices, Pi, and a set of 3D feature points, Xj, detected at image coordinates, xi,j. The reprojection error is then the distance between the detected feature point in the image, xi,j, and the image coordinates of the feature reprojected onto the image given the camera pose and the 3D feature position, PiXj. Since this gener- ally can not be solved exactly due to noisy measurements, the objective is to find estimates of the camera matrices and feature point positions, P̂i and X̂j, respec- tively, that minimise the total reprojection error. Assuming Gaussian noise on the measurements the maximum likelihood, estimates can be found by minimising the sum of the squared reprojection errors: min P̂i,X̂j ∑ i,j∈C d(P̂iX̂j,xi,j)2, (2.64) where d(a,b) is the geometric distance in the image between homogeneous image points, a and b, and C is the set of pairs i, j, for which feature Xj has been seen from pose i, i.e. xi,j exists. BA needs a good initialisation [24, pp. 435]. This estimate can be provided by triangulating new features when they have been seen in two consecutive frames whose poses are estimated using epipolar geometry, as detailed above. It has been shown that BA is the optimal non-linear least-squares SLAM algorithm [46], however, the complexity of minimising (2.64) with the Levenberg-Marquardt algorithm is cubic in the number of poses and features [46][24, pp. 435], which makes in impractical to optimised the full graph online. A way of reaping the benefits of the optimality of BA, while maintaining constant complexity, making it possible to use BA for VO, is to partition the sequence of images by windowing the latest n frames, and only performing BA for the corresponding poses and the features in those images [24, pp. 453]. This is called local BA. 2.2.2 Direct methods Direct methods optimises the geometry directly on the image intensities, which enables the possibility of using all information in the image [23]. By this, direct 23 2. Theory methods circumvents the limitation of feature based methods, that “only informa- tion that conforms to the feature type can be used” [23]. Direct methods have higher accuracy and robustness, compared to feature based methods, in particular in areas with few feature points, and in addition provides more information about the geometry of the environment [23]. Direct methods for visual odometry has been well established for RGB-D and stereo cameras [23], such as the solution by Klose et al. in 2013 [47], but solutions have been proposed for monocular cameras by Pizzoli et al. in 2014 [48] and Stühmer et al. in 2010 [49]. All these solutions are computationally demanding and requires a state-of-the-art Graphics Processing Unit (GPU) to run in real-time [23]. In 2013 Engel et al. proposed a semi-dense solution for direct visual odometry for a monocular camera [13], which significantly reduces the computational complexity, compared to previous work, but does not build a globally consistent map including loop closures [23]. The approach of the method is to spend computations where the information gain is maximised [13]. This is done by by calculating a semi-dense inverse depth map only for the regions of the image with non-negligible gradient [13]. This solution for visual odometry was incorporated in a full solution for the SLAM problem by Engel et al. in 2014, which corrects accumulated scale drift and closes loops [23]. 2.3 Simultaneous Localisation And Mapping The SLAM problem consists of building a map and localising within that map si- multaneously. There are several kinds of maps that can be used to represent the surroundings. Maps consisting of the locations of discrete point-like landmarks, i.e. point clouds, are well suited to be used with camera data [50]. In 2001, Dissanayake et al. showed that it is possible to build accurate maps with only relative mea- surements, assuming Gaussian noise, linear models and static landmarks [51]. More specifically they proved that the only lower limit on the variance in the estimates of the pose and landmarks is the variance of the initial pose. The SLAM problem can be formulated as calculating, or approximating, the prob- ability distribution p(x1:t,m|z1:t, u1:t, x0), (2.65) where x1:t is the trajectory of poses x1 to xt, m is the map, i.e. the position of the observed landmarks, z1:t are all measurements, u1:t are all the motion commands issued and x0 is the initial pose [9, pp. 310] [52]. Using a motion model appropriate for the given system, the motion commands can be translated into an odometry and u is indeed sometimes referred to as odometry. The initial position is usually considered to be arbitrary, x0 can be set to the origin with zero uncertainty. An equally important variant is the online SLAM problem of finding p(xt,m|z1:t, u1:t, x0), (2.66) where only the current pose is sought together with the map, given all measurements and motion commands [9, pp. 309]. Factorising the right-hand side of (2.66) by 24 2. Theory applying Bayes theorem, omitting x0 for brevity, gives p(xt,m|z1:t, u1:t) = p(xt,m|z1:t−1, u1:t)p(zt|xt,m) p(zt|z1:t−1, u1:t) . (2.67) The observation model, p(zt|xt,m), is the probability of observing zt at time t given a location xt and a map. p(xt,m|z1:t−1, u1:t) is the motion prediction, given the new controls, which can be written as p(xt,m|z1:t−1, u1:t) = ∫ p(xt−1,m|z1:t−1, u1:t−1)p(xt|xt−1, ut) dxt−1. (2.68) This clearly factors the prediction into the motion model, p(xt|xt−1ut), and the previous posterior, p(xt−1,m|z1:t−1, u1:t−1). The main difference between VO using BA and V-SLAM is that V-SLAM in general tries to create a globally consistent map, as opposed to the local consistency of VO [28], and thereby reducing the drift of the ego-motion. A way of ensuring the global consistency of the map is by detecting the return to a previously visited location, a loop closure. 2.3.1 Loop closure Because of the drift inherent in successively building a map by incremental, imper- fect odometry, the map representation will never be perfectly aligned when returning to a previously visited location. To make sure the map is globally consistent, the SLAM algorithm needs to explicitly connect the new location with the previously visited location to which it corresponds, and adjust the path in between to take the new connection into consideration in order to make the parts of the map in- volved consistent. The process thus has two basic steps: loop detection and loop closing. It is important to note that while loop closures are very useful in a SLAM system, introducing erroneous loop closures can cause catastrophic errors in the map, and ego-motion estimates that are hard to recover from, if past data associations can not be undone. It is thus very important not to introduce false positives when closing loops. A common way of measuring the performance of a loop detector is the recall at 100% precision, i.e. the proportion of all the true loop closures detected without accepting any false positives. With good VO, only occasional loop closures are needed to straighten out the map and trajectory. In the case of monocular V-SLAM, there are three general categories of loop detec- tion [53]: map-to-map, image-to-map and image-to-image. Map-to-map loop detection consists of matching the features of two submaps consid- ering the feature descriptor as well as the relative geometric position of the features. This method has been used in an EKF SLAM system by Clemente et al. in 2007 [54], and in a Graph-based SLAM system by Eade and Drummond in 2008 [55]. When the submaps have too few features in common to match, 100% precision cannot be maintained [53]. 25 2. Theory Image-to-map loop detection aims to find correspondences between the latest image and features in a map. This method is used in a Graph-based SLAM system by Lim et al. in 2014 [19] and in EKF SLAM with submaps by Williams et al. in 2008 [56]. Lim et al. in 2014 used a vocabulary tree to find feature correspondences with earlier feature point locations and when enough features match geometric verification is performed using RANSAC to reject outliers [19]. Image-to-image loop detection, or appearance only based loop detection, works by matching the current image to previously seen images. One such algorithm is FAB- MAP 2.0, by Cummins and Newman in 2011 [21], further details of which is given below. Another variant of the image-to-image approach is matching a recent se- quence of images to previous sequences, as implemented in SeqSLAM by [22] in 2012 [22]. In the case of vehicles travelling on roads, it is clear that sequences of images from a forward looking camera can provide useful information for resolving problems with visual aliasing, since the vehicles in the same location will move along similar paths. However, the procedure used by Milford and Wyeth in 2012 [22], of taking the absolute difference between down-sampled and patch normalised images is highly sensitive to changes in view-point orientation [57]. A representation of the image differences that is less susceptible to such variations could improve the performance. One limitation in common for the image-to-image systems examined here is that only a topological loop closure is achieved. If a geometric loop closure is sought the rela- tive pose between the recent image and the found match must be computed. FAB-MAP FAB-MAP 2.0 [21], by Cummins and Newman in 2011, is the state of the art method for loop closure detection [22][58]. It builds on FAB-MAP [20] by Cummins and Newman in 2008. Both use the Bag-Of-Words (BoW) representation of images. The words consist of feature descriptor vectors detected in the image, that have then been quantised with respect to a vocabulary, i.e. the descriptor is replaced by the closest word, in L2 sense, in the vocabulary. The vocabulary is built by clustering the descriptors found in a training set of images and choosing the centres of the clusters as the vocabulary words. The set of descriptors that are represented by each word corresponds to the Voronoi region in the descriptor space around the word. An image is then a “bag” of the vocabulary words found in the image. The point of this procedure is to get a discrete representation of the visual words which can be indexed. This allows for simple enumeration of the vocabulary words appearing in each image as well as the creation of an inverted index which describes which location each word appears in. A common strategy for building a vocabulary is to initialise the clustering with uniformly distributed random words. Cummins and Newman in 2011 showed that a radius-based initialization that requires the initial guesses to be separated by a minimum distance gives improved recall at a given precision [21]. The loop detection is done by matching the BoW description of the latest image to 26 2. Theory those previously seen, taking into account the uniqueness of the words, and their co-visibility. FAB-MAP can either assume the features are independent of each other, the naive Bayes approximation, or each feature can be conditioned on at most one other feature, resulting in a tree structured graphical model. The naive Bayes approach has one unique solution, the probability of observing a feature is proportional to its frequency in the training data. There are many different tree structures for a given number of nodes, the one that most closely approximates the original distribution is a Chow-Liu tree [20]. Word occurrence are highly correlated between words, and accounting for the correlation in this approximate manner is shown to improve results [20]. The probabilistic model, underlying FAB-MAP, is based on unobservable “scene elements”, eq, which give rise to visual word observations, zq. These are binary variables that take the value 1 when the scene element is in view or when a word describing that scene element is detected, respectively. The detection of words is characterised by two parameters corresponding to the probabilities p(zq = 1|eq = 1) and p(zq = 1|eq = 0) which are the true and false positive probabilities. These two parameters together with a training data set is needed to initialise, i.b build vocabulary and tree, and run the FAB-MAP algorithm. The observation derived from an image at time k is Zk = z1, ..., zv, with v being the number of words in the vocabulary. A location Li is modelled as the belief about the presence of scene elements p(e1 = 1|Li), . . . , p(e1 = 1|Li). This allows a formulation of the belief about the current location very similar to a recursive Bayesian estimation. The probability for the ith location being the current location given the observations Z1 to Zk can be written p(Li|Z1:k) = p(Zk|Li, Z1:k−1)p(Li|Z1:k−1) p(Zk|Z1:k−1) , (2.69) where p(Li|Z1:k−1) is the location prior, p(Zk|Li, Z1:k−1) the observation likelihood, and p(Zk|Z1:k−1) a normalising term. Commonly, the normalising term does not have to calculated the in Bayesian estimation because the probability function cov- ers all possible outcomes, however, in (2.69), the outcomes only cover previously visited locations, and the possibility that the measurement is taken from an entirely new location still remains. To determine this, the normalising term must also be evaluated [20]. FAB-MAP will give a probability for each previous location of it being a loop closure with the latest image, and those above a certain threshold will be accepted as a loop closure. FAB-MAP 2.0 also performs a verification of the epipolar geometry between the potential loop closures with RANSAC and weighs this into the ranking. 2.3.2 EKF SLAM EKF SLAM is based on the EKF. The EKF applies the classic Kalman filter to a non-linear model by linearising the model, using the first order Taylor expansion. This form of SLAM was the first to be implemented. It is an online form of SLAM and uses feature-based maps. As EKF SLAM is based on the EKF it assumes that the measurement noise and the motion model uncertainty has a Gaussian dis- 27 2. Theory tribution. EKFs tend to handle large uncertainties in the posterior poorly, since the linearisation may introduce errors too large to handle [9, pp. 312]. The EKF SLAM algorithm is also limited to only using positive information, it cannot take into account the absence of landmarks [9, pp. 313]. The state vector in EKF SLAM consists of the current pose, an n-dimensional vec- tor, xt, and the landmark positions, m. Call the combined M -dimensional state vector yt = (xTt ,mT )T . EKF SLAM calculates an approximation to the posterior p(yt|z1:t, u1:t). This is done by following the standard EKF procedure outlined be- low. Initialise the mean and covariance of yt in the origin with zero variance for the pose state and infinite variance for the landmarks, µ0 = ( 0 . . . 0 ) and Σ0 = [ 0 0 0 ∞ · I ] (2.70) for the mean and covariance matrix respectively. Σ0 is an M ×M matrix in which the diagonal zero is n× n, and ∞ · I is a matrix with ∞ on the diagonal. Then the predicted mean and covariance is calculated: µ̄t = g(µt−1, ut), (2.71) where µ̄ is the predicted mean, µt−1 is the previous estimate of the mean, and g(µt−1, ut) is the motion model. Σ̄t = GtΣt−1G T t +Rt (2.72) Here, Σ̄t is the covariance of the prediction, Gt = ∂g(µt−1,ut) ∂yt−1 is the Jacobian of the motion model, and Rt is the motion noise. This is how the covariance propagates in the linear model [9, pp. 314]. The following assumes that the data association problem is solved by a feature tracker, loop closure algorithm, or other method so that correspondences of the features are known and stored in a correspondence vector, ct, where the elements cit are the index of the ith landmark seen at time t, 1 < i < Nt, with Nt the number of landmarks at time t. If the landmark has never been seen before, add the estimated mean and covariance, of the landmark position, to the end of the state vector. In the case of bearing only measurement, there is no sensible estimate of the position of the landmark until it has been seen multiple times, and the position can be triangulated. The mean and covariance can then be updated by calculating the estimated mea- surement for each individual landmark seen at time t, ẑit = hj(µ̄t), where hj is the measurement function and j = cit, that is the expected measurement given what is known about the landmarks so far. For a camera, this is the pixel coordinates of the landmark, when projecting it to the camera of the current pose. For each landmark the update is then carried out: Ki t = Σ̄i−1 t H iT t (H i tΣ̄i−1 t H iT t +Qt)−1 µ̄it = µ̄i−1 t +Ki t(zit − ẑit) Σ̄i t = (I −Ki tH i t)Σ̄i−1 t , (2.73) 28 2. Theory where Ki t is the Kalman gain, and H i t is the Jacobian of the measurement func- tion. Finally, setting µt = µ̄Nt t , and Σt = Σ̄Nt t , gives the estimate of the mean and covariance for the entire state. One of the drawbacks of EKF SLAM is the poor scaling with the number of land- marks. The complexity of EKF SLAM is O(l2) for l landmarks. Since the number of landmarks increases with the distance driven, squared complexity means the com- putational problem becomes intractable for large maps. There are several variants of Kalman filter based SLAM algorithms, among them other variants of the basic Kalman filter that has been extended to handle non- linear filtering such as the Unscented Kalman Filter (UKF). There are also similar filters that, unlike the EKF and other direct descendants of the Kalman filter, uses covariance matrices to quantify the uncertainty, use the inverse of the covariance matrix, the information matrix. These are called Information Filters (IFs). The Extended Information Filter (EIF) is one such filter extended to handle the non- linear case by linearisation, as in the EKF [9, pp. 75]. IFs trade simple mean calculation and complex covariance calculation in Kalman filters, for complex mean calculation and simple information matrix calculation [9, pp. 73]. One major benefit of using IFs is that, unlike covariance matrices, the information matrices in SLAM are usually sparse, since there is only odometry between consecutive poses, and only a subset of the landmarks are seen from each pose. An information matrix is only non-zero when there is a connection between the corresponding pose-pose or pose- landmark pair [9, pp. 79]. Sparse problems can be solved significantly faster than dense ones. 2.3.3 FastSLAM FastSLAM is the result of applying particle filters to solve the SLAM problem. A particle filter represents a probability distribution by a finite number of random sam- ples of the distribution, instead of a parametric description of a certain probability function, such as the mean and covariance parameters used to represent Gaussian distributions [9, pp. 85]. For this reason particle filters are sometimes called non- parametric filters. While the sample based representation used in particle filters is only approximate, it can represent a much broader class of distributions, than the commonly used Gaussian distribution, e.g. multimodal distributions. Non-linear transformations of distributions are also easily applied to sets of samples without the need for linearisation, to generate a set of samples in the resulting distribution [9, pp. 97]. To maintain performance, particle filters need a certain density of particles in the state space, to assure that some particles are close to the correct state [9, pp. 112]. As the “volume” of the state space grows exponentially with its dimension and the dimension of the state space grows with the number of landmarks, using a particle filter to estimate the entire state, including landmark positions, would quickly be- come intractable. Fortunately, since there are no measurements between landmarks, the correlations between landmarks arise solely through the uncertainty in the state [9, pp. 437]. Thus, if the poses are known, the landmarks are independent of each 29 2. Theory other. Each individual particle assumes that the trajectory is known (the uncer- tainty is instead described by the spread of the particles pose estimates), and can therefore estimate the position of the landmarks independently of each other. This can be described as factorising the SLAM posterior (2.65) to p(x1:t,m|z1:t, u1:t, c1:t) = p(x1:t|z1:t, u1:t, c1:t) N∏ n=1 p(mn|x1:t, z1:t, c1:t), (2.74) for a trajectory x1:t, a map m of N landmarks with the correspondences c1:t, and odometry and measurements, u1:t and z1:t. Thus, each landmark can be indepen- dently estimated with a parametric filter like the EKF. This greatly reduces the pose states, whose posterior distribution is estimated with particles, to a constant number of states, 6 in the case of 3D poses. The resulting particle filter is called a Rao-Blackwellised particle filter. In FastSLAM, each particle has an estimate of the pose, and estimates the location of the landmarks, with an EKF for each landmark [9, pp. 439]. A particle with index [k] in FastSLAM thus contains an estimate of its trajectory, x[k] 1:t, and a set of parametric, specifically Gaussian, estimates for all the landmarks seen so far, {µ[k] 1 ,Σ [k] 1 , µ [k] N ,Σ [k] N }. The FastSLAM algorithm iter- ates over each of the M particles and does a motion prediction and measurement update. The motion prediction update differs between FastSLAM 1.0 and 2.0. In the 1.0 algorithm the proposal distribution sampled is simply the motion prediction p(xt|xt−1, ut), while the 2.0 version samples a distribution p(xt|x1:t−1, u1:t, z1t). The benefit of the 2.0 approach is that the proposal distribution better covers the target distribution [9, pp. 451]. If the control and odometry are poor predictors of the posterior, few of the samples drawn from the proposal distribution will lie in the regions of high probability density in the target distribution, the posterior. This means that the distribution of particles diverge from the posterior. A way of coping with this effect is resampling [9, pp. 447], which is done as a last step. Before resampling the measurement update is carried out, which consists of doing an EKF update, of the mean and covariance, of each measured feature, in each particle. If a feature has not been seen the estimate remains unchanged. When resampling, M new particles are drawn with replacement from the current set of particles, which then become the new set of particles used to represent the posterior. The probability of a particle being drawn is proportional to its importance weight. The importance weight, w[k] t , is the quotient of the target distribution and the proposal [9, pp. 448] w [k] t = target proposal = p(x[k] 1:t|z1:t, u1:t, c1:t) p(x1:t|z1:t−1, u1:t, c1:t−1) = η p(zt|x[k] t , z1:t−1, u1:t, c1:t)p(x[k] 1:t|, z1:t−1, u1:t, c1:t) p(x1:t|z1:t−1, u1:t, c1:t−1) = η p(zt|x[k] t , z1:t−1, u1:t, c1:t) = η p(zt|xt, ct) , (2.75) which is the measurement likelihood. η is a normalising factor that can be ignored, 30 2. Theory since the probability of drawing a particle is only proportional to w[k] t . Resampling is done as a last step in FastSLAM, after the measurement update. One of the disadvantages of particle filter is the inflexibility under loop closure. When the particles are resampled, fewer and fewer estimates of “old” poses will be represented among the particles. Since only the estimates still represented by one of the particles are available for the filter to choose from, it might be that the “locally” best approximations no longer contains the “globally” best match, when including a loop closure. 2.3.4 Graph-based SLAM Graph-based SLAM has a very intuitive structure: each pose and each landmark is a node in a graph, and between the nodes there are edges that represent measure- ments. Typically there are edges between two successive pose nodes, and between a pose node and each node corresponding to a landmark seen from that pose, as is illustrated in figure 2.10. These measurements form constraints on the location of the poses and landmarks involved. Since no real measurement is exact these con- straints will not be perfectly consistent. The goal of graph-based SLAM is to find a configuration of the nodes which minimise this inconsistency. 1 4 6 9 12 14 1618 20 21 2 3 5 7 88 10 11 13 15 17 19 Figure 2.10: Basic structure of the graph in graph-based SLAM. White circles are pose nodes, black squares are landmarks and solid and dashed edges are pose-to-pose and pose- to-landmark constraints respectively. At 20 and 21 loop closures are detected when old landmarks 5 and 7 are observed. Graph-based SLAM system can be split into a front-end and a back-end. The front- end builds a graph from the measurements and finds loop closures when they arise, setting the constraints of the graph. The back-end takes the graph built by the front- end and optimises it with respect to the constraints in the graph. The front-end deals directly with the sensor data, and thus is highly dependent on the kind of sensors used, the method of finding loop closures and the specifics of the implementation, while the back-end works on the more general and abstract representation of the graph. Each edge in the graph corresponds to an error function. The type of error function depends on what type of measurement the edge represents. A pose-pose edge could be the Euclidean distance between two poses or the 6 DoF relative transform between the nodes, or a pose-landmark edge could represent the reprojection error in an image in which the landmark appears. The error is the difference between the 31 2. Theory actual measurement and the expected measurement, given the current values in the graph: eij(xi,xj) = zij − ẑij(xi,xj). (2.76) Here, eij is the error associated with the edge between node xi and node xj, and zij is the measurement between nodes with index i and j. ẑij(xi,xj) is the expected measurement for the pair of nodes xi, xj. A common way of finding the parameters that best explains a set of measurements is minimising the square of the errors over the set of possible parameters, which results in the maximum likelihood estimate. Assuming the measurements are affected independently by Gaussian noise, this is equivalent to finding the node poses that minimises the negative log-likelihood of the observations [59]. Let x = (xT1 , ...,xTT )T be a vector of poses xi, and Ωij be the information matrix of the measurements from node i to node j. The likelihood of the observations given the poses and landmarks is p(z|x) = ∏ i,j∈C η e−(zij−ẑij)T Ωij(zij−ẑij) = ∏ i,j∈C η e−eT ijΩijeij , (2.77) where C is the set of all pairs i, j, for which a measurement exists, and Ωij is the information matrix. Taking the logarithm of this gives ln  ∏ i,j∈C η e−eT ijΩijeij  = ∑ i,j∈C ( ln η − eTijΩijeij ) = A− ∑ i,j∈C eTijΩijeij, (2.78) where A is a constant that does not depend on the node states. Maximising the right hand side in this, finding the maximum likelihood estimate, is equivalent to minimising F (x) = ∑ i,j∈C eTijΩijeij, (2.79) which is the sum of squared error terms weighted by their information matrix. This can be written as finding a solution to x∗: x∗ = arg min x F (x). (2.80) With a good initial guess, x̌, of the poses and landmark positions, the Gauss-Newton or Levenbarg-Marquardt algorithms can be used for for minimising F [50]. These algorithms rely on iteratively refining x̌ by linearising the error functions, and solving the resulting linear minimisation problem to find a better state vector. Simplifying the notation for eij(xi,xj) to eij(x) and eij(x̌) to ěij, the linear approx- imation of the error function around the initial guess can be written as its first order Taylor expansion, eij(x) = eij(x̌ + ∆x) ≈ ěij + Jij∆x, (2.81) where Jij is the Jacobian of eij(x) in x̌. The terms in (2.79) can be expressed as eTijΩijeij ≈ (ěij + Jij∆x)TΩij(ěij + Jij∆x) = ěTijΩij ěij + 2ěTijΩijJij∆xij + ∆xTJTijΩijJij∆x = cij + 2bTij∆xij + ∆xTijHij∆xij, (2.82) 32 2. Theory where cij = ěTijΩij ěij, bTij = ěTijΩijJij, and Hij = JTijΩijJij. Now the approximation of F (x) close to x̌ can be written as F (x̌ + ∆x) ≈ ∑ i,j∈C cij + 2bTij∆xij + ∆xTijHij∆xij = c+ 2bT∆x + ∆xTH∆x. (2.83) The quadratic form in this has a global minima if H is positive semi-definite. H is an information matrix [50], thus at least positive semi-definite [60]. The global minima occurs when the gradient of the quadratic form is 0, that is 2b +H∆x = 0. (2.84) Thus, solving H∆x∗ = −2b gives ∆x∗. Setting the new initial guess to x̌+∆x∗, and iterating until some criteria of convergence is reached will yield an approximation of x∗. In V-SLAM it is more common to use the Levenberg-Marquardt algorithm for finding ∆x∗. It is very similar to Gauss-Newton, the difference being that instead of solving (2.84), 2b + (H + λI)∆x = 0 (2.85) is solved. Here, λ is a parameter to be chosen for each iteration. If λ is “large enough”, H + λI will have full rank and be positive definite. This is helpful in the V-SLAM context, sinceH can lack positive definiteness due to scale ambiguity. Solving (2.85) requires inverting H +λI. Matrix inversion has complexity of O(n3), which quickly becomes intractable for matrices of the sizes relevant to graph-based SLAM with landmarks. Luckily, H, is sparse and symmetric, stemming from the fact that each landmark is only seen from a few poses and the only pose-pose edges are between consecutive poses, as well as the terms JTijΩJij being symmetric, as is illustrated in figure 2.11. There are very efficient solvers for sparse symmetric systems, the complexity of which depend on the structure of H. 5 10 15 20 5 10 15 20 Node index N od e in de x Figure 2.11: An illustration of the sparseness and symmetry of the information matrix, H, corresponding to the graph in figure 2.10. Only the shaded elements are non-zero, black elements are pose nodes, dark grey corresponds to landmarks and middle grey the edges between these. Loop closures manifest as off-diagonal elements at the symmetric pairs 〈(20, 5), (5, 20)〉 and 〈(21, 7), (7, 21)〉. 33 2. Theory When tracking features in the images, sometimes erroneous associations are ob- tained, resulting in outliers when triangulating or adding edge constraints to the graph. These can have a large detrimental impact on the final result, since the square of an already large error can overwhelm the smaller, inlier errors. Seen in a Gaussian light, the outliers are deemed very unlikely measurements, so unrealisti- cally large adjustments to the other estimates are made to accommodate the outlier. To lessen this effect one can use a robust error function. These usually mimic the squared error function for small errors to come close to a Gaussian maximum like- lihood estimate, but grow slower than quadratically for larger errors to lessen the influence of the outliers. Such a robust error function is the Huber error, which, at a set magnitude of the error, grows linearly instead of quadratically, as is shown in figure 2.12. −2 0 20 2 4 6 8 z − ẑ F Huber error Squared error Figure 2.12: Comparison of the squared error function and the Huber error. 34 3 Implementation To evaluate the performance of the methods described in chapter 2, a solution for the 6 DoF SLAM problem has been implemented. The rest of this chapter details the implementation of the solution. The implementation is written in C++ and makes use third-party libraries. Non- primitive data types, mathematical methods and feature point related methods from the OpenCV library [29] are used. OpenCV was chosen since it is a widely used, cross-platform library for computer vision tasks with a C++ interface and includes most of the functionality that was provided by Matlab in the first implementation of the VO system. For handling the graph structure and optimisation the g2o library [60] is used as it is specifically tailored to graph-based SLAM problems and bundle adjustment. For loop closure detection OpenFABMAP [58], as provided in OpenCV, is used. OpenFABMAP is an open source implementation of FAB-MAP, the loop closure method for V-SLAM that was chosen for this implementation. Tunable parameters are presented in appendix A. 3.1 Visual Odometry The implementation of the VO assumes known camera calibration parameters. The system is divided into two parts. The first takes raw camera images, rectifies the images and tracks features in the consecutive images. This enables the use of epipolar geometry for an initial estimate of the incremental ego-motion and triangulation of the detected features. The second part takes the initial estimate and further refines it with windowed bundle adjustment. 3.1.1 Features and tracking To extract corresponding feature points between two consecutive images a KLT tracker [35] is used. The tracker is initialised with a specified number, KLTNPoints, FAST features [34], with an intensity threshold KLTFASTThreshold, in the first image and tracked in the second image. Points for which no corresponding point is found in the second image are removed. RANSAC, using the normalised 8-point algorithm, is used to remove outliers with a Sampson distance to the epipolar line greater than KLTRANSAC8PointThreshold. The number of RANSAC iterations is specified in KLTRANSAC8PointIterations. If points are lost, new FAST features are added in second image to later be tracked in the next image. VO gives better 35 3. Implementation results when the features are evenly distributed in the image [32]. To achieve an even distribution, the image is divided into a grid of KLTNWin fields, with margins specified in KLTWinWidthMargin and KLTWinBottomMargin, where no grid fields are made to avoid creating features that are likely to disappear out of view in the next image. The number of feature points are counted in each field, and new FAST feature points will be added only in the fraction, KLTWinThreshold, of fields containing the least number of feature points. As suggested by Scaramuzza et al. in 2009, if the distance between the matched points is less than KLTMotionDistanceThreshold pixels for a fraction equal to KLT- MotionRatioThreshold of the points, no motion is assumed, and the second image is discarded. This is repeated until an image not fulfilling this condition is found. The fundamental matrix is calculated using the feature points remaining after track- ing and RANSAC with the normalised 8-point algorithm described in section 2.2.1. Using a known camera calibration matrix, the essential matrix is obtained accord- ing to (2.48). From the essential matrix, two possible rotational matrices and two possible unit length translation vectors between the two frames can be obtained, as shown in (2.49). The equation shows that the two possible translation vectors are parallel, in opposite directions. To determine the correct translation vector, the speed of the vehicle, which can be obtained from wheel odometry or an IMU, is used. The speed is also used to fix the scale ambiguity of the monocular camera. This is done by setting the length of the translation vector to the distance travelled between the two images, according to the speed. To find the correct rotational matrix, the matrix describing a rotation less than 180° in all axes is selected. The above steps are repeated for each consecutive pair of frames. The rotational matrix, R, and the translation, t, are combined and expressed as a relative homo- geneous transformation matrix: T = [ R t 0 1 ] (3.1) The current pose, expressed as a homogeneous transformation matrix, is found by right-multiplying the previous relative transformation matrices up to the current state. The first pose is initialised in the origin of the camera coordinate system, with the identity matrix. 3.1.2 Bundle Adjustment Each pose is added as a pose node, VertexCam, in a g2o [60] graph, with an SBACam estimate. An SBACam estimate contains a fixed camera calibration matrix and an extrinsic camera matrix, a pose, to be optimised. The camera calibration matrix is previously known and set accordingly, and the pose is set to the pose obtained from the estimated essential matrix. For each new feature point tracked across two consecutive frames, a point node, VertexSBAPointXYZ, is added to the graph. The point node has an estimate of a position in R3 to be optimised. The initial estimate is obtained by triangulating 36 3. Implementation the point from the first two poses it is seen from. For each pose where the point is tracked, a projection edge, EdgeProjectP2MC, is added. The projection edge connects a pose node with a point node. The measurement is the pixel coordinate of the observed point in the image from the relevant pose. The error function is the reprojection error of the point to the image according to (2.11). The information matrix of the measurement is set to a diagonal matrix σ−2 · I, where σ2 is PixelVar. To fix the scale ambiguity of the monocular camera, a distance edge, EdgeSBAScale, is added between each consecutive pose node. The measurement is the geometric distance between the two poses, and the error function is the offset between the estimate and the measurement. The measurement is set to the distance between the poses according to the speed of the vehicle and the information to σ−2, where σ2 is DistanceVar. Between each consecutive pose node, a pose-pose edge, EdgeSBACam, is created and added to the graph. These edges are not considered during the incremental VO, but are instead used when loop closures are detected to avoid doing costly BA optimisation on the entire graph. They are updated until the poses they connect fall out of the BA window. The measurement is the pose to pose transformation between the connected nodes, expressed in a 6-vector with Euclidean coordinates describing the transformation and a unit quaternion with real part omitted (assumed positive) describing the rotation, as x = (x, y, z, q0, q1, q2)T . Using homogeneous transformation matrices, the error is calculated as Terr = T−1 measT −1 1 T2. (3.2) The information matrix is set to a diagonal matrix with (σ−2 t , σ−2 t , σ−2 t , σ−2 r , σ−2 r , σ−2 r )T on the diagonal, where σ2 t is set to TranslVar and σ2 r is set to RotVar. For each iteration, a graph with the projection edges and distance edges connect- ing the latest BAWindow number of pose nodes, are optimised BAIterations times with the Levenberg–Marquardt algorithm, OptimizationAlgorithmLevenberg, using a Huber error function and a solver based on sparse Cholesky factorisation, Linear- SolverCholmod. 3.2 Simultaneous Localisation And Mapping To get a full SLAM implementation, the estimate of the trajectory and map from the bundle adjusted VO is again adjusted when loop closures are detected to ensure that the map is consistent. This is divided into detecting loops using OpenFABMAP and finding the geometric relation between the current and old pose in a step similar to the first part of the VO. 3.2.1 FAB-MAP The FAB-MAP 2.0 [21] algorithm is set up using a vocabulary and Chow-Liu tree [20], which have been created offline, as well as a training data set. The probability of a feature being detected, p(z = 1 | e = 1), is set to FMPzGe, and the probability of a false positive, p(z = 1 | e = 0), is set to FMPzGNe. 37 3. Implementation For each new image, a set of Star feature points, from the OpenCV library [29], which is a variant of the CENSURE [38] feature detector, is extracted. The threshold of the detector is dynamically adjusted, in maximum FMStarIterations iterations, to give between FMStarNMin and FMStarNMax feature points, and is initialised at FMStarThreshold. For the