6DOF camera localization through a short image sequence Master’s thesis in Signals, Systems and Mechatronics Jonas Garsten & Ivar Wikenstedt Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden, February 2, 2020 Master’s thesis February 2, 2020 6DOF camera localization through a short image sequence Jonas Garsten & Ivar Wikenstedt Department of Electrical Engineering Signal processing and Biomedical engineering Computer vision and medical image analysis Chalmers University of Technology Gothenburg, Sweden, February 2, 2020 6DOF camera localization through a short image sequence Jonas Garsten & Ivar Wikenstedt © Jonas Garsten & Ivar Wikenstedt, February 2, 2020. Supervisor: Torsten Sattler, Department of Electrical Engineering Examiner: Torsten Sattler, Department of Electrical Engineering Master’s Thesis February 2, 2020 Department of Electrical Engineering Division of Signal processing and Biomedical engineering Computer vision and medical image analysis Chalmers University of Technology SE-412 96 Gothenburg Cover: Visualization of the full reconstruction of CMU-Seasons-Extended [51] 3D- models using Meshlab [12], where each slice is represented by a unique colorcode. Typeset in LATEX Gothenburg, Sweden, February 2, 2020 iv 6DOF camera localization through a short image sequence Jonas Garsten & Ivar Wikenstedt Department of Electrical Engineering Chalmers University of Technology Keywords: Visual localization, Structure from motion, Computer vision, Sequential localization v Acknowledgements Thank you to Torsten Sattler our supervisor and examiner for all the discussions and feedback he assisted with. Test early and test often. To Rasmus Lundin for the endless supply of coffee and Linux support. To David Frisk for creating the LATEX-template used in this report, licensed under CC BY 4.0. Authors advice: When in doubt, run RANSAC! Jonas Garsten & Ivar Wikenstedt, Gothenburg, February 2, 2020 vii Contents Abstract xi List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Ways to localize images . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Point Cloud Compression . . . . . . . . . . . . . . . . . . . . 4 1.2 Content of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Finding point coordinates and projection matrices . . . . . . . . . . . 8 2.3 A Structure from Motion Pipeline . . . . . . . . . . . . . . . . . . . . 10 2.4 The Generalized Camera . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Pose estimation using the Generalized Camera . . . . . . . . . 14 2.5 Random Sample Consensus . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 RANSAC with Td,d test . . . . . . . . . . . . . . . . . . . . . 16 3 Method 17 3.1 Query Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Compute 3D Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Matching 3D Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Camera Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Compressing 3D Point Cloud . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Special case: One Image . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Results 23 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Cambridge Landmarks St Mary´s Church . . . . . . . . . . . 23 4.1.2 Cambrige Landmarks Street . . . . . . . . . . . . . . . . . . . 23 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Results Model compression . . . . . . . . . . . . . . . . . . . . 28 ix Contents 5 Conclusion 31 6 Future Work 33 Bibliography 35 x Abstract Camera localization is a core problem in the computer vision field and is a hot topic as cameras are more commonly integrated into navigation applications. With re- cent technological advancements it is now possible to store and process large data streams locally or in the cloud. One such example would be video streams. Common localization practices only localize one image at a time. The generalized camera de- scribes image observations as ray origins and ray directions in 3D space. Gaining the advantage of localizing many images at the same time and can use information from many images to do so. In this paper we develop a pipeline for camera localization using the generalized camera and evaluate how localization accuracy is affected by different sequence lengths. We present results showing our pipeline outperforming state of the art localization pipelines on the Cambridge Landmarks Street and St Mary’s Church datasets. xi Contents xii List of Figures 2.1 Camera extrincics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Camera intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Visualization of point triangulation . . . . . . . . . . . . . . . . . . . 10 2.4 Extracted features and matches . . . . . . . . . . . . . . . . . . . . . 11 2.5 Image gradient computation into keypoint descriptors. . . . . . . . . 12 2.6 Graphical visualization of cameras viewing the same points . . . . . . 14 2.7 Graphical visualization of the generalized camera model . . . . . . . . 14 3.1 Overview pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Comparison methods, St Mary’s Church dataset . . . . . . . . . . . . 30 4.2 Comparison methods, Street dataset . . . . . . . . . . . . . . . . . . 30 xiii List of Figures xiv List of Tables 4.1 Localization results presented as bins with thresholds . . . . . . . . . 25 4.2 Localization results presented as median values . . . . . . . . . . . . 26 4.3 Reconstruction drift for sequence lengths . . . . . . . . . . . . . . . . 27 4.4 Localization results for overlapping sequences . . . . . . . . . . . . . 28 4.5 Localization results Street, presented as median values . . . . . . . . 29 4.6 Localization results from state of the art localization pipelines . . . . 29 xv List of Tables xvi 1 Introduction Visual localization is a method of identifying a camera’s location and orientation given an image from the camera, also known as the camera pose. This problem is core to computer vision with applications such as Structure from Motion (SfM) [1, 53], augmented reality (AR) [42] and simultaneous localization and mapping (SLAM) [13,63] to name a few. Visual localization can be done in a multitude of ways depending on the application and aim. A common approach is to find local 2D features in an image which can be compared and matched to features in a 3D map. By matching enough features between the two, the camera can be localized [32,59]. A more recent approach that has gained popularity with the development of machine learning is to substitute conventional methods with machine learning. One can use machine learning to generate 2D features [56], match 2D and 3D features [14] or to skip these steps and localize the camera directly [29]. In this paper, we will describe a visual localization pipeline that differentiates itself by using the generalized camera model. The generalized camera model describes a match between a 2D image point and a 3D model point, hereafter referred to as 2D-3D match, as a ray origin and ray direction. As matches are described in 3D space it is possible to use matches from different images to localize many images at a time. This makes it possible to localize more than one camera at a time by combining the information of several images. This pipeline takes an image sequence as input and estimates the camera trajectory as a local point cloud by doing Structure from Motion (SfM). The points in the local point cloud are then matched to points in a global point cloud, which is commonly referred to as a structure-based method. To find the camera trajectory in the image sequence, all cameras in the sequence are transformed into one generalized camera model. Then we localize the sequence by finding a similarity transform that maps the generalized camera to the global point cloud, based on our point correspondences. The main question of this research is to draw a conclusion whether or not using an image sequence can improve camera pose estimation results using the generalized camera model. Additional questions that we want to answer is the impact of memory compressed 3D-point model and splitting up the query sequence into fixed sequence length. In this report we make the following contributions: (i) A localization pipeline us- ing a 3D structure-based method with the properties of a sequential-based method. This will use a 3D-point map and an image sequence query or otherwise known as a video to compute a trajectory. (ii) An evaluation of our sequential localization algorithm on datasets commonly used for localization comparison. (iii) An exper- 1 1. Introduction imental evaluation of the effects using different sequence lengths and compressed 3D-point clouds on our pipeline. 1.1 Related Work This thesis main objective is to implement and evaluate how the use of generalized cameras can improve visual localization. Therefor we make use of already existing methods and theory when implementing our pipeline. 1.1.1 Ways to localize images Camera localization has been a developing area for many years where different meth- ods have emerged based on new areas of application and current technological ad- vances. An intuitive localization type is 2D image-based localization, also known as image retrieval. This localization method makes use of existing information associ- ated with images such as GPS location [7, 45] and either assigns the same position to the query images or computes the position [44] by matching features present in both images and finding a transformation between them. The challenge is to find the best matching image from the library in an efficient way. When libraries grow to hundreds of thousands or millions of images [10, 11] exhaustive matching between images becomes to slow and tasks that require localization in real time become impossible to solve. This has made the use of Bag of Words (BoW) and tree structures [7,24,54,55] increasingly popular because of the increased computa- tional speed compared to exhaustive methods. One of these real time applications is simultaneous localization and mapping (SLAM). A common problem for SLAM is compounding errors. A small error in position estimation will grow larger with time as more and more estimations are done. Image retrieval has had great success in dealing with this through loop closure [22,65]. Loop closure occurs when a place is visited a second time and is recognized as a previously visited location. 2D image based localization relies on images with location information such as GPS tags to estimate camera poses for new image. 3D structure-based localiza- tion removes the need for poses to be attached to images and calculates the camera pose from a 3D reference model instead. Most structure-based localization algo- rithms make use of the same general structure of extracting features from images and matching these to points in a point cloud model where with enough matches, a position and pose can be computed. This is the same independent of geograph- ical scope, ranging from neighborhood [60] to world wide [30] localization. Even tough geographical scope is completely different, the same localization challenges appear. In urban environments, many of the structures have similar patterns and colors such as brick walls. This occurs in rural areas as well but instead of structures it is vegetation that looks similar. With changing seasons and lighting conditions added on top, the task of finding robust and unique matches become increasingly challenging [30,49,66]. A wide variety of methods have been proposed to increase localization accuracy on increasingly challenging data sets. The most common way of matching features is 2 1. Introduction to match image-to-model. In [31] it is proposed that doing the opposite, matching model-to-image will increase localization accuracy. Since this method utilizes that similar images will have similar features. This is done by creating different scenes that images can be matched against. This has the potential of greatly reducing the number of feature comparisons. Similarly, [32] makes use of co-visibility between 3D points to not consider every match in isolation instead of evaluating if an image has many closely related matches. In a co-visibility graph the nodes are 3D points where two co-visible points, pi, pj, in the graph are bi-directional connected with edges eij and eji. The wight of the edge is calculated with cij = |Ai ∩ Aj| |Aj| (1.1) where cij is the weight if edge eij, Ai and Aj is the set if database images containing point pi and pj. As with image retrieval, there have been suggestions to use BoW for structure-based localization. BoW is used in computer vision to index similar images or features un- der words. This will make it easier to find which images are similar to one another. One proposal [48] is to use a fine vocabulary, where a BoWs vocabulary refers to the resolution of the grouping of images, in combination with co-visibility constraints to localize cameras and present results of greatly reduced memory requirements for large datasets. A more recent development in the computer vision field is to switch out traditional practices for machine learning alternatives. Machine learning uses neural networks that are modeled after neurons in the human brain. These networks are trained on massive amounts of data to make certain connection between input data and desired output. A simple example would be that given an image as input, the network will output the probabilities of which country it was taken in. The two main applica- tions of neural networks to camera localization are (i), to directly predict a cameras pose from the input image. (ii), to replace some part of the camera localization pipeline with a neural network. In the case of (i), [29] uses a convolutional neural network (CNN) to estimate camera poses and displays result of more robust pose estimation. Walch et al. [64] built on this idea by adding a long short-term memory (LSTM) [23]. Adding a LSTM helps in preventing overfitting, a phenomenon in machine learning where the network becomes adapted to the training data and its ability to make correct predictions on new data decrease. As for ii, in [39] normal feature detection is replaced by synthetic view point gen- eration that is then evaluated with the SeqSlam [40] algorithm. There is a problem in camera localization where a change in view point, such as lateral shifts, can neg- atively affect localization results. The proposed solution has a network trained to estimate depth maps, assigning how far objects are from the camera. With this depth map, it is possible to generate new view points that are shifted laterally. All views are then evaluated separately for better results. In many modern application of camera localization, the images to be localized are not disordered but come from a video stream where each consecutive image should 3 1. Introduction have a lot in common. This sequential data flow is common in robots, autonomous cars and SLAM. The sequential nature of query images can be exploited by estimat- ing movement trajectories [13] from the most recent pose estimations or to not only match one image at a time but match an entire sequence of images [40] to increase the likelihood of good matches. This method is referred to as sequential-based. In most of the previous works, there has only been cases where a single camera is used. This does imply that research into utilizing multi-camera systems have been overlooked. Multi-camera systems are not a new idea. The topic has been researched since the early two thousands [16, 26]. Most of the work up till now have focused on using multiple cameras for SLAM [9, 26, 63] where the benefit of more cameras is a wider filed of view, enabling the multi-camera rig to analyze more structure at the same time. In image localization where there is no rig or robot to add more cameras to the approach has instead been to try and represent cameras in a more general way to localize multiple images as if they were only a single image. This was introduced by Grossberg and Nayar [17] who proposed a way of modeling pixels in an image as raxels. A Raxel is the virtual representation of a photo sensitive element measur- ing the light intensity a pixel is exposed to.The idea of modeling light sensors was then further expanded upon by [43] who introduced the generalized camera model. The generalized camera model switched out raxels for Plücker vector, simplifying the camera model by, for example, removing the modulation of light intensity. Recently, Sweeney et al. [59] applied the generalized camera to large scale SfM and displayed impressive results reconstructing the city of Rome in only twenty two minutes. 1.1.2 Point Cloud Compression With generalized cameras having the ability to localize more than one image at a time by using the combined information present in all images, it is of interest to know how this added information can be used for camera localization. One applica- tion is point cloud compression. As camera localization moves towards applications requiring real-time performance [18,34] and the hardware in current mobile devices is not yet powerful enough for running high preforming localization algorithms in real-time. An alternative is then to offload the localization to a server [38]. The issue with this is that visual information captured by the mobile device has to be sent to the server requiring fast transfer speeds. To minimize transfer speed require- ments one can use point cloud compression to reduce the total data that has to be transmitted. Point cloud compression is generally done in two ways. Either the number of points are reduced [8, 31] or the point descriptor is compressed [18, 48]. Both approaches try to achieve the same result in reducing the memory require- ments for point cloud models. When selecting a subset of points the goal is to create a more compact description of a model without losing significant localization performance. In the method by S. Cao and N. Snavely [8] compression is done by incrementally selecting points which have good coverage and are distinct until all images can see k points. How good coverage a point has is not determined by how many images it is visible in but rather a probability pij of point i being visible in 4 1. Introduction image j is defined. This is similar to simply selecting the points with best coverage in the already registered images. However, it was found that using a probabilistic approach contributed to finding good feature matches. Points are then selected to maximize the probability of each images seeing k points. To increase the likelihood of picking distinct points the distance between a point and already picked points in descriptor space is also considered. Compressing the descriptors is tricky as the objective is to reduce the size while maintaining the distinctiveness of the descrip- tors. Intuitively, reducing the size of all descriptors would result in a less descriptive model representation and result in poor localization and [34] is an example of com- pression affecting localization accuracy. However, the application here is a SLAM system and the worse image localization accuracy did not significantly impact its pose estimation performance.They employ product quantization [25] which reduces a high dimensional descriptor into several descriptors of lower dimensionality. The low dimensional descriptors are clustered using k-means clustering and each cluster is then represented with a single descriptor. In the original descriptor each low di- mensional descriptor will be exchanged with a descriptor representing the cluster it belongs to. Given the increase in demand for real-time solutions in mobile applications, viewing the effects of using a compressed point cloud model in our pipeline is in our interest. 1.2 Content of thesis This thesis is split into four sections: Theory, Method, Results and Conclusion. In Theory we will explain how camera localization works and what a generalized camera is. Method will explain the different parts of our pipeline and specify what software and methods we use. Results will include our findings and discussion regarding our results. Finally we present our findings in short in the Conclusion section and will discuss what more can be done in Future Work. 5 1. Introduction 6 2 Theory Visual localization has been a research topic for decades where many methods have been developed to improve localization accuracy. In this chapter the underlying the- ory for visual localization will be presented to give the reader information necessary to understand the topic of visual localization. Our pipeline boils down to two steps, creating models using Structure from Motion and localizing images by comparing Structure from Motion models. The first chap- ters will explain how cameras are modeled and how these models can be used to calculate camera positions as well as to build 3D models of a scene. The later chap- ters describe how the generalized camera works and how images can be localized using this model. Lastly a method called RANSAC will be explained and why it is necessary to use RANSAC when the data is noisy or contain outliers. 2.1 Camera Model A camera model describes the transformation of points in the world into an image. There are many different types of camera models depending on for example which type of lens a camera is using. The simplest camera model is the pinhole camera model. This camera model has a camera center C where all image rays intersect. Global points are projected, transformed to the image plane, by following the ray connecting the point to the camera center until it intersects the image plane. The transformation from world coordinates into a pinhole camera can be expressed with the projection matrix P = K [ R t ] , (2.1) where R is a 3x3 rotation matrix, t is a 3x1 translation vector and K is a 3x3 calibration matrix. R and t are called the extrinsic parameters of a projection matrix containing information about camera orientation and position while K is the intrinsic parameters describing the camera. As seen in Figure 2.1 the rotation R and translation t transform global point into a camera coordinate system. From a camera coordinate system, points can be projected by applying the camera calibration matrix K as seen in Figure 2.2. The calibration matrix for a pinhole camera looks like K = γf sf x0 0 f y0 0 0 1  , (2.2) where the parameters are; focal length f, the distance between the center of projec- tion and the image plane. Skew s, which describes tilt of the pixels in the image. 7 2. Theory Aspect ratio γ, determining the pixel x-y ratio and finally x0 and y0 maps the origin from where the z-axis intersects the image plane to the upper left pixel. When ap- plying the calibration matrix, the distances are transformed into pixel values. With this, an image pixel coordinate can be found by reprojecting its corresponding global point [ X Y Z ]T into the image. The projection matrix would be applied as λ xy 1  ︸︷︷︸ x̄ = P  X Y Z 1  ︸ ︷︷ ︸ X̄ , (2.3) where [x y]T is an image pixel coordinate and λ is a scaling parameter. This compact way of applying R,t and K requires the use of homogeneous coordinates. Homogeneous coordinates lift a N-dimensional point to a N+1 dimensional line. This allows for translations and projections to be expressed as matrix multiplications. The N-dimensional point can after a transformation be retrieved by dividing by the last entry and then dropping the last entry. If we instead want to transform a point from an image into world coordinates, a problem occurs. To project a point, we apply P and divide with the last entry. But when transforming a point from the image into the world the last entry is not known. Calculating the last entry is not possible with a single image. For this a second image has to be added. 2.2 Finding point coordinates and projection ma- trices Given two cameras with different position, if both cameras can see the same 3D point its exact position can be computed. Because the point is present in both images, there are two different lines describing where it is located in 3D space. Its position is Figure 2.1: Visualization of applying the extrinsic parameters of camera matrix P will transform a 3D point from global coordinates into the camera coordinate system. 8 2. Theory then where both lines intersect (Figure 2.3). In essence, an equation system is solved by inserting image and 3D point information in Equation (2.3). The same principle can be applied when solving camera matrices. A camera matrix has the dimensions of 3x4, containing twelve variables. We only have to solve eleven variables as it is determined up to an arbitrary scale. To solve eleven variables, eleven equations are required. The camera equation (2.3) gives three equations for every point but λ also has to be solved for each new point. The number of points required then becomes 3n ≥ 11 + n⇒ n ≥ 5.5 (2.4) A simple approach to solve systems of linear equations is called direct linear trans- form (DLT) [20]. This approach expresses all equations in terms of a matrix mul- tiplication and finds an approximate nullspace for the matrix. The DLT algorithm explained below is a slightly modified version [27] of the one in Hartley and Zisser- mans book [20]. First, lets express P as P = p1 p2 p3  (2.5) where pi, i = 1, 2, 3, are 1x4 row vectors from P . Insert this representation of P into (2.3) and it can be written as p1X − λx = 0 (2.6) p2X − λy = 0 (2.7) p3X − λ = 0 (2.8) which in matrix form looks like X T 0 0 −x 0 XT 0 −y 0 0 XT −1   pT 1 pT 2 pT 3 λ  = 0 0 0  . (2.9) Figure 2.2: Applying the calibration matrix K to a point in camera coordinates will reproject it into the image as a pixel coordinate. 9 2. Theory Figure 2.3: Visualization of how 3D point locations can be determined by finding correspondences in two separate image and evaluate where the rays intersect. This is the matrix for only a single point. To find P at least 5.5 points are required according to equation 2.4. Since you can not use partial points, six points are required. If we include more points in equation 2.9 the matrix will be on the form XT 1 0 0 −x1 0 0 · · · 0 XT 1 0 −y1 0 0 · · · 0 0 XT 1 −1 0 0 · · · XT 2 0 0 0 −x2 0 · · · 0 XT 2 0 0 −y2 0 · · · 0 0 XT 2 0 −1 0 · · · XT 3 0 0 0 0 −x3 · · · 0 XT 3 0 0 0 −y3 · · · 0 0 XT 3 0 0 −1 · · · ... ... ... ... ... ... . . .  ︸ ︷︷ ︸ M  pT 1 pT 2 pT 3 λ1 λ2 λ3 ...  ︸ ︷︷ ︸ v =  0 0 0 0 0 0 ...  . (2.10) This equation can be solved by finding a non-zero vector in the nullspace of M . Since the scale of the solution is arbitrary the constraint ||v||2 = 1 can be applied. This system will most likely not have an exact solution and a least squares solution is found by minimizing min ||v||2=1 ||Mv||2. (2.11) 2.3 A Structure from Motion Pipeline A Structure from Motion (SfM) pipeline is widely used to reconstruct 3D point models from images [1, 53, 59]. A SfM pipeline is divided into the following steps: 10 2. Theory 1. Extract image features and compute descriptors. To compare images against each other, it is possible to simply use individual pixel values or a blurred version to see if general patterns match. However, this yields poor results and unique features are instead computed. 2. Match descriptors. With sufficiently unique features, they can now be matched to determine if the same feature is present in multiple images. 3. Estimate camera positions and create point cloud model. Matched features can now be triangulated into 3D points. This can be done either by Incremental SfM [1] or Global SfM [41]. Feature extraction is commonly done by using SIFT [33] descriptors. This is done by sliding a Difference of Gaussian filter over the image and detecting local extrema for the filter. Descriptors [3,33,46] are used to describe feature points. A descriptor does not have to be made in any particular way but a common descriptor such as SIFT [33] uses gradient vectors to describe features. As seen in Figure 2.5, gradients are created by analyzing a small portion of the image and creating gradients to nearby pixels based on changes in light intensity. All gradients are then placed into the bins of multiple histograms which are concatenate to form a descriptor. When creating a descriptor, the goal is to make it as robust and unique as possible. A robust descriptor will not change in different lighting and weather conditions. This is an important characteristic for matching features in images taken over long periods of time. Uniqueness in the other hand is important as to not match descriptors which do not represent the same point, so called false positives. Figure 2.4: Extracted features (red dots) and the matched features between images (blue line). Both images are from the dataset Cambridge Landmarks St Mary’s Church [29]. Feature extraction and matching was done with COLMAP [53]. 11 2. Theory Figure 2.5: Image gradient computation into keypoint descriptors. Matching descriptors to one another is a simple task in theory. As with simple 3D points, comparing descriptors is done by computing the Euclidean distance between them. What may not be obvious is that with high dimensional data such as descrip- tors, this distance will be large even to close matches [2]. Close matches will have many dimensions that match well but some will not and this is enough for distances to grow large. High dimensional data is also usually gathered in large quantities because of the larger space they reside in. This makes matching all descriptors a non trivial task that can take lots of time. To decrease the time spent matching, approximate search [35] is used instead of exhaustively looking for the best match. Approximate search can be done by arranging all data into a k-d tree (k-dimensional tree) [4]. Arranging the data in a tree structure makes it easier to only search data that is similar. Starting from the top of the tree, similar data can be found by following the path down the tree which has the best matches. To further ensure that good matches are found it is common to use Lowe´s ratio test [33] to discard matches that are probable to be false. The ratio test takes the ratio of the distance between the first and second nearest neighbors of the descriptor to be matched and evaluates if the ratio is less than a pre-defined threshold. The first nearest neighbor is the most similar descriptor to the matched descriptor while the second nearest neighbor is the second most similar descriptor. This is expressed as |di − dN1|2 |di − dN2|2 ≤ tol (2.12) where di is the descriptor to be matched, dN1 and dN2 are the first and second nearest neighbor matches and tol is the predefined tolerance. A match passes the ratio test if the ratio is below the threshold. Employing the ratio test ensures that matched descriptors are more similar to each other than to other descriptors. This filters out matches that are unlikely to be correct. Creating a point cloud model can be done in two ways, incrementally or globally. Incremental SfM [53] adds one image at a time to the reconstruction, incrementally expanding the model. Global SfM [41] estimates all camera positions at the same time by first constructing a graph connecting cameras viewing the same scene. Only incremental SfM will be explained here. 12 2. Theory Incremental SfM first finds a good starting image pair from which an initial model can be created. Given this initial model, one by one images are added to the model. Each new image has to see n 3D points already in the model for a camera pose to be computed. The number of points depend on how much information is available about the camera and also determines which PnP solver to be used [67]. New point correspondences can then be triangulated and added to the point cloud. As more images are added, small errors will accumulate distorting the model. It is therefor common to perform optimization on both 3D point and camera poses with set intervals. Optimization or more specifically bundle adjustment [62] is performed to minimize growing errors by shifting the position of 3D points and, depending on how much information about the camera is known, tuning the projection matrix. Localization error is usually quantified as the reprojection error of 3D points. It is computed by taking the square of all 2D image points subtracted by their respectively reprojected 3D point correspondence as e = n∑ i=1 m∑ j=1 ∣∣∣∣∣ ∣∣∣∣∣ ( xij − p1 i Xj p3 i Xj , yij − p2 i Xj p3 i Xj )∣∣∣∣∣ ∣∣∣∣∣ 2 (2.13) where [ xij yij ] is the 2D image point in camera i corresponding to 3D point j. pk i , k = 1, 2, 3 are the rows of the projection matrix P for camera i in Equation (2.5). Unfortunately, minimizing Equation (2.13) is not simple as this is a non-linear least squares problem. Therefor gradient decent methods are used for local optimization. 2.4 The Generalized Camera The generalized camera describes image observations as rays of light. These rays have two components, one vector for the direction in which the ray is pointing and and a point which lies on the ray. This parameterization is not unique to generalized cameras and can be used for pinhole cameras as well. Though, there are distinct differences between them. A pinhole camera has a center of projection/camera center. In Figure 2.6 this would be Ci, i = 1, 2, 3. A center of projection is where all rays intersect. A generalized camera does not have this constraint. In theory, every single ray could have its own center of projection. Another distinguishing feature is the lack of individual camera coordinate systems. For normal multi- camera systems such as the one in Figure 2.6, the pose of each individual camera has to be computed separately because the same rotation and translation will not transform 3D point correctly into all different camera coordinate systems. If the system is instead formulated as a generalized camera (Figure 2.7), all rays will be in the same coordinate frame and a single transformation is needed to find a solution. 13 2. Theory Figure 2.6: A graphical visualization of multiple cameras viewing the same points. 2.4.1 Pose estimation using the Generalized Camera As mentioned before, generalized cameras represent image observations as rays of light. A ray-point correspondence can be expressed as sqi + αir̄i = Rri + t, i = 1, 2, 3, ... (2.14) where s, R, t is the scale, rotation and translation for the sought transformation from world coordinates into the generalized camera coordinate system [58]. The ray has ray origin qi and ray direction r̄i. αi stretches the rays to meet the 3D point ri matched to ray i such that αi = ||Rri + t− sqi||. The rotation can be parameterized using Cayley-Gibbs-Rodriguez to be represented with three unknowns. Translation plus scale are four more unknowns. When the query is constructed it is arbitrary scaled. Since the task is to localize in regards to the model, the scale needs to be solved for. For each correspondence an α also has to be solved. This result in 7 + n total unknown variables, where n is the amount of points used. The problem of finding a transformation which lets 3D points coincide with rays can be expressed Figure 2.7: A graphical visualization of how the cameras in Figure 2.6 could be represented as a single generalized camera. 14 2. Theory as minimizing the cost function C(R, t, s) = n∑ i=1 ||r̄i − 1 αi (Rri + t− sqi)||2 . (2.15) We have one slight problem. For every correspondence added a new equation is introduced but so is a new unknown variable. Fortunately, it is possible to exploit linear dependence between unknowns and rewrite the equation in terms of only the rotation R. Equation (2.14) can be rewritten as  r̄1 q1 −I . . . ... ... r̄n qn −I  ︸ ︷︷ ︸ A  α1 ... αn s t  ︸ ︷︷ ︸ x =  R . . . R  ︸ ︷︷ ︸ W  ri ... rn  ︸ ︷︷ ︸ b (2.16) ⇔ Ax = Wb (2.17) where x contain all linearly dependent variables to be eliminated. Manipulating Equation (2.17) it is possible to express x as x = (ATA)−1ATWb = US V Wb (2.18) where U,S,V are constant matrices constructed from (ATA)−1AT . By expressing t, s, α in terms of U, S, V,W, b it is now possible to rewrite Equation (2.14) and (2.15) to only include the unknown rotation. There are now more equations than unknowns and it is possible to find a least squares solution. Also observe that the number of unknowns now is independent from the number of observations making it scalable in terms of number of points used by the solver. For a more detailed explanation, read [58]. 2.5 Random Sample Consensus Random Sample Consensus (RANSAC) [15] is a method for picking random data samples in a way that ensures an outlier free subset of data to a certain degree. In many optimization tasks, the data is not perfect. It contains noisy measurements and outliers. Optimizing for all points can then be challenging. Outliers can be far off from a correct measurement, contributing with large errors to the cost function, drowning the effect of errors from correct measurements. The RANSAC method is widely used to try and find a subset of data that only contains inliers. Given a set of data points and a solver requiring s points to compute a solution, RANSAC randomly picks a subset of s data points N number of times. The idea is that in a set of points some will be outliers and some will be inliers. It is desired 15 2. Theory to pick a subset of points which is outlier free, resulting in the computation of a correct solution. The number of times a subset has to be picked is described by N = log(1− p) log(1− (1− e)s) (2.19) where N is the number of iteration, s is how many points are needed to find a solution, e is the ratio between outliers and all matched points and p represents the desired probability of finding the best model. The outlier ratio e is hard to determine and scarcely available. It can therefor be necessary to update the distribution based on inlier results calculated during runtime. The process of calculating inliers is done every iteration of RANSAC to evaluate if the subset of points picked resulted in a good solution. To determine if a 3D point is an inlier, the point is reprojected using (2.3). Then the distance between the reprojected point and the image feature which the point is matched against is computed. If this distance is smaller that a pre-determined threshold, it is considered an inlier. This procedure is done for all points and will give the number of inliers for a given solution. The number of inliers is used to calculate a new outlier ratio and is then inserted back in (2.19) to calculate the number of iterations needed. 2.5.1 RANSAC with Td,d test The Td,d test [36] is a small addition to the normal RANSAC procedure. In normal RANSAC each solution will be evaluated on all data points. This is computationally expensive and shown to be unnecessary. Instead, the Td,d test uses a very smaller subset d of data points for a first preliminary evaluation. If the preliminary evalu- ation suggests that a good solution has been found by all points in d being inliers, then all other data is also evaluated to determine if a new best solution has been found. The number of iterations needed can be formulated as picking s + d inlier points as N = log(1− p) log(1− (1− e)s+d) (2.20) where d is the number of points to evaluate the solution on. It was shown in [36] that choosing d << N is enough to confidently tell if a solution is correct or not. 16 3 Method The task at hand is to create a localization pipeline using a 3D structure-based method with the properties of a sequential-based method. By using a 3D-point model and a query image sequence, multiple camera pose estimates are computed and evaluated. The pipeline is built with a modular structure in mind so that if desired, a part can be swapped without influencing the remaining modules. Given a query of sequential images and a model, a camera pose estimations are computed for every image in the sequence. This chapter will cover how the pipeline is constructed as shown in figure 3.1. Figure 3.1: This is an overview of the pipeline. Given a sequence of images a query reconstruction is created. From the reconstruction, a 3D descriptor is computed and matched against the models 3D descriptors. Query reconstruction is a collective term for the process of computing a 3D SfM reconstruction. Given a set of images, features are extracted from all images. All features are matched against each other and then computed into a Structure from Motion reconstruction. Feature extraction, feature 2D-3D matching and model gen- eration is done with COLMAP [53]. Since COLMAP outputs descriptors for 2D features and not for the reconstructed 3D points, a descriptor for each 3D point is calculated by creating an elementwise mean descriptor from all feature descriptors mapped to each 3D point. The 3D descriptor is then matched with the precom- puted model 3D descriptor where a ratio test is used for outlier rejection. 3D-3D 17 3. Method matching is done using a kd-tree search algorithm for nearest neighbor matching. Using the matched points, a camera pose is computed using the generalized camera solver gDLS [58]. It is also possible to include compression in the pipeline. Either the pipeline is run from start to finish with an added step of compressing the model point cloud before the matching step or using previous query reconstruction to only re-run the descrip- tor matching and camera pose estimation steps in Figure 3.1 with a compressed point cloud for the model. No implementation is available for running compressed descriptors. Below we will cover each module in detail starting with query reconstruction followed by computing 3D descriptors, matching 3D descriptors, camera pose estimation, compression of 3D model and the special case of using a sequence length of one. 3.1 Query Reconstruction The query is reconstructed using COLMAP [53]. Given a sequence of images COLMAP extracts feature points and 128-dimensional descriptors that describes the respective point. COLMAP implements SIFT [33] to extract feature descriptors in each image. This process is represented as Feature Extraction in Figure 3.1. The data is then processed in Feature Matching by matching the descriptors between the images, illustrated in Figure 2.4. The final step of the reconstruction is computing the Structure from Motion model using the matched data. Using the matched fea- tures between images 3D points are triangulated, resulting in a 3D reconstruction and estimates of the cameras pose for each query image in the query-coordinate frame obtained from the local reconstruction. The query-coordinate frame will from here on be referred to as query frame. The resulting reconstruction is defined up to an arbitrary scaling factor in its own query frame. As a last step, all 2D features which are not assigned to a 3D point after the COLMAP reconstruction will be assigned to a 3D point on the image plane and transformed into the query model using the poses estimated from SfM. If only the points that were triangulated from SfM were to be used in the estimation of the generalized camera pose, it would result in to few rays and the solver would estimate significantly less accurate camera poses. As the solver utilizes ray origins and ray directions the depth of a point is not needed. However, the COLMAP reconstruction is still necessary to determine all camera poses in the query frame. 3.2 Compute 3D Descriptors The resulting data from the reconstructions provides 3D points in query coordinate frame. The model and query are defined in their own frames from here on referred to as world and query coordinate frame. These 3D points are only described in space (X,Y,Z) and do not have a descriptor assigned to them. This is why the 3D descriptor d̄ is computed, by taking the mean of the feature point descriptors di that have been matched to the 3D point. The resulting descriptor d̄ = 1 n ∑n i=0 di 18 3. Method is assigned to the 3D point. The mean is used since the corresponding feature descriptors should be similar due to matching. This results in a 128-long vector describing the 3D-point. This is done for the query and model, where the model is computed beforehand. 3.3 Matching 3D Descriptors With 3D descriptors created for both query and model the task is to find matches between these two models. We use the existing library FLANN, Fast Library for Approximate Nearest Neighbors [35]. FLANN is used to find the most similar model descriptor to all query descriptors, however due to the amount and dimensionality of descriptors, a kd-tree search is implemented to reduce computation time. FLANN outputs the indexes that correspond to the most similar descriptors in the model and the distance between the descriptors, which is the vector norm of the difference between the query d̄i and model dNN descriptors. Since there is a possibility that some matches are outliers Lowe’s ratio test [33] is implemented as described in Equation (2.12). 3.3.1 Implementation The FLANN library was chosen as it is highly customizable, in our method a kd-tree search is used. The default settings are used, with the exceptions of trees changed from 1 to 10 and branches from 32 to 100. FLANN is set to search for the two nearest neighbors. The tolerance used in Lowe’s ratio test is 0.75. 3.4 Camera Pose Estimation Using the matched data and the camera poses from the query reconstruction, the final step is to compute the camera pose estimation in world coordinate frame. For this we use primarily the gDLS (generalized Direct Least Squares) solver [58] and in the unique case when the image sequence is one image, the DlsPnp (Direct least squares Perspective n Point) [21] solver is used. The solvers implementations in Theia SfM library [57] are used. To handle outliers in the matches RANSAC with Td,d test [37] is implemented. 3.4.1 Implementation To use the solvers the matches needs to be converted into ray origins and ray direc- tions. This is done by taking the difference between the point and camera center qi,j = pi − Cj (3.1) where qi,j is the ray origin, pi is the 3D point coordinate and Cj is the camera center/ray origin, the light ray is then normalized. At least 4 light rays and 4 ray origins are necessary to solve with both gDLS and DlsPnp. In the description for the solver in the Theia SfM library [57] it says "Theoretically, up to 27 solutions 19 3. Method may be returned, but in practice only 4 real solutions arise". Each solution contains a scale s, rotation R and translation t representing a similarity transform between query frame and world frame. Since there are outliers in the matching RANSAC with Td,d test [37] is implemented to find the best solution. For a set amount of iterations RANSAC samples five light rays and their ray origins where four of these light rays and ray origins are used to compute solutions using the gDLS solver. The resulting solutions are then applied to the fifth rays corresponding 3D point, the point is transformed and reprojected down onto the image. If the reprojection error is less then a set tolerance (we used 10 pixels) the light ray and ray origin is considered good enough for the solution to be evaluated on all rays. By transforming the query point cloud and reprojecting all points we evaluate how many inliers there are. The solution that produces the most inliers after all RANSAC iterations will be presented as best_solution. The choice of using RANSAC was due to the amount of light rays and ray origins, to decreased the computation time for each iteration RANSAC with Td,d test is used. However since the amount of sampled light rays increase, based on Equation 2.19 the amount of iterations necessary for a given probability of picking an outlier free subset is also increased. After the iterations, the best_solution defined as (s, R, t) is then applied to the camera pose estimates (Rq, tq) in query frame, resulting in the desired output (Rest, test) where Rest = RqR and test = Restt+ 1 s tq. The code used for computing the camera poses are presented as pseudo-code in Algorithm 1. Algorithm 1 Solver with RANSAC 1: Convert matches into light rays and ray origins 2: for Amount of RANSAC iterations do 3: Sample 5 random query rays and corresponding model 3D points 4: Use rays 1-4 in gDLS 5: for all solutions do 6: Apply solution to fifth point 7: Compute reprojection error of transformed fifth point 8: if reprojection error ≤ threshold then 9: inliers = 0 10: for all query points do 11: Apply solution 12: Compute reprojection error 13: if reprojection error ≤ threshold then 14: inliers = inliers + 1 15: if inliers > best_inliers then 16: update best_inliers 17: update best_solution 18: Apply best_solution to query 20 3. Method 3.5 Compressing 3D Point Cloud When compressing the 3D model the total amount of 3D points are reduced. We specify the compression as how many points are left compared to the original amount. For example if there originally are one hundred points and after com- pression only ten are left, this would be a compression rate of ten percent, thus reducing the necessary static storage for the model into 1/10 of the original size. It would be preferable if this percentage could be specified an input to the compression algorithm but it introduces some unwanted problems. When compressing data the goal is to preserve as much information as possible in a small set of all data. We could simply pick the first x% of points based on how many images observe the point as a measure. However, this could lead to the points being picked all ending up in the same region of the model. This could happen if there are more images depicting the same scene or if one scene contain more structural variance in the model dataset. This would result in a compressed model with partial or no coverage for some areas. We prevent this from happening by instead specifying a minimum amount of points that all images must see, similar to the method by Li et al. [31]. First, points are sorted and picked after how many images they are present in. As points are being picked and an image reaches the minimum number of points required, it will no longer count toward the number of images a point is present in and all points are resorted. This continues until all images can see the minimum amount of points or all points have been evaluated since it’s possible that some images view less then the minimum requested amount of points. 3.6 Special case: One Image When using a single image COLMAP can not create a query reconstruction. Instead, a 3D point is created for every image feature. This point is placed in the image plane, a z-distance of one, and the same x and y pixel value as the image feature. It is then transformed with the inverse calibration matrix to be placed in the query coordinate system. Using one images also results in all rays having the same ray origin. gDLS can not find a solution if all rays have the same origin. Instead DlsPnP is used. 21 3. Method 22 4 Results In this chapter we will present our localization results for different sequence lengths as as well as how overlapping images and compressed data affect localization using generalized cameras. 4.1 Datasets To verify that our pipeline achieves significant results we have evaluated our pipeline on the Cambridge Landmarks dataset [29] as it contains scenes of different sizes and is widely used in the visual localization community. All scenes in the dataset were recorded roughly under the same weather conditions. 4.1.1 Cambridge Landmarks St Mary´s Church In the St Mary’s Church dataset all images come in a number of image sequences. It contains a prebuilt model reconstructed from images intended to be used for training of neural networks. It also contains test images that have not been used to create the model. In total there are around 1500 training images and 500 test images to evaluate on. As St Mary’s Church is a small dataset of a structure with well defined features it is used as a first benchmark to determine if a localization algorithm is able to localize images in a relatively easy scene. 4.1.2 Cambrige Landmarks Street The Street scene shows King’s Parade street in Cambridge. It contains roughly 3000 training images which the included model is reconstructed from. There are also nearly 3000 test images. The training images are divided into four sequences; west, east, north and south. In each sequence the camera is only pointed in the direction of the sequence name. The test images are instead taken by walking up and down King’s Parade constantly rotating. Street poses a more challenging localization problem as it features many repeated features over a larger area and a greater variety of depth at which structure are located. 4.2 Evaluation We evaluate our pipeline implementation on the above described datasets. The re- sults will be presented in two ways. First to evaluate the impact of different sequence 23 4. Results length our results from the St Mary’s and Street will be presented in the same way as [51] to observe the trends of increasing sequence length. Our results will also be presented as median positional and orientation error of the pose estimation to com- pare results against other camera localization methods [52]. More specifically we will compare against some deep learning approaches, PoseNet with geometric loss function [28], MapNet [6], AnchorNet [47]; and some structure based approaches, DSAC++ [5], and ActiveSearch [50] as well as the image retrieval method Den- seVLAD [61] as this will be a good selection of methods highlighting localization results from different approaches. Additionally to this we will evaluate the impact of using a compressed 3D point model with the initial assumption that by using more images the point cloud can be compressed to a fraction of its initial size with- out significantly impacting localization results. In [51] each image is placed in bins depending how accurately it was localized in posi- tional error computed by the Euclidean distance and the orientation error (Xm,Y o). There are three bins with thresholds (0.25m, 2o), (0.5m, 5o) and (5m, 10o) labeled fine-, medium- and coarse-precision. An image is placed in all or none of the bins depending on how accurately it was localized. We will measure pose accuracy as the absolute orientation and position error between the estimation and the ground truth. The orientation error |α| is measured by standard practice [19] and is com- puted as 2 cos |α| = trace(R−1 gt Rest) where |α| is the minimum rotation angle to align the estimated rotation matrix Rest to the ground truth rotation Rgt. The position error is measured as the Euclidean distance between the ground truth camera origin cgt and cest the estimated camera origin, ||cest − cgt||2. The localization results for our pipeline on the Street and St Mary’s dataset shown in bins can be seen in Table 4.1 where different sequence lengths are evaluated on the two datasets. Of note is that the St Mary’s dataset contains three sequences of images with two of them consisting of less than a hundred images. When localizing images with sequence lengths of eighty five and above, the entire sequence was then localized at the same time. There is a clear trend between sequence lengths where results improve with each step up in sequence length until a certain point where results start to fluctuate up and down. For Street this happens for longer sequence lengths than St Mary’s. For both datasets the use of sequences vastly outperforms the results from the single image solver. However for some sequence lengths the single image case performs better. This is likely the result of some sequences not being reconstructed and therefor the images in the sequence are not localized. In Table 4.2 the results are presented as median position and orientation error to be compared with state of the art localization results presented in [52]. We manage to outperform the best Street localization result in position while our pipeline is only surpassed by ActiveSearch in orientation error for a sequence length of eighty. This is presented in Table 4.6 and visualized in Figure 4.1 and 4.2. An advantage of localizing sequences of images is that we can overlap sequences with each other thus the images exist in multiple queries. This increases the likelihood of the image existing in a reconstruction and therefor being localized. When presented with two or more poses we take the one with the most inliers, computed by pro- 24 4. Results Street St Mary’s m deg .25/.50/5.0 2/5/10 .25/.50/5.0 2/5/10 1 12.8/23.4/37.9 0/3.9/97.8 5 2.2/5.7/25.2 5.5/15.8/72.3 10 4.3/9.9/33.7 9.8/27.1/89.8 15 4.0/9.0/26.8 8.1/23.1/89.3 20 2.8/8.6/33.5 4.9/18.6/88.7 25 4.1/11.7/34.5 17.3/34.5/88.7 30 4.3/12.8/38.2 15.0/38.2/94.5 35 7.4/17.4/42.0 16.2/46.2/90.6 40 5.2/11.3/20.2 23.1/54.5/98.3 45 11.3/26.1/56.3 21.2/48.5/99.2 50 10.8/23.6/43.0 27.8/60.0/99.4 55 6.6/19.1/46.5 35.3/69.4/99.6 60 8.7/21.7/51.7 25.4/59.2/91.5 65 5.3/17.1/46.5 25.2/62.2/99.4 70 5.6/19.8/52.9 24.2/62.8/99.1 75 13.1/29.5/62.0 28.9/69.0/99.2 80 11.0/30.3/67.1 28.4/64.5/99.1 85 11.3/29.0/56.7 28.9/63.3/99.4 90 11.1/21.9/48.2 20.5/54.3/95.3 95 12.6/28.6/60.5 20.7/61.3/99.2 100 10.7/23.9/52.4 22.4/63.0/98.9 Table 4.1: This table shows localization results on the Street and St Mary’s datasets for different sequence lengths where all images are assigned to bins de- pending on how accurately they were localized. Presented is the percentage of images clearing the threshold for each bin. In bold are the best localization results for Street and St Mary’s dataset respectively. 25 4. Results Street St Mary’s Median position [m]/ orientation [o] error Median position [m]/ orientation [o] error 1 17.690/68.290 0.82/1.05 5 30.249/54.616 2.11/2.07 10 15.911/38.968 0.91/1.44 15 16.190/41.723 1.03/1.43 20 19.448/38.515 1.19/1.75 25 6.601/11.711 0.67/1.22 30 11.087/14.759 0.66/0.99 35 4.521/8.274 0.54/1.16 40 3.443/12.327 0.43/0.80 45 3.077/6.326 0.51/0.84 50 3.246/6.570 0.40/0.71 55 1.956/3.302 0.35/0.60 60 1.459/4.147 0.40/0.76 65 1.945/4.068 0.39/0.82 70 1.586/4.983 0.40/0.77 75 0.968/2.066 0.36/0.69 80 0.82/2.458 0.39/0.81 85 0.853/2.530 0.39/0.71 90 2.856/5.467 0.46/0.89 95 1.096/3.386 0.40/1.10 100 1.486/3.390 0.40/0.84 Table 4.2: Localization results from our pipeline on Street and St Mary’s datasets shown as median position and orientation error. jecting the matched points onto the query image. In Table 4.4 we see that by using overlapped sequences we improve our previous best results by over ten percentage points in all bins compared to not using overlapped sequences. The best median position and orientation error is also halved in Table 4.5. When comparing our pipeline on the St Mary’s dataset, Figure 4.1, our method preforms great in comparison to PoseNet, MapNet and DenseVLAD, but are out- performed by ActiveSearch and DSAC++. As for the Street dataset, Figure 4.2, our method outperforms both AnchorNet and DenseVLAD, using sequences with no overlap our method reaches the same level of accuracy in position error as ActiveSearch but is outperformed in orientation error. Using overlapping sequences our method reaches an equal oritentional accuracy as that of ActiveSearch and outperforms all methods in positional accuracy. In Table 4.3 we have also evaluated if our localization results are correlating with the reconstruction errors from COLMAP. If these are compared to the localization 26 4. Results Street St Mary’s Median position error [m] Median position error [m] 5 15.8 1.3 10 14.3 1.8 15 65.7 1.6 20 29.9 2.5 25 12.4 2.1 30 20.3 1.8 35 13.8 1.8 40 19.2 1.8 45 6.7 3.4 50 16.8 2.1 55 9.1 1.9 60 6.4 1.9 65 16.2 2.4 70 6.5 2.5 75 3.6 2.2 80 6.7 2.7 85 8.2 3.6 90 11.3 5.0 95 9.3 3.9 100 13.3 3.1 Table 4.3: Pipeline reconstructions compared to ground truth poses for recon- struction drift evaluation. The camera with the largest pose drift is selected for each sequence. Reconstruction drift is then determined as the median of these val- ues for each sequence length. 27 4. Results Street - Overlap Street m deg .25/.50/5.0 2/5/10 .25/.50/5.0 2/5/10 10 9.1/22.61/54.47 4.3/9.9/33.7 20 8.52/23.20/60.90 2.8/8.6/33.5 30 10.26/25.08/61.31 4.3/12.8/38.2 40 14.27/31.95/64.49 5.2/11.3/20.2 50 19.30/39.28/73.25 10.8/23.6/43.0 60 22.20/44.13/75.13 8.7/21.7/51.7 70 23.30/46.9/82.83 5.6/19.8/52.9 80 21.35/46.73/80.6 11.0/30.3/67.1 90 24.50/49.74/85.63 11.1/21.9/48.2 100 29.73/54.36/87.92 10.7/23.9/52.4 Table 4.4: This table shows localization results for the Street dataset where a sequence overlaps the previous sequence with n/2 images if the sequence length is n. results we see that reconstructions for Street have reconstruction errors roughly in- creasing and decreasing in a similar pattern as the localization errors for different sequence lengths. This does however not seem to be the case for St Mary’s Church as the smallest reconstruction errors are for shorter sequence lengths in comparison the sequence lengths achieving the best localization results. Except for the sequence length five, the reconstruction errors are similar until the start to grow for longer sequence lengths. The observed trends indicate that reconstruction errors is a fac- tor that can’t be ignored if one want to achieve good localization results for larger datasets. 4.2.1 Results Model compression When testing our pipeline on compressed 3d point clouds as described in Section 3.5 we never managed to get results even in the vicinity of our other localization results. Our tests were performed on the Street dataset which 3d point cloud had been compressed to 50%. To try and get good matches with the compressed point cloud different values between 0.3 and 0.8 for the threshold in Lowe’s ratio test was tested but none of the values tried managed to achieve presentable localization results. 28 4. Results Street - Overlap Street Median position [m]/ orientation [o] error Median position [m]/ orientation [o] error 10 2.750/3.210 15.911/38.968 20 1.730/2.810 19.448/38.515 30 1.605/3.042 11.087/14.759 40 1.163/2.312 3.443/12.327 50 0.726/1.794 3.246/6.570 60 0.620/1.419 1.459/4.147 70 0.541/1.239 1.586/4.983 80 0.555/1.382 0.82/2.458 90 0.500/1.164 2.856/5.467 100 0.435/0.983 1.486/3.4 Table 4.5: Same localization results as Table 4.4 but shown as median position/ori- entation error. Median error Street St Mary’s position [m] orientation [o] position [m] orientation [o] PoseNet 20.3 25.5 1.57 3.32 MapNet - - 2.00 4.53 DenseVLAD 5.16 23.5 2.31 8.00 ActiveSearch 0.85 0.8 0.19 0.54 DSAC++ - - 0.13 0.4 AnchorNet 7.86 24.2 1.04 2.69 Our method 0.82 2.46 0.35 0.60 Our method w. overlap 0.435 0.983 - - Table 4.6: This table shows the median errors of the state of the art localization pipelines, PoseNet [28], MapNet [6], DenseVLAD [61], ActiveSearch [50], DSAC++ [5], AnchorNet [47], DenseVLAD [61], that we compare against for the Cambridge Landmarks data sets 29 4. Results Figure 4.1: Comparison on the dataset Cambridge Landmarks: St Mary’s Church dataset [29] between our method and PoseNet [28], MapNet [6], DenseVLAD [61], ActiveSearch [50], DSAC++ [5]. Figure 4.2: Comparison on the dataset Cambridge Landmarks: Street dataset [29] between our method and AnchorNet [47], DenseVLAD [61] and ActiveSearch [50]. 30 5 Conclusion A localization pipeline has been implemented that utilizes a generalized camera model for image localization. As the pipeline makes use of generalized cameras it can localize an entire image sequence at once which has become highly relevant in many modern localization applications that make use of video streams for capturing visual information. It achieves significant results on the Street dataset by localizing images more accurately than state of the art localization pipelines when using over- lapping sequences. On the St Mary’s dataset it perform as well but not better than other pipelines when using non overlapping sequences. This is likely due to that the dataset covers only one primary structure. So when using a short image sequence some queries will not be reconstructed and thus result in a decresed accuracy. By testing different sequence lengths it is clear that sequence lengths of fifty or more is desirable for accurate localization for this dataset. On the Street dataset a sequence length of eighty images was used to reach the peak accuracy. Using overlapping image sequences increased our pipelines accuracy and more accurate localization could be achieved for shorter sequences. This is likely due to the images existing in additional subsets and increases the likelihood that the image exists in a com- plete reconstruction and thus computed a camera pose estimate. When testing the pipeline in compressed datasets no presentable results were gathered. 31 5. Conclusion 32 6 Future Work In this thesis we have presented some results for camera localization with generalized cameras. The developed pipeline has a lot of potential for improvements as it is only a first implementation where the amount of optimization has been minimal. The gDLS solver is implemented in a way to maximize the likelihood of good results at the cost of long run times. We believe that the run time could be significantly shorter with a relatively small amount of adjustments and changes. The gDLS solver solutions did not receive any bundle adjustment which should improve results to some degree. We hypothesized that using a generalized camera could help in localization with compressed databases but we did not achieve any presentable results when testing our pipeline on a compressed model of the Street dataset. This could be explored further to narrow down the issue. It is possible that the fault is in our implementa- tion of selecting 3D points or that Street simply is a challenging dataset to achieve good results using model compression. As applications for localization using image sequences usually require real-time per- formance from the localization pipeline, the next step would be to create a pipeline capable of real-time localization and record benchmarks for comparison to other localization methods. We believe this could be possible by incrementally including new images and discarding old ones from a single query reconstruction. By doing so the SfM calculation would be significantly faster and previously good matches between the query and reference model can be utilized when calculating the new pose. 33 6. Future Work 34 Bibliography [1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building Rome in a day. In 2009 IEEE 12th International Conference on Computer Vision, pages 72–79. IEEE, 9 2009. [2] C. Aggarwal, A. Hinneburg, and D. Keim. On the Surprising Behavior of Dis- tance Metrics in High Dimensional Space. In J. Van den Bussche and V. Vianu, editors, Database Theory — ICDT 2001, pages 420–434. Springer Berlin Hei- delberg, Berlin, Heidelberg, 2001. [3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding, 110(3):346–359, 6 2008. [4] J. L. Bentley. Multidimensional binary search trees used for associative search- ing. Communications of the ACM, 1975. [5] E. Brachmann and C. Rother. Learning Less is More - 6D Camera Localiza- tion via 3D Surface Regression. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. [6] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz. Geometry-Aware Learn- ing of Maps for Camera Localization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. [7] S. Cao and N. Snavely. Graph-Based Discriminative Learning for Location Recognition. 2013 IEEE Conference on Computer Vision and Pattern Recog- nition, pages 700–707, 2013. [8] S. Cao and N. Snavely. Minimal scene descriptions from structure from motion models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014. [9] G. Carrera, A. Angeli, and A. J. Davison. SLAM-based automatic extrinsic cal- ibration of a multi-camera rig. In Proceedings - IEEE International Conference on Robotics and Automation, 2011. [10] W.-T. Chu, X.-Y. Zheng, and D.-S. Ding. Image2Weather: A Large-Scale Image Dataset for Weather Property Estimation. In 2016 IEEE Second International Conference on Multimedia Big Data (BigMM), pages 137–144, 5 2016. [11] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total Recall: Auto- matic Query Expansion with a Generative Feature Model for Object Retrieval. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8, 5 2007. [12] P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, and G. Ranzuglia. MeshLab: an Open-Source Mesh Processing Tool. In V. Scarano, R. D. Chiara, and U. Erra, editors, Eurographics Italian Chapter Conference. The Eurographics Association, 2008. 35 Bibliography [13] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. MonoSLAM: Real-Time Single Camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 6 2007. [14] M. Feng, S. Hu, M. H. Ang, and G. H. Lee. 2D3D-MatchNet: Learning to Match Keypoints Across 2D Image and 3D Point Cloud. 2019 International Conference on Robotics and Automation (ICRA), pages 4790–4796, 2019. [15] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 2002. [16] J.-M. Frahm, K. Köser, and R. Koch. Pose Estimation for Multi-camera Sys- tems. In H. H. Rasmussen, Carl Edward B{\"u}lthoff, B. Sch{\"o}lkopf, and M. A. Giese, editors, Pattern Recognition, volume 3175, pages 286–293, Berlin, 5 2004. Springer Berlin Heidelberg. [17] M. Grossberg and S. Nayar. A general imaging model and a method for find- ing its parameters. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, pages 108–115, 2001. [18] C. Häne, L. Heng, G. H. Lee, F. Fraundorfer, P. Furgale, T. Sattler, and M. Pollefeys. 3D visual perception for self-driving cars using a multi-camera system: Calibration, mapping, localization, and obstacle detection. Image and Vision Computing, 2017. [19] R. Hartley, J. Trumpf, Y. Dai, and H. Li. Rotation Averaging. International Journal of Computer Vision, 103(3):267–305, 7 2013. [20] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, 2 edition, 2004. [21] J. A. Hesch and S. I. Roumeliotis. A Direct Least-Squares (DLS) method for PnP. In Proceedings of the IEEE International Conference on Computer Vision, 2011. [22] K. L. Ho and P. Newman. Loop closure detection in SLAM by combining visual and spatial appearance. Robotics and Autonomous Systems, 54(9):740–749, 9 2006. [23] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Compu- tation, 1997. [24] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2599–2606, 2009. [25] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. [26] M. Kaess and F. Dellaert. Visual SLAM with a Multi-Camera Rig. Technology, 2006. [27] F. Kahl. Lecture 3: Camera Calibration, DLT, SVD, 2018. [28] A. Kendall and R. Cipolla. Geometric Loss Functions for Camera Pose Regres- sion With Deep Learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7 2017. [29] A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In 2015 IEEE International Confer- ence on Computer Vision (ICCV), pages 2938–2946. IEEE, 12 2015. 36 Bibliography [30] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide pose estimation us- ing 3D point clouds. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). ECCV, 2012. [31] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using prioritized feature matching. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010. [32] L. Liu, H. Li, and Y. Dai. Efficient Global 2D-3D Matching for Camera Lo- calization in a Large-Scale 3D Map. In Proceedings of the IEEE International Conference on Computer Vision, 2017. [33] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Inter- national Journal of Computer Vision, 60(2):91–110, 11 2004. [34] S. Lynen, T. Sattler, M. Bosse, J. Hesch, M. Pollefeys, and R. Siegwart. Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization. In Robotics: Science and Systems XI. Robotics: Science and Systems Foundation, 7 2015. [35] Marius Muja and David G. Lowe. Scalable Nearest Neighbor Algorithms for High Dimensional Data. Pattern Analysis and Machine Intelligence, IEEE Transactions, 36(11):2227–2240, 2014. [36] J. Matas and O. Chum. Randomized RANSAC with Td,d test. In Image and Vision Computing, 2004. [37] J. r. Matas and O. r. Chum. Randomized RANSAC. In Proceedings of the CVWW’02, pages 49–58, 2002. [38] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt. Scalable 6-DOF lo- calization on mobile devices. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor- matics), 2014. [39] M. Milford, S. Lowry, N. Sunderhauf, S. Shirazi, E. Pepperell, B. Upcroft, C. Shen, G. Lin, F. Liu, C. Cadena, and I. Reid. Sequence searching with deep- learnt depth for condition- and viewpoint-invariant route-based place recogni- tion. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 18–25. IEEE, 6 2015. [40] M. J. Milford and G. F. Wyeth. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE International Conference on Robotics and Automation, pages 1643–1649. IEEE, 5 2012. [41] P. Moulon, P. Monasse, and R. Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. In Proceedings of the IEEE International Conference on Computer Vision, 2013. [42] R. Paucher and M. Turk. Location-based augmented reality on mobile phones. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, CVPRW 2010, 2010. [43] R. Pless. Using many cameras as one. In 2003 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition, 2003. Proceedings., vol- ume 2, pages 587–93. IEEE Comput. Soc, 2003. [44] D. Robertson and R. Cipolla. An Image-Based System for Urban Navigation. In A. Hoppe, S. Barman, and T. Ellis, editors, Proceedings of the British Machine Vision Conference, pages 1–84. BMVA Press, 2004. 37 Bibliography [45] A. Roshan Zamir and M. Shah. Accurate Image Localization Based on Google Maps Street View. In K. Daniilidis, P. Maragos, and N. Paragios, editors, Computer Vision – ECCV 2010, pages 255–268, Berlin, 2010. Springer Berlin Heidelberg. [46] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alterna- tive to SIFT or SURF. In 2011 International Conference on Computer Vision, pages 2564–2571. IEEE, 11 2011. [47] S. Saha, G. Varma, and C. V. Jawahar. Improved Visual Relocalization by Discovering Anchor Points. In British Machine Vision Conference 2018, page 164, 2018. [48] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys. Hyper- points and fine vocabularies for large-scale location recognition. In Proceedings of the IEEE International Conference on Computer Vision, 2015. [49] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys. Large-Scale Location Recognition and the Geometric Burstiness Problem. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1582–1590. IEEE, 6 2016. [50] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & Effective Prioritized Match- ing for Large-Scale Image-Based Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [51] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla. Bench- marking 6DOF Outdoor Visual Localization in Changing Conditions. CVPR 2018, 2018. [52] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe. Understanding the Lim- itations of CNN-Based Absolute Camera Pose Regression. In The IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 6 2019. [53] J. L. Schonberger and J.-M. Frahm. Structure-from-Motion Revisited. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [54] X. Shen, Z. L. Lin, J. Brandt, S. Avidan, and Y. Wu. Object retrieval and localization with spatially-constrained similarity measure and k-NN re-ranking. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3013–3020, 2012. [55] Sivic and Zisserman. Video Google: a text retrieval approach to object matching in videos. In Proceedings Ninth IEEE International Conference on Computer Vision, 2003. [56] N. Sünderhauf, F. Dayoub, S. Shirazi, B. Upcroft, and M. Milford. On the Performance of ConvNet Features for Place Recognition. IEEE International Conference on Intelligent Robots and Systems (IROS), 1 2015. [57] C. Sweeney. Theia Multiview Geometry Library: Tutorial & Reference. http://theia-sfm.org. [58] C. Sweeney, V. Fragoso, T. Höllerer, and M. Turk. gDLS: A Scalable Solution to the Generalized Pose and Scale Problem. In Computer Vision – ECCV 2014, pages 16–31, 2014. [59] C. Sweeney, V. Fragoso, T. Hollerer, and M. Turk. Large scale SfM with the distributed camera model. In Proceedings - 2016 4th International Conference on 3D Vision, 3DV 2016, 2016. 38 Bibliography [60] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. Proc. CVPR, 3 2018. [61] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1808–1817, 10 2015. [62] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle Adjustment — A Modern Synthesis. Workshop on Vision Algorithms, 1999. [63] S. Urban and S. Hinz. MultiCol-SLAM - A Modular Real-Time Multi-Camera SLAM System. arXiv preprint arXiv:1610.07336, 2016. [64] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck, and D. Cremers. Image-based localization using LSTMs for structured feature correlation. ICCV, 2017. [65] Z. Xin, X. Cui, J. Zhang, Y. Yang, and Y. Wang. Visual place recognition with CNNs: From global to partial. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1–6. IEEE, 11 2017. [66] B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-scale image- based localization. In Proceedings of the IEEE International Conference on Computer Vision, 2015. [67] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi. Revisiting the PnP problem: A fast, general and optimal solution. In Proceedings of the IEEE International Conference on Computer Vision, 2013. 39 Bibliography 40 Abstract List of Figures List of Tables Introduction Related Work Ways to localize images Point Cloud Compression Content of thesis Theory Camera Model Finding point coordinates and projection matrices A Structure from Motion Pipeline The Generalized Camera Pose estimation using the Generalized Camera Random Sample Consensus RANSAC with Td,d test Method Query Reconstruction Compute 3D Descriptors Matching 3D Descriptors Implementation Camera Pose Estimation Implementation Compressing 3D Point Cloud Special case: One Image Results Datasets Cambridge Landmarks St Marys Church Cambrige Landmarks Street Evaluation Results Model compression Conclusion Future Work Bibliography