DF Geometry Identification from Point Cloud Data Master’s thesis in Engineering Mathematics and Computational Sciences MAITREYA DEEPAK DAVE Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2021 Master’s thesis 2021 Geometry Identification from Point Cloud Data MAITREYA DAVE DF Department of Mathematical Sciences Chalmers University of Technology Gothenburg, Sweden 2021 Geometry Identification from Point Cloud Data MAITREYA DAVE © MAITREYA DAVE, 2021. Supervisor: Ludvig Friborg, Wiretronic AB Academic Supervisor: Mats Rudemo, Department of Mathematical Sciences Examiner: Aila Särkkä, Department of Mathematical Sciences Master’s Thesis 2021 Department of Mathematical Sciences Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Part segmentation result on point cloud representation of a lamp and Left: Point cloud represented with true labels, Right: Point cloud represented with pre- dicted labels Typeset in LATEX, template by David Frisk Printed by Chalmers Reproservice Gothenburg, Sweden 2021 iv Geometry Identification from Point Cloud Data MAITREYA DEEPAK DAVE Department of Mathematical Sciences Chalmers University of Technology Abstract Abstract: The aim of this thesis is to conduct a comparative study of different al- gorithms which can learn from 3D (x,y,z) point cloud data and are able to conduct per-point classification. The point clouds of interest for this thesis are the one’s which represent a 360-degree view of an object. For example, the input is a point cloud representing a knife with two parts: the blade and the handle. For such an input, the algorithms must be able to classify which points of this point cloud belong to blade and to the handle. To solve this task different types of deep-learning based models like graph-based neural networks and non-grid point convolution networks were implemented and analyzed. These models perform with over 80% accuracy which is quite remarkable given the input is only raw 3D coordinates with no need of voxelization. The models have also been tested on synthetic datasets as well as an object scanned using a LiDAR camera. These algorithms can be applied in au- tonomous driving, augmented/virtual reality applications, medical data processing and 3D animation industry. Keywords: point, cloud, part, segmentation, pointnet, dgcnn, kpconv, 3D, deep- learning, supervised v Acknowledgements I would like to my supervisor Mats Rudemo and examiner Aila Särkkä for their inputs, and guidance throughout the thesis. I would like to also thank Wiretronic AB, for giving me this opportunity and aiding me throughout the thesis. Thank you to both these parties for supporting me throughout this thesis! Secondly, I would like to thank Chalmers University of Technology for its academic resources and for providing support during these unprecedented times. The courses offered by Chalmers University of Technology have helped me gain valuable expe- rience in both theory and practice. A special thanks to the Chalmers Centre for Computational Science and Engineering (C3SE) for providing me the access to the computer hardware needed to perform this thesis. Finally, I would like to thank my family and every close friend of mine for their continued unconditional support during the whole process, and for being there all way through everything. Thank you, everyone! Maitreya Deepak Dave, Gothenburg, June 2021 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Introduction to 3D Representations . . . . . . . . . . . . . . . . . . . 3 2.1.1 Point Cloud Theory . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Challenges with Point Clouds . . . . . . . . . . . . . . . . . . 8 2.2 Deep Learning Primer . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Multi-layer perceptron . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Loss Function and Optimization . . . . . . . . . . . . . . . . . 13 2.2.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 13 2.3 Introduction to 3D Deep Learning . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Projection Networks . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Point-wise Networks . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Graph-based Convolution Networks . . . . . . . . . . . . . . . 16 2.3.4 Point Convolution Networks . . . . . . . . . . . . . . . . . . . 16 3 Methodology 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 PointNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 PointNet++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Dynamic Graph CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Kernel Point Convolutions . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Development 27 4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 Data Capture Hardware . . . . . . . . . . . . . . . . . . . . . 27 4.1.2 Computation System . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Dataset & Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 28 ix Contents 4.2.1 Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2 Test Data Collection . . . . . . . . . . . . . . . . . . . . . . . 28 5 Results 31 5.1 Results on Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Results on Collected Test Data . . . . . . . . . . . . . . . . . . . . . 37 6 Discussion 43 6.1 Invariance to geometric transformations . . . . . . . . . . . . . . . . . 43 6.2 Input point cloud structure . . . . . . . . . . . . . . . . . . . . . . . 43 6.2.1 Varying the number of input points . . . . . . . . . . . . . . . 43 6.2.2 Missing points and outliers . . . . . . . . . . . . . . . . . . . . 44 6.3 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.3.1 Mislabelling of true data . . . . . . . . . . . . . . . . . . . . . 44 6.3.2 Scaling issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7 Conclusion 45 Bibliography 47 A Appendix 1: Algorithms I A.1 Farthest Point Sampling . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Ball Query Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II B Appendix 2: Test Data Collection Pipeline III B.1 RANSAC-based Plane segmentation . . . . . . . . . . . . . . . . . . . III B.2 Point cloud fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III B.2.1 ICP registration . . . . . . . . . . . . . . . . . . . . . . . . . . III B.2.2 Multiway registration . . . . . . . . . . . . . . . . . . . . . . . IV C Appendix 3: Intel Realsense L515 Operating Modes VII x List of Figures 2.1 Image depicting camera taking images of an object from multiple viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Voxelization results on point cloud representation of Standford Bunny 4 2.3 Mesh representation of Standford Bunny . . . . . . . . . . . . . . . . 5 2.4 Corresponding color and depth images captured using Intel’s Re- alsense L515 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 SDF learned via DeepSDF[20] applied on Standford Bunny (a) de- piction of the decision boundary of underlying implicit surface where SDF < 0 and SDF > 0 means inside and outside of the surface re- spectively, (b) 2D cross-section of the signed distance function, (c) rendered 3D surface recovered from SDF = 0 . . . . . . . . . . . . . . 7 2.6 Point cloud representation of Stanford Bunny . . . . . . . . . . . . . 8 2.7 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.8 Visualization of Activation Functions: ReLU and Leaky ReLU . . . . 12 2.9 Network graph for a 3-layer perceptron . . . . . . . . . . . . . . . . . 13 2.10 Convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.11 Demonstration of max pooling in CNN . . . . . . . . . . . . . . . . . 14 2.12 High level view of a simple CNN architecture . . . . . . . . . . . . . . 15 3.1 Network Architecture of PointNet . . . . . . . . . . . . . . . . . . . . 19 3.2 Network Architecture of PointNet++ . . . . . . . . . . . . . . . . . . 20 3.3 Visual representation of EdgeConv Operation of DGCNN . . . . . . . 21 3.4 Network Architecture of Dynamic Graph CNN . . . . . . . . . . . . . 22 3.5 Visualization Comparison: Left: Regular Convolution on 2D Image and Right: Kernel Point Convolutions on Un-ordered 2D Points . . . 23 3.6 Visualization of kernel points, center and points of a point cloud in a neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7 Network Architecture of KPConv KP-FCNN: Kernel Point - Fully Connected Neural Network used for Segmentation . . . . . . . . . . . 25 4.1 Visual results of test data collection process on object 1 . . . . . . . . 30 5.1 Training results on ShapeNetPart: training dataset . . . . . . . . . . 32 5.2 Transformed view of point cloud with color information modified based on true labels and predicted labels for Test 1 of object 1 . . . . 35 5.3 Transformed view of point cloud with color information modified based on true labels and predicted labels for Test 2 of object 1 . . . . 35 xi List of Figures 5.4 Transformed view of point cloud with color information modified based on true labels and predicted labels for Test 1 of object 2 . . . . 36 5.5 Transformed view of point cloud with color information modified based on true labels and predicted labels for Test 2 of object 2 . . . . 36 5.6 Testing models on scanned data with varying number of input points 37 5.7 Untransformed view of point cloud with color information modified based on true labels and predicted labels for test object 1 (Number of points = 2048) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.8 Untransformed view of point cloud with color information modified based on true labels and predicted labels for test object 2 (Number of points = 2048) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.9 Untransformed view of point cloud with color information modified based on true labels and predicted labels for test object 1 (Number of points = 1024) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.10 Untransformed view of point cloud with color information modified based on true labels and predicted labels for test object 2 (Number of points = 1024) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.11 Untransformed view of point cloud with color information modified based on true labels and predicted labels for test object 1 (Number of points = 4096) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.12 Untransformed view of point cloud with color information modified based on true labels and predicted labels for test object 2 (Number of points = 4096) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.13 Orthogonal View of point cloud with color information modified based on true labels and predicted labels for test object 1 (Number of points = 4096) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.14 Top View of point cloud with color information modified based on true labels and predicted labels for test object 2 (Number of points = 4096) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xii List of Tables 4.1 L515 operating models . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 L515 device specifications . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Training Duration of all models on ShapeNetPart training dataset . . 33 5.2 Comparison of the models on Test 1: untransformed test dataset and Test 2: transformed test dataset . . . . . . . . . . . . . . . . . . . . . 34 5.3 Object-wise training and testing IoU’s for KPConv on ShapeNetPart dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C.1 Stream Profiles supported by the L515 Depth Sensor . . . . . . . . . VII C.2 Stream Profiles supported by the Motion Module . . . . . . . . . . . VII C.3 Stream Profiles supported by the RGB Camera . . . . . . . . . . . . VIII xiii List of Tables xiv 1 Introduction The geometric shape of an object can be described as the description of an object without its scale, orientation and reflection. This means that if we do any of these operations on an object, the geometrical shape or structure of the object should remain the same and not turn into a new distinct shape. In recent years, research in object identification with point cloud data as an input has turned into a growing topic of research as it leads to many applications in vision perception, virtual/augmented reality, robotics, and medical imaging. The idea of this thesis is to investigate AI-based methods to segment out geometrical shapes from the 3D point clouds with as little human intervention as possible. Task associated with 3D are • 3D Geometry Analysis which consists of tasks like object classification, seg- mentation of a given object or a scene, finding correspondences between dif- ferent but similar data. • 3D Synthesis which consists of tasks like reconstruction of a partial point cloud, automatic shape completion or predictive shape modelling. 1.1 Objectives The objective of this thesis is to conduct a comparative study of different algorithms which segment geometrical shapes from point cloud data. By achieving this, one could use the results for further tasks like component verification, synthetic data generation, visualization and more. The research questions to investigate are as follows Question 1 What possible deep learning based models exist to perform part segmentation of point clouds? How well do they perform? Question 2 How efficiently do the models handle raw point clouds as an input ? Since point cloud data can be obtained either from sensors directly or estimated from images, there can be quite a lot of variation in the point density, missing points, outlier and accuracy of the obtained point cloud. To answer this question, the models are trained on synthetically available data and tested on test dataset which is an unseen subset of the training dataset as well as with point clouds ob- tained from scanning an object using LiDAR based camera. The input point during 1 1. Introduction training stage is obtained by uniformly sampling a 3D mesh made in CAD program. However, when testing the point clouds using scanned data, the point cloud has outliers, missing points and randomly jittered points which need to be taken care of. Question 3 Are all geometric shapes recognized with the same accuracy? Which geometric shapes are easier to identify? The visualizations of predictions made on the test datasets will be analyzed. 1.2 Previous Work Geometry identification from a point cloud can be seen as a point cloud segmentation task. There has been significant research in this area in recent years, the credit for this advancement goes to machine learning techniques [22, 23, 17, 33]. Methods before machine learning [18, 25] relied on hand-crafted features for specific tasks but it may not always be easy to find the optimal set of features by hand. Deep learning models like [22], showed remarkable results in 3D object classification, part segmentation, and semantic segmentation of point clouds. These results motivated more research in this area, the result of which emergence of models based on 3D point-based networks [17, 31], graph-based networks [36, 34], networks-based on projecting data to a different dimensional space [28, 1], capsule networks [3]. A common approach in all the models has been down-sampling the input point cloud as the cost of computation is directly proportional to the number of input points. Hence, there is an inevitable loss of information. The task of segmentation without downsampling a point cloud is called fine-grained segmentation of point cloud, while [37] delivered very good results for this task it has been approached in a supervised manner which makes it less practical for developing an application with. [9] provides a detailed account of deep-learing based methods for various point cloud related tasks. 1.3 Structure of report The thesis begins with an introduction of the problem statement, the objective of this thesis, and the previous work done in this area. We start with the background which introduces different types of representations of 3D data and dwells in-depth about point clouds which is the focus of this thesis. Then we provide a basic overview of deep-learning followed by different types of deep-learning approaches taken by re- searchers for 3D point cloud analysis. Then methodology chapter provides important details of each of the implemented models, explaining how they work. The develop- ment stage introduces the hardware setup used to train the models, the hardware used to capture real-world test data, and how this captured data was processed. The results chapter presents numerical and visual results of training and testing the models. The results were examined in the discussion chapter. Finally, the conclusion speaks of what was achieved and proposes possible future work directions. 2 2 Background 2.1 Introduction to 3D Representations To represent 3D data, we have multiple representations. Here I briefly introduce a few of the most commonly used representations and then dive deeper into point clouds which is relevant for this thesis. Multi-view representation A multi-view representation is usually a 2D image based representation method where multiple 2D images of the same object/scene are captured from multiple view- points of the 3D object and then processed with 2D convolutional neural network (CNN) models. Usually some kind of localization information about the capture device should be available or estimateable for accurate processing of these images. Multi-view representations are highly convenient because we already have state of art 2D CNN models which are capable of object detection as semantic segmenta- tion. Based on this representation networks like in [29] were developed to learn 3D knowledge of shapes from their 2D views. Figure 2.1: Image depicting camera taking images of an object from multiple viewpoints 3 2. Background Voxel Grid Representation Voxel Grid Representation is a Volumetric representation method and the represen- tation is obtained by voxelizing a 3D mesh or 3D point cloud. In context of 3D data, we can start by representing the 3D data with a 3D bounding box, then define a voxel to be a rectangular parallelepiped (whose dimension shall be much smaller than the original bounding box). A 3D grid of such voxels shall be used to match the dimensions of the bounding box. Each voxel has an occupancy value which tells us whether the voxel is within the original 3D data or not. Based on the resolution of our voxel we can have a fine or a coarse representation of our original data. (a) Voxel Grid of size 0.01 (b) Point cloud downsampled using voxels of size 0.01 (c) Voxel Grid of size 0.05 (d) Point cloud downsampled using voxels of size 0.05 Figure 2.2: Voxelization results on point cloud representation of Standford Bunny This representation is a very well defined data structure and can be used directly with 3D CNN architectures but the voxelization process requires high memory usage 4 2. Background and suffers from information loss which makes it inefficient. Mesh Representation A mesh is a surface-based representation method and one of the most commonly used ones in the 3D community. It consists of sets of vertices, edges and faces which all together define the shape and volume of a 3D object/scene. The faces in a mesh can be triangular (triangle mesh), quadrilateral (quad mesh) or even a convex polygon (n-gonal mesh). The advantage with mesh representation is that it does not suffer from information loss like a voxel grid or a point cloud but it is also an unnatural data structure as an input to a neural network. Figure 2.3: Mesh representation of Standford Bunny 5 2. Background Depth Images Depth images are mainly available as RGBD images i.e. colour images with depth information obtained in the form of distance from the camera origin. This depth information is available for each pixel in the image. However, it is relatively difficult to create a model which can work accurately with only RGBD images as an input. Moreover the data acquisition with respect to the depth information is sensitive to lighting conditions as well as the material of the objects present in the scene. (a) RGB Color image (b) Depth image Figure 2.4: Corresponding color and depth images captured using Intel’s Realsense L515 Camera 6 2. Background Signed Distance function A signed distance function (SDF) is an implicit representation method i.e. it uses some function f to represent a 3D object. Mathematically, SDF f of a given set ω determines the distance (magnitude) of a point β from the boundary of the set, along with the sign indicating whether the point is within the set or not. Hence, the function f encodes the surface (boundary) of a shape. While the SDF is a very interesting representation method, there has been little work done with it with respect to our problem defined in section 1.1. However, it can be an interesting representation to look into for reconstruction of the 3D object using the results of point cloud segmentation. Figure 2.5: SDF learned via DeepSDF[20] applied on Standford Bunny (a) depic- tion of the decision boundary of underlying implicit surface where SDF < 0 and SDF > 0 means inside and outside of the surface respectively, (b) 2D cross-section of the signed distance function, (c) rendered 3D surface recovered from SDF = 0 Point Cloud A Point cloud is a set of 3D points distributed in a 3D space. Each of these 3D points has a deterministic position denoted by a certain (x, y, z) coordinate and can have more information attributes like RGB colour values. Point cloud representation certainly has more significance over multi-view representation, depth images, voxel grids because they preserve more high-quality geometric information without any discretization. Moreover, they can be sampled directly from a mesh and can also be processed to a voxel grid if at all necessary. However, this representation method loses information about local connections between points which a mesh has. While they may not be the most natural data structure to work on, they are relatively better than most other standard representations. In the next subsection we shall have a more deeper look at point clouds. 7 2. Background Figure 2.6: Point cloud representation of Stanford Bunny 2.1.1 Point Cloud Theory A point cloud is a very simple data type, it is basically a set of points in a space. We define a point cloud P in 3D Euclidean space as follows P = {xi εR3}i≤N (2.1) where N is the number of points in a given point cloud. Typically, a point cloud would consist of only spatial coordinates {x, y, z} but can include more information like colour information in terms of RGB values for each point or normals of each point expressed as (nx, ny, nz). In that case, we can define P as follows P = {(xi, fi) |xi εR3, fi εRD}i≤N (2.2) where fi represents the D-dimensional additional point feature vector. Since point clouds are just collections of unordered points they have no information about the connectivity of the points. 2.1.2 Challenges with Point Clouds When working with point clouds, one has to deal with new challenges which have not been addressed before when working with neural networks. Below, we list some of the challenges encountered when working with point cloud data and their possible solutions. Varying number of points: When working with 2D CNNs, the input to such models is well defined structured data. For example, in tasks like image classification, images of varying size are pre- processed before feeding as an input to the CNN model. The pre-processing step can 8 2. Background involve re-sizing all the images to the same size, hence the same sized pixel arrays will represent every image. In case of point clouds, we may not have the same number of points representing an object. The simplest solution would be to sample a fixed number of points from the point cloud. However, the source of the point cloud as well as the sampling strategy used plays an important role in which points are sampled and eventually shall affect our model. The most commonly used sampling methods are uniform sampling and the farthest point sampling (FPS). The objective of FPS is to sample a fixed number of points Q from a set of N points, such that all the sampled points are farthest from each other. FPS has a certain randomness and non-uniformity associated with it whereas uniform sampling works by creating a 3D voxel grid from the point cloud data and the points closest to the voxel center are then selected. Irregularity (unorderedness): Point cloud data is an unordered set which means that the system does not know which point has a higher preference or if all the points have same level of impor- tance. Here, we define the concept of permutation π which is a bijective function of a point cloud, mapping the index set onto itself. Let the original point cloud be defined as follows P = {x1, x2, x3, . . . , xN} (2.3) By using the permutation π : {1, 2, 3, . . . , N} −→ {1, 2, 3, . . . , N} (2.4) we obtain the reordered point cloud, Pπ = {xπ1, xπ2, xπ3, . . . , xπN} (2.5) If we have a function f such that for all permutations π, we have : �(x1, x2, x3, . . . , xN) = �(xπ1, xπ2, xπ3, . . . , xπN) (2.6) then such a function � is called a symmetric function. Examples of such functions are max ,min, sum, average and product. Here, we give a simple example of using max (column-wise maximum) applied on a point cloud P and Pπ. P consists of 5 points, represented by (x, y, z) coordinates and no additional features like color or normal information and Pπ represents a reordered version of P . For ease of viewing, the maximum values in each column are made darker. P =  0.1 0.2 0.5 0.5 0.1 0.2 0.1 0.2 0.1 0.2 0.5 0.4 0.7 0.3 0.3 Pπ =  0.2 0.5 0.4 0.5 0.1 0.2 0.7 0.3 0.3 0.1 0.2 0.5 0.1 0.2 0.1  (2.7a) max(P ) = max(Pπ) = (0.7, 0.5, 0.5) (2.7b) 9 2. Background The conclusion to draw over here is that one must have some transformation process so that raw point cloud data can be transformed into an input suitable for neural networks or we must design the neural networks so that they are “permutation in- variant”. To achieve permutation invariance, the use of symmetric functions can be one suitable strategy. Noise and outliers in point clouds: Point clouds can be obtained directly from sensor’s like LiDAR (Light Detection and Ranging), derived from images or estimated from RGBD maps. Each method of point cloud accquisition results in a point cloud with different point density, dif- ferent additional feature information. A LiDAR can have a point density less than 10 points/m2 and even more than 100 points/m2 depending on the sensor. Point clouds obtained from RGBD images usually have a point density between 10 to 100 points/m2 whereas for point clouds dervied from images it depends on the spatial resoulution of the cameras used. When point clouds are estimated then the reso- lution of their depth estimation depends on the underlying model used for point cloud construction. While the LiDAR sensor gives a direct point cloud, its accuracy is dependant on the lighting conditions and the material of surface present in its field of view. Depending on how the point cloud is accquired, there maybe missing points, outliers present and even irregular distortions. 10 2. Background 2.2 Deep Learning Primer Deep learning is a machine learning technique based on the fact that there is enough training data available. Deep learning architectures are based on Artificial Neural Networks (ANN). The concept of artificial neuron develops from the neurons in the human brain, which work by receiving inputs from other neurons and give an output. Mathematically we model artificial neurons (also known as perceptrons) to have an input passing through an activation function which would decide the final action. Figure 2.7 represents a perceptron which takes inputs x := x1, x2, . . . , xN and produces an output y. Each input value xi of x is associated with a weight value wi in weight vector w which indicates the significance of the input value. For the perceptron to produce an output, it needs to be "fired up" just like a real neuron, the condition for this firing is the weighted sum of the inputs must be more than the set threshold of the perceptron. Alternatively one can define the negation of the threshold as the bias b of the preceptron and then incorporate it in the input vector itself by setting x0 = b = −threshold and w0 = 1. This pre-activation result is denoted by z and passed as an input to the activation function σ for final output. Hence, we come to the following formulation y = σ(z) (2.8) where, z = b+ n∑ i=1 wixi (2.8a) z = n∑ i=0 wixi (2.8b) z = wTx (2.8c) Figure 2.7: A perceptron taking input vector x, multiplying it with weight vector w and producing output y after processing it via its activation function σ. 11 2. Background 2.2.1 Activation Functions The activation function of a perceptron defines its output. We use Rectified Linear Unit (ReLU) and one of its variants, leaky ReLU in our models. Mathematically ReLU is a max function with linear behaviour for positive values and 0 for negative values. Leaky ReLU a is modification of ReLU which allows negative values of the input to exist by scaling them down by a value of α. The functions are defined as follows:- σ = max(0, z) , if z ≥ 0 αz , if z < 0, ReLU: α = 0, Leaky_ReLU: α ∈ (0, 1) (2.9) Figure: 2.8 provides a simple visualization of the respective activation functions. (a) ReLU (b) Leaky ReLU Figure 2.8: Visualization of Activation Functions: ReLU and Leaky ReLU 2.2.2 Multi-layer perceptron The perceptron presented by equation 2.8 represents a single layer perceptron. Now we build on this idea to develop the a network with arbitary number of layers containing multiple perceptrons. This type of network is known as a Multilayer perceptron (MLP), it is also known as a "vanilla" neural network or a feedforward neural network. It consists of an input layer to receive the inputs, one output layer that makes a prediction and an arbitary number of hidden layers in between them. Since it is quite understood that every perceptron layer has an input and output, we shall consider the total number of layers in a MLP by only counting the number of hidden layers. Figure 2.9 is an example network graph of a 3-layer perceptron with 4 units in the input layer represented by red boxes, 3 units in the output layer represented by green boxes. The 1st, 2nd, and 3rd hidden layers consist of 4, 3 and 2 perceptron units respectively which are represented as a blue circle. For convenience of visualization, the weights associated with each input are not displayed in the figure. 12 2. Background Figure 2.9: Network graph for a 3-layer perceptron 2.2.3 Loss Function and Optimization Part segmentation is essentially a multilabel (or multiclass) classification problem i.e. the model must identify the label of every point in an input point cloud. An object point cloud can have m > 0 parts and for each part a new label is assigned. The loss function that is used for training all the models is called Cross-Entropy Loss. Cross entropy indicates the distance between the predicted output distribution of a model and the original distribution. Following equation 2.10 represents calculation of a loss for each class, where yc is the prediction made by the neural network for a class c for an input sample s and gt represents the corresponding true values for the same class and input sample. loss = − m∑ c=1 gts,c log(ys,c) (2.10) Now that we have our loss function, can come to the optimization aspect of our network. For our network to learn and we must tell it to update its weight matrix such that the loss function is minimized. Todo so we use the stochastic gradient descent (SGD) method as it reaches the minima in a shorter amount of time than algorithms like gradient descent and newton’s method. 2.2.4 Convolutional Neural Networks Convolutional neural networks (CNN) are a different type of neural network which have a special structure and are typically used for data which have a grid-like topol- ogy. This is because they were designed to find patterns in the data using the convolution operation. There are two main layers in a CNN, the first is the convo- lution layer and the second is the pooling layer. The convolution layer as the name suggests performs convolution using a convolutional kernel or filter which is slided 13 2. Background accross the input data. The output of convolution is called as a feature map or an activation map. Figure 2.10 shows the output of convoluting a 4 × 3 with a 2 × 2 kernel. Figure 2.10: Convolution operation A pooling layer is also known as a subsampling layer which reduces the sample size of a feature map, making the processing faster. The output of a pooling layer is called a pooled feature map. The function used to perform pooling is usually either max or avg. Following figure 2.11 shows the result of max pooling operation on an example feature map. Figure 2.11: Demonstration of max pooling in CNN A CNN is constructed by stacking multiple convolution and pooling layers. By doing so different layers capture different features present in the input data. Finally a fully connected layer, i.e. MLP, is used to generate a prediction. Figure 2.12 shows a simple CNN architecture consisting of an input layer, combination of convolution 14 2. Background layers and pooling layers acting one after the other, and the result of final pooling layer being sent to the fully connected layers which give us the output/prediction. Figure 2.12: High level view of a simple CNN architecture 2.3 Introduction to 3D Deep Learning 2.3.1 Projection Networks The idea of a projection network is as the name suggests, to project the 3D point cloud data to a more suitable data format. The inspiration for these networks started from one of the earliest and most simplest methods which was to use multiple 2D images which would be generated from 3D point cloud projections; and then process these 2D images with state of the art 2D CNN models as done in [29]. Combin- ing actual multi-view depth maps, 2D images and 3D-to-2D projections come at a computational cost. Similar works on projecting the 3D data to tangents and then performing tangent convolutions like [30] also exists and shows great ability to process large-scale point clouds but they are dependent on the ability of the model to estimate the tangent. Moreover projection based methods are sensitive to view- points, cannot exploit underlying geometry and suffer from structural information loss. A different type of projection network, sometimes referred to as a discretization- based network, which projects the 3D point cloud to sparse lattice structures or vol- umetric representations have been introduced in [28] and [39] respectively. While the models made for these networks provided good results, volumetric representations require high memory usage for fine resolution and suffer from information loss inher- ently. Moreover since they are naturally sparse in nature, using dense convolutions was inefficient in the first place. However, projection of 3D point clouds to lattice structures has more advantages. Models like Sparse Lattice Network (SPLATNet) can project both multi-view 2D images and 3D point clouds to high dimensional permutohedral lattice [14] and operate on them with Bilinear Convolutional Layers [28], delivering very good results. However, they are still limited by the grid struc- ture of lattice which eventually puts a constraint on the convolution operation. This thesis does not implement any type of projection or discretization based networks. 15 2. Background 2.3.2 Point-wise Networks Point-wise networks are a new type of neural network which are designed specifically to work with point cloud data. The focus of these networks is to try to learn features directly from the 3D points as an input. Moreover these networks do not have a convolution element to them and therefore, making them completely different from the regular deep learning approaches which are based on convolution. There are two significant models in this area, PointNet [22] and its successor PointNet++ [23], which are explained in sections 3.2 and 3.3 respectively. 2.3.3 Graph-based Convolution Networks Use of graphs to add structure to unstructured data has been gaining attention in the recent years due to the rise of convolution operators defined specifically for graph- based neural networks. There are two types of convolution operations possible on a graph: spectral and spatial. Spectral-based convolution approaches like [36] carry out multiplication of filters with spectral representation of the graph in the Fourier domain whereas spatial-based convolution methods put their focus on aggregating features from edges for each node. The graph-based approach explored in this thesis is called Dynamic Graph CNN (DGCNN) [34] and is explained in section 3.4. 2.3.4 Point Convolution Networks Convolution operation can be seen as a template-matching operation when dealing with images. One of the main reasons why CNNs have been so successful in dealing with images is because of the structure of images i.e. images are dense and discrete, and they can be easily represented by arrays while preserving meaning of the pixel intensities. Point clouds on the other hand are sparse and continuous and hence we need a new way to define convolution on point clouds. Point Convolution network formulations are dominated either by the use of MLPs to formulate the kernel or by the use of explicitly defined geometric kernels. A point convolution network explored in this thesis is based on a new convolution operator called Kernel Point Convolution (KPConv) [31] and is explained in section 3.5. 16 3 Methodology 3.1 Introduction Selecting the appropriate models to implement from a long list of models was a tough a choice. One motivating factor was to see the use of these models for some real world applications like in [7, 21, 2, 12, 27]. Finally PointNet, PointNet++, Dynamic Graph CNN and Kernel Point Convolution for point clouds were decided to be implemented. The following sections go in depth about each of the networks. 3.2 PointNet The idea of PointNet is to compute features for every point in the point cloud directly. The input to this model is a N ×F point cloud, where N is the number of points and F is the number of information channels. F is expressed as the 3 + D, where 3 represents the 3-dimensional cartesian coordinates and D represents the additional features. For example, for D = 0 =⇒ F = 3 we have spatial coordinates (x, y, z) and an input point cloud of size N × 3 or with D = 3 =⇒ F = 6 we could add the three RGB values per-point, then our input vector would look like this (x, y, z, r, g, b) and an input point cloud of size N × 6. The architecture of PointNet helps it to achieve tasks like object classification, part segmentation and semantic segmentation. For segmentation the output is a prediction matrix of size N × m i.e. for each of the N points, there is a prediction made for each of the m labels. PointNet uses symmetric functions (refer to Section: 2.6) to bypass the problem of irregularity in point clouds. X = {x1, x2, . . . , xN} ⊂ RF , N ∈ R where, F = 3 +D (3.1) Considering the simplest case where the input point cloud has only N number of 3D cartesian coordinates i.e. D = 0 =⇒ F = 3, so we have X =  x1 x2 ... xN  where, xi = (x, y, z), i ∈ {1, 2, . . . , N}, N ∈ R (3.2) 17 3. Methodology We can then define the PointNet function fpn as follows fpn(X) = γ(�(h(X))), RN×F −→ Rk where, i ∈ {1, 2, . . . , N}, N ∈ R h : RN×F −→ RN×M , M ∈ R � : RN×M −→ R1×M γ : R1×M −→ Rk (3.3) Here the function h indicates a list of MLP’s, γ is implemented as a MLP and � represents a symmetric function. The proposed PointNet model in [22] uses M = 1024 which means that h consinsts of 1024 MLPs, each MLP takes in a point xi as input and ouputs a scalar value. So the function h can be modelled as follows h(X) = (h1(xi), h2(xi), . . . , h1024(xi)), RN×3 −→ RN×1024 where, hj : R3 −→ R j ∈ {1, 2, . . . ,M}, M ∈ R (3.4) The function h would be applied on all N points resulting in a N × 1024 matrix. These extracted features (per-point features) are then aggregated using the sym- metric function � = max. This results in a global feature vector of size 1 × 1024. Finally, γ represents a new MLP which processes the global feature vector to output a point set feature of size k. To address the problem of invariance to rigid transforms, PointNet introduced a spa- tial transformer network (STN) which aligns an arbitrary point cloud to a canonical space. This network is sometimes also called as the Joint Alignment Network and allows spatial manipulation of the data within the network itself. The idea of this network is to somehow canonize the input point cloud. STN is used twice in the PointNet model, once as an input transfer where the output is a 3x3 matrix and second time as a feature transformer where the output is a 64x64 matrix. The PointNet model architecture [22] consists of two networks. The first network is called the Classification Network whose core base is the PointNet function men- tioned in Section 3.3. The second network is called the Segmentation Network whose task is to output per-point scores for m semantic subcategories. The input to this network is the global feature vector k as well as the per-point feature vector which is the result of h function. This network acts as input transformer as well as a feature transformer when the input is being processed in the Classification Network. An overview of the PointNet architecture can be seen in Figure: 3.1. The complete network architecture details can be found in [22]. 18 3. Methodology Figure 3.1: Network Architecture of PointNet 3.3 PointNet++ PointNet++ is the successor to PointNet architecture. The reason to build this model was to overcome the main drawback of PointNet, namely that PointNet did not really capture interaction between points. To capture this point interaction, PointNet++ introduced a new layer called Sampling and Grouping Layer. The result of this layer is then fed into PointNet Layer for feature extraction. The combined process of Sampling and Grouping Layer followed by a layer PointNet is called Set Abstraction Layer (SAL). PointNet++ is based on multiple SAL which helps it to achieve local feature aggregation for every point. We know introduce the Sampling and Grouping Layer. Sampling Layer 1. This layer is used to generate centroids from the given input point cloud. The centroids are sampled using the Farthest Point Sampling (FPS) algorithm. 2. The objective of this algorithm is to sample a fixed number of points Q from an input point cloud P and output a sampled point cloud Pq such that all the points in Pq are farthest from every other point in Pq for the same value of Q. This methods provides a good coverage of the entire point cloud. For an algorithmic implementation please refer to A.1. P = {x1, x2, . . . , xN} (3.5a) Pq = {x1, x2, . . . , xQ} where Q < N (3.5b) Grouping Layer 1. The objective of this layer is to find a fixed number of nearest neighbours for all of the points in Pq. There are two possible approaches to do this: by using the ball query search or the k-Nearest Neighbours (kNN) algorithm. kNN is a simple algorithm which shall find a fixed number of neighbours from every sampled point based on a distance metric. Wheras ball query search forms a sphere/ball with every point in Pq and a predefined radius, and then returns all the points within the sphere. For an algorithmic implementation please refer to A.2. 2. The input to this layer is the original input point cloud P and the sampled 19 3. Methodology point cloud Pq, the output of this layer is a neighbourhood point setNm ∀ q ∈ Q in the Euclidean space. An overview of this architecture can be seen in Figure: 3.2. The complete network architecture details can be found in [23]. Figure 3.2: Network Architecture of PointNet++ 3.4 Dynamic Graph CNN Dynamic Graph CNN is a graph-based approach to solve classificaiton and segmen- tation problems of point cloud data. It creates a graph from the point cloud data by finding a fixed number of nearest neighbours of the input data points and then pro- ceeds to compute features and apply convolution. The original paper [34] introduces the concept of “EdgeConv” operator which eventually replaces the MLP used in the PointNet. The model takes points, can take additional features as an input, it then computes a neighbourhood point set Ni for every point and then calculates edge values using “EdgeConv” operator for each point in every Ni; then it aggregates the results for every Ni. It performs very well for segmentation, but the computa- tional complexity of the model increases very fast as it is a graph-based method in which the graph is updated after every layer. Figure:3.4 shows the architecture of the model for classification and segmentation. Following is the explanation of the EdgeConv and layer update operation. Edge Convolutions (EdgeConv) 1. We start by using the simplified equation mentioned in Section 3.6. X = {x1, x2, . . . , xN} ⊂ RF , N ∈ R where, F = 3 +D 2. Compute a directed graph G using vertices, V and edges, E. Vertices of this graph are from X i.e. the points themselves and the edges which represent the connectivity between different vertices are computed using k-nearest neigh- bours algorithm. The graph also includes a "self-loop" which means that every 20 3. Methodology vertex is also connected to itself. The graph G = (V,E) now represents a local point cloud structure where V = {1, 2, . . . , N} E ⊂ V × V (3.6) 3. Next we compute the edge feature as follows eij = hΦ(xi, xj) where, hΦ : RF ×RF → RFn i ∈ V j ∈ V (3.7) Here hΦ is a nonlinear function with learnable parameters Φ 4. Convolution and feature aggregation! To convolute point xi we make use of the neighbourhood information of the point xi given by j : (i, j) ∈ E, the edge features eij, and a symmetric function (explained here 2.6) denoted by �. The output x′ i is then given by x ′ i = �j:(i,j)∈E eij where, eij = hΦ(xi, xj) (3.8) Figure:3.3 gives a visual representation of the EdgeConv operator. Figure 3.3: Visual representation of EdgeConv Operation of DGCNN Selecting correct � and hΦ(xi, xj) functions is important here as they have crucial influence on the properties of EdgeConv. For � its adviseable to use max function. hΦ(xi, xj) is implemented as a MLP but the input to MLP can vary. For example, if we consider PointNet then hΦ(xi, xj) = hΦ(xi), hence the MLP extracts features based on only the point xi. The original paper [34] suggests 3-4 options for selecting � and hΦ. Update Layer 1. For every layer l the graph Gl is recomputed using the k-nearest neighbours approach. 2. The nearest neighbours are computed in the feature space using L2 distance. 3. Using a symmetric function like max-pooling, permutation invariance is by- passed. 21 3. Methodology Figure 3.4: Network Architecture of Dynamic Graph CNN 3.5 Kernel Point Convolutions Kernel Point Convolutions (KPConv) for point clouds [31] proposed a new type point convolution method which was meant to overcome the limitations of previous architectures. To achieve the, author proposed a kernel function to compute point- wise filters. In convolutions on 2D images, we have the kernel weights associated with pixel values and the filters we use are discrete and fixed in size. In KPConv, we have kernel points instead of pixels, and they are used to determine the kernel weights. We then use this kernel function to iterate over small neighborhoods of the point clouds, this is analogous to using 2D kernel filters and traversing them accros an image. To understand the formulation of convolutions in KPConv we begin by redefining original point cloud equation (defined in equation2.2) for the scope of this section. We split our P into two parts, in the following way PKP ∈ RN×3 (3.9a) FKP ∈ RN×D (3.9b) Application of kernel function gKP to input data point FKP at point x is equal to the weighted sum of neighbouring point features with the neighbourhood Nx defined as a ball in 3D space with radius r ∈ R. The convolution operation if defined as follows (FKP ∗ gKP )(x) = ∑ x̂i ∈Nx gKP (x̂i − x)fi where, Nx = { xi ∈ PKP ∣∣∣∣ ||x̂i − x|| ≤ r } (3.10) The idea of kernel function gKP is that it must be able to apply different weights to different areas inside a neighborhood. Drawing a comparison to the 2D CNN, where kernels are usually defined as a square matrix of size smaller than input data matrix, KPConv defines the kernel gKP using K support points which are known as 22 3. Methodology kernel points. Figure 3.5 provides a comparison between a kernel of 2D CNN and KPConv. Figure 3.5: Visualization Comparison: Left: Regular Convolution on 2D Image and Right: Kernel Point Convolutions on Un-ordered 2D Points Figure 3.6: Visualization of kernel points, center and points of a point cloud in a neighborhood The input to this function is neighboring points centered on x and defined as x̂i = xi − x, the domain of the function becomes B3 r = { x̂i ∈ R3 ∣∣∣ || x̂i ≤ r || } . The kernel points are defined by {x̃k ∣∣∣ k < K} ⊂ B3 r and have a learnable weight matrix Wk ∣∣∣ k < K} ⊂ RDin×Dout associated with them. Here Din and Dout represent the number of input dimensions incoming from the previous layer and the number of output dimensions. For visual representation of this setting see figure 3.6. The kernel function gKP for a point x̂i ∈ B3 r is then defined as follows gKP (x̂i) = ∑ k r2] = N /* Sort the array in an ascending order and select the first sample_limit number of points */ 4 group_idx[:, ] = sort(group_idx)[:, : sample_limit] /* Considering that there may be points in the previous num_sample points that are assigned N (ie, less than num_sample points in the spherical area), this point needs to be discarded, and the first point can be used instead. */ /* group_first: [B, S, k], actually copy the value of the first point in group_idx to the dimension of [B, S, K], which is convenient for subsequent replacement. */ 5 group_first = group_idx[:, :, 0].view(B, S, 1).repeat([1, 1, num_sample]) /* Find the point where group_idx is equal to N */ 6 mask = group_idx == N /* Replace the value of these points with the value of the first point */ 7 group_idx[mask] = group_first[mask] 8 return group_idx II B Appendix 2: Test Data Collection Pipeline B.1 RANSAC-based Plane segmentation To remove the top of the box on which the object is place, we make use of RANSAC to find a plane which fits the largest number of points and eventually represents the top of the box. To achieve this we make use of Open3D’s segment_plane function as mentioned in [11]. It has the following three parameters: • A distance threshold which defines the maximum distance a point can have to an estimated plane, for it to be considered an inlier i.e. a point within the estiamted plane. • The number of points that are randomly sampled to estimate a plane. • The number of iterations which states how often a random plane is to be sam- pled and verified. The function then returns the coefficients of the plane as (a, b, c, d), such that for each point (x, y, z) on the plane we have ax + by + cz + d = 0 and a list of indices of the points which lie within the plane. Making use of this list of inlier indices, we can remove the points which belong to the top of the box. The plane removal takes place for all the point clouds, but the parameters for the function are set only once. To find the optimum parameters effectively, a visualization of the segmented point cloud is shown using default parameter values and if the results are not satisfactory, new parameters can be entered by the user, followed by a new visualization. This process takes place in an iterative fasion till the user decides to stop the program, hence setting the argument values of the function to be used for all the remaining point clouds. If needed the plane segmentation function can also be called only once or be used in the above mentioned way to process a set of point clouds. B.2 Point cloud fusion B.2.1 ICP registration The main algorithm used to fuse multiple point clouds here is called Multiway reg- istration which essentially uses a point cloud registration algorithm like Iterative Closest Point (ICP) as the base algorithm. ICP takes in two point clouds (source and target) as an input, and an initial transformation that might roughly align the III B. Appendix 2: Test Data Collection Pipeline source point cloud to the target point cloud. The output of ICP is a transforma- tion that aligns the two point clouds. There many variations of ICP available as mentioned in [24], the one used here is called as point-to-plane ICP [35] and is im- plemented via Open3D [11]. The point-to-plane ICP algorithm iterates over two steps: 1. To find the correspondence set Cs = {(pi, pj)} from target point cloud Pj, and source point cloud Pi which is transformed with the transformation matrix Ti,j. Here pi and pj are points from the source and the target point clouds respectively. 2. Update the transformation matrix Ti,j by minimizing an objective function E(T) defined over the correspondence set Cs . The