Self-Supervised Stereo Depth Estimation Depth estimation in multiple environments through an adaptive CNN and IR light Master’s thesis in System, Control and Mechatronics JONATAN NORDH MARCUS VIKÉN Department of Mechanics and Maritime Sciences CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2021 www.chalmers.se www.chalmers.se Master’s thesis 2021:27 Self-Supervised Stereo Depth Estimation JONATAN NORDH MARCUS VIKÉN Department of Mechanics and Maritime Sciences Division of Vehicle Engineering and Autonomous Systems Adaptive Systems Research Group Chalmers University of Technology Gothenburg, Sweden 2021 Self-Supervised Stereo Depth Estimation Depth estimation in multiple environments through an adaptive CNN and IR light JONATAN NORDH MARCUS VIKÉN © JONATAN NORDH, MARCUS VIKÉN, 2021. Supervisor: Peter Forsberg, CPAC Systems AB Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences Master’s Thesis 2021:27 Department of Mechanics and Maritime Sciences Division of Vehicle Engineering and Autonomous Systems Adaptive Systems Research Group Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: 3D visualization of a stereo image pair taken in Gothenburg with the pre- sented camera-rig. Depth prediction for each pixel is produced with the proposed CNN called SH-Net. Typeset in LATEX, template by Magnus Gustaver Printed by Chalmers Reproservice Gothenburg, Sweden 2021 iv Self-Supervised Stereo Depth Estimation Depth estimation in multiple environments through an adaptive CNN and IR light JONATAN NORDH MARCUS VIKÉN Department of Mechanics and Maritime Sciences Chalmers University of Technology Abstract We have developed a complete depth sensor unit with a self-supervised neural net- work and stereo camera. The sensor is both adaptive during usage and can work in dark and low light environments with aid from IR spotlights. Disparity estima- tion via stereo cameras has shown great performance in combination with neural networks during recent years. The reason is because deep learning reduces the com- putational effort considerably compared to previous methods. However, the existing deep learning methods do not evaluate the depth measurements but rather the dis- parity estimation accuracy on available benchmark datasets. In difference to earlier work, this system has been evaluated with respect to depth measurement accuracy and suitable evaluation metrics have been developed. If the stereo camera is to be used as a reliable depth sensor the depth estimation quality needs to be ensured. From the thesis contributions a high-functional depth sensor unit can be developed with potential to surpass other sensors with respect to the amount of data obtained per second. Keywords: Artificial Neural Network (ANN), Convolutional Neural Network (CNN),deep learning, self-supervised, machine learning, stereo vision, disparity estimation, night- vision, Oriented FAST and Rotated BRIEF (ORB). v Acknowledgements This master’s thesis completes our studies within the M.Sc. programme in Systems, Control and Mechatronics at Chalmers University of Technology. The thesis project was carried out during the spring of 2021 at CPAC Systems AB. We would like to send our gratitude to our supervisor Peter Forsberg who has guided us through this project with great commitment. Then we would also like to thank CPAC that provided all the necessary resources required to make this project possible. Although a pandemic has made everything more complicated this year, Peter and all others at CPAC have made the best out of the situation, for this we are very impressed and grateful. Jonatan Nordh and Marcus Vikén, Gothenburg, May 2021 Thesis advisor: Peter Forsberg, CPAC Systems AB Thesis examiner: Peter Forsberg, Department of Mechanics and Maritime Sci- ences vii Abbreviations AD .....................Autonomous Driving ADAS ................Advanced Driver Assistance Systems BRIEF ...............Binary Robust Independent Elementary Features CNN ..................Convolutional Neural Network FAST .................Features from Accelerated Segment Test FIR ....................Far Infrared FOV ...................Field Of View FPS ....................Frames Per Second GBG ..................Gothenburg GPU ..................Graphics Processing Unit IR ......................Infrared LiDAR ...............Light Detection And Ranging LWA-Net ............Light-Weight Adaptive Network NIR ....................Near Infrared ORB ...................Oriented FAST and Rotated BRIEF SH-Net ...............Stacked Hourglass Network SIFT ..................Scale-Invariant Feature Transform SSIM ..................Structural Similarity Index Measure ReLU .................Rectified Linear Unit XCNN ................Cross Convolutional Neural Network ix Contents List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Binocular disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 The stereo camera . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Stereo calibration . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Disparity and depth calculations . . . . . . . . . . . . . . . . . 7 2.2 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Training the network . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Loss-function . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3.1 Supervised loss . . . . . . . . . . . . . . . . . . . . . 9 2.2.3.2 Self-supervised loss . . . . . . . . . . . . . . . . . . . 10 2.2.3.3 Linear interpolation . . . . . . . . . . . . . . . . . . 10 2.2.3.4 Reconstruction mask . . . . . . . . . . . . . . . . . . 11 2.2.3.5 Structural similarity between images . . . . . . . . . 11 2.2.3.6 Regularization loss . . . . . . . . . . . . . . . . . . . 12 2.2.4 IR night vision . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.5 Oriented FAST and Rotated BRIEF (ORB) . . . . . . . . . . 12 3 Methods 13 3.1 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 KITTI Stereo 2015 . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Collecting new data . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3.1 Offline stereo calibration . . . . . . . . . . . . . . . . 15 3.2.3.2 Online stereo calibration . . . . . . . . . . . . . . . . 16 3.3 Network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 xi Contents 3.3.1 SH-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 XCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.4 FOV and optimized disparity zone . . . . . . . . . . . . . . . 20 3.3.4.1 Optimal baseline . . . . . . . . . . . . . . . . . . . . 20 3.3.5 Occlusion mask . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Evaluate depth accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.1 Depth evaluation using LiDAR . . . . . . . . . . . . . . . . . 23 3.4.2 Depth evaluation using ORB . . . . . . . . . . . . . . . . . . . 25 3.4.3 Depth evaluation in low-light conditions using IR light . . . . 26 3.5 Online self improving ability . . . . . . . . . . . . . . . . . . . . . . . 26 4 Result and discussion 29 4.1 Optimal baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 KITTI benchmark . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Depth sensor evaluation . . . . . . . . . . . . . . . . . . . . . 31 4.2.2.1 Depth evaluation using ORB . . . . . . . . . . . . . 31 4.2.2.2 Laser point evaluation . . . . . . . . . . . . . . . . . 31 4.2.2.3 Performance in low light conditions using laser . . . 33 4.2.2.4 Performance in low light conditions using ORB . . . 34 4.3 Adaptive performance . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Visual results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.1 GBG traffic dataset . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4.2 IR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.1 Optimal baseline . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.2 Quality of data . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5.3 Network performance . . . . . . . . . . . . . . . . . . . . . . . 42 4.5.4 Adaptive performance . . . . . . . . . . . . . . . . . . . . . . 43 4.5.5 IR depth accuracy . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.6 Visual performance in daylight . . . . . . . . . . . . . . . . . 43 4.5.7 Visual performance with IR light . . . . . . . . . . . . . . . . 44 4.5.8 Evaluation using LiDAR and ORB . . . . . . . . . . . . . . . 45 4.5.9 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Conclusion 47 Bibliography 49 A Appendix 1 I B Appendix 2 III xii List of Figures 2.1 The pinhole camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Illustration of the camera setup and disparity to depth relation in a three-dimensional space. . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Image warp example. To the left is the original right image and to the right is the warped image filled with the real right image according to the mask values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Camera rig used in the project. From left to right are a camera, a 50 W IR LED, a camera, a laser rangefinder, a camera, another 50 W IR LED and a fourth camera. At the back, a Jetson TX2 is mounted and connected to the units. . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 (a) Left input image of stereo pair. (b) Right input image of stereo pair. (c) Ground truth disparity map. . . . . . . . . . . . . . . . . . . 14 3.3 To the left is an uncalibrated image pair where red color channel represent the right image and the other channels represent the left image. To the right is the same image but calibrated. . . . . . . . . . 16 3.4 Online self-adapting horizontal alignment using ORB. The red chan- nel is from the right image while the green and blue channels come from the left image. To the left is the image pair before calibration, right image is shifted 15 pixels up. In the right image calibration has been performed and the image pair are horizontally aligned. . . . . . 17 3.5 Siamese network architecture of the SH-Net with inspiration from GA-Net. Red layers are convolutional, blue are transposed convolu- tional and the yellow are adding layers connecting the encoder and de- coder parts of the network. The two identical networks share weights through cross-connections (gray arrows) and have skip connections (black arrows) between the encoder and decoder parts. . . . . . . . . 18 3.6 XCNN network architecture with skipping and cross connections vi- sualized. The red layers are convolutional building up the encoder while the blue blocks represent the transposed convolutional layers that defines the decoder. . . . . . . . . . . . . . . . . . . . . . . . . . 19 xiii List of Figures 3.7 Field of view visualisation of visible and occluded areas for a stereo par. The dark zones at the edges as well as regions occluded by objects are marked with colors. Blue for areas not visible to camera 2 and the orange areas are not visible to camera 1. In this example there is a person visible to both cameras and a car only visible to camera 2 as the house occlude the car. . . . . . . . . . . . . . . . . . 21 3.8 Disparity to depth relationship to the left and depth error caused by 3-pixel positive disparity error to the right. The lines show three different baselines that are possible with the given camera rig. . . . . 22 3.9 Depth estimation from stereo camera with baseline of 14 cm in an of- fice corridor. The measurement value in red is the predicted distance, the laser measured 5.92 meter giving an error of -2.48 meters for this image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.11 Left stereo camera image that depth was predicted from. On top of a garage roof with a distance of 20.84 m to the small entrance building according to the laser measurement. . . . . . . . . . . . . . . . . . . . 24 3.12 Error distribution of the ORB estimations compared with KITTI ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.14 Image captured with use of IR light in a dark warehouse. . . . . . . . 27 4.1 Depth errors calculated for the two baselines of 14 and 42 cm. The measured errors are plotted together with the mean value for both baselines at distances between 2.5 and 22.5 meter. . . . . . . . . . . . 29 4.2 Distribution of the measured errors for the two baselines of 14 and 42 cm. The error measured with baseline equal 42 cm is closer to zero and less spread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Error depth calculated between predicted disparity estimation and laser measurement. Images of a garage entrance building at distances between 3 and 73 meters were input to the SH-Net and XCNN net- work to estimate the depth. . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Depth errors from 65 image pairs collected in the dark with IR light as only source of light. Distances between 2.5 and 22.5 meter were measured and the error for the two networks, XCNN and SH-Net, were calculated as the difference of predicted depth and measured depth using laser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6 IR depth estimation errors for distance 0-35 meter for IR dataset. . . 34 4.7 Example of SH-Net adapting to data through training on the GBG traffic dataset. The network start with untrained weights and improve during 3000 training steps. . . . . . . . . . . . . . . . . . . . . . . . . 35 4.12 Disparity to depth relation for two baselines with the distances 42 cm and 252 cm. The corresponding disparity for 112 meter measurements are plotted together with a 3-pixel positive offset. The offset causes an error in depth measurement with more effect on the shorter baseline. 41 4.13 Color distribution for red, green and blue channels taken from two example images. One image taken in daylight on a road and another image in a dark warehouse with IR as light source. . . . . . . . . . . 44 xiv List of Figures 4.14 Prediction comparison between XCNN and SH-Net for input image without distinctive objects. . . . . . . . . . . . . . . . . . . . . . . . . 44 4.15 Prediction comparison between XCNN and SH-Net for IR input image. 45 xv List of Figures xvi List of Tables 4.1 Evaluation results on Kitti 2015 benchmark for different self-supervised network architectures. Unavailable data are noted as -. . . . . . . . . 31 4.2 Depth and disparity evaluation result from the two network predic- tions compared with ground truth from the ORB algorithm on GBG dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Depth error of network predictions from images captured in daylight on top of a garage roof. The error is defined from the predicted depth compared with laser measurements for distances between 3 to 73 meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Depth error from SH-Net and XCNN network prediction compared with laser measurements for distances between 2.5 to 22.5 m. Images were captured in a dark warehouse with IR lights as the only source of light. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5 Depth and disparity evaluation result from the two network predic- tions compared with ground truth from the ORB algorithm on IR dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.1 SH-Net network architecture in detail. . . . . . . . . . . . . . . . . . I B.1 XCNN network architecture in detail. . . . . . . . . . . . . . . . . . . III xvii List of Tables xviii 1 Introduction The demand for autonomous and remotely controlled vehicles is growing at a rapid pace. Autonomous systems can both increase safety and reduce cost. If a human driver can be replaced in a hazardous and inhospitable environment, that would decrease both labor cost and injury risk of the driver. Areas where autonomous systems can be applied are many such as: boat docking [1], garbage collection [2], mining [3] and self-driving cars [4]. To be able to interpret the surroundings and make suitable decisions, the system of an autonomous vehicle needs accurate and dependable sensors. Malfunctioning or failure of such systems can lead to irreversible damage which consequently makes them safety critical [5]. One of the most universal signal types that is used to interpret the surrounding is the 3D depth measure whereby, producing a 3D grid map nearby objects can be located and measured. One type of sensor that can produce 3D depth estimation is the stereo camera. Here, by utilizing disparity be- tween objects in two images, depth can be calculated. Disparity is defined as the pixel coordinate difference between an object’s position in the left and right image produced by a stereo camera. Stereo cameras can produce depth measurements with higher resolution compared to LiDARs and provide color images. In addition, the cameras are inexpensive which consequently makes them competitive as depth sensors. However, current stereo cameras have a few drawbacks due to a variety of real-world problems such as occlusions, large textureless areas, reflective surfaces and insufficient light. To make accurate estimations the stereo cameras also need initial and frequent calibration due to imperfect assembly or geometric deformation from for example thermal changes. If these drawbacks were compensated for the stereo camera could be a competitive sensor within the AD/ADAS industry where a lot of data can be obtained from images in a variety of environments. The stereo camera could also be combined with infrared (IR) light and other sensors which can boost predictions during night and low light conditions. It has been shown that by using four stereo cameras with different baselines, the accuracy can be increased for a long range, but at the cost of high computational demand [6]. Further, using deep learning the computational effort can be decreased considerably [7]. Currently, state of art stereo methods on the KITTI stereo bench- mark [8, 9] leaderboard is based on deep learning. However, most of these stereo methods are using supervised learning which require ground truth data. The cum- bersome labor of collecting ground truth data has here been solved by using self- supervised learning. This is a solution that also enables network adaption to new environments based on the images collected by the stereo camera during usage. In- stead of developing new competitive network architectures the focus of this thesis 1 1. Introduction work is to evaluate real-life performance of networks inspired by already existing state-of-art network architectures. The performance must be evaluated in a trust- worthy manner before it can be used as a depth sensor in safety-critical applications. Evaluation on data gathered with other camera configurations in different environ- ments will not provide a reliable evaluation. Therefore, the purpose of this thesis is to present smart ways to evaluate real-world depth estimation accuracy on self- collected data and to discuss areas where the camera sensor potentially could be implemented. 1.1 Purpose The purpose of this master’s thesis is to implement and evaluate existing state- of-the-art CNN strategies for instantaneous disparity calculation using a binocular stereo camera. With a focus on real life online performance an evaluation will be done that evaluate to what extent the stereo camera can complement or replace cur- rent depth sensors. Hence, the precision and robustness of depth estimations will be examined in different environments and conditions. The aim is to address the following objectives: i) Implementation of two or more state-of-art inspired CNN architectures, ii) Combine CNN and stereo camera as a complete depth sensor unit, iii) Evaluation of full-HD depth estimation, iiii) Evaluate night vision performance, iiiii) Improve real life robustness and adaptability. This will be achieved by im- plementing existing theory and research and then by examining the possibilities of refining the estimation. 1.2 Objectives The main objective of this thesis is to evaluate the stereo camera as a depth mea- surement sensor. This includes implementation of state-of-the-art algorithms, collect and evaluate depth data and test possible setups for usage on moving vehicles. Fur- thermore, the objective is to evaluate the systems to find an effective and usable system configuration. The thesis will aim to answer the following questions: • How can existing theory and research be implemented to design a depth esti- mation sensor for real-life usage? • How can depth accuracy be evaluated on collected data? • What is the optimal baseline distance for the proposed depth estimation sen- sor? • What is the precision of estimated dense full-HD depth maps? • How can self-supervised learning be used to enable an adaptive behavior? • How can IR light be implemented to increase performance in low-light condi- tions? 2 1. Introduction 1.3 Scope The goal of this project is not to achieve good benchmark scores for any available benchmark dataset. The project will focus its resources on evaluating the areas related to depth measurements and practical implementations on board vehicles. The thesis aims at showing the potential of stereo cameras to be used as sensors and investigate the possible benefits and drawbacks of such a sensor. Therefore, efforts will not be spent trying to reach the best publicly available benchmark scores. Neither will time allow to try all network architectures such that state-of-the-art predictions are obtained. Instead, the components of chosen neural networks will be evaluated and enhanced with respect to obtaining as good dense depth estimations as possible. Faster computational speed and more advanced algorithms will most likely be presented in the future. However, this work will provide information of how to make a stereo camera useful as a depth sensor. 1.4 Related work In recent years significant improvements have been achieved within the area of stereo vision due to the use of deep learning. The reason for this is that deep learning re- duces the computational effort considerably compared to previous methods. In 2017, Kendall et al. [7] proposed the Geometry and Context network (GC-Net) which is an end-to-end disparity regression learning architecture. The network architecture is based on a Siamese network which learns deep unary features through a number of 2D convolutions. The deep unary features are then used to compute a stereo matching cost by forming a 4D cost volume using 3D convolutions. The GC-net is a state-of-art network and since it was released in 2017 many well performing networks have been designed based on this architecture. PSMNet [10] is a more recent network that further increases accuracy by introducing a stacked hourglass 3D convolution architecture. The number of 3D convolutions can be increased considerably without affecting computational cost due to frequent down and up-sampling. The PSMNet later inspired the GA-Net [11] which is replacing the computationally costly and memory-consuming 3D convolutions by introducing two new neural network layers. Mayer et al. [12] proposed a network architecture called DispNet that applies the optical flow estimation concept to disparity estimation using convolutional neural networks. The performance of these methods are state-of-art with impressive results on pop- ular benchmark suites like KITTI Stereo 2012, 2015 [8, 9]. However, they are all using supervised learning which require ground truth depth data. As is well known, ground truth data is often expensive and time consuming to obtain, and thus self- supervised learning is preferred. Inspired by DispNet, Godard et al. [13] proposed a method to perform monocular depth estimation as an image reconstruction problem. By implementing an im- age reconstruction loss with a left-right consistency check the network can learn to perform single image depth estimation, despite the absence of ground truth data. They show that this method even outperforms supervised methods. Stereo match- 3 1. Introduction ing is closely related to monocular depth estimation and Zhong et al. [14] intro- duced a self-supervised learning method for stereo matching. The network predicts dense disparity maps directly from the stereo input which enables the network to be self-improving and adaptive to new unseen imageries and different camera settings. Reconstruction loss is the most common way to remove the dependency of depth ground truth data but one drawback when reconstructing right input image from left is that the network cannot handle occluded regions. Peng et al. [15] introduced an occlusion aware self-supervised stereo method. By making use of geometry fea- tures of the disparity maps in an iterative way occluded pixels can be detected and added to an occlusion mask. The resulting occlusion mask is then used as a guid- ance in either training or post processing. What is common for these state-of-art depth estimation algorithms is that they often ignore limitations of GPU memory space and power consumption. Gröndahl et al. [16] introduced the XCNN network and shows that the speed, GPU memory and power consumption can be decreased considerably by applying weight pruning but at the cost of network adaptability. Gan et al. [17] propose a light-weight network for real-time adaptive stereo depth estimation which is suitable for an embedded device such as NVIDIA Jetson TX2 [18]. 4 2 Theory In this chapter the theory specific for the thesis will be presented. Initially a brief introduction to the stereo camera and the correlation between disparity and depth is given. Thereafter theory related to the neural networks used in the project is presented. Finally, IR light for night vision and the feature matching algorithm ORB are introduced. 2.1 Binocular disparity Humans perceive the world in three-dimensional coordinates although the human eye can only extract information in two dimensions which is made possible using binocular disparity. An object point in space appears at distinct positions for the left and right eye. The difference is called a disparity and it is proportional to the distance to the object. Objects located nearby give large disparities while objects further away result in smaller disparities. With the use of two horizontally aligned cameras a computer can estimate depths like humans do [19]. 2.1.1 The stereo camera A stereo camera consists of two or more pinhole cameras with different field of perception. In Figure 2.1 there is a representation of the pinhole camera. The light reflected from objects pass through a small aperture and projects an upside-down image on the opposite side of the box, called the image plane. Figure 2.1: The pinhole camera. 5 2. Theory The distance between the aperture and respective image plane is called a focal length, denoted as f . The focal lengths in the x and y directions are different, fx and fy, since the shape of individual pixels in a camera often are non-square. Furthermore, the possible offset between the optical axis, also called principal point, and the center of an image are expressed with the two variables cx and cy. Where cx is the horizontal offset and cy is the vertical offset to the true image middle point. From the real world-point P = [X, Y, Z]T the camera coordinates, x and y, can be calculated with the following equations: x = fx X Z + cx (2.1) y = fy Y Z + cy (2.2) These relations can be expressed as a 3x3 matrix that maps between real-world coordinates and camera coordinates and is called the intrinsic matrix, denoted as M . The transformation is written asxy z  = M XY Z  = fx 0 cx 0 fy cy 0 0 1  XY Z  (2.3) In many applications it is useful to transform camera coordinates to a real world coordinate system which can be achieved with linear transformation using the ex- trinsic parameters. They consist of a rotational 3x3 matrix R and a 3x1 translation vector t. The mapping between world and camera coordinates is P camera = R(P world − t) (2.4) Image distortion is a common effect for all cameras with a lens and there exist two common types of distortions. Firstly, radial distortion, which is an effect of having a convex lens that bend light more at the edges than in the center, resulting in a "Fish-Eye" effect. A phenomenon that can be compensated for with the following equation: xcorrected = x(1 + k1r 2 + k2r 4 + k3r 6) (2.5) ycorrected = y(1 + k1r 2 + k2r 4 + k3r 6) (2.6) The second most common lens distortion is tangential distortion as an effect of the lens not being parallel to the imaging plane. This can also be compensated for with use of these equations: xcorrected = x+ 2p1xy + p2(r2 + 2x2) (2.7) ycorrected = y + p1(r2 + 2y2) + 2p2xy (2.8) Here, r = √ x2 + y2 is the pixel coordinate distance to the origin. The radial dis- tortion coefficients k1, k2 and k3 as well as the tangential distortion coefficients p1 6 2. Theory and p2 are also considered intrinsic parameters. Even though these two distortions have the largest impact there exist other types of distortions with less impact that usually can be neglected. 2.1.2 Stereo calibration Stereo calibration is essential for disparity calculations. A horizontal miss-alignment will contradict the assumption that matching points from the left and right camera will appear on the same horizontal line. Factors for a miss-alignment are camera and lens displacement. Either as a translation shift internal in the camera or external between cameras, such a shift is most probably quite small. However, rotational shift has a larger impact and is more difficult to notice. A small pitch-angle offset can give a large image offset in pixel-distance. It has been shown that the errors corrupt the depth estimation but there are ways to compensate for these imperfections [20]. One way to simplify the depth estimation is by calibrating the cameras. A single camera calibration makes use of the intrinsic parameters to compensate for the possible distortions. However, for a stereo pair to be identical in all aspects except the horizontal shift a stereo calibration is necessary. With the use of a known point in space visible to both cameras a relationship between the cameras in space can be calculated. The calibration is done by calculating the rotational matrix and translation vector between two cameras, calculated as: R = RrRl T (2.9) t = tr −Rtl (2.10) where R and t denotes the rotation and translation to move the right camera co- ordinate system into the left one. To transform a point from the left to the right camera Equation 2.4 can be used with P L instead of P camera. 2.1.3 Disparity and depth calculations Once a binocular camera is stereo calibrated there is a known relation between the left and right view. A world point P should be located at the same horizontal line in both views such that they have the same y-coordinate. In opposite, the x-coordinate should have different values for the two views where the disparity is linearly related to the distance of the point P . As illustrated in Figure 2.2 the world point P results in the image points pL = [xL, yL] and pR = [xR, yR] in the left and right image, respectively. As an effect of the horizontal displacement, also known as the baseline B, the image coordinates are not identical. If perfectly stereo calibrated the vertical coordinates are the same, yL = yR, while the horizontal coordinates are not, xL 6= xR, except for points extremely far away from the camera. The disparity d is the horizontal coordinate difference of where the point P appear in respective camera, d(pL) = xL− xR. The geometric relation of a pinhole camera gives a linear relationship between disparity and depth as Z = Bf d(pL) (2.11) 7 2. Theory Figure 2.2: Illustration of the camera setup and disparity to depth relation in a three-dimensional space. where the baseline B [millimeters], focal length f [pixels] and d(pL) [pixels] give the distance Z to point P in millimeters. There exist multiple algorithms for stereo depth estimation with the common blocks: • Matching cost computations • Cost aggregation • Disparity computation and optimization The traditional matching cost algorithms are computationally heavy and/or depen- dent on exact stereo calibration. Since two points must be matched the size of the search area will decide the computational cost. Two of these algorithms are semi- global matching and mutual information series (SGM and SGBM) [21]. It has been proven that with the use of neural networks dense disparity-maps can be estimated faster and less dependent on perfect calibration [7]. 2.2 Artificial neural networks Artificial neural networks have shown impressive performance for the disparity es- timation task. The theory and specific functions used will be presented in this section. 2.2.1 Network architecture Different tasks need different network structures and sizes. In general, a larg- er/deeper network can learn more complex patterns but at the cost of being more computationally heavy. A popular network architecture is the U-net with convolu- tional layers down-sampling the input to a latent space and then use up-sampling 8 2. Theory with transposed convolutional layers to the original input size [22]. The down- sampling part is called an encoder while the up-sampling part is known as a decoder. A way to not lose information while going deep into the network is by introducing residual connections [23]. These connections feed forward information from layers in the encoder to layers in the decoder adding a contribution in addition to the previous layer in the decoder. For a layer LD N in the decoder the contribution can be written as LD N = LD N−1 + wLE N where w is a weight constant and LE N is a layer output from the encoder. For these connections to work the layer output sizes of the decoder and added encoder layer need matching height, width and number of channels. Lastly, to extract information with respect to the difference of two input images cross-connections have proven useful. Consider two identical networks with different input data. In each layer information from the other identical network is added making it possible for the network to do comparisons between both inputs. The addition makes every layer in the left and right lane dependent and in that way forced to share trainable weights. A structure commonly known as the Siamese network architecture [24]. 2.2.2 Training the network There are two main approaches of training a neural network: supervised and self- supervised training. Supervised training use annotated data containing the ground truth of what the network tries to learn. For disparity estimation the ground truth data is the true disparity/depth map obtained from for example a LiDAR. The goal is to create a network capable of estimating disparity from a stereo pair it has never seen before. While supervised training needs a lot of manual prepossessing the self-supervised fashion is less demanding. A calibrated stereo pair of images as input is all the network need in order to learn. Instead, the complexity lies in the loss-function. Rather than minimizing a simple mean square error (MSE) between the ground truth and network output a more sophisticated loss must be defined. 2.2.3 Loss-function All neural networks have an objective function which they try to minimize or max- imize. It is the loss function that control how well the network is performing and give information about what changes that are most effective to get closer to the objective. This is done by backpropagation [25] that computes the gradient of the loss function with respect to the weights and provide a direction of change most optimal for the current state. How the loss-function is defined will therefore have significant impact on the learning and outcome of neural network training. 2.2.3.1 Supervised loss The supervised loss can be expressed as the MSE of the predicted and ground truth disparity. Ls = 1 NM M∑ i N∑ j (dL(i, j)− d̂L(i, j))2 (2.12) 9 2. Theory for images with a shape of [N×M×3], predicted left disparity dL and ground truth d̂L. With the objective to minimize a MSE loss-function the network will learn how to create the ground truth from the provided input. Hence, the quality of ground truth data become important as bad data will result in a poor training. The network will never perform better than the quality of the ground truth data. 2.2.3.2 Self-supervised loss The self-supervised loss is not dependent on ground truth data to indicate perfor- mance and direction for a network. Instead, the input data can be processed in a way such that deficient performance increases the loss and superior performance result in lower loss. For disparity estimation the advantage of having a stereo image pair is utilized. The network processes the left and right input images and predict a disparity map that equals the pixel distances between the two images used to move pixels from one image to the other. For a pixel coordinate [u, v] the left image has a pixel intensity value IR(u, v) = [RL, GL, BL] for a three-dimensional image. The network estimates a disparity at the same position as dL(u, v) = d̃ after having processed the stereo pair. Then the pixel value from the left image is copied to the position [u + d̃, v] in a new image. The pixel value is compared with the same position of the right image to see if the pixel value IR was moved correctly, called a warping-loss or reconstruction error. The warping-loss is defined as LW = |IR(u, v)− IL(u+ d̃, v)| (2.13) Ltot = LW + LSSIM + LReg (2.14) describing the image intensity difference between the matched points of the left and right image. As the disparity is not perfect some pixels will be moved too little or too much causing both empty gaps and positions with two contributions from the left image. 2.2.3.3 Linear interpolation As the predicted disparity values d(x, y) are float values and the positions must be integers the values must be rounded. To simply round the value to the nearest location can lead to loss of information. Instead of using the intensity value of the closest pixel an interpolation method can be applied which interpolates the intensity of pixels around the predicted float value. These values are weighted by the difference of the disparity float value and the integer real positions where respective intensity is taken from. For example, a position equal [u + dx, y] = [10 + 15.5, 20] would calculate the pixel intensity as IL(25, 20) ∗ (25.5− 25) + IL(26, 20) ∗ (26− 25.5) for a linear interpolation. It can also be done for two variables such as the two coordinates (x,y) and is then called bilinear interpolation [26] which can be useful when both coordinates have float values. 10 2. Theory Figure 2.3: Image warp example. To the left is the original right image and to the right is the warped image filled with the real right image according to the mask values. 2.2.3.4 Reconstruction mask Other sources of warping-loss error are large texture-less regions and occluded re- gions. Texture-less regions are areas where the pixel intensities are so similar that it is difficult for a network to find matching features at the correct location. Occluded regions are areas only visible to one of the two cameras. As the network must see two pixels to perform a match this usually causes an error around the areas of edges to objects. A way to tackle the problem is to apply a mask that only give loss contri- butions from areas visible in both images and with enough texture to match pixels with. In that way the network is never punished for errors out of its control. By warping the left image and removing the pixels according to a calculated mask there will be blank spaces. If filled with the correct right image pixels the reconstruction task is eased of these tricky points. An example of how the combination of warped left image filled with true right image pixels can be seen in Figure 2.3. 2.2.3.5 Structural similarity between images Image similarity loss is not always best compared with a MSE loss where a loss calculated with the SSIM values can be more accurate. The structural similarity index measure (SSIM) is a good complement to the MSE as it compares the similarity for a larger area and with a different method. It calculates the similarity of the luminance, contrast and structure between two images. The SSIM equation is as follows: SSIM(x, y) = (2µxµy + C1)(2σxy + C2) (µ2 x + µ2 y + C1)(σ2 x + σ2 y + C2) (2.15) where σ is the mean pixel intensity, µ is the variation and C1, C2 are constants to ensure computational stability. The mean and variation are calculated for each position a kernel is shifted over. The size of the kernel is one important parameter for the similarity measurement which decides how large area that is to be compared. As shifting all pixels one step could yield a large "warping-loss" from the MSE function the SSIM-loss will still be quite small since it focuses more on patterns and the overall kernel similarity [27]. 11 2. Theory 2.2.3.6 Regularization loss A way to increase network segmentation of the disparity map is to include a regu- larization loss in the loss-function. The regularization loss is defined as: LReg = 1 N ∑ i∈N (|∆2 xdi|e−|∆ 2 xIi| + |∆2 ydi|e−|∆ 2 yIi|) (2.16) where N is the number of pixels, di is the disparity value at position i and Ii is the image pixel intensity at position i. The disparity gradients ∆2 xdi are weighted by the image gradients ∆2 xIi such that sharp edges in both the image and disparity map do not result in a high loss. However, intermediate areas with low image gradients will give higher weight and hence large loss if the disparity gradient would be high in the same location. The effect is a smaller variation of disparities for connected areas such that an object normally is given less variation of disparity values. 2.2.4 IR night vision Night vision is the ability to see in low-light conditions. One way to enable vision in dark environments is to use IR LEDs that emits electromagnetic radiation. IR light is defined as rays with wavelengths in the spectrum 700 nm to 1 mm [28] which are not visible to the human eye. Various kinds of IR LEDs produce light with different wavelengths and effects. By using a camera that can create images from reflected IR radiation night vision can be enabled. Two common IR night vision technologies that are used on the market today are far-infrared (FIR) and near-infrared (NIR) systems [29]. FIR cameras are used to detect thermal heat with the wavelength of around 8-12 µm. Warm objects will emit more radiation and will thus be more visible in the image. NIR cameras use near-infrared LEDs that emit radiation with a wavelength of 800 nm that the NIR camera can detect. The main advantage of NIR is the lower cost while FIR offer superior range. 2.2.5 Oriented FAST and Rotated BRIEF (ORB) Oriented FAST and Rotated BRIEF (ORB) is an algorithm which describe and detect local features in images that was published in 2011 by Rublee et al. [30]. The algorithm is an efficient alternative to SIFT [31] and SURF [32] which is often used in computer vision applications like object recognition, image stitching and video tracking. ORB builds on the well-known FAST [33] key-point detector and the BRIEF [34] descriptor. 12 3 Methods In this chapter the methods and performed experiments are presented. The hardware and software setups that were used during the project are first presented and elab- orated on. Thereafter the data collection and network architectures are explained as well as the loss function. Lastly the methods used to evaluate performance and obtain results from are presented. 3.1 Hardware setup In the project a single stereo camera rig was used to gather data and test the online performance with. The TX2 is built around a GPU NVIDIA Pascal™ with 256 NVIDIA CUDA® cores and a total RAM of 8 GB. The camera rig consists of four cameras mounted on a distance 14 cm apart. The cameras are from Leopard Imaging Inc [35]. They can capture images in full-HD at 60 fps, have a focal length of 5 mm and a pixel size of 3.75 µm. With the use of a cut-off filter the cameras can capture images in daylight with the filter and in total darkness without the filter if infrared (IR) light is used. Two 50 W IR LEDs are mounted to enable camera vision in the dark. The rig is equipped with one point laser rangefinder, LiDAR, that measure distances between 0 and 100 meters [36]. A photograph of the camera rig can be seen in Figure 3.1 Figure 3.1: Camera rig used in the project. From left to right are a camera, a 50 W IR LED, a camera, a laser rangefinder, a camera, another 50 W IR LED and a fourth camera. At the back, a Jetson TX2 is mounted and connected to the units. 13 3. Methods In addition to the TX2 a stationary computer with a GeForce RTX3090 GPU [37] was used for training the neural networks. The GPU have 24 GB of G6X-memory and 10,496 CUDA cores. 3.2 Dataset Collection Training self-supervised neural networks require plenty of qualitative image data. The data need to consist of stereo image pairs from cameras with known baseline and focal length. In this project most training data were gathered with the camera rig in different environments. Furthermore, the famous KITTI stereo 2015 dataset was used. 3.2.1 KITTI Stereo 2015 KITTI vision benchmark suite is a project that was introduced by Karlsruhe Insti- tute of Technology and Toyota Technological Institute in Chicago [38]. The purpose of the project is to offer challenging real-world computer vision benchmarks for dif- ferent tasks like stereo disparity estimation, optical flow, visual odometry, 3D object detection and 3D tracking. The stereo dataset consists of rectified stereo image pairs with a resolution of approximately 1242x375 pixels. The dataset contains 8400 image pairs where only 200 of them have ground truth disparities. The ground truth data have been collected with LiDAR. The KITTI benchmark was used to verify that the proposed evaluation methods are accurate enough and to compare the network performance to other published networks. An example stereo image pair together with the ground truth disparity map can be seen in Figure 3.2. (a) Left input image. (b) Right input image. (c) Ground truth disparity map. Figure 3.2: (a) Left input image of stereo pair. (b) Right input image of stereo pair. (c) Ground truth disparity map. 14 3. Methods 3.2.2 Collecting new data The stereo camera presented in Section 3.1 was mounted on moving vehicles and in static positions for different environments. Stereo calibrated images with a resolution of 1080x1920x3 were saved with a frequency of 1 FPS. The calibration was performed using Matlab’s Stereo Camera Calibration App [39]. When a new dataset was collected the following tasks were performed: 1. Synchronized images were captured with the cameras 2. Images were stereo rectified with the calibration parameters 3. Images were labeled and saved on the disk To test performance in different environments data were gathered in three ways. One dataset was gathered indoor to test the performance of an indoor light scenario. Another dataset was collected from a car with the camera mounted on top of the hover while driving around the city. To test the performance in low-light conditions one dataset was collected in a dark warehouse of 600 m2. In this setup the cameras have a static position capturing scenes of moving objects and persons. Night vision was enabled during the gathering of data with use of the IR LEDs and cameras without cut-off filter. 3.2.3 Calibration Calibration is necessary because of distortions and misalignment of the stereo cam- era. Distortions are created by the camera design since small angles and convexity of lenses affect the captured images. Small angular shifts between the stereo camera pair can also appear due to imperfect mounting. Offline stereo calibration compen- sates for most errors that are built into the camera rig. However, during usage the cameras are exposed to changes in temperature, pressure and outer forces which can cause physical changes that need to be compensated for. Then an online cal- ibration strategy can become handy since it can be used to calibrate the cameras continuously during usage. 3.2.3.1 Offline stereo calibration The cameras need to be calibrated to compensate for distortions as well as for an- gular and translational differences between the camera pairs. Offline calibration is performed by first calibrating the individual cameras separately and then stereo calibrate them together. During single camera calibration the intrinsic and extrinsic parameters are estimated individually for each camera. During stereo calibration the transformation (rotation and translation) between the two camera planes is es- timated. The calibration was performed with the Matlab’s Stereo Camera Calibrate App [39]. Multiple images of a chessboard with known dimensions were captured with all four cameras. The images were used as input to the calibration tool app whereby the intrinsic and extrinsic parameters were calculated. An example of an uncalibrated and calibrated image pair can be seen in Figure 3.3. The red channel represents the right image and the two other channels belong to the left image. 15 3. Methods Figure 3.3: To the left is an uncalibrated image pair where red color channel represent the right image and the other channels represent the left image. To the right is the same image but calibrated. 3.2.3.2 Online stereo calibration A way to avoid frequent offline calibration is by using automatic online calibration. By using the ORB algorithm corresponding key-points in the left and right images were found and the y-coordinates of the key-points were extracted. The difference between the y-coordinates of the two images were then used to determine how the images should be vertically shifted to match better. The following equation was used to decide the number of pixels to shift the images. Voffset = 1 N ∑ i∈N yL(i)− yR(i) (3.1) The left image was moved Voffset pixels to align better with the right image. This technique works well to horizontally align the images but unfortunately the dispari- ties make it impossible to use the same technique to align the cameras with respect to the x-coordinate. An example of an online calibrated image can be seen in Fig- ure 3.4. In addition, more advanced methods have proven capable of continuously estimating the stereo camera calibrations despite large initial errors and varying extrinsic parameters [40]. This is considered out of scope and will not be evaluated in the thesis. 3.3 Network architectures Two different network architectures were investigated and implemented. The SH-Net inspired by GA-Net [11] and the XCNN [16] architecture that is a lightweight and simple architecture with promising performance. The networks were implemented in Tensorflow [41] and trained on benchmark datasets as well as on the data collected with the camera rig. A single loss function was developed and implemented for both networks such that the only difference was the network architecture. 16 3. Methods Figure 3.4: Online self-adapting horizontal alignment using ORB. The red channel is from the right image while the green and blue channels come from the left image. To the left is the image pair before calibration, right image is shifted 15 pixels up. In the right image calibration has been performed and the image pair are horizontally aligned. 3.3.1 SH-Net The Stacked Hourglass Network (SH-Net) was designed with inspiration from the feature extracting part of the GA-Net [11] that use a stacked hourglass architecture. The hourglass architecture consists of two down-sample parts with convolutional layers and two up-sample parts with transposed convolutional layers as can be seen in Figure 3.5. The down-sampling parts, called encoders, and the up-sampling parts, called decoders, compress the input to a latent space and then scale up to the input dimension again. To maintain information in the network residual connections, also known as skip-connections, are introduced between the encoders and decoders. These are represented in the figure as the yellow blocks and black arrows. This is a Siamese architecture which enables shared weights between the two identical networks and has proven effective for the stereo disparity estimation task [24]. There is also a rectified linear unit (ReLU) activation function in between every layer to normalize the values and ensure speed and stability in training. The last layer has a hyperbolic tangent activation function that set the output values between -1 and 1. The two identical networks are fed with the left and right camera images with a width and height evenly dividable by 32 to keep consistent dimensions. The shared weights give a connection between the information of the left and right image such that, much like humans, the network can compare differences and estimate the disparity. The network can output one left and one right 2D disparity map with the same height and width as the input images. This network contains 4, 336, 658 trainable parameters. A detailed table with all layers of the SH-Net can be found in Appendix A. 17 3. Methods Figure 3.5: Siamese network architecture of the SH-Net with inspiration from GA- Net. Red layers are convolutional, blue are transposed convolutional and the yellow are adding layers connecting the encoder and decoder parts of the network. The two identical networks share weights through cross-connections (gray arrows) and have skip connections (black arrows) between the encoder and decoder parts. 3.3.2 XCNN In earlier work another cross-connected network architecture was developed called XCNN [16]. Instead of the stacked hourglass architecture this network only has one encoder and one decoder but with similar structure and input/output dimensions. This network contains an encoder with 15 convolutional layers followed by a decoder with 11 transposed convolutional layers. The XCNN architecture can be seen in Figure 3.6. In difference to SH-Net this network has layers specific for the left and right network such that the weights are not shared for some of the layers. The outputs of those layers are instead combined such that contributions from respective side are added in a cross-connected way (black arrows in the figure). However, there are still shared weights for some of the layers and residual connections exist as well. With only one hourglass part and fewer channels for each layer this architecture became more light-weight then the SH-Net. The number of trainable parameters is less, 544, 913, such that the amount of graphic memory is decreased. A detailed table with all layers of the XCNN can be found in Appendix B. 3.3.3 Loss function To train the networks with the stereo images an unsupervised loss function was constructed. By minimizing this loss, the networks improved their ability to make disparity estimations. The predicted disparity map is used to warp the left image pixels horizontally to reconstruct the right image. If the disparity map is perfect 18 3. Methods Figure 3.6: XCNN network architecture with skipping and cross connections vi- sualized. The red layers are convolutional building up the encoder while the blue blocks represent the transposed convolutional layers that defines the decoder. the warped left image equals the right image. However, an imperfect disparity map can be used for defining a loss. The imperfections of the disparity map cause a warped image with inaccurate pixel values. Such a warped image has gaps where no pixels were moved to and positions where multiple pixel values have been moved to. By using a reconstruction mask and bilinear sampling these imperfections could be handled. The gaps were filled with the nearest neighbor values and positions with multiple contributions were removed from the loss. The warped image and the right image were compared with respect to the pixel intensity difference. Both the raw difference from MSE and the more contextual SSIM difference. The warping loss was defined as: Lwarp = W1LMSE +W2LSSIM (3.2) W1 and W2 are weights that can be tuned to make the training more effective. A regularization loss, introduced in Equation 2.16, was added to the loss to make the disparity map more segmented. The regularization loss makes the disparity map smoother where the image gradient is small which results in a more segmented disparity map. The total loss was defined as a detail loss, Lwarp, and a regularization loss, Lreg. Experiments show that a too high regularization loss can make the disparity map consistent for example, only filled with zeros [16]. On the other hand, without a regularization loss the image contrasts caused by details at the same distance from the camera give wrong disparity values. This causes an error in depth estimation and a balance in loss-weights must be found. The total loss function is expressed as: L = W1LMSE +W2LSSIM +W3Lreg (3.3) 19 3. Methods The parameters that can be tuned are the three weights, W1, W2, W3 and other specific loss parameters such as the kernel size of the SSIM filter. Moreover, the loss function is necessary while training the networks but can be removed once the networks are fully trained. The prediction time during usage can then be lowered by removing the loss function of the trained networks. 3.3.4 FOV and optimized disparity zone The horizontal field of view (FOV) is the open observable area that the cameras can see, and it can be calculated with: FOVHorizontal = 2arctan ( width 2f ) (3.4) The cameras had a width of 1920 pixels and focal length of f = 1333.3 [pixels] which gave a horizontal FOV equal 71.51o. Given a baseline B this result in a closest common visible distance to the camera equal Z = tan(54.25o)∗B/2. B is the baseline between the cameras and the angle is given by the geometry. For a baseline of 42 cm the closest common visible point is 29.3 cm from the cameras. Depending on the distance of an object there will be differently large dark zones at the edges of the disparity map. These zones are the areas only visible to one of the two cameras and hence difficult to obtain good disparities from. To increase computational speed and the percentage of accurate disparity estimations the edges can be cut off from both sides of the images. With the smallest depth measure of 3 meter the maximum disparity allowed was set to 189 [pixels]. This create dark zones of the same size at each side of the images which were cut off before input to the network. Furthermore, as interesting objects are usually located in the middle of the images the stereo image pair were cut with 200 pixels from the top and bottom, removing a lot of sky and road in the GBG traffic dataset. In Figure 3.7 the FOV for respective camera and possible occluded areas are presented. The idea was to cut of the dark zones that do not contribute to any qualitative information. In this way the need for memory and computational speed decreases. After trimming the images, they contain a size of 1542×680×3, to fit the network dimensions had to be evenly dividable by 32. With this in mind, the true input size of the images was 1536 × 672 × 3. By decreasing the image size in this way, the amount of input values was decreased from 6220800 pixels to 3096576 which is 50.2% less input values. Trimming the images was one way to increase the speed but performance was evaluated on full-HD images. 3.3.4.1 Optimal baseline Theoretically, wider baselines yield better depth estimations for all visible ranges compared to shorter baselines. For example, if the network predicts a disparity map with one pixel offset it will yield larger distance errors for shorter baselines because of higher percentile errors compared to the same situation with larger baselines. Drawbacks of using a large baseline is that it yields larger occluded areas than short baselines and the minimum distance that can be estimated increases. In Figure 3.8 the disparity to depth relationship as well as the depth error caused by a pixel offset 20 3. Methods Figure 3.7: Field of view visualisation of visible and occluded areas for a stereo par. The dark zones at the edges as well as regions occluded by objects are marked with colors. Blue for areas not visible to camera 2 and the orange areas are not visible to camera 1. In this example there is a person visible to both cameras and a car only visible to camera 2 as the house occlude the car. can be seen. The baselines are 14, 28 and 42 cm which are the possible baselines with the camera rig. The presented theoretical depth error is caused by a 3-pixel positive offset such that the error was calculated with the following equation, E(dL) = depth(dL)− depth(dL + 3) (3.5) where dL is the disparity value and depth is a function converting disparity to depth measurement similar to Equation 2.11. The positive shift resulted in an error relative to the baseline such that larger baseline yield less error for the same disparity offset of 3 pixels. To strengthen the theory an experiment was performed comparing the depth error from predicted disparities with two baselines of 14 cm and 42 cm. The camera rig was placed in a corridor gathering images with a person that walked in front of the camera back and forth in the corridor. More than 200 images were captured with depth information from the one-point laser. The stereo images were stereo rectified for the two baselines with the Matlab stereo calibration toolbox. Then the SH-Net was fine-tuned on the images with pre-trained weights that had more than 100 hours of training on the KITTI dataset with a baseline of 54 cm. After 10 epochs of fine tuning the depth error between the network and laser measurements were evaluated. The same procedure was applied for both baselines. In Figure 3.9 the left image and predicted disparity for the shorter baseline of 14 cm is presented. The predicted distance is 3.44 meter, the laser measured distance was 5.92 meter resulting in an error of -2.48 meter. 21 3. Methods Figure 3.8: Disparity to depth relationship to the left and depth error caused by 3-pixel positive disparity error to the right. The lines show three different baselines that are possible with the given camera rig. Figure 3.9: Depth estimation from stereo camera with baseline of 14 cm in an office corridor. The measurement value in red is the predicted distance, the laser measured 5.92 meter giving an error of -2.48 meters for this image. 22 3. Methods 3.3.5 Occlusion mask One of the drawbacks of the stereo camera as a depth sensor is occluded regions. The offset between the cameras lead to areas that are visible in the left camera but not in the right and vice versa. It is difficult or even impossible for the network to reconstruct pixels from one image to another if it is not visible in both images. Consequently, calculating reconstruction loss on occluded pixels is noisy and will have a negative impact on network performance. Previous work show that an oc- clusion mask applied in the training result in less outliers for the predicted disparity map [16]. A solution that locates occluded pixel regions and exclude them from the calculated reconstruction loss. The occluded pixels were detected in the predicted disparity map through an iterative process. A pixel was classified as an occluded pixel if there existed another pixel in the left image that had been warped into the same coordinates in the right image. The network should be able to correctly warp all non-occluded pixels from the left image into the right. Furthermore, the network has problems predicting disparities in large textureless areas since no distinctive features can be extracted from these areas. False predictions in textureless areas will also cause the network to warp multiple pixels to the same pixel coordinates which will be highlighted by the occlusion mask. In Figure 3.10, an example situation can be seen with an occlusion. In the top left image, the blue car is almost completely visible and in the top right image the rear of the blue car is not visible. The network fails to correctly reconstruct the left image by warping the right input image. The rear of the blue car cannot be reconstructed since the network was not able to match pixels in this area. Moreover, the dark zones, only visible to the left camera, were removed according to the occlusion mask as can be seen in the same figure (d)) where for example the left side was removed. The occlusion mask can also be used post training to refine predictions in occluded areas. 3.4 Evaluate depth accuracy The depth measurement accuracy is one crucial evaluation parameter that was used to compare performance and robustness of proposed methods. The depth accuracy was evaluated in two ways. One method was to compare the measurements from the stereo camera with the on-board one-point laser range finder value, another method was to use the key-point matching algorithm ORB. 3.4.1 Depth evaluation using LiDAR With the one-point laser range finder an accurate depth measurement was gathered with each stereo image pair. The on-board laser measurements have an accuracy of 0.1 m for a 70% reflective target at 200C. The laser was mounted in the middle of the camera rig pointing in the same direction as the four cameras. On top of a garage roof the camera was directed towards a small entrance where it captured images and laser measurements from different distances between 3 and 80 m. A left image, 20.84 meters from the stereo cameras, can be seen in Figure 3.11. 23 3. Methods (a) Left input image. (b) Right input image. (c) Warped image from right to left. (d) Occlusion mask. Figure 3.10: The image has been warped using bilinear sampling based on disparity map predicted from left and right input images. The occlusion mask has highlighted occluded regions which can be seen as the black regions in the mask. Figure 3.11: Left stereo camera image that depth was predicted from. On top of a garage roof with a distance of 20.84 m to the small entrance building according to the laser measurement. 24 3. Methods The networks evaluated were the SH-Net and XCNN that were trained on the com- plete KITTI dataset and the custom created GBG traffic dataset. The networks were also fine-tuned on the evaluation images such that the specific environment on top of the garage was learnt. This training was done until no further improve- ments could be noted. Evaluation was performed such that the distance given from a laser measurement was compared with the network predicted distance for a point in the center of the garage entrance-building. The predicted distance measurement was the mean value of 200 pixels belonging to the object. The absolute difference of the predicted mean value and the laser point measurement was defined as the measurement error. 3.4.2 Depth evaluation using ORB Since no ground truth data exists on the datasets collected it was difficult to evaluate network accuracy. One way to create ground truth data was to use ORB [30]. An algorithm was implemented in Python that took a stereo image pair as input and then returned pixel coordinates of matching features and their pixel disparities. The ORB algorithm was utilized to find pixel coordinates of matching key-points and descriptors. The ground truth disparity map was then computed by finding the horizontal pixel difference between key-points in the left and right image. Often, the algorithm produced a huge number of matching features but only a few matching key-points were accurate enough to compute ground truth disparities from. The algorithm sorted the matches in a list based on the certainty in ascending order with higher certainty in the front. The designed algorithm had an input parameter for the percentage of how many matching key-points that should be returned such that only good matches were obtained. A trade-off was made between accuracy and number of key-points when deciding the percentage of key-points to include as ground truth disparities. Different percentages were tested and it was decided that 2% of the found key-points were accurate enough to be used as ground truth pixels. The algorithm was evaluated to ensure that it could produce accurate ground truth values. This evaluation was performed on the KITTI benchmark dataset where the disparities from the ORB algorithm were compared with the corresponding LiDAR ground truth values. Approximately 100 matching key-points were found in each KITTI stereo image pair that could be used to compute disparities. However, the KITTI ground truth disparity maps are sparse and many of the computed disparities did not match a ground truth disparity value at that pixel coordinate. Due to the sparsity approximately 22 computed disparities per image could be evaluated against the ground truth data. In Figure 3.13a, 3.13b and 3.13c example image pairs from different datasets can be seen together with the key-points ORB has located in both images. The computed disparities that are evaluated against the ground truth are located randomly in the image which gave a good indication of the ORB algorithm performance. In Figure 3.12 the results can be seen from the evaluation, indicating that the mean absolute error was 0.77 px, variance 0.67 px and the standard deviation 0.82 px. Even though ORB can estimate accurate disparities from stereo image pairs it is not suitable in a depth sensor application. The algorithm is too slow and the amount of 25 3. Methods data that are retrieved per second are too sparse compared to the speed and amount of data retrieved from CNNs. Figure 3.12: Error distribution of the ORB estimations compared with KITTI ground truth. 3.4.3 Depth evaluation in low-light conditions using IR light For evaluation of the distance measurement with the cameras in low-light conditions a dark warehouse was visited. In the dark environment the camera rig with an IR LED of 50 W was aimed at an object that was moved between 2.5 and 22.5 meters from the cameras while capturing images. The distance to the object was also measured with the one-point laser such that ground truth values were saved. In Figure 3.14 an example image taken with one of the cameras in the dark can be seen. The networks used for depth evaluation were trained for more than 100 hours on the KITTI and GBG traffic dataset. The networks were also fine-tuned for 20 epochs on 550 image pairs captured in the same way as Figure 3.14 with moving objects in front of the camera. 3.5 Online self improving ability With the self-supervised loss function that was used during training the networks can be self-improving and adapt themselves to new unseen environments. The stereo images taken by the camera can be used to predict a disparity map and simulta- neously be used to calculate a reconstruction loss. The input images will serve as pseudo ground truth which will enable the network to fine-tune its weight parame- ters continuously during usage. An image from the GBG dataset was chosen to be the model of the adaptive improvement. The untrained SH-Net predicted a disparity map for the image, then it trained on 50 other images from the GBG dataset after which another prediction was made on the same image. The adaptive behavior is presented in Figure 4.7. The same procedure can be applied for a trained network with lower learning rate enabling a fine-tuning variant of adaption. While running the network in a self-improving mode the runtime is increased considerably since many operations are calculated in the loss function. In contrast, for a trained net- work without a loss function, only the operations of the network must be calculated which lower the computational cost during predictions. 26 3. Methods (a) (b) (c) Figure 3.13: (a) Image pair from Kitti benchmark dataset together with ORB key-points. (b) Image pair from GBG dataset together with ORB key-points. (c) Image pair from IR dataset together with ORB key-points. Figure 3.14: Image captured with use of IR light in a dark warehouse. 27 3. Methods 28 4 Result and discussion The results obtained from experiments and measurements described in the previous chapter will be presented and discussed here. 4.1 Optimal baseline From the experiment presented in Section 3.3.4.1 the depth estimation accuracy of a stereo camera with baseline 14 cm and 42 cm can be compared. The errors from the depth predictions with SH-Net compared with ground truth laser measurements for the two baselines are presented in Figure 4.1. The mean error is closer to zero for the larger baseline and for short distances the maximum error is larger for a baseline of 42 cm compared with 14 cm. On the other hand, for long distances the opposite applies such that the shorter baseline gives larger error outliers. In Figure 4.2, the distribution of the errors calculated from the two baselines are presented. Increasing the baseline from 14 cm to 42 cm seem to decrease both the mean error and the error variation. With 14 cm between the cameras the mean error was -3.20 meter and the standard deviation was 1.84 meter. Increasing the baseline to 42 cm results in a mean error of -0.47 meter and a standard deviation of 1.74 meter. These results indicate that for distances between 3 and 22.5 meter a baseline of 42 cm have better performance than a baseline of 14 cm. Figure 4.1: Depth errors calculated for the two baselines of 14 and 42 cm. The measured errors are plotted together with the mean value for both baselines at distances between 2.5 and 22.5 meter. 29 4. Result and discussion Figure 4.2: Distribution of the measured errors for the two baselines of 14 and 42 cm. The error measured with baseline equal 42 cm is closer to zero and less spread. 4.2 Numerical results In this section the numerical results from the evaluation of the stereo camera as a depth sensor are presented. The evaluation has been performed using the software Keras [42] with a TensorFlow backend [41]. Two networks, SH-Net and XCNN, have been implemented and evaluated separately. The performance of the networks was first evaluated on KITTI 2015 benchmark to compare performance with state- of-art network architectures. Thereafter the real-life depth estimation performance of the networks combined with the binocular camera were evaluated. The real-life performance was evaluated by training and evaluating the networks on self-collected data captured with the binocular camera presented in Section 3.1. Ground truth data were created by measuring distances to objects with a one-point laser range finder and by using the ORB algorithm presented in Section 3.4.2. 4.2.1 KITTI benchmark The two networks implemented were evaluated on the KITTI 2015 benchmark dataset. The dataset contains 200 evaluation images with ground truth data that has not been seen during training. The performance was evaluated by calculat- ing "D1-all" which represents the percentage of outliers averaged over all the ground truth pixels of the 200 test images. A pixel was classified as an outlier if the disparity was falsely predicted with more than 3 pixels. The results from the evaluation can be seen in Table 4.1 and are compared to LWA-Net which is one of the most recent state-of-art self-supervised network architectures with an impressive runtime. The runtimes have been calculated for predictions on the Jetson TX2 with 256 CUDA cores. What can be observed from the results is that LWA-Net is the best perform- ing network on this benchmark. XCNN has better accuracy than SH-Net but is the slowest one. 30 4. Result and discussion Table 4.1: Evaluation results on Kitti 2015 benchmark for different self-supervised network architectures. Unavailable data are noted as -. Network Parameters (million) D1-all (%) Average Runtime (s) RMSE Input size (pxl) SH-Net 4.336 18.7 0.45 7.33 320x1216x3 XCNN 0.545 7.711 0.50 3.6 320x1216x3 LWA-Net 0.098 4.94 0.20 - 320x1216x3 4.2.2 Depth sensor evaluation The networks were trained on a dataset collected with the outermost cameras with baseline 42 cm on the stereo camera rig described in Section 3.1. The evaluation was performed both in daylight and in low-light conditions where the datasets were collected as described in Section 3.2. 4.2.2.1 Depth evaluation using ORB The depth sensor was evaluated using ground truth depth computed with ORB. The evaluation dataset contains 40 images captured during daylight that has not been seen during training. Approximately 130 ground truth depth pixels were computed per image in the evaluation dataset. To make a thorough evaluation of the networks and understand their strengths and weaknesses the evaluation has been performed on different distance intervals. During evaluation it was noted that sometimes the networks made false predictions with several hundred meters for distances longer than 80 meters. Therefore, the decision was made to not include predictions longer than 80 meter in the result. The evaluation result for XCNN and SH-Net can be seen in Table 4.2. The average runtimes are calculated on a GeForce RTX3090 with 10,496 CUDA cores and on the Jetson TX2 with 256 CUDA cores. In Figure 4.3a and 4.3b the error measurements are plotted for both networks. The depth absolute mean error and standard deviation of SH-Net were 10.25 and 13.94 meter respectively. The depth absolute mean error and standard deviation of XCNN were 11.37 and 17.9 meter. What can be noted from the plots and in the table is that the prediction error increases with the distance which is not the case for the disparity error that seems to decrease with the distance. Both networks have the best accuracy for distances from 3 to 20 meter where SH-Net is slightly better. 4.2.2.2 Laser point evaluation Table 4.3 presents the real and relative distance errors between predicted values and measured distances with a laser rangefinder. The overall spread of errors for this result is presented in Figure 4.4. For short distances the prediction consistently overshot and for the distances longer than 20 meter the predicted distances mostly undershot. The SH-Net has less outliers while XCNN tend to predict too long distances and spread out the predictions more, especially for the longer distances. 31 4. Result and discussion (a) (b) Figure 4.3: (a) Error distribution of predictions from SH-Net on GBG dataset. (b) Error distribution of predictions from XCNN on GBG dataset. Figure 4.4: Error depth calculated between predicted disparity estimation and laser measurement. Images of a garage entrance building at distances between 3 and 73 meters were input to the SH-Net and XCNN network to estimate the depth. 32 4. Result and discussion Table 4.2: Depth and disparity evaluation result from the two network predictions compared with ground truth from the ORB algorithm on GBG dataset. Network SH-Net XCNN Input size [pxl] 1920x1056x3 1920x1056x3 Runtime on GeForce RTX3090 [s] 0.048 0.055 Runtime on Jetson TX2 [s] 1.78 2.24 Baseline [cm] 42 42 Disparity absolute mean error all [pxl] 6.73 7.61 Disparity absolute error standard deviation all [pxl] 9.55 13.5 Disparity absolute error mean 3 -> 20 meter [pxl] 9.73 9.31 Disparity absolute error mean 20 -> 40 meter [pxl] 6 5.53 Disparity absolute error mean 40 -> 60 meter [pxl] 4.87 5.37 Disparity absolute error mean 60 -> 80 meter [pxl] 4.12 4.12 Depth absolute mean error all [m] 10.25 11.37 Depth absolute error standard deviation all [m] 13.94 17.94 Depth absolute error mean 3 -> 20 meter [m] 3.26 4.09 Depth absolute error mean 20 -> 40 meter [m] 8.45 9.32 Depth absolute error mean 40 -> 60 meter [m] 13.39 15.57 Depth absolute error mean 60 -> 80 meter [m] 18.5 21.98 Network Error mean Error std Relative error mean Relative error std SH-Net 8.79 3.51 62.80% 92.51% XCNN 13.56 4.24 150.15% 152.36% Table 4.3: Depth error of network predictions from images captured in daylight on top of a garage roof. The error is defined from the predicted depth compared with laser measurements for distances between 3 to 73 meters. 4.2.2.3 Performance in low light conditions using laser The performance of the networks in low light conditions were evaluated in the same way as the laser point measurement evaluation but in a dark warehouse and with use of IR light. The resulting error distribution of distances between 2.5 and 22.5 meter can be seen in Table 4.4 and Figure 4.5. The network tends to estimate longer distances than what the actual distances were. The SH-Net has a more consistent error with uncertainty that grow with the distance. The mean error was 0.974 meter in average, the standard deviation was 0.820 meter and the largest error was 4.31 meter. The XCNN was more inconsistent and had some predictions with larger errors. The mean error was 3.649 meter and standard deviation was 2.152 meter where 6 of the 65 predictions have the largest contribution. The largest error was 36.12 meter but except for the 6 outliers most predictions were as accurate for the XCNN as they were for SH-Net. Apart from the outliers the XCNN depth errors are more spread at longer distances but the relative error decreases as the distance grow. 33 4. Result and discussion Network Distances Error mean Error std Relative error mean Relative error std SHNET 2 < x <22.5 0.974 0.820 10.03% 25.82% XCNN 2 < x <22.5 3.649 2.152 25.63% 51.63% Table 4.4: Depth error from SH-Net and XCNN network prediction compared with laser measurements for distances between 2.5 to 22.5 m. Images were captured in a dark warehouse with IR lights as the only source of light. Figure 4.5: Depth errors from 65 image pairs collected in the dark with IR light as only source of light. Distances between 2.5 and 22.5 meter were measured and the error for the two networks, XCNN and SH-Net, were calculated as the difference of predicted depth and measured depth using laser. 4.2.2.4 Performance in low light conditions using ORB The performance of the networks was further evaluated in the dark warehouse using ORB. Approximately 55 ground truth depth pixels were computed per image for this evaluation dataset. The result from the evaluation of XCNN and SH-Net can be seen in Table 4.5. In Figure 4.6 the error measurements are plotted for both networks. The depth absolute mean error and standard deviation of SH-Net were 2.49 meter and 5.43 meter. The depth absolute mean error and standard deviation of XCNN were 4.56 meter and 10.87 meter. What can be interpreted from the table and plots is the same behavior as in daylight. The depth error increases with distance while the disparity error decreases. It can also be noted that the depth prediction accuracy seems to be better for images captured in IR light compared to daylight. Figure 4.6: IR depth estimation errors for distance 0-35 meter for IR dataset. 34 4. Result and discussion Table 4.5: Depth and disparity evaluation result from the two network predictions compared with ground truth from the ORB algorithm on IR dataset. Network SH-Net XCNN Input size [px] 1920x1056x3 1920x1056x3 Baseline [cm] 42 42 Disparity absolute mean error all [px] 5.46 9.35 Disparity absolute error standard deviation all [px] 8.03 17.65 Disparity absolute error mean 3 -> 20 meter [px] 5.85 10.71 Disparity absolute error mean 20 -> 40 meter [px] 4.65 6.13 Depth absolute mean error all[m] 2.49 4.56 Depth absolute error standard deviation all [m] 5.43 10.87 Depth absolute error mean 3 -> 20 meter [m] 0.6 1.57 Depth absolute error mean 20 -> 40 meter [m] 6.48 11.61 Figure 4.7: Example of SH-Net adapting to data through training on the GBG traffic dataset. The network start with untrained weights and improve during 3000 training steps. 4.3 Adaptive performance The networks can adapt to different environments by continuously updating the weight parameters during usage as depth sensor. An improvement of the untrained SH-Net while training on the GBG traffic dataset is presented in Figure 4.7. The improvements are distinct for the first 450 training steps and then learning is slower. For some steps there is no improvement or even a deterioration for the prediction result. In this setup the learning rate was lr = 0.0005 and the training weights introduced in Equation 3.3 were set to W1 = 0.5, W2 = 0.8 and W3 = 0.3. Further- more, fine-tuning already trained weights was also done during the project and with a balanced learning rate the network can adapt both fast and correct to new scenes. 4.4 Visual results A visual evaluation of predicted disparity maps was performed to better understand the strengths and weaknesses of the networks. 35 4. Result and discussion 4.4.1 GBG traffic dataset In Figure 4.8b the best SH-Net prediction result from the GBG evaluation dataset can be seen. The result of this prediction had a depth relative mean error of 10.3% and depth error mean of 1.32 m. When observing the predicted disparity map the prediction looks accurate where the car, trees, buildings and other relevant objects have been given reasonable disparity values. In Figure 4.8d the worst SH-Net pre- diction result from the GBG dataset can be seen. The result from this prediction had a depth mean relative error of 137% and depth error mean of 28.65 m. When observing the predicted disparity map, it seems like the network has captured the car correctly but had problems with the fence and the houses in the background. The horizontal subway wires also caused too high disparities in the sky leading to large distance errors. In Figure 4.9b the best prediction result from XCNN can be seen from the GBG traffic dataset. The result of this prediction had a depth relative mean error of 12.9% and depth error mean of 1.85 m. When observing the predicted disparity map it seems like the network captured the car in the middle correctly but had some problems with the left and right car as well as the road. In Figure 4.9d the worst prediction result can be seen from XCNN from the GBG dataset. The result from this prediction had a depth relative mean error of 111% and depth error mean of 36.6 meter. When observing the predicted disparity map, it seems like the image did not contain any distinct object and the network had problems predicting the road and the tress correctly. 4.4.2 IR dataset In Figure 4.10b the best prediction result from SH-Net can be seen from the IR evaluation dataset. The result of this prediction had a depth relative mean error 4.7% and depth error mean is 0.37 m. When observing the predicted disparity map the prediction looks accurate, both persons and the trash bins were captured by the network. In Figure 4.10d the worst prediction result can be seen from the IR dataset. The result from this prediction had a mean relative error of 22% and depth error mean of 6.75 m. When observing the predicted disparity map it seems like the prediction is accurate, but the network had some problems with the area in the lower left corner. In Figure 4.11b the best prediction result from XCNN can be seen from the IR evaluation dataset. The result of this prediction had a depth relative mean error of 5.9% and depth error mean of 0.32 m. When observing the predicted disparity map the network has captured the persons and objects correctly. The contours of the objects are sharper compared with the prediction from SH-Net. In Figure 4.11d the worst XCNN prediction result for the IR dataset can be seen. The result from this prediction had a depth mean relative error of 53% and depth mean error of 12.79 meter. When observing the predicted disparity map, it seems like the network have captured the objects correctly but had some problems with the background. 36 4. Result and discussion (a) Left input image of best prediction. (b) Best predicted disparity map. (c) Left input image of worst prediction. (d) Worst predicted disparity map. Figure 4.8: (a) Left input image that produced best prediction result from SH-Net. (b) Best predicted disparity map from SH-Net, relative error mean 10.3%. (c) Left input image that produced worst prediction result from SH-Net. (d) Worst predicted disparity map from SH-Net, relative error mean 137%. 37 4. Result and discussion (a) Left input image of best prediction. (b) Best predicted disparity map. (c) Left input image of worst prediction. (d) Worst predicted disparity map. Figure 4.9: (a) Left input image that produced best prediction result from XCNN. (b) Best predicted disparity map from XCNN, relative error mean 12.9%. (c) Left input image that produced worst prediction result from XCNN. (d) Worst predicted disparity map from XCNN, relative mean error 111%. 38 4. Result and discussion (a) Left input image. (b) Best predicted disparity map. (c) Left input image. (d) Worst predicted disparity map. Figure 4.10: (a) Left input image that produced best prediction result from SH-Net. (b) Best predicted disparity map from SH-Net, relative error mean 4.7%. (c) Left input image that produced worst prediction result from SH-Net. (d) Worst predicted disparity map from SH-Net, relative error mean 22%. 39 4. Result and discussion (a) Left input image. (b) Predicted disparity map. (c) Left input image. (d) Predicted disparity map. Figure 4.11: (a) Left input image that produced best prediction result from XCNN. (b) Best predicted disparity map from XCNN, relative error mean 5.9 %. (c) Left input image that produced worst prediction result from SH-Net. (d) Worst predicted disparity map from SH-Net, relative error mean 53 %. 40 4. Result and discussion Figure 4.12: Disparity to depth relation for two baselines with the distances 42 cm and 252 cm. The corresponding disparity for 112 meter measurements are plotted together with a 3-pixel positive offset. The offset causes an error in depth measure- ment with more effect on the shorter baseline. 4.5 Discussion In this section a discussion of the presented result will take place. With a focus on why the result look like it does this part will connect the result and conclusion. 4.5.1 Optimal baseline Previous work with stereo camera concludes that a short baseline is better at pre- dicting depth at short distances while a larger baseline increase performance at longer distances [43]. However, the theory and experiments performed in this work indicate the opposite. With respect to depth measurement accuracy a wider baseline is always more accurate and robust compared to a smaller baseline setup. The only drawback of using wider baselines is the increased number of occluded regions and that the shortest measurable distance increases. Instead of evaluating baselines less than 42 cm it would be interesting to learn the accuracy and robustness of base- lines up to several meters. The relation between disparity and depth, for baselines with distances 42 cm and 252 cm, is presented in Figure 4.12. A positive 3-pixel offset is plotted for the baselines as well as the resulting depth for such an error. At 112 meters the predicted depth is 31.8 meters more wrong for the 42 cm baseline compared with the 252 cm baseline. A distance of 112 meters is therefore estimated more robustly with the wider baseline. The result from this thesis indicates that distances from 40 meter and more are difficult to predict reliable disparity maps for with the current maximum baseline of 42 cm. However, as the performance between 41 4. Result and discussion 3 to 20 meter was much better, increasing the baseline should make the disparity prediction more accurate and robust. For example, if a baseline of 252 cm would be used this should give the same accuracy for distances from 18 to 120 meters as the baseline of 42 cm gives for 3 to 20 meters. It should be possible to setup three cameras such that one long and one short baseline are obtained and thus both short and long distances can be measured more accurate and robust. 4.5.2 Quality of data The quality of the training data is very important. If the data is ill calibrated, have too little variance or other disturbances it will be more difficult for the networks to learn disparity estimations. The quality of the KITTI dataset is probably better than the data collected with the camera rig since it was a larger project with the only focus of creating qualitative data. When gathering the GBG dataset small errors can have corrupted the data. In that case it was most probably due to disturbances during the capturing of stereo pairs or due to an imperfect camera calibration and image rectification. If the images were not captured in perfect synchronization objects might move during the time difference of capturing the right and left image. Then the disparity map is no longer valid for the objects that moved. Likewise, if any calibration parameter is wrong or the rectification is not perfect the disparity input to the networks will be corrupted leading to bad predictions. It will also affect the disparity to depth relationship since the focal length was calculated in the calibration. 4.5.3 Network performance The networks used in this project do not perform well on the KITTI benchmark dataset compared with other networks. Even though the benchmark dataset perfor- mance give an indication of how qualitative the disparity estimations are it does not entail that the general depth measurements are equally good. Presented experiments indicate that the quality of network depth predictions do not always align with the performance on benchmark datasets. The XCNN network was much better than SH-Net on the KITTI benchmark but for the depth measurements on data collected with the camera the SH-Net had better performance. There are many possible rea- sons why this is the case but since the loss-function, training parameters and data are the same it is probably the network architecture that have the largest impact. The two networks are very different and with respect to trainable parameters the SH-Net is more than seven times larger. It is possible that SH-Net can learn more complex features and patterns since there are more parameters to change. However, the network could learn a too complex pattern such that unimportant details in images have more effect on the disparity map. A less segmented and more noisy disparity map are two of the possible negative consequences of this. XCNN have layers that do not share weights between the parallel lanes, and this could affect what intermediate features that are obtained. For example, an occluded region only visible to one of the cameras could be processed in two different ways in the layers that do not share weights in XCNN. This would be more complicated for SH-Net 42 4. Result and discussion since all weights are shared and the left and right input image will therefore be processed in more similar ways. The relation between disparity and depth has a large impact on the performance and the distance to objects in an image will therefore impact the accuracy that depth is measured with. Another parameter affecting the quality of measurements are the cameras. Imperfect lenses, sensors and calibration can impact how the images are captured such that assumptions regarding focal length, pixel size, baseline and hor- izontal alignment become invalid. With use of ground truth values obtained with laser measurements and the ORB algorithm the accuracy of depth measurements can be evaluated. By evaluating the depth rather than the disparity the true per- formance of a stereo camera as a sensor can be evaluated. Most published projects only measure the disparity accuracy for a given benchmark dataset. In order to obtain information about the performance of a stereo camera as depth sensor we propose that also the depth accuracy is evaluated. With use of LiDAR or matching algorithms the performance can be evaluated numerically. 4.5.4 Adaptive performance An adapting behavior can be beneficial for the network if the general performance can be preserved. We show that it is possible to implement an adaptive depth sensor, through self-supervised learning, but not how well such a system would perform in real situations. In order to be useful, the time to adapt and quality obtained after such an adaption must be evaluated further. The obtained result gives an indication that adaptive training is both effective and useful for new unseen environments. 4.5.5 IR depth accuracy Surprisingly the network performance with IR light in darkness was better than outdoor in daylight. Exactly why this is the case is difficult to find out but one major difference between the input data are the color channel intensities. In daylight images the color channels intensities are more similar. For the IR images the green color channel only contain low intensities while the blue and red channels are more centered. An example of the color spectra for one daylight and one IR light image can be seen in Figure 4.13. Another mayor difference is that for IR images the camera rig was placed in a static location such that it could fit the network to the environment during training. The daylight images from GBG traffic dataset are all taken from different positions such that the background change on all images while training. These two differences are the two most probable reasons for the result obtained. 4.5.6 Visual performance in daylight When comparing the visual performance of SH-Net and XCNN the SH-Net seems to give better estimations of the background compared to XCNN. In Figure 4.14b the worst prediction from XCNN can be seen compared to the prediction from SH-Net in Figure 4.14c. What can be observed is that XCNN have trouble predicting the road and the forest on the right side of the road. This is not the case for SH-Net which 43 4. Result and discussion Figure 4.13: Color distribution for red, green and blue channels taken from two example images. One image taken in daylight on a road and another image in a dark warehouse with IR as light source. (a) Left input image. (b) XCNN prediction. (c) SH-Net prediction. Figure 4.14: Prediction comparison between XCNN a