Semantic Scene Change Detection Evaluation through Classical & Machine Learning Algorithms Master’s thesis in Computer Science and Engineering Jithinraj Sreekumar Shreya Desai Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2021 Master’s thesis 2021 Semantic Scene Change Detection Evaluation through Classical & Machine Learning Algorithms Jithinraj Sreekumar Shreya Desai Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2021 Semantic Scene Change Detection Evaluation through Classical & Machine Learning Algorithms Jithinraj Sreekumar Shreya Desai © Jithinraj Sreekumar and Shreya Desai, 2021. Supervisor: Peter Damaschke, Department of Computer Science and Engineering Advisor: Per Nilsson Lundberg, CEVT AB Examiner: Carl-Johan Seger, Department of Computer Science and Engineering Master’s Thesis 2021 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2021 iv Semantic Scene Change Detection Evaluation through Classical & Machine Learning Algorithms Jithinraj Sreekumar Shreya Desai Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Scene change detection helps to detect changes in a pair of multitemporal images of the same scene. We apply the concept of scene change detection to detect misplaced objects in a passenger vehicle. Deep learning neural networks have been extensively used in scene change detection. We study scene change detection using the classical Watershed algorithm and machine learning algorithms. In machine learning, we exploit the feature extraction capability of ResNet and Spatial Pyramid Pooling to predict the scene change. The performance of the classical and machine learning algorithms are also compared. The models are trained on a custom dataset and eval- uated using the metrics, dice coefficient, mean intersection over union (mIoU) and pixel accuracy. We infer that the machine learning model significantly outperforms the classical model in terms of mIoU score. Keywords: scene change detection, machine learning, semantic segmentation, convo- lutional neural network, residual neural network, siamese network, spatial pyramid pooling v Acknowledgements We would like to express our sincere gratitude to our academic supervisor, Peter Damaschke, our advisor at CEVT AB, Per Nilsson Lundberg, and our examiner, Carl-Johan Seger, for their insightful input, guidance and support throughout the thesis. We would also like to extend our gratitude to Anders Werner for giving us the op- portunity to carry out our thesis at CEVT AB. We are also grateful to our family and friends for their relentless encouragement. Jithinraj Sreekumar and Shreya Desai, Gothenburg, February 2021 vii Contents List of Figures xi 1 Introduction 1 2 Technical Background 5 2.1 Watershed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Morphological Transformation . . . . . . . . . . . . . . . . . . 7 2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9 2.2.3 Residual Neural Network (ResNet) . . . . . . . . . . . . . . . 11 2.2.4 Global Average Pooling (GAP) . . . . . . . . . . . . . . . . . 13 2.2.5 Conv 1x1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.6 Rectified Linear Unit (ReLU) . . . . . . . . . . . . . . . . . . 14 2.2.7 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.8 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.9 Learning Rate Scheduler . . . . . . . . . . . . . . . . . . . . . 17 2.2.10 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.11 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.12 ResNet50-Siamese . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.12.1 Architecture of ResNet50 . . . . . . . . . . . . . . . 19 2.2.12.2 Reconstruction Network . . . . . . . . . . . . . . . . 21 2.2.12.3 Siamese Network . . . . . . . . . . . . . . . . . . . . 22 2.2.13 Pyramid Pooling Module . . . . . . . . . . . . . . . . . . . . . 23 3 Literature Review 25 3.1 Scene Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Residual Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Global Average Pooling (GAP) . . . . . . . . . . . . . . . . . . . . . 27 3.5 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Spatial Pyramid Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.9 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Methods 33 ix Contents 4.1 Dataset and Dataloader . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Watershed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 ResNet50-Siamese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Spatial Pyramid Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Results 39 5.1 Watershed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 ResNet50-Siamese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Spatial Pyramid Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4 Classical vs. Machine Learning . . . . . . . . . . . . . . . . . . . . . 46 6 Discussion and Conclusion 49 6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.1 Watershed Algorithm . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.2 Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . 49 6.1.2.1 Choice of Residual Neural Network . . . . . . . . . . 49 6.1.2.2 Choice of Transfer Learning . . . . . . . . . . . . . . 50 6.1.2.3 Choice of Image Spatial Dimension . . . . . . . . . . 50 6.1.2.4 Choice of Loss function . . . . . . . . . . . . . . . . 50 6.1.2.5 Choice of Optimizer and Learning Rate Scheduler . . 51 6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A Appendix 1 I A.1 Kernel and Output Shape . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Training and Validation Loss - ResNet50-Siamese . . . . . . . . . . . V B Appendix 2 VII x List of Figures 1.1 Scene change detection . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Semantic segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 RGB image pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Absolute difference of RGB images (grayscale) . . . . . . . . . . . . . 6 2.3 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Convolutional block and feature maps . . . . . . . . . . . . . . . . . . 9 2.5 Residual block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Global average pooling . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Conv 1x1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.9 Dice coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.10 Intersection over Union (IoU) . . . . . . . . . . . . . . . . . . . . . . 18 2.11 ResNet50 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.12 Siamese_Res50_Fuse_Net . . . . . . . . . . . . . . . . . . . . . . . . 22 2.13 Siamese_Res50_Diff_Net . . . . . . . . . . . . . . . . . . . . . . . . 23 2.14 Example of SPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Spatial Pyramid Pooling presented in the paper . . . . . . . . . . . . 28 3.2 Example of Spatial Pyramid Pooling . . . . . . . . . . . . . . . . . . 29 4.1 RGB input image 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 RGB input image 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Difference image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Morphological transformation . . . . . . . . . . . . . . . . . . . . . . 35 4.5 Watershed final output . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1 (a) Image 1 (b) Image 2 . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 IoU : 76.67 (a) Difference image (b) OTSU threshold (c) Watershed (d) Final output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 ResNet50-Siamese: Inference results on two thousand training dataset 41 5.4 ResNet50-Siamese: Inference results on sixteen thousand training dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 (a) Image 1 and (b) Image 2 . . . . . . . . . . . . . . . . . . . . . . . 42 5.6 From(a)to(d) Left pane: Ground-truth labels; Right pane: Predicted results, (a)Fuse, (b)Diff, (c)Fuse+GAP, (d)Diff+GAP . . . . . . . . . 43 5.7 ResNet50-Siamese: Inference results for the given image pair . . . . . 44 5.8 SPP: Inference result on two thousand images dataset . . . . . . . . . 45 xi List of Figures 5.9 SPP: Inference result on sixteen thousand images dataset . . . . . . . 45 5.10 SPP: Prediction accuracy graph (sixteen thousand images dataset) . . 46 5.11 Comparison of classical and machine learning models . . . . . . . . . 46 A.1 Kernel and Output Shape . . . . . . . . . . . . . . . . . . . . . . . . IV A.2 Training and Validation loss (a) Fuse (b) Diff (c) Fuse + GAP (d) Diff + GAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V B.1 SPP Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII xii 1 Introduction Scene change detection is an appealing subject in computer vision. It is a process that detects change in two multitemporal images of the same scene, as shown in Figure 1.1. Figure 1.1: Scene change detection The basic idea of scene change detection is to detect the change in an image. The following scenario of a passenger boarding a vehicle will illustrate the process of scene change detection. First, two different images are captured as the passenger enters and exits the vehicle. The images are captured using multiple cameras mounted at an angle above the back seat of the vehicle. The two images captured at different timestamps are used as input to an algorithm for detecting objects in the interior scene of the vehicle. The objects may be a bag, a mobile phone, keys, an umbrella, etc. The aim is to identify such objects that are not previously present in the vehicle. Also, the passenger can be notified about the misplaced objects. The algorithm should also ensure that no items belonging to the driver are detected if they have been placed in the back seat along with the passenger’s items. We can apply this concept to taxis or other passenger vehicles. Scene change detection using classical computer vision algorithms have been in prac- tice for a long time [1, 2]. The ability to learn solutions from observational data 1 1. Introduction makes machine learning an interesting technique to use for scene change detection. Machine learning concepts include powerful tools that can build complex appli- cations, learn semantics, and extract useful features from images and videos. It addresses the limitations of classical techniques by taking on tasks associated with the human brain, which is capable of recognising objects and patterns, making vi- sual classifications, and so on [3, 4, 5, 6]. Various scene change detection methods are studied and applied in remote sensing, street monitoring, etc. To detect the changes in a scene, there should be a technique to distinguish the objects in images. It can be achieved by image segmentation. Image segmentation is the process of extracting meaningful information from an image by segregating the pixels. The pixels can be characterized based on texture, intensity level, shape, etc. Image segmentation can be further divided into semantic segmentation and instance segmentation. Semantic segmentation is a process that associates each pixel in an image with a class label, while instance segmentation is a process that assigns a unique ID to each object. Semantic segmentation is an important component of scene parsing. Figure 1.2 illustrates the process of semantic segmentation. Here, the object pixel (foreground area) in the image is associated with a class label. Figure 1.2: Semantic segmentation A convolutional neural network is considered as the backbone for most image seg- mentation tasks. It forms a powerful architecture for feature extraction and classi- fication. It is good at processing the data to create a large feature space from an input image which is encoded in the architecture. It allows the network to learn and self-train the appropriate features for a given task by itself that makes it suitable for image-focused tasks [7]. This ability of the convolutional neural network helps re- searchers to explore its use for image segmentation. There are several architectures in the field of convolutional neural networks, for example, GoogleNet [8], AlexNet [9], VGGNet [10], ResNet [11]. Convolutional neural network with deeper network architecture can provide more accurate prediction results. However, with deeper networks, it is more difficult to train the model and it is difficult to draw a conclusion about the accuracy of the model. Residual Neural Network (ResNet) attempts to mitigate this problem. ResNet is a kind of architecture in the field of convolutional neural networks. ResNet is a popular convolutional neural network. There are different versions of ResNet 2 1. Introduction which contain 18, 34, 50, 101, 152 layers. The significance of layers and their selection will be discussed in the following chapters. The main objectives of this study are: 1. Implement an existing classical algorithm and evaluate its performance on the custom (user-defined) dataset. 2. Implement new machine learning algorithms inspired by the existing studies and evaluate their performance. The work is divided into six chapters. Chapter 1 briefly describes the basic concepts of scene change detection. Chapter 2 discusses the technical background of classical and machine learning algorithms. Chapter 3 reviews the existing studies dealing with scene change detection and semantic segmentation. Furthermore, Chapter 4 discusses the methodologies for implementing the classical and machine learning algorithms. Chapter 5 presents the results and compares the performance of both the algorithms. Finally, the decisions made for each algorithm implementation are discussed and the results are summarized in Chapter 6. 3 1. Introduction 4 2 Technical Background In this chapter, we will walk through the theoretical aspects of classical and ma- chine learning algorithms for scene change detection. Section 2.1, briefly explains the theoretical concepts of the watershed algorithm. Section 2.2 briefly discusses the concepts of a neural network, weight initialization, convolutional neural net- works, residual neural networks, global average pooling and 1x1 convolution. This section also explains the activation function, loss function, optimization algorithm for training the model, and the metrics for evaluating the performance of the model. Subsection 2.2.12 explains the technical background for implementing ResNet50 based model. The feature extraction architecture of ResNet50 and the architecture for reconstructing a segmented scene change map are explained. Also, the ResNet50 based models developed for this study are presented in Subsection 2.2.12.3. Sub- section 2.2.13 describes the technical background for the Spatial Pyramid Pooling model. Section 2.1, Section 2.2, Subsection 2.2.2 - 2.2.11 can be skipped if the reader is familiar with the classical watershed algorithm and machine learning concepts. 2.1 Watershed Algorithm The notion of the Watershed algorithm is that it considers an image as a topographic surface and is divided into multiple catchment basins or watershed basins [12]. It transforms an image such that the catchment basins are the objects that we want to identify. The Watershed algorithm was introduced in 1978 and further developed in 1982 by Serge Beucher [13]. It has been successfully used in image processing, especially in image segmentation [14, 15]. The Watershed algorithm works on a grayscale image, applying segmentation to the gradient of an image. It is a region-based algorithm that looks for the similarities between pixels and regions. Each region in the image is characterized by the gray levels of an image. And, any variation in the gray levels will result in small gradient values. 5 2. Technical Background Assume the available image datasets are based on the RGB color model. An RGB image has three channels: Red(R), Green(G), Blue(B). Here, the red/green/blue channel is also referred to as a feature map. In image processing, a channel is the grayscale image of the same size as a color image, made of just one primary color (Red(R), Green(G), Blue(B)). It carries the intensity information about the image corresponding to a pixel value. A grayscale image has a single channel: It contains only one of the primary colors. Figure 2.1 shows two RGB based input images used for scene change detection (RGB values (R, G, B) are highlighted in yellow in the bottom left of the two images). Figure 2.1: RGB image pair Since the Watershed algorithm operates on a single grayscale image, the scene change is detected in two different images by first taking the absolute difference of the two RGB input images. The absolute difference image is then converted to a grayscale image, as shown in Figure 2.2 (grayscale value (L) is highlighted in yellow in the bottom left of the image). Figure 2.2: Absolute difference of RGB images (grayscale) 6 2. Technical Background Image segmentation can also be done based on the shape of objects, using the distance transformation function. The distance transformation function calculates the distance between the object pixel and the nearest background pixel such that high intensity pixels are turned into catchment basins. It works with a binary image so that all object pixels are set to maximum intensity ‘255’ (white pixels), and the background pixels are set to lowest intensity ‘0’ (black pixels). The binary image is obtained by applying morphological transformation operators (discussed in section 2.1.1) and then thresholding. The Watershed algorithm works best with an image on which a distance transformation has already been applied. Image segmentation using the Watershed transform works better if the foreground objects can be marked well or defined from the background area. It helps in extracting the desired objects from the image. 2.1.1 Morphological Transformation Morphological transformation is a powerful preprocessing step. It is essential to improve the quality of an image, highlight the required features and remove noise or distortion. It includes operators that are mainly used to analyze binary images to enhance the image, remove noise, detect edges, etc. It uses a kernel or structuring element to determine the type of transformation. The following are the main operators of morphological transformation. 1. Erosion: The basic idea is to gradually chip away the boundaries of fore- ground objects. A pixel in the binary image (value 0’s or 1’s) is considered 1 only if all pixel values under the kernel are 1. Otherwise, the pixel will be con- sidered 0 and eroded. This operation always tries to keep the sure foreground area (object pixel) in white and rest of the area in black. 2. Dilation: It is the reverse of erosion. In this case, if at least one-pixel value is 1, the pixel value is considered to be 1. 3. Opening: There are two operations performed by this operator usually in the following order: first is erosion, and then a dilation operation. 4. Closing: It is the reverse of opening. 2.2 Neural Networks Neural networks are inspired by the capability of the biological neural system to process experimental data in the brain. The basic computational unit is a neuron. In the biological model, neurons receive and process information from their dendrites. The processed information is then transported along the axon to the terminal unit, the synapse. The output of this neuron forms the input for other neurons. The learning ability of the brain occurs through a series of activations of the neurons [16]. 7 2. Technical Background Figure 2.3, shows the computational model of a neuron. Figure 2.3: Neuron Each neuron performs the following calculation: performs dot product with the input (xi) and its weights (wti), then add the bias and apply a nonlinear function (fn). Neural networks are a group of neurons. They are interconnected in the form of an acyclic graph. It receives input and goes through a series of hidden layers. The hidden layer consists of neurons and is independent of each other. The neurons in every layer are interconnected. In other words, the output of a neuron is a linear transformation of the previous layer combined with a nonlinear activation function. 2.2.1 Initialization The purpose of weight initialization is to prevent the following problems: 1. Vanishing gradients: During training, the weights are updated based on the errors which are backpropagated. As we add more layers to the network, the errors during backpropagation fail to reach the initial layers. Therefore, the amount of gradient information decreases and eventually does not reach the initial layers. This is called vanishing gradient problem. The vanishing gradient problem can occur if the weight initialization values are very small. 2. Exploding gradients: The value of weights to be updated in each layer becomes large when larger values of gradient information gets accumulated in the layers. This is called exploding gradient problem. The exploding gradient problem can occur if the weight initialization values are very large. 8 2. Technical Background Kaiming et al. (2015) [17] define the weight initialization using the following formula: W ∼ N (0, 2/nl) (2.1) where W is the initial weight matrix that depends on d (total number of filters). The weights of these filters are represented by every row in W (W is a matrix of size d×n), N is the normal distribution with mean zero and variance 2/nl, nl is the number of connections in a layer l. The standard deviation in the above equation is√ 2/nl. 2.2.2 Convolutional Neural Networks Convolutional neural networks are similar to neural networks in that they consist of layers. It may consist of many hidden layers and millions of parameters. When compared to the fully connected layers in the neural networks in Section 2.2, the output of the convolutional layer is the result of applying convolutions to a subset of the neurons in the previous layer. Figure 2.4 shows the basic convolutional block, where a kernel/filter (represented as a matrix) in the convolutional layers, slides over the image (input image matrix) in the convolutional layers to create a feature map for the next layer. A kernel/filter is used to detect the essential features in an image. The feature map is the result of element-wise multiplication of the input image matrix and the kernel matrix. The kernel maps a subset of neurons from the previous layer to a single neuron in the next layer to create a feature map. Figure 2.4: Convolutional block and feature maps The mathematical concept of convolution [18] is explained by first defining convo- lution in one dimension. A convolution is an operation on two functions f(x) and g(x) that results in (f(x) ∗ g(x)) at point x. Here, f(x) and g(x) represent one 9 2. Technical Background dimensional function. It is blending one function over the other. In mathematical terms, convolution is an integral that evaluates the overlap of a function g(x) shifted over another function f(x). It is evaluated for all shifts and thus yields a convolution function, h(x). A one-dimensional convolution of two discrete functions f(x) and g(x) is given using the following formula: h(x) = f(x) ∗ g(x) = +∞∑ −∞ f(a) · g(x− a) (2.2) where ‘a’ is the shift. A convolution in two dimensions is used when the input is an image. Let f(x, y) be the input image and g(x, y) be the kernel function. Then a convolution of the input image and kernel function is given using the following formula: h(x, y) = f(x, y) ∗ g(x, y) = +∞∑ a=−∞ +∞∑ b=−∞ f(a, b) · g(x− a, y − b) (2.3) where ‘a’ and ‘b’ are the shift. A convolutional neural network [19] is independent of spatial dimension and takes an image with two/three dimensions as input. It also consists of learnable weights and biases. In general, a convolutional neural network transforms the input image pixels, layer by layer into a final class probability score in a classification task. Convolutional neural networks form the backbone in computer vision applications such as pattern recognition, image classification, object recognition, etc. The convolutional neural network architecture can be divided into three main layers: 1. The first layer is the convolutional layer. It forms the basic block of convolu- tional neural networks. It takes an image as input to create a feature space by scanning each pixel. The convolutional layer is followed by the activation function, which is a nonlinear transformation operation. 2. The second layer is pooling, where the dimensionality of the feature space is reduced to extract more finer features. This is also known as downsampling. The pooling layer does not contain any parameters. 3. The final convolutional layer consists of vectors of feature maps. These vectors are passed to the fully connected layers, which form the third layer. Softmax regression often follows, such that the output of the network is a probability distribution with respect to the predicted classes. The fully connected layers help the network generalize its prediction by compiling the features extracted from the previous layers. The parameters in the convolutional and fully connected layers are trainable. A training should ensure that the class likelihood score matches the ground truth image in the training dataset. Gradient descent is a common training method. 10 2. Technical Background 2.2.3 Residual Neural Network (ResNet) In a simple convolutional neural network where the convolutional layers are stacked on top of each other. A convolutional neural network with more layers can provide more accurate results. There are two scenarios to consider when more layers are added to the network: the network learns new weights, or the network does not learn any new weights. When the network does not learn new weights, it could result in a state where the weights are not updated effectively during each training phase. Here, adding more layers would only increase the computational overhead, and no improvement in terms of accuracy. ResNet [11], a kind of convolutional neural network architecture, is used to mitigate this problem. It uses a residual block. The residual block consists of a residual function F (x) and an identity mapping function as illustrated in Figure 2.5. The residual function F (x) is given using the following formula: F (x) = H(x)− x (2.4) where x is the input to the residual block, H(x) is the final output of the residual block. The residual function is the difference between the input and the output of the residual block. This allows the network to learn F (x) instead of H(x). Along with F (x), the residual block also give our network the ability to learn the identity mapping of the input to the output. The above equation can be rearranged as follows: H(x) = F (x) + x (2.5) H(x) is passed to the following layers in the network. The advantage of using residual function F (x) is that the network learns new weights, say wt1 and wt2, from the convolutional layers (Layer1 and Layer2 de- picted in Figure 2.5) in the residual block. In the worst case, when the network does not learn any new weights, i.e., when the weights in the residual block tend to zero, the identity mapping of the input layer is used. The identity mapping ensures that the previously learned layers are passed on to the subsequent layers. This approach helps to maintain the accuracy of the network. Figure 2.5 represents the residual block with added weights and an activation func- tion, ReLU. The layers in the residual blocks are connected in such a way that few layers are skipped between them. This is often referred to as skip connection or shortcut connection. The layers in the residual blocks are connected in such a way 11 2. Technical Background Figure 2.5: Residual block that few layers are skipped between them. This is often referred to as skip connection or shortcut connection. The blue line in Figure 2.5 represents the skip connection where Layer1 and Layer2 are skipped. The skip connection is basically an identity mapping function. Here, the input from the previous layers is added to the output of another layer. ResNet uses identity mapping function [20] to overcome the vanishing gradient prob- lem and accuracy degradation problem when more layers are added. Applying iden- tity mapping to the input will yield an output that is identical to the input. Let X be the input and I be the identity mapping, then their product isXI = X. In Figure 2.5, Gradient 1 (green dotted line) represents the gradient traversing back through the residual function during backpropagation, and Gradient 2 (orange dotted line) represents the gradient traversing back through the identity function during back- propagation. The new gradients are computed by updating the weights ‘wt1’ and ‘wt2’. The gradients become smaller and may eventually vanish. Here, the gradients reach the initial layers by skipping the residual block. This ensures that the network updates and learns the correct weights. 12 2. Technical Background 2.2.4 Global Average Pooling (GAP) A pooling operation is used to reduce the dimension of an input by encapsulating significant features of the feature map. Global Average Pooling (GAP) [21] is a pooling operation that computes the average of each feature map to encapsulate significant features in the input. It is also used to reduce the spatial dimension of an input by averaging the individual feature maps to output robust spatial infor- mation of the input using its pooling operation. In a convolutional neural network, GAP layer is commonly used after the last stage of convolution where there are large number of feature maps. GAP layer does not add any new parameters in the network. Therefore, it speeds up the training process of the network and minimizes the overfitting problem. Figure 2.6: Global average pooling Given a three-dimensional input (width,height,depth), GAP reduces the dimension to (1,1,depth), as shown in Figure 2.6. For example, let the output dimension of a convolutional layer be [wxhxd]:[15x20x2048]. The layer GAP takes the average over the [wxh] values, so it transforms [15x20x2048] to dimension [1x1x2048]. Here depth(d) is the range of values, e.g. 0-255, in a channel. 2.2.5 Conv 1x1 Conv 1x1 (1x1 convolution) is a convolution process using a single filter of size 1x1. It helps in reducing the dimension depth-wise, so that n channels are embedded into a single channel. Assume an image of dimension [480x640x3] and a filter of size [1x1x3] as shown in Figure 2.7. We use a single filter (filter=1) to perform 1x1 convolution. After performing 1x1 convolution, the number of channel of the input image, 3 in this case, is reduced such that the output is of dimension [480x640x1]. 13 2. Technical Background Figure 2.7: Conv 1x1 Conv 1x1 is used in the ResNets’ bottleneck design (explained in Subsection 2.2.12). Conv 1x1 layer is usually added before an expensive convolutional layer, such as 3x3 or 5x5. Conv 1x1 [21], is often followed by a nonlinear activation function, such as ReLU, allowing the model to learn more deeply and adjust weights efficiently during backpropagation. 2.2.6 Rectified Linear Unit (ReLU) ReLU function [22] is defined by the following formula: S(x) = max(0, x) (2.6) It gives an output x, if x is positive and 0 otherwise. Figure 2.8: ReLU Function [23] 14 2. Technical Background It is important to understand the problem with activation functions like the sigmoid and tanh functions. A general issue is that they saturate, which means that if the input values are large, then in the case of tanh and sigmoid, they tend to 1. If the values are not large they tend to -1 in terms of tanh and 0 for sigmoid. These properties lead to challenges especially in case of adapting the weights for improving the performance of an algorithm. A solution to this problem is an activation function that helps an algorithm to learn the complex relations in a given data (nonlinear in nature) and at the same time behave like a linear function such that SGD can be used to train neural networks. Also, it should not easily saturate like the other activation functions. In such cases, the ReLU function is suitable and gives better performance. Kaiming initialization is often used with ReLU activation function. A few advantages of ReLU are: It is computationally inexpensive compared to most of the other activation functions, such as sigmoid. ReLU is used to enable better training of deep neural networks by mitigating vanishing gradient problem. 2.2.7 Loss Function A loss function is used to measure the degree of relationship between the prediction and the ground truth label. The total loss is taken as an average over a set of data and is calculated using the following formula: L = N∑ i=1 Li/N (2.7) where N is the number of training or validation datasets; i is the ith training sample in a dataset; Li is the loss calculated for ith data sample. We will now define Li for binary cross entropy loss. Binary Cross Entropy (BCE): Binary cross entropy loss is used in binary classi- fication or segmentation problem, where an output predicted by the model is binary: 0 or 1. Binary cross entropy loss is defined using the following formula: Li = −[gti ∗ log(predi) + (1− gti) ∗ log(1− predi)] (2.8) where gti is the ground truth class value for the ith training sample, 0 or 1; predi is the predicted class value for the ith training sample, 0 or 1. For binary classification problems, the output layer uses a sigmoid1 function followed by BCE loss. The sigmoid function is applied to the predicted output so that the resulting values are between 0 and 1. Given an input x, the sigmoid function is 1Sigmoid/Logistic function converts real numbers into probabilities in the range [0,1] 15 2. Technical Background given by, S(x) = 1/(1 + e−x) (2.9) As x approaches infinity, S(x) approaches 1 and as x approaches negative infinity, S(x) approaches 0. Dice Loss: The Sørensen–Dice coefficient [24] or the dice loss [25], is the harmonic mean of recall and precision. Recall is the measure of actual features that are retrieved and is defined using the following formula: Recall = TP/(TP + FN) (2.10) where TP (True Positive) is the number of cases in which the model predicts a class and the class is also present in the ground truth; FN (False Negative) is the number of cases where a class is present in the ground truth and the model does not predict it. Precision is the measure of positive features among the actual features retrieved and is defined using the following formula: Precision = TP/(TP + FP ) (2.11) where FP (False Positive) is the number of cases in which the model predicts a class and the class is not present in the ground truth. In terms of recall and precision, Dice coefficient is defined using the following for- mula: Dice coefficient = (2 ∗Recall ∗ Precision)/(Recall + Precision) (2.12) Dice coefficient attaches equal importance to false positives (FP) and false negatives (FN) as shown in Figure 2.9. Therefore, it is highly immune to class-imbalanced datasets. As illustrated in Figure 2.9, the union of prediction mask and ground truth label is the number of pixels present in prediction mask, in ground truth or both in prediction mask and ground truth label. The intersection of prediction mask and ground truth label is the number of pixels present both in prediction mask and ground truth label. 16 2. Technical Background Figure 2.9: Dice coefficient Dice coefficient = (2 ∗ Intersection)/(Union+ Intersection) = (2 ∗ TP )/((2 ∗ TP ) + FN + FP ) (2.13) Dice loss = 1− Dice coefficient (2.14) 2.2.8 Optimizers Adam (Adaptive moment estimation): Adam [26] is a method of first order stochastic gradient optimization. It is an extension of the classical stochastic gra- dient descent algorithm. It additionally performs step-size annealing and computes the learning rates adaptively from the estimates of the first and second moment of the gradients. In the pytorch library, the running means of the gradients and its square are computed from the coefficients, β1 and β2. The parameter updates depend on the value of the momentum. The momentum update ensures that the parameter updates are in the direction of the constant gradient of the loss function. This guarantees a better convergence of the network. 2.2.9 Learning Rate Scheduler Reduce on Plateau (RoP): Learning rate schedulers are used to anneal the learn- ing rate over time. RoP is a learning rate scheduler that dynamically reduces the 17 2. Technical Background learning rate based on the validation loss. Machine learning models often benefit from reducing the learning rate by a certain factor. For example, the learning rate is reduced every 5 epochs. In the Pytorch library, this is fulfilled by the ‘patience’ parameter. Once the patience value is set, the scheduler monitors a metric, such as validation loss, for improvements. If there are no improvements for the set patience value (number of epochs), the learning rate is reduced by a factor of 0.1 by default. 2.2.10 Metrics There are several metrics that are commonly used to evaluate the accuracy of the semantic segmentation, such as pixel accuracy, mean intersection over union (mIoU), dice score, etc., [27, 28]. Pixel Accuracy: Pixel accuracy is the measure of number of correct pixels among the total number of predicted pixels. It allows us to compare each pixel of the predicted masks with the ground-truth label. Intersection over Union (IoU): IoU or Jaccard’s Index is the area of overlap (intersection) between the predicted segmentation and ground truth divided by the area of union of the predicted segmentation and ground truth, as shown in Figure 2.10. IoU is a measure of the percentage of overlap. Therefore, IoU is a value between 0 and 1, where 0 represents no overlap and 1 represents that the prediction and ground truth overlap completely with each other. A higher value of IoU corresponds to more accurately predicted segmentation. Figure 2.10: Intersection over Union (IoU) 18 2. Technical Background IoU is defined using the following formula: IoU = Intersection/Union = TP/(TP + FN + FP ) (2.15) Dice score: The dice score or dice coefficient has already been covered in the section 2.2.7 when discussing about dice loss. 2.2.11 Transfer Learning In machine learning, transfer learning is an approach that aims at gaining knowledge by solving a problem and using it to solve another related problem. There are two types of transfer learning [29] in convolutional neural networks: Feature extraction and fine-tuning. In feature extraction, a pre-trained convolutional neural network such as ResNet50, VGG16 etc. is chosen for feature extraction. The weights of the pre-trained model are frozen. A few layers can be added to the frozen layers and then trained. In fine-tuning, a pre-trained model is used to re-train the entire model by updating the weights through backpropagation. The pre-trained ResNet50 is trained on 1.2 million images with 1000 classes. The last fully connected layer is usually removed and the rest of the layers are trained for feature extraction. The transfer learning process includes initializing a pre-trained model, choosing the fine-tuning or feature extraction, defining the optimization algorithm to update the correct weights. Finally, the model is trained, the model is validated to check if the model learns, and then tested to evaluate if the model has learned. It is computationally intensive and takes longer to train a convolutional neural network on a huge dataset like Imagenet. The Pytorch library model_zoo contains a list of pre-trained convolutional neural network architectures. As mentioned earlier, the model can be trained either from scratch, the lower layers or only the last layers. It is important to carefully select the pre-trained model based on the problem to achieve the desired prediction. 2.2.12 ResNet50-Siamese 2.2.12.1 Architecture of ResNet50 ResNet50 takes an input image of spatial dimension [height,width,channels]. The width of an image is multiple of 32. The number of channels should be 3. Each residual block consists of three convolutional layers. ResNet50 has four main convolutional stages. The input size is halved and the channel width is doubled in each convolutional stage. The architecture of ResNet50 is depicted in Figure 2.2.12.1 [11]. There is an initial convolutional module performed with a 7x7 kernel size and 64 kernels, followed by max-pooling with 3x3 kernel sizes. A stride of 2 is used. For 19 2. Technical Background an input dimension [480,640,3], the output dimension of the initial convolution is [240,320,64] and the max-pooling layer will be [120,160,64]. Figure 2.11: ResNet50 architecture The first stage has [1x1, 64], [3x3, 64] and [1x1, 256] kernels, stacked together as the residual block. There are three residual blocks which results in 9 layers. Here, [1x1, 64] represents 1x1 convolution and 64 filters/kernels. 20 2. Technical Background The second stage has [1x1, 128], [3x3, 128] and [1x1, 512] kernels. These kernels are stacked together and repeated four times, resulting in 12 layers. The third stage has [1x1, 256], [3x3, 256] and [1x1, 1024] kernels, which are repeated six times to form 18 layers. The fourth stage has [1x1, 512], [3x3, 512] and [1x1, 2048] kernels stacked together and repeated three times to form 9 layers. Each residual block is stacked with three layers, namely 1x1, 3x3 and 1x1 convolu- tion. These three layers form the bottleneck layer in ResNet architecture with 50 or more layers. The purpose of introducing bottleneck design is to reduce the computa- tional cost in networks with more layers. As more layers are added and the network gets deeper, the 3x3 convolution becomes an expensive operation. Therefore, in a bottleneck design the dimension is reduced by a 1x1 convolution before performing a 3x3 convolution. Finally, the dimension is restored by a 1x1 convolution. The residual block in a ResNet with smaller number of layers (18 and 34 layers) consists of only two 3x3 convolution layers stacked on top of each other. Each convolutional layer consists of batch normalization followed by an activation function, ReLU. Batch normalization [30] makes the network more stable and faster by mitigating the problem of internal covariate shift2. ResNet50 also uses a global average pooling layer, followed by a densely connected layer (having 1000 neurons corresponding to ImageNet class output) and softmax activation, resulting in a single layer. 2.2.12.2 Reconstruction Network Recontruction Network is used to create an image from the feature maps obtained from ResNet50. It consists of a transposed convolutional layer (TransposeConv) and a bilinear upsampling layer (Bilinear). The transposed convolution helps in reconstructing the spatial dimension of the input. The transposed convolutional part consists of a convolutional layer followed by trans- posed convolutional layers assuming input channels 16, 32, 64, kernel size of 1 and ReLU as activation function. Transposed convolution is used in semantic segmenta- tion to up-sample the input feature maps from the convolutional stages of ResNet50 to a high-resolution feature map. It helps in learning its own parameters by updat- ing the weights through backpropagation. Only the low-level features are passed as input to the transposed convolutional layer. Bilinear upsampling uses a bilinear interpolation technique. It is generally a linear interpolation performed in two different directions. The feature maps from each stage of convolutional and the global average pooling layer are upsampled using 2The phenomenon of random distribution of input data across the layers of a neural network. For more details refer section 3.6. 21 2. Technical Background bilinear upsampling, followed by convolution to preserve the spatial dimension of the image. The deep feature maps 256, 512, 1024, 2048 are directly upsampled using bilinear interpolation. Batch normalization is used to speed up the training of the deep network. The kernel and output shape for each layer of ResNet50 and Reconstruction Network can be found in the appendix A.1. 2.2.12.3 Siamese Network Siamese network [31, 32] forms a parallel network that share the same architecture. Siamese network shares the same set of weights and learn to differentiate the inputs rather than classify the inputs. Therefore, it learns the distinct image similarities. The Siamese network forms the basis for scene change detection tasks. It is fed with a pair of images as input, for which the changes are to be detected. In this study, the ResNet50 and the Reconstruction network (represented as “Transpon- seConv_Bilinear_Net” in Figure 2.12 and Figure 2.13) are integrated to form the Siamese network. The network model presented here is inspired by the study of Varghese et. al. (2018) [33] and is an extension to it. It maps the feature space to the desired change map corresponding to the dimension of the input image. This is achieved by either merging the multilayer features from the parallel network or taking the absolute difference between the parallel networks. The weights of the ResNet50 are shared and the weights of the Reconstruction network are indepen- dent. Figure 2.12: Siamese_Res50_Fuse_Net 22 2. Technical Background Figure 2.13: Siamese_Res50_Diff_Net The model is called Siamese_Res50_Fuse_Net (Figure 2.12) when the outputs of the Siamese network are merged/ fused. Based on the difference operation on the output of the Siamese network, the model is called Siamese_Res50_Diff_Net (Figure 2.13). 2.2.13 Pyramid Pooling Module CNNs are followed by fully connected layers that accept input of a fixed size thus making it unacceptable for the CNN to have varied size inputs. Therefore, images are first converted to a specific dimension before being fed into the CNN. This leads to another problem of image warping (where the image may be distorted in some way) as well as reduced resolution. Spatial Pyramid Pooling (SPP) helps to solve this particular problem. It manages the information in local bins (spatial). The number and size of these bins are fixed. The pooled responses of each filter are available in the bins. The output feature map has 256 filters, as shown in Figure 2.14, and is of arbitrary size (depending on the input size). As seen in Figure 2.14, there are three pooling layers and the first one is similar to the global pooling operation, whose output is 256-d. The second pooling layer has 4 bins, which gives an output of 4*256 and similarly the third with 16 bins gives an output of 16*256. A flattened and concatenated version of the output layer is obtained so that the dimension remains the same regardless of the size of the given input. 23 2. Technical Background Figure 2.14: Example of SPP 24 3 Literature Review This chapter will walk through the existing studies on machine learning algorithms for scene change detection. It will also discuss the studies on semantic segmenta- tion, residual neural network, pyramid pooling, global average pooling, optimization algorithms and batch normalization. 3.1 Scene Change Detection Jong et al. (2019) [34] study change detection using unsupervised learning with satellite images. The authors also address the need for future research in terms of improving the accuracy and noise resistance of a model. Sakurada et al. (2015) [31] detect the change externally for a given geographical area using the city landscapes. This paper proposes a method using a pair of its vehicular, omnidirectional (360◦) images for detecting changes in a scene. The images are taken at different times and have temporal differences in illumination and photographing conditions. They make use of a fully convolutional Siamese network to overcome visual differences between image pairs. Zhao et al. (2019) [32] also use a Siamese encoder-decoder network for street-view change detection. Varghese et al. (2018) [33], use a Siamese network to form a parallel network. ResNet50 forms the backbone of the network where the last three stages of con- volutional layers are used to extract the features. The convolutional layer is then followed by a bilinear upsampling layer to upsample the features from the three convolutional layers. Finally, the feature maps are merged and fed to a softmax layer to generate the changes in image. 3.2 Semantic Segmentation A semantic segmentation network is similar to an encoder-decoder network. The encoder is similar to any convolutional neural network model, like the ResNet, GoogleNet, etc. It is used to extract low and high resolution features. The de- coder has a different mechanism that helps to recover the spatial information and produce the segmented prediction. 25 3. Literature Review Ronneberger et al. (2015) [35] (U-Net) proposed a U-shaped architecture. It has a contracting (encoder) and expansive (decoder) path to perform semantic segmen- tation. The contracting path is like the convolution layer in CNN. It extracts the features from the image. The expansive path is similar to the transposed convolution (deconvolution/ upsampling) network. It takes the feature set from the contracting path and recovers the spatial information lost in the contracting path. Additionally, every step in the expanding path is concatenated with the high-resolution feature set from the contracting path. The combination of high-resolution features and the spatial information produce better segmentation result. The authors claim to train a network with very few images. Zhao et al. (2017) [27] (PSPNet) is a scene parsing network. PSPNet has a CNN model to extract all the features of an image. Then, it exploits the capability of the global contextual information by feeding the pyramid pooling module with the feature space. The output from the pyramid pooling module is then upsampled and concatenated with the initial feature set from the convolution layer for pixel-wise prediction. It captures both local and global contextual information. Chen et al. (2018) [36] (Atrous Spatial Convolution, ASPP) propose a powerful encoder module in the encoder-decoder model by applying several parallel atrous convolutions to capture higher semantic information and for a faster computation. Atrous or dilation rate is a parameter used to increase the receptive field of each layers. A larger kernel can be used for the same purpose. But the number of parameters increases with the size of kernels. Atrous is introduced in DeepLab as a tool to adjust/control effective field-of-view of the convolution. The model aggregates feature from the image at different scales. These models have shown success in several segmentation benchmarks. 3.3 Residual Neural Network The network convergence can be hampered at an earlier stage due to the vanishing/ exploding gradient [37]. It is shown in the studies [38] that deeper networks can result in higher training error rate. Here, the network shows good performance at the beginning, but gradually the accuracy gets stagnant and degrades swiftly. He et al. (2016) [11] propose a residual network to mitigate the vanishing gradient problem, to reduce the training error rate and to increase the performance of deeper networks. The “Highway Networks” [39] is a similar technique to ResNet, which also uses a skip/ shortcut connection. However, the amount of information to pass through the skip connections are controlled by a parametric gate. Also, since the gates can be closed, the layers represent non-residual functions. Nevertheless, the identity shortcut connections in the ResNets are never closed. 26 3. Literature Review 3.4 Global Average Pooling (GAP) GAP, first proposed in [21], is placed as the last layer which intake the feature maps of the last max pooling layer. The GAP layer output is a single-entry vector for each class in the classification task. The studies [40, 41], use global average pooling (GAP) to add global context information to their model framework. Zhou et al. [42] added a GAP layer with the convolutional neural networks for the purpose of object localization. This network is then trained for image classification. Hence, the object in the image can be detected using the convolutional neural network, and the addition of the GAP layers allows us to know where the object is contained in the image. 3.5 Optimizers The gradient descent optimization algorithms [43, 44] are used in the field of deep learning neural network to optimize the neural networks. Gradient descent [45] is an optimization algorithm used to obtain the best set of weights in a network by finding local minima of a function. Stochastic gradient descent algorithm [46, 47, 48, 49] is an iterative method for optimizing gradient descent. The algorithm iterates through the training set. An adaptive learning rate (a parameter to tune an optimization algorithm that determines the step size for finding the minima) in conjunction with shuffled dataset during each training can help the algorithm converge better. Adam [26] is an extension of stochastic gradient descent. 3.6 Batch Normalization The input data and parameters across each layer of a neural network can influence the training process. The parameters of the current layer can change the input data distribution of the succeeding layer. This phenomenon of random distribution of input data across the layers is described as internal covariate shift. Batch nor- malization [50, 30] mitigates the problem of internal covariate shift and makes the neural network stable and faster. In a nutshell, Siamese network is a great architecture in a scenario where two dif- ferent images are to be processed simultaneously. By the nature of its architecture, it can help in reducing the number of parameters and the memory footprint while training. ASPP, PSPNet, etc., models for semantic segmentation are trained on a deeper network. They use ResNet with 101, 152 layers for feature extraction and the entire model is trained using multiple GPUs. Therefore, we try to implement a model which can be computationally efficient and also yield accurate predictions. Inspired by the study of Varghese et al. (2018) [33], we exploit the feature extrac- tion capability of ResNet using 50 layers and the siamese network architecture for processing two different images and to produce the desired semantic change map. 27 3. Literature Review 3.7 Spatial Pyramid Pooling Zhang et al. (2015) [51] illustrate the effectiveness of using Spatial Pyramid Pooling (SPP) techniques in deep learning and visual recognition. The existing deep CNN model needs images of a fixed size to process and predict the objects and scenes. For this purpose, input images need to be either cropped or resized. While cropping, some of the objects in the images disappear. When resized, the dimension changes, and the resolution also changes which reduces the clarity of an image. If this particular image is given as an input, it becomes difficult for the system to predict the exact scenes or objects. To mitigate these issues, a spatial pyramid pooling technique has been introduced. In this concept images of any size and specifications could be given as input. SPP layer is placed on top of the last layer as seen below. Figure 3.1: Spatial Pyramid Pooling presented in the paper SPP is very significant in object detection. When the same model is tested for two other datasets namely Pascal VOC 2007 and Caltech1 01, it yields the best results and high accuracy scores compared to the other models. For Pascal VOC 2007 dataset, SPP model yields an accuracy score of 82.44 percent which is high than the previous high accuracy score of 81.58 percent and when the model has used for Caltech101 the model yields output scores of 93.42 percent which is very much higher than the previous best (88.54 percent). SPP has turned out to be one of the milestones in deep learning techniques in recent times. Overall, SPP is a better solution for handling images at different scales and sizes to yield approximate predictions. In the study [27], the Pyramid Scene Parsing Network (PSPNet) method is intro- duced to predict the scenes and object in the given images(datasets) with better accuracy. It is a kind of deep neural network that primarily uses a Convolutional Neural Network (CNN). This paper initially suggested using Fully Convolution Net- work (FCN) but later decided to invoke PSPNet (Pyramid Scene Parsing Network) into the model to overcome some drawbacks of using FCN. FCN could not recognize some of the objects and scenes in real-time, whereas authors claim that the PSPNet analyzed the same scene with utmost accuracy to find the exact object. FCN (Fully Convolutional Network) sometimes wrongly identified the objects (car instead of a boat, skyscraper instead of buildings, etc.). To avoid these kind of drawbacks the spatial pooling system and Pyramid Scene Parsing techniques have been introduced. Global average pooling is used as a baseline model. Using spatial pooling increased the accuracy scores and turned out to be more efficient. ADK 28 3. Literature Review dataset and few other datasets have been tested through this method, resulting in higher accuracy scores compared to other methods. PSPNet is a pixel prediction framework that authors claim is ideal for applications like driving, robot sensing, etc. In PSPNet each pixel in the image is assigned with a category label. It clearly understands the scene and predicts the objects based on the scene. The input is first sent to CNN then the resultant of which is sent to the pooling layer. After this, the obtained output is sent to different layers of the pooling system. Figure 3.2: Example of Spatial Pyramid Pooling In Figure , each layer of the SPP network performs the unique job assigned to these layers, and all these outputs are finally concatenated to get the final output which is predicted. For any model, the number of pyramid level and size of each level depends on the type of dataset used and modified. In the ADK dataset, the PSPNet model proposal is a remarkable achievement and solves all the common problems faced in the FCN model. Furthermore, the PSPNet model is used for the Cityscapes dataset to check for accuracy in predicting change, and it secures the best accuracy scores compared to other techniques. It also finds its applications in the field of military intelligence, where the prediction of correct objects and scenes is highly crucial, which is successfully executed using the PSPNet method. When PSPNet models are pre-trained and then used, it further increases the accu- racy scores of the model. In some cases, FCN recognizes two different objects as the same objects (mostly since both are in the same color), but PSPNet correctly identifies them as different objects. Cost usage of PSPNet and FCN networks are al- most similar (the computational cost is not too high for using this model). PSPNet stands out in its ability to capture diverse scenes and unrestricted vocabulary. 29 3. Literature Review 3.8 Activation Function In recent times computer vision and natural language processing have become so popular and widely used worldwide. These models are more powerful and effective and could be used for large-scale datasets containing even millions of data. The authors in [52] primarily deals in activating the neurons in deep neural networks. The activation of neurons and loss functions play a vital role in the deep learning model. The calculation of loss functions and activation of neurons highly influence the efficiency and accuracy scores of the model. Each model has several biological neurons. These artificial neurons are arranged in an orderly manner. Each of these neurons is activated by electrical signals which are sent by the previous neurons. If these electrical signals are big enough to stimulate the neuron then these neurons go to an excited state. If not, they will remain in an inactive state. The activation process can be carried out in techniques like the sigmoid function and hyperbolic tangent function. In the first technique sigmoid function, the activation of activated neurons is carried out as a saturated process. The activation process consists of several steps that are time-consuming as well as requires high cost. The neurons are activated by a mechanism of sigmoid function only in some instances of Deep learning. The next technique is hyperbolic tangent function. Here, based on sine and cosine values the tangent values are calculated. The neurons are activated in certain steps and the loss values obtained in this process are comparatively negligible than the sigmoid function. It consumes a bit less time and cost as well, so it is widely used. The authors in paper [52] suggests an activation function called Rectifier Linear Unit (ReLU) to activate the neurons. This is an unsaturated and supervised model that requires very little cost and less time. The authors claim that this function can activate the neurons simultaneously. The computational cost is much cheaper than sigmoid and hyperbolic tangent functions. ReLU activation technique performs activation much faster and makes the network easily activated. This method can be carried out even without using any unsuper- vised or supervised learning methods. The ReLU method could be carried out in three different methods such as LReLU, PReLU (Parameterised ReLU), and RReLU (Randomised ReLU). Each of these techniques differs only in some aspect while the baseline model remains the same. The baseline model in the paper consists of five convolutional layers followed by two pooling layers and one fully connected layer. The given dataset consists of 60000 samples and 10000 training samples. When the dataset is given as input to deep convolutional neural networks and all these techniques are used separately to activate neurons the following results are obtained. 30 3. Literature Review When a sigmoid function is used to activate the neurons the obtained error percent- age (deviation in original value from predicted) is 1.15 percentage. For hyperbolic tangent, it is around 1.1 percentage whereas for ReLU function it is just 0.8 per- centage making it the best among all the techniques. Hence for activating neurons ReLU is the best technique which is being followed for every deep learning method in recent past. 3.9 Loss Function Nie et al. (2018)[53] in their research primarily focus on loss functions involved in deep learning and machine learning techniques. Loss functions are one of the important factors which influence the whole efficiency and accuracy of the model. Hence, loss functions must be given higher priority, and suitable methods must be chosen by the type of model. In the deep learning mechanism, there are two different models namely the regression and classification model. In the Regression model, the values will be continuous whereas in the classification model the values will be discrete. Regression is about predicting the quantity whereas classification is about predicting the label. Suitable loss functions must be chosen for any model to get the best results. There are two types of loss functions namely bi-lateral loss functions and unilateral loss functions. The deviation in predicted value and the original obtained value is known as a hyperplane. If the obtained value is lower the method is best suitable and higher the value the method is not suitable. In the case of bi-lateral loss functions, the function is calculated for both regression and classification models and the loss (hyperplane) values are obtained. The loss value obtained for the regression model is less than 1 whereas the classification model is greater than 1. So, for the classification model, the values need to be punished (since it is greater than 1). Nevertheless, for the regression model, the loss values need not be punished (since it is less than 1). In this paper, the authors conclude that the bilateral loss functions are most suitable for regression models and less suitable for classification models. In the case of unilateral loss functions, the loss values are calculated for both re- gression and classification models. The value obtained for the regression model is greater than 1 whereas for the classification model is less than 1. Hence, for the classification model, the values need not be punished (since it is less than 1) and for the regression model, the hyperplane values must be punished (since it is greater than 1). So, the model concludes that unilateral loss is suitable for the classification model whereas it is not suitable for the regression model. Hence choosing a suitable loss function is necessary for any algorithm. 31 3. Literature Review 32 4 Methods Based on the literature reviewed in Chapter 3, our technical contributions in the field of scene change detection are discussed in this chapter. Section 4.2 explains the methodologies for the Watershed algorithm. Section 4.3 and Section 4.4 elaborates the methodologies for implementing ResNet50 based model and Spatial Pyramid Pooling model respectively. 4.1 Dataset and Dataloader The dataset consists of pairs of images, captured by multiple cameras mounted inside the car. The images, originally with a resolution of [960x1280], are downsized to a resolution of [480x640]. The dataset is composed of around six thousand raw images which are then randomly transformed into about twenty thousand image pairs. The dataset is split 80:10:10 ratio during training for the machine learning part which corresponds to training, validation and test datasets. They are trained on two different training datasets, about two thousand (smaller training dataset) and sixteen thousand (full training dataset) respectively. The validation dataset and the test dataset are 10 percent of the full dataset. This includes about two hundred samples for the smaller training dataset and about two thousand samples for the full training dataset. A random seed is used to ensure that the models are tested on the same samples of the dataset. The data loader also provides functions to select subsets of the dataset if necessary. The dataset is then loaded using a data loader. A Pytorch data loader is used to load the dataset. It allows iteration over the dataset during training. The data loader also provides batched samples by defining the required batch size. During training, a batch size of 4 is chosen. The training and validation dataset is always shuffled. The training process is accelerated using a CUDA-enabled GPU (Tesla K80). When loading the data, the memory is pinned to enable faster data transfer from the hosts to the GPU. 33 4. Methods 4.2 Watershed Algorithm The Watershed algorithm works with grayscale images. There are several prepro- cessing steps before the difference image is fed to the watershed algorithm. The two color coded RGB images (Red,Blue,Green), are shown in Figure 4.1 and Figure 4.2 respectively. Figure 4.1: RGB input image 1 Figure 4.2: RGB input image 2 The difference of two RGB images is generated by finding the absolute difference between the pixel values of two images and then converting it to a grayscale image, as shown in Figure 4.3. This helps us to extract only the pixels that have changed. 34 4. Methods Figure 4.3: Difference image Thresholding is a key operation in image processing. OTSU [54] is chosen to thresh- old the difference image. It makes the object appear more prominent. OTSU is a binary thresholding operation. It returns the optimal value of the pixel for thresh- olding and the array of thresholded image. All thresholded pixels are assigned the value ‘255’. Here the output is a binary image. Figure 4.4: Morphological transformation It is necessary to remove the noise after image thresholding. Then, we apply the morphological transformation, such as dilation and opening, to the binary image to remove the noise. First, the dilation operator is applied to expand the regions of the object pixels (sure foreground region, i.e., the region with maximum intensity, ‘255’). This is followed by an opening operator to erode the boundary of the dilated object pixel. The noise around the boundary is also removed. A well-defined foreground 35 4. Methods region, i.e., the object pixels, is now identified. The output of the morphological transformation is shown in Figure 4.4. Further, distance transformation is applied to the noise free binary input. It operates on the binary image, such that all object pixels are set to a maximum intensity value ‘255’ and the background pixels are set to the lowest intensity value ‘0’. At this stage, the sure foreground and background are known. A marker is now created. The marker based approach allows us to label the region-of-interest. It is an array that is the same size as the image to label all the regions. The marker is applied to the sure foreground region to label the object pixels with the value ‘255’ (white pixels in the image). All other region are marked with a value ‘0’ (black pixels in the image). Figure 4.5: Watershed final output Finally, the watershed is applied on the image along with the marker, resulting in the segmented difference image, as shown in Figure 4.5. We use a combination of morphological transformation and marker controlled watershed to segment the objects in an image. 4.3 ResNet50-Siamese We use two different techniques in the Resnet50-Siamese based architecture for train- ing. These are the fusing technique (Siamese_Res50_Fuse_Net) and the difference technique (Siamese_Res50_Diff_Net). The fusing technique combines the multi- layer features from the Siamese network. The difference technique takes the differ- ence of the multi-layer features from the Siamese network. The architecture of these two techniques is explained in Section 2.2.12.3 and visually illustrated using Figure 2.12 and Figure 2.13. 36 4. Methods Based on the study [1], the base model uses only three stages of ResNet50 convolu- tion, followed by Reconstruction network. The fusing and difference techniques are then applied and trained separately on the smaller two thousand training dataset. The fusing technique (Siamese_Res50_Fuse_Net) uses all the four stages of ResNet50 convolution, followed by Reconstruction network (TransposeConv_Bilinear_Net). The model is trained on approximately two thousand (smaller training dataset) and sixteen thousand (full training dataset) training datasets. The difference technique (Siamese_Res50_Diff_Net) also uses the four stages of ResNet50 convolution, followed by Reconstruction network. It is also trained on about two thousand and sixteen thousand training datasets. Finally, the Siamese_Res50_Fuse_Net and Siamese_Res50_Diff_Net models are also trained with the global average pooling layer (GAP). GAP is applied to the last layer (2048 feature maps) of ResNet50. They are also trained on both training datasets. Since training on the full dataset takes significantly more time, the selection of the models to train on the full dataset is based on the results obtained by training different models on the smaller training dataset. In all these trainings, the fully connected dense layer of ResNet50 is omitted. Each model is trained for 20 epochs on the smaller training dataset and for 30 epochs on the full training dataset. The technique with the best performance on the full training dataset is trained for an additional 20 epochs. Adam is used as the learning rate optimizer, using the default hyperparameter set- tings: β1=0.9,β2=0.999. Reduce on Plateau (RoP) is the learning rate scheduler for both smaller and full dataset. The scheduler’s patience is set to 3 and 5 when training on both training datasets. The initial learning rate is set to 0.001. For backpropagation, a combination of binary cross entropy and dice loss is chosen as the loss function. 4.4 Spatial Pyramid Pooling This algorithm will not be presented in the report due to confidentiality issues associated with the company CEVT AB. It is based on CEVT’s internal model. 37 4. Methods 38 5 Results In this chapter, the semantic scene change results of both classical and machine learning algorithm using different metrics are presented. The semantic scene change maps produced by both the algorithms are also depicted. The performance of classical and machine learning algorithms are compared using a fixed set of test data. This test dataset is unknown to the trained machine learning models. The test dataset is ten percent of the whole dataset consisting of approxi- mately two hundred samples (smaller test dataset) for the smaller training dataset and approximately two thousand samples (larger test dataset) for the full training dataset. The metric score to evaluate the performance of the model is rounded off to two decimal values. 5.1 Watershed Algorithm Figure 5.1: (a) Image 1 (b) Image 2 Figure 5.1 represents the two input images for the Watershed algorithm. 39 5. Results Figure 5.2: IoU : 76.67 (a) Difference image (b) OTSU threshold (c) Watershed (d) Final output As seen in Figure 5.2, the Watershed algorithm gives fair accuracy with an IoU (In- tersection over Union) score of 76.6 for a given image pair. The algorithm segments the larger objects well. In Figure 4.5, the object pixels are comparatively smaller compared to the previous image pair. The algorithm finds it hard to accurately segment the objects as in Figure 4.5, yielding an IoU score of 30.60. The varying level of intensity has left an object undetected. The marker controlled watershed has allowed to control the over-segmentation, i.e. multiple irrelevant regions, only to an extent. The Watershed algorithm using the samples from the smaller test dataset results in a mIoU (mean Intersection over Union) score of 32.32. The algorithm produced an IoU score ranging from 50.00 to 60.00 for approximately ten percent of the test samples and an IoU score of less than 50.00 for the rest of the test samples. The mIoU score for the larger test dataset is 33.08. The algorithm produced an IoU score ranging from 50.00 to 60.00 for approximately nine percent of the test samples and an IoU score of less than 50.00 for the rest of the test samples. 5.2 ResNet50-Siamese The ResNet50 based models are evaluated using pixel accuracy, mean intersection over union and dice metrics. 40 5. Results Figure 5.3: ResNet50-Siamese: Inference results on two thousand training dataset Figure 5.3, shows the inference results obtained by training the models on approxi- mately two thousand custom dataset. Figure 5.4: ResNet50-Siamese: Inference results on sixteen thousand training dataset Figure 5.4 shows the inference results obtained by training the models on the full custom dataset, around sixteen thousand. 41 5. Results The following observations are made for the models trained on two thousand and sixteen thousand samples for 20 and 30 epochs respectively: The models using four stages of convolutions clearly outperforms the model which used only three stages of convolution. The model using the difference technique (Diff) shows better metric score than the model using the fusing technique (Fuse) with respect to pixel accuracy (PixAcc), mean Intersection over Union (mIoU) and dice. The addition of global average pooling layer, further improves the score for the model trained with smaller dataset. The model using the difference technique yields a mIoU score of 79.37 and 87.78 on the test samples from the smaller and the larger test dataset. All the test samples yields an IoU score greater than 78.00. Furthermore, the model based on the difference technique is trained further for 20 more epochs to compare the performance of the model using a global average pooling layer. The model using the difference technique yields a mIoU score of 91.11 and the addition of global average pooling layer yields a mIoU score of 90.39. The model has not yet converged, which implies that it can be trained further for more epochs. The memory footprint of the model, based on ResNet and Siamese architecture, to train using images of [480x640] resolution is around 6GB. For the graphs on the training loss (Loss/train) and validation loss (Loss/val) for the best four models, refer Appendix A.2. Figure 5.5 shows two different images (Image 1 and Image 2) captured at different time, respectively. Figure 5.5: (a) Image 1 and (b) Image 2 Figure 5.6, shows the inference results of the four best models. The following images are the prediction results after passing the two images through the four models. 42 5. Results Figure 5.6: From(a)to(d) Left pane: Ground-truth labels; Right pane: Predicted results, (a)Fuse, (b)Diff, (c)Fuse+GAP, (d)Diff+GAP 43 5. Results The models and their respective scores for the given image pair are illustrated in Figure 5.7. Figure 5.7: ResNet50-Siamese: Inference results for the given image pair All the models yield good results. Even though the fuse technique with global average pooling layer scored better for the given input image pair, we have seen in Figure 5.4 that the difference technique with global average pooling layer scored better on the whole test dataset. 5.3 Spatial Pyramid Pooling The model is evaluated using pixel accuracy, mean intersection over union and ac- curacy. Figure 5.8 shows the inference results obtained by training the models on the two thousand images custom dataset. The inference is run on 205 test dataset. Figure 5.9 shows the inference results obtained by training the models on the six- teen thousand images custom dataset. The inference is run on 2056 test dataset. It leaves a memory footprint of 10 GB. 44 5. Results Figure 5.8: SPP: Inference result on two thousand images dataset Figure 5.9: SPP: Inference result on sixteen thousand images dataset In Figure 5.10, it can be observed how the accuracy almost reaches 100 in the initial training itself. The accuracy metric shows how accurately the model predicts change. 45 5. Results Figure 5.10: SPP: Prediction accuracy graph (sixteen thousand images dataset) Losses like the reconstruction loss, auto encoder loss as well as the BCE and KL loss are worth seeing as the training progresses and can be found in Appendix 2 5.4 Classical vs. Machine Learning Figure 5.11: Comparison of classical and machine learning models 46 5. Results We compare the performance between the classical and machine learning models based on the results shown in the previous sections. The performance is evaluated using mIoU scores and on the test dataset from the sixteen thousand sample. As seen in Figure 5.11, it is evident that the machine learning models vastly outperform the classical one. Among the model based on ResNet50 architecture, the difference technique yields better mIoU score. 47 5. Results 48 6 Discussion and Conclusion This chapter discusses the selection of the various techniques used in the study. Future experiments are also briefly discussed. A summary of the study is also presented in Section 6.2. 6.1 Discussion 6.1.1 Watershed Algorithm The classical algorithm, such as the Watershed algorithm, can be used for image segmentation, but the images must be preprocessed efficiently. The algorithm can work accurately for one image. A diverse image set containing objects of varying size, shape and illumination, make it difficult for the watershed algorithm to seg- ment different images with the same accuracy. We can observe that the algorithm is sometimes prone to over-segmentation problem depending on the image. The problem of over-segmentation is fairly well controlled by using markers. The exper- iment shows that the classical algorithm is not so efficient for image segmentation for a large diversified image set. A fair segmentation is only possible if the object of interest can be accurately extracted from the image. Therefore, it is necessary to extract the objects more accurately and adequately before applying the classical algorithm. 6.1.2 Machine Learning Algorithm 6.1.2.1 Choice of Residual Neural Network The deep residual network ensures that all layers generate optimal feature maps by making the identity mapping optimal. The idea is to train the network to learn the residual function, such that it approaches to a zero value. Identity skip connection (shortcut connection) mitigates the vanishing gradient by passing gradients to the initial layers. It also mitigates the problem of accuracy degradation. ResNet50 in- corporates batch normalization to mitigate the problem of internal covariate shift and improve the stability of the network. The bottleneck residual block [8] is in- tegrated into the ResNet architecture to increase the performance of the deeper layers and reduce the computational cost. Therefore, ResNet50 is used as a feature 49 6. Discussion and Conclusion extractor since it is a deeper architecture and computationally feasible. ResNet50 has proven to be an ideal architecture to generate extensive feature maps from an image compared to other ResNet layers. Further, a more deep layered convolutional network can be experimented with in future. It can be observed that the inclusion of all four stages of ResNet50 outperforms the model that only used three stages of ResNet50. This may be because the in- clusion of the lowest stage of ResNet50 network helps in detecting the low-level features. The addition of the global average pooling layer has significantly increased the score for both the fusing technique (Siamese_Res50_Fuse_Net) and the dif- ference technique (Siamese_Res50_Diff_Net). This is due to the fact that the global average pooling increases the context information and also helps in mitigat- ing overfitting problems by reducing the total number of parameters in the network. However, applying global average pooling to more layers did not improve the accu- racy score. This may be due to the fact that pooling operation is already applied in the pre-trained convolutional network using strides. 6.1.2.2 Choice of Transfer Learning Training the entire network is not a viable option because it requires more time to train and adds computational overhead. Since ResNet has a deeper network architecture, the dataset to be trained must be large enough to avoid overfitting. Therefore, feature extraction using pre-trained ResNet50 is an ideal choice and has shown to be one of the ideal architectures for scene change detection. 6.1.2.3 Choice of Image Spatial Dimension The resolution of the image can improve the accuracy of the model. Since loading the original image with a resolution of 960x1280 hampers the data loading process and memory requirements, it was not practical to use this resolution. Moreover, it is important to preserve the aspect ratio. Therefore, the spatial dimension of the original image is reduced by half, which in turn reduced the memory requirement by about four times. 6.1.2.4 Choice of Loss function The ground-truth labels are binary images, where the background pixels outnumber the foreground pixels. This often leads to class imbalance. So, an appropriate choice of loss function is required so that correct weights can be updated to reduce the loss in the next validation. Therefore, a combination of BCE loss and dice loss was chosen as the loss function. This combination of loss functions has proved to be an ideal choice in the binary image segmentation task. With larger numbers, BCE loss calculation results in an arithmetic overflow and is numerically less stable. Therefore, a combination of BCE loss and a sigmoid layer is usually chosen to mitigate the overflow problem. We can experiment with training the network using this combination instead of using BCE loss alone in the future. 50 6. Discussion and Conclusion 6.1.2.5 Choice of Optimizer and Learning Rate Scheduler The pre-trained ResNet50 model trained on the Imagenet dataset uses Stochastic Gradient Descent (SGD) as the optimization algorithm. Adam is the improved version of SGD. Therefore, Adam was chosen. Based on training the network on the smaller dataset, we observe that 0.001 is an ideal initial learning rate for ResNet50. At a higher learning rate, the parameters vary disorderly and the network cannot settle at local minima. And at a lower learning rate, the network might settle into false minima. The reason for choosing 0.001 as the learning rate is because it is the default value for training ResNet50 using the Imagenet dataset. This is an ideal value that is neither too low nor too high. The learning rate is updated with a patience of 5 for training the entire dataset, based on monitoring the validation loss. The rate at which the learning rate decays is ideal. Therefore, we reduce the computational wastage, aids the training process, and validate to reach the best position. The exponential decay of the first and second moments is controlled by the hyperparameters, β coefficients. The moment starts at β1=0.9 and lets it decay over the epochs to β2=0.999. It is the default value. Due to time constraint, it was not possible to train the network by setting different hyperparameters. Meanwhile, an attempt to choose the best initial values based on various tests in this study is made. We can further train the network by tuning the hyperparameters of Adam in the future such as varying the β coefficients and learning rate. Adam with weight-decoupled decay is another optimizer to experiment in the future. Moreover, the best model can be trained for more epochs. The trend of loss functions indicates that the model can score better in predicting the semantic change maps in case of ResNet. 6.2 Conclusion We have evaluated both classical and machine learning algorithms in scene change detection. The machine learning algorithm requires more computational resources compared to the classical algorithm, depending on the architecture of the training model. However, its ability to infer semantic changes from unknown datasets makes it invaluable. We investigated the feature extraction capability of ResNet50 and built a Siamese network architecture for semantic scene change detection. We also investigated the object localization capability of convolutional neural network using a global average pooling layer. The difference technique yields the best mIoU score among the ResNet50 based models. The addition of global average pooling layer has significantly increased the score for all the three metrics. Finally, we also investigated the capabilities of SPP and has shown good prediction accuracy. Based on our research and the results we inferred from the training, we believe that the machine 51 6. Discussion and Conclusion learning models can yield more score if trained for more number of epochs, i.e., until the network converge. Thus, it can be concluded that the machine learning algorithm significantly outperforms the classical algorithm. In a nutshell, this study suggests that the machine learning models are more suitable for the development of scene change detection functions in taxis and passenger vehicles. We think there is a scope for future research by evaluating the architecture for scene change detection using a deeper network, for example, ResNet with more than 50 layers. We can also implement multi-class object detection where the model can predict the detected objects. Finally, a notification system can also be built to notify the passenger of any misplaced belongings. 52 Bibliography [1] K. Sakurada, T. Okatani, and K. Deguchi, “Detecting changes in 3d structure of a scene from multi-view images captured by a vehicle-mounted camera,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2013, pp. 137–144. [2] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 4063–4067. [3] Z. Kourtzi and N. Kanwisher, “Cortical regions involved in perceiving object shape,” Journal of Neuroscience, vol. 20, no. 9, pp. 3310–3318, 2000. [4] K. Das, B. Giesbrecht, and M. P. Eckstein, “Predicting variations of perceptual performance across individuals from neural activity using pattern classifiers,” Neuroimage, vol. 51, no. 4, pp. 1425–1437, 2010. [5] C. Spampinato, S. Palazzo, I. Kavasidis, D. Giordano, N. Souly, and M. Shah, “Deep learning human mind for automated visual classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6809–6817. [6] J. Laserson, “From neural networks to deep learning: zeroing in on the human brain,” XRDS: Crossroads, The ACM Magazine for Students, vol. 18, no. 1, pp. 29–34, 2011. [7] J. Wu, “Introduction to convolutional neural networks,” National Key Lab for Novel Software Technology. Nanjing University. China, vol. 5, p. 23, 2017. [8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- 53 Bibliography tion with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf [10] L. Wang, S. Guo, W. Huang, and Y. Qiao, “Places205-vggnet models for scene recognition,” arXiv preprint arXiv:1508.01667, 2015. [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog- nition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [12] A. Bleau and L. J. Leon, “Watershed-based segmentation and region merging,” Computer Vision and Image Understanding, vol. 77, no. 3, pp. 317–370, 2000. [13] S. Beucher et al., “The watershed transformation applied to image segmenta- tion,” Scanning microscopy-supplement-, pp. 299–299, 1992. [14] A. Bieniek and A. Moga, “An efficient watershed algorithm based on connected components,” Pattern recognition, vol. 33, no. 6, pp. 907–916, 2000. [15] H. Ng, S. Ong, K. Foong, P. Goh, and W. Nowinski, “Medical image segmen- tation using k-means clustering and improved watershed algorithm,” in 2006 IEEE southwest symposium on image analysis and interpretation. IEEE, 2006, pp. 61–65. [16] J. A. Anderson, An introduction to neural networks. MIT press, 1995. [17] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. [18] G. B. Arfken and H. J. Weber, “Mathematical methods for physicists,” 1999. [19] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a convolu- tional neural network,” in 2017 International Conference on Engineering and Technology (ICET). Ieee, 2017, pp. 1–6. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645. [21] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013. 54 http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Bibliography [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1. Citeseer, 2013, p. 3. [23] Y. Pathak, K. Arya, and S. Tiwari, “Feature selection for image steganalysis using levy flight-based grey wolf optimization,” Multimedia Tools and Applica- tions, vol. 78, no. 2, pp. 1473–1494, 2019. [24] J. M. Duarte, J. B. d. Santos, and L. C. Melo, “Comparison of similarity coef- ficients based on rapd markers in the common bean,” Genetics and Molecular Biology, vol. 22, no. 3, pp. 427–432, 1999. [25] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso, “Gen- eralised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, 2017, pp. 240–248. [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [27] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing net- work,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890. [28] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein, “Brain tumor segmentation and radiomics survival prediction: Contribution to the brats 2017 challenge,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 287–297. [29] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers, “Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016. [30] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch normaliza- tion help optimization?” in Advances in neural information processing systems, 2018, pp. 2483–2493. [31] K. Sakurada and T. Okatani, “Change detection from a street image pair using cnn features and superpixel segmentation.” in BMVC, 2015, pp. 61–1. [32] X. Zhao, H. Li, R. Wang, C. Zheng, and S. Shi, “Street-view change detec- tion via siamese encoder-decoder structured convolutional neural networks,” VISIGRAPP, vol. 2, p. 2, 2019. [33] A. Varghese, J. Gubbi, A. Ramaswamy, and P. Balamuralidhar, “Changenet: 55 Bibliography A deep learning architecture for visual change detection,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0. [34] K. L. de Jong and A. S. Bosman, “Unsupervised change detection in satel- lite images using convolutional neural networks,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8. [35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. [36] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818. [37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994. [38] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5353–5360. [39] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015. [40] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see bet- ter,” arXiv preprint arXiv:1506.04579, 2015. [41] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous con- volution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017. [42] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929. [43] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016. [44] P. Baldi, “Gradient descent learning algorithm overview: A general dynamical systems perspective,” IEEE Transactions on neural networks, vol. 6, no. 1, pp. 182–195, 1995. [45] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal repre- 56 Bibliography sentations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985. [46] L. Bottou, “Stochastic gradient learning in neural networks,” Proceedings of Neuro-Nımes, vol. 91, no. 8, p. 12, 1991. [47] ——, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436. [48] ——, “Large-scale machine learning with stochastic gradient descent,” in Pro- ceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [49] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951. [50] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [51] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolu- tional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015. [52] B. Ding, H. Qian, and J. Zhou, “Activation functions and their characteristics in deep neural networks,” in 2018 Chinese Control And Decision Conference (CCDC), 2018, pp. 1836–1841. [53] F. Nie, H. Zhanxuan, and X. Li, “An investigation for loss functions widely used in machine learning,” Communications in Information and Systems, vol. 18, pp. 37–52, 01 2018. [54] J. Yousefi, “Image binarization using otsu thresholding algorithm,” University of Guelph, Ontario, Canada, 2011. 57 Bibliography 58 A Appendix 1 A.1 Kernel and Output Shape Figure A.1 below, depicts the kernel shape and output shape across each layers of the ResNet50-Siamese architecture. I A. Appendix 1 II A. Appendix 1 III A. Appendix 1 Figure A.1: Kernel and Out