DF Vectorization of architectural floor plans PixMax – a semi-supervised approach to domain adaptation through pseudolabelling Master’s thesis in Complex Adaptive Systems Alexander Radne Erik Forsberg Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2021 Master’s thesis 2021 Vectorizaton of architectural floor plans PixMax – a semi-supervised approach to domain adaptation through pseudolabelling Alexander Radne Erik Forsberg DF Department of Electrical Engineering Division of Computer Vision Chalmers University of Technology Gothenburg, Sweden 2021 Vectorization of architectural floor plans PixMax – a semi-supervised approach to domain adaptation through pseudolabelling Alexander Radne Erik Forsberg © Alexander Radne, Erik Forsberg, 2021. Supervisor and examiner: Fredrik Kahl, Department of Electrical Engineering Master’s Thesis 2021:NN Department of Electrical Engineering Division of Computer Vision Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Illustration of different stages of vectorization of a floor plan. Raster image from a scanned or rasterized architectural drawing (left), neural network’s pixel-wise class segmentation map (middle) and polygonized vector graphics image (right). Typeset in LATEX, template by David Frisk Printed by Chalmers Reproservice Gothenburg, Sweden 2021 iv Vectorization of architectural floor plans PixMax - a semi-supervised approach to domain adaptation through pseudolabelling Alexander Radne, Erik Forsberg Department of Electrical Engineering Chalmers University of Technology Abstract Machine Learning and Computer Vision techniques are rapidly improving comput- ers’ abilities of image comprehension. In recent years, these techniques have been applied to information parsing on floor plan bitmap images, thus addressing the problem of converting rasterized images to vector graphics. Current state of the art models have shown great results in predicting walls as well as room types and archi- tectural drawing icons. However, these models require a large amount of annotated data, and since the cost of labelling can be quite high, the current available datasets are limited in terms of diversity of styles and regional-specific features. Therefore, there is an opportunity for algorithms that exploit unlabelled data to further im- prove these models. Semi-supervised learning is a set of algorithms commonly used to achieve this. We propose and analyse three approaches utilising semi-supervised learning through self-training by letting a model trained on labelled data make predictions on unla- belled data. We then use a collection of the best of these predictions as a basis for creating pseudolabels for further training. In the first approach, we use a probability measure on the model output as a proxy for high quality predictions. Our second approach is to use a post-processing algorithm as a quality enhancement of the pre- dictions on all unannotated images. Finally we propose and evaluate our proposed prediction quality measurement, PixMax. This method aims to give a proxy for how confident the network is on its predictions by measuring inter-consistency between several non-destructive augmentations of any input image. The created pseudola- bels are then compared to evaluate whether the network is confident enough or not for the pseudolabels to be included in the continued training. With PixMax we obtain results comparable with — and for recall better than — the fully supervised state-of-the-art model that we benchmark against. Our evaluations are carried out both on the labelled and unlabelled dataset used to train the models. As expected, the relative performance boost is most prominent on the unlabelled dataset where we reach a 69 % average recall. We show that the PixMax approach can be used for adapting a trained model to a new domain. Keywords: semantic segmentation, object detection, semi-supervised learning, floor plan images, domain adaptation, self-training. v Acknowledgements First we would like to thank our supervisor Fredrik Kahl for his support and guid- ance during the course of this project. He helped us to both on an academic and administrative level to find and access the right resources to develop the project in the desired way. Lars Hammarstrand helped us to get admitted to a compute project which allowed us to access GPU-resources. We would like to thank him as well as the team at C3SE for helping us with this. Also Anders Karlström was of great assistance to the project by taking of his time to read and sign the application for access to one of the datasets we used. During this time of social distancing and isolation, taking time for some coffee and small talk is more important than ever. We would therefore like to send a special thanks to Erica Samuelsson and Sara Eidenvall for sharing the morning coffee break with us every day and for all the interesting discussions that this led to. We would also like to in particular thank Adnan Fazlinovic, Joel Ekelöf, Sofia Malmsten among many others who have been supportive during this process. Finally we would like to thank our families and friends for all the support and patience shown during this time. Alexander Radne Erik Forsberg Gothenburg, January 2021 vii "You have broken new ground for the Architecture and Engineering programme" — Karl-Gunnar Olsson, former head of programme "Really nice stuff!" — Markus Häikiö, CTO, CubiCasa "Sometimes you don’t see the full picture for all the pixels." — Common saying ix x Contents Abstract v List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Method outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Scope and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6.1 Vectorization of architectural floor plans . . . . . . . . . . . . 4 1.6.2 Raster-to-Vector & CubiCasa5k . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 8 2.1.2 Optimisation and vanishing gradients . . . . . . . . . . . . . . 9 2.1.3 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Pseudolabelling . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Loss functions and probability transformations . . . . . . . . . 17 2.4.2 Multi objective loss and relative loss weighting . . . . . . . . . 18 2.5 Consistency regulation . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Vicinal Risk Minimisation . . . . . . . . . . . . . . . . . . . . 20 2.5.2 Geometric transformation consistency regularisation . . . . . . 21 3 Method 23 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Annotated data - The CubiCasa5k dataset . . . . . . . . . . . 23 3.1.2 Unannotated data - The Lifull Home’s dataset . . . . . . . . . 24 3.2 Pseudolabelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 xi Contents 3.2.1 Statistical approach . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Post-processing technique . . . . . . . . . . . . . . . . . . . . 28 3.2.3 PixMax pseudolabelling technique . . . . . . . . . . . . . . . . 30 3.3 PixMax self-training . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Network model . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Data diversity augmentations . . . . . . . . . . . . . . . . . . 33 3.3.3 Evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.1 Hardware specifications . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 Experimental setup for model training . . . . . . . . . . . . . 36 4 Results 39 4.1 Pseudolabelling techniques . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Statistical approach . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2 Post-processing technique . . . . . . . . . . . . . . . . . . . . 41 4.1.3 PixMax pseudolabelling technique and model training scheme 42 4.2 Results for PixMax model training scheme . . . . . . . . . . . . . . . 44 5 Discussion 49 5.1 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.1 Pseudolabelling techniques . . . . . . . . . . . . . . . . . . . . 49 5.1.1.1 Statistical approach . . . . . . . . . . . . . . . . . . 49 5.1.1.2 Post-processing technique . . . . . . . . . . . . . . . 50 5.1.1.3 PixMax pseudolabelling technique . . . . . . . . . . 50 5.1.2 PixMax model performance . . . . . . . . . . . . . . . . . . . 51 5.2 Limiting factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Data sufficiency and utilisation . . . . . . . . . . . . . . . . . 52 5.2.2 Post-processing algorithm . . . . . . . . . . . . . . . . . . . . 53 5.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Contributions and implications . . . . . . . . . . . . . . . . . . . . . 54 6 Conclusion 55 Bibliography 57 A Results of all tested model hyperparameters I B Class distribution III C Visual comparison of models V xii List of Figures 1.1 A concept representation of the method first introduced by Kalervo. et al [1] where a specific set of interest points are detected to aid the vectorization algorithm that is separate from the main network model. 6 2.1 The concept of a convolutional layer. In this particular example we have data in 2 dimensions and a third kernel dimension. The items in the data tensor gets element-wise multiplied with a kernel tensor and summed to form the consequent layer in the network. . . . . . . 9 2.2 For nested function classes, using a bigger function class means that we can get closer to the true function G, but this is not necessarily the case for non-nested function classes. . . . . . . . . . . . . . . . . . 12 2.3 The structure of the ResBlock. The function f is split into a residual and an identity function. Only the residual function is propagated through the network to later be added back together with the identity function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 The concepts of underfitting and overfitting a model to the data. The model in the middle has a good balance between capturing the main features of the data but is at the same time stable to noise and therefore better approximates the true function (green). . . . . . . . . 14 2.5 The self-training scheme described in [31]. . . . . . . . . . . . . . . . 16 3.1 Examples of the visual style of the images of the three categories in the CubiCasa5k dataset with their respective labels above. The images are scaled to fit the page format. . . . . . . . . . . . . . . . . . 24 3.2 A visual representation of all the different annotation categories of the CubiCasa5k dataset. Junctions, openings and corners are lists of coordinates while rooms and icon categories are pixel-wise segmenta- tion maps over the image. . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Examples of the visual style of the images of in the LIFULL HOME’s dataset. The images are scaled to fit the page format. . . . . . . . . . 25 3.4 The resolution distributions of the different datasets used in the project. A simple random sample of 4000 image instances of each set was used. 25 3.5 An example of what a correlation between the prediction certainty and correctness could look like. The pixels that the network is most sure about is to a high extent also the pixels that are classified cor- rectly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 xiii List of Figures 3.6 Examples of what qcc does with the segmentation maps from the room channel. The examples are randomly sampled from the test set of the CubiCasa5k dataset. The most visually prominent changes is that all segmentation have been translated into simple polygons. . . . . . . . 30 3.7 The concept of how the post-processing algorithm qcc works. Given the predicted junction heatmaps and the room and icon segmenta- tions, qcc can "clean up" the segmentations e.g. by inferring a closed room between 4 suitable L-type corners. . . . . . . . . . . . . . . . . 30 3.8 The PixMax model training scheme. Light blue: The labels for the labelled dataset. Dark blue: The images and predictions for the labelled dataset. Dark green: The images and model predictions for the images in the unlabelled dataset. Light green: The pseudolabels created by the model in the pseudolabelling phase. . . . . . . . . . . 32 3.9 A simplified illustration of the architecture of the model used. Image from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.10 The overall accuracy on the LIFULL HOME’S test set for models trained with different β-thresholds for selecting what pseudolabels to use. Blue: Models trained using the ignore index setting described above. Red: Models trained without the ignore index setting. : Models evaluated with test-time augmentations. : Models evaluated without test-time augmentations. : The best model that we found in our final experiments. Called ours in the following section. . . . . . 37 4.1 Left: The proportion of the pixels with US(Gθ) ≥ x for 4 different images. Take note of the logarithmic scale on the x-axis. Right: A zoom-in on the graph of the first image with a higher resolution. . . . 40 4.2 Left: The decrease in Labs as a function of how big proportion of the pixels removed for 100 images. The green, dashed line shows the average over all images. The values are calculated at fixed intervals and interpolated in between. Right: The quotient of the loss of the whole image and the truncated image with respect to the fraction of pixels removed. Note the logarithmic y-axis. All values are weighted to compensate for the fraction of the pixels removed and the size of the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 The correlation between LABS before and after qcc for 100 images. . . 42 4.4 The correlation between LCE before and after qcc for 100 images. . . . 42 4.5 A histogram over the distribution of β for 8400 images from the LIFULL HOME’S dataset. The best sample fit for the gamma- distribution has the shape parameter k = 4.22 and the scale pa- rameter θ = 55.69. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.6 Examples of predictions images with different β-values. Column 1: The original image and the room colour legend. Column 2-5: The predictions for each of the augmentations. Column 6: The resulting pseudolabel (most common pixel prediction) and per-pixel βi,j-value maps for the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 xiv List of Figures 4.7 Comparison of results from different models. Evaluated on 4 images from the LIFULL HOME’s dataset. . . . . . . . . . . . . . . . . . . . 47 5.1 A conceptual model training training training scheme for inductive conformal prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 B.1 The per-pixel distribution of the room classes in the CubiCasa5k dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III B.2 The per-pixel distribution of the icon classes in the CubiCasa5k dataset. III B.3 The per-pixel distribution of the room classes in the LIFULL HOME’s dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III B.4 The per-pixel distribution of the icon classes in the LIFULL HOME’s dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III C.1 Comparison of results from different models. Evaluated on 4 images from the CubiCasa5k dataset. . . . . . . . . . . . . . . . . . . . . . . VI xv List of Figures xvi List of Tables 1.1 The layers of information that our model extracts from a floor plan image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 The structure of the output layers of the network. . . . . . . . . . . . 24 4.1 1st row: The β-thresholds used for model training in the PixMax training scheme. 2nd row: The number of images with a β-value larger than the tested set of thresholds. 3rd row: The percentage of pseudolabelled images used in the model training scheme (4200 labelled examples used for all runs.) . . . . . . . . . . . . . . . . . . 43 4.2 Per class-comparison between CubiCasa’s (CC) model[1], our best reproduced model of CC and our best model achieved using PixMax. All models evaluated on our LIFULL HOME’s test set. Note that classes with a (-) is not present in the test set and can not be evaluated. 45 4.3 Performance comparison of models. CubiCasa’s best model vs our best model trained on CubiCasa5k training data and unannotated LIFULL HOME’s data, tested on the CubiCasa5k test set . . . . . . 46 4.4 Performance comparison of models. CubiCasa’s best model vs our best model trained on CubiCasa5k training data and unannotated LIFULL HOME’s data, tested on our annotated LIFULL HOME’s test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Performance of our best model (with β = 0.97 and TTA) on the LIFULL HOME’s test set. Subscript p stands for the post-processed (polygonized) predictions. . . . . . . . . . . . . . . . . . . . . . . . . 46 A.1 Full table of all evaluated models. . . . . . . . . . . . . . . . . . . . . II xvii List of Tables xviii 1 Introduction In this chapter we will give a brief background and history to the topics that are being investigated in this thesis. We will start by describing the the possible benefits there could be in the industry in using the techniques that we propose and we will then continue to give a brief outline of our proposed method. In the last section of the chapter we present a chronology of work that has been done on related topics though the last decade and the papers that this project is based on. 1.1 Background The building industry has undergone a lot of large technological changes during the last few decades. One of the most prominent examples of this is the way digital tools is now used to aid the complex coordination process of big projects that span over multiple disciplines and long time-scales. Even though the industry is susceptible to digital development and progression in general, the evolution is slow due to the long project time scales and hence slow information reversal. The digitalisation has up until a few years ago mostly been focusing on how to streamline the design process. This has been done for instance by simplifying communication between disciplines with automated clash-checks and by moving from 2D-drawings to 3D-modelling software. Today effectively all new production design and planning are heavily aided by dig- ital tools such as CAD (Computer Aided Design) and BIM (Building Information Modelling) software. These software give a much improved way of coordinate the workflow between different disciplines and they make it easier and faster to change parts of the design in different stages of the design process compared to the tra- ditional way of designing buildings with pen and paper. Moreover these modern software are in general both vector-based and support object oriented modelling in some form. This gives the user the ability to combine drawing and other kinds of information in an efficient way. Vector-based drawings have the advantage of being easy to both modify and anno- tate compared to raster-based drawing formats. In combination with the metadata attachment capability of these drawings they are in many ways far superior com- pared to traditional raster-based images not only in the planning of a project, but throughout its whole life cycle when it comes to versatility and maintenance. The problem that we want to address is that many of the drawings that are used to convey information to clients and customers are often stripped of this information when they get converted to a raster image format for distribution outside the soft- 1 1. Introduction ware where they were originally made. There are also a big fraction of drawings that were made before the adoption of these software and hence only ever have existed in raster-based image format or as a physical printed drawing. By converting these drawings (back) into vector format, many new possibilities for how they can be used will emerge. Floor plans is the type of architectural drawing that most often is used to carry information about a building or an apartment to the general public and it is also one of the most common type of drawing to encounter in all sorts of projects. To have these drawings in an annotated, vector-based format would open doors for e.g. property owners, real estate agents and property management firms that wants to convey information about floor plan layout in a more intuitive way. This could be done by e.g. create a 3D representation of the property on their website. This is a much easier task to do with a vectorized floor plan as a basis rather than on a raster image due to the geometrical information being represented explicitly in a vector based image as opposed to a raster image. Another potential application is to extract information from old drawings to then be included in a reference database for architects, planners and engineers etc. to make informed decisions. 1.2 Proposal Our proposal is to improve on current automated pipelines that exist for converting raster based floor plan images into a vectorized (mathematically represented) format with the use of machine learning. To do this, we want to create a model that is able to able to distinguish between several of the most common ways that floor plans are represented by being able to detect a finite set of features such as walls, room spaces, doors and windows. After identification of these features the model should be able to create an accurate vector representation of the floor plan with geometries in the form of drawing "symbols" and metadata attached to these object symbols. The output from the model should be a in a format that is easily read and convert to the most common and widely used CAD-software file formats. Once the result is converted to one vector-based format, it is quite easy to convert it to other for- mats since most modern CAD-software have methods for converting files from other common formats built in. Recent work in the area by Kalervo et. al. and Liu et al. [1], [2] has been shown to give good and reliable results using Artificial Neural Networks (ANN). Despite reaching impressive results, the lack of large, annotated data sets are in these works pointed out as one of the greatest challenges in creating a model with even better generalisation capabilities. This project aims to work towards a solution to this problem by introducing a framework for using unannotated floor plan images to let the model learn to work on images with novel drawing styles. This idea can also by extension be seen as a step in the direction of being able to create larger, annotated, custom datasets of floor plans that can be used for data analysis or to train ever more intricate models. In other words, a good and reliable model for parsing floor plan images might be used for annotating large datasets to be used for other applications. 2 1. Introduction 1.3 Method outline To address the lack of large quantities of annotated data of high quality, our proposed method is based on Kalervo et. al. [1] and Liu et. al. [2], with the difference that we instead use a semi-supervised approach allowing us to use a large data set of unannotated floor plan drawings to train the model further. Our approach consists of letting the network model perform predictions on unnan- otated data. These predictions will after carefully chosen refinements be used as pseudolabels for the model to be further trained on. In order to increase the per- formance of the network we test several techniques for improving the quality of the pseudolabel based on the network output. In the first pseudolabelling method, we evaluate the potential correlation between the "confidence" (network output after softmax layer) and the average pixel-accuracy. Second, we measure the potential increase of quality after using a post-processing algorithm used in [1]. Finally, we develop a prediction quality measurement — PixMax — based on a batch of non-destructively augmentations to the same image. This measurement is used to select only pseudolabels of high quality for further training. 1.4 Scope and limitations We limit the scope of the project to only architectural floor plans. In most projects, a variety of different floor plan drawings are used to convey different types of informa- tion to the construction workers. These plans can include installations and fixtures, electrical wiring and plumbing the materials being used and different phases of the building process. We have chosen to only look at architectural drawings since it is the type of drawing that is primarily used after the building is finished to display information about its architectural qualities. To also limit the scope of the project we will only use single-level floor plans for our model. Following from the approach of [1], we limit the scope to use a set of 12 room classes, 11 icon types and 21 types of interest points as given in Table 3.1. 44 Output maps 21 Interest points (Heatmaps) 23 Segmentation Maps Wall corners Opening endpoints Icon corners Room class Icon class 13 4 4 12 11 Table 1.1: The layers of information that our model extracts from a floor plan image. Two datasets will be used in this project; one annotated dataset that is used to train the initial model and one unannotated datasets that will be used to for creating pseudolabels and evaluate the models performance after the extended training. 3 1. Introduction 1.5 Research questions • With what accuracy are we able to recover information from a raster image of a floor plan to a vectorized representation using our proposed method PixMax? • How does our algorithm compare to state of the art results in the field? • Could a semi-supervised algorithm be used to improve the results based on only a fraction of annotated data in the dataset? 1.6 Related work The following is a brief summary of the field and the two papers that have inspired this thesis the most. 1.6.1 Vectorization of architectural floor plans The problem of converting floor plan raster-images to a vector-based format has been explored extensively over the last few years [1]–[5]. The techniques used for the task has shifted from conventional algorithms such as patch-based segmentation [4], to the use of neural networks as bigger data sets have been released and the cost of computation has become cheaper, making the data-hungry networks a viable option [6]. Due to the intricacy of the task and the variety in the data, neural networks has been shown to yield good results compared to other algorithms. This can be accredited to their ability to find complex correlations in the data and to generalise[1], [2]. A commonly used method lately for image parsing is semantic segmentation where an image is split up into a set of pixels per class segmentation maps, each corre- sponding to a one of N predefined classes in the specific dataset [7]. Several datasets have been released and the and the results have been steadily improving as better techniques have evolved [8]. Although great results have been shown on a vari- ety of different datasets, the main focus of the research has been on natural image segmentation since this has been an important problem to solve for several major industries such as the automobile industry where the goal of building self driving cars is a strong driving force. Although not the main focus of the research in machine learning, a significant amount of work has been done on semantic segmentation for human-created images such as drawings. The idea of using segmentation maps as a way to automatically vectorize architectural floor plan drawing in some cases predates the use of neural network based methods. Heras et al. uses a statistical, grid-based method to seg- ment walls, windows and doors from floor plans [9]. To train their model they used the very popular CVC-FP dataset [10]. This was at the time one of the biggest and most popular, publicly available floor plan datasets with annotated images. After this work it seems segmentation methods making use of neural networks has been increasingly popular. Dodge et al.[3] proposed in 2017 a method where Optical Character Recognition (OCR) from the Google vision API was combined with a fully convolutional network, the Faster R-CNN framework, to obtain a model that can both segment walls and also interpret semantic information in the input such 4 1. Introduction as measurements and room types written in the drawing. In their work they also introduced a new public floor plan dataset known as the R-FP dataset containing 500 high-resolution real-estate floor plan images. In 2018 Yang et al. managed to segment walls and doors simultaneously [5] using U- Net+DCL - an alteration to the U-Net where the deconvolutional layers was replaced with a simplified version of pixel deconvolution layers. They managed to achieve an validation pixel accuracy of 97.5 respectively 99.5% for walls and doors, establishing that impressive results can be reached using convolutional neural networks on real- estate floor plans. 1.6.2 Raster-to-Vector & CubiCasa5k The two works that this project is most heavily inspired by are Kalervo. et al. and Liu. et al. from [1], [2]. Liu. et al. proposed in their 2017 paper a learning-based method with multiple objectives. By transforming a rastor-image with the model into both heatmaps of low-level geometric and semantic information (a set of corner and end points) and also a semantic segmentation map of different room and icon types, they managed to extract multiple layers of information from a single image. In contrast to Dodge et al. [3] where the model architectures consisted of multiple network that learned separate tasks, instead of using a single model for all learning goals. This was done by implementing single fully convolutional network (FCN) with a multi-objective loss for the different output maps, and then combining the individual losses with a weighting for the prediction categories to a combined total loss that is used to back-propagate the network. The advantage of this approach is that the geometric feature maps (corners, wall end points etc.) can be used in an intelligent post-processing schedule that aims to refine the rather coarse network output in terms of the room and wall segmentation maps. An conceptual illustration of this can be seen in Figure 1.1. For example, by knowing with high precision the four corner points of a rectangular room, the segmentation map of the pixels within that room can in some cases be improved by knowing that all pixels within that room most likely are of the same class. This fact have been researched and evaluated in chapter 3.2.2 to see if this can be used to further improve our model. Their algorithm ultimately yielded around 90% precision recall for wall junctions, walls, drawing icons and rooms on the LIFULL HOME’S dataset [11], significantly outperforming most other methods trying to extract the same amount of information from a single drawing. In 2019, Kalervo. et al. [1] continued on this work by using the same baseline network architecture, ResNet-152 [12] pretrained with ImageNet [13], but they ex- tended the dataset used substantially by making use of their (CubiCasa) manual floor plan annotation pipeline to collect 5 000 high-quality, human-annotated data points from a set of 15 000. They also more than doubled the number of target room- and icon classes to get more reliable and exact predictions. Making use of this data, they managed to outperform [2] in both recall and accuracy for all classes but one, all while just making use of one single model for predicting all the different feature maps with varying learning criteria. 5 1. Introduction This project intends to continuing to build on the framework that was created respectively refined by these two works. e Figure 1.1: A concept representation of the method first introduced by Kalervo. et al [1] where a specific set of interest points are detected to aid the vectorization algorithm that is separate from the main network model. 6 2 Theory This section will give a brief background of some to the techniques and key concepts used in machine learning in general and used in this project in particular. We will cover all necessary theory that is relevant to the project. It will however be assumed that the reader has a solid understanding of the field a priori. We will go into themes related to semi-supervised learning and computer vision through convolutional neural networks more thoroughly since these are the primary concepts for this work and it is vital to understand the techniques used properly. We will also touch upon concepts such as the residual neural networks and multi objective loss functions. 2.1 Deep learning Deep learning is a subgenre of machine learning that deals with algorithms based on Artificial Neural Networks (ANN) and representation learning. Deep learning is substantiated by the notion that there exists some non-linear, often complex function G that can map points x from a high-dimensional input domain H to target points y in a defined target domain I: G : H → I. (2.1) The assumption we usually make is that for certain kinds of problems we only need to know a small fraction of all point-mappings from input to target space to be able to predict a much bigger fraction with a good precision. Just based on our trivial assumption we cannot induce any boundary on the complex- ity of G. To make this model framework useful we need to be able to approximate it with a reasonably good parameterised approximation Gθ 1. So we want Gθ to mimic the mapping of G for x in the domain we are concerned with. For a parametrization θ of dimension m, the objective can be stated as follows: find θ ∈ Rm s.t. ∀x ∈ H : Gθ(x) ≈ G(x) (2.2) Since we in general do not know what G does for the majority of all x ∈ H, we cannot explicitly use equation 2.2 to find θ. But if we have a random sample of known mappings of size B, we can view {x,y} = {{x1, . . . , xB}, {y1, . . . , yB}} as a random variable and use this approximate the real probability distribution of x, P (x), by an empirical approximation Pemp(x). We can 1θ typically represents the weights and biases in an artificial neural network. 7 2. Theory now try to find the set of parameters θ̂ that when applied to a fixed G minimises the expected loss over this simplified probability distribution Pemp(x): Gθ := Gθ̂ with θ̂ := argmin θ ∫ L(Gθ(x), y)dPemp(x). (2.3) Here L is a loss function that measures the difference between the prediction and the target. The exact composition of this function will be further discussed in Section 2.4.1. For practical purposes we would also like to require θ to not be too big since we want to be able to conduct calculations with it in reasonable timescales. Generally speaking, it is clear that it is not always the case that such a function exists, but it has been shown empirically that for certain kinds of problems it seems to often be the case. Fortunately these problems often coincides with the problems that have applications in many areas and that is the reason why deep learning and in particular deep artificial neural networks have become so popular in the recent years. They give us a framework for finding well-behaved parametrizations of seemingly arbitrary mappings. 2.1.1 Convolutional Neural Networks Convolutional Neural Networks (CNN) is a type of artificial neural network that is commonly used in computer vision applications. Their high performance in im- age analysis partly comes from their shared-weights architecture and translation invariant characteristics. A convolutional network is defined as any artificial neural network that uses one or more convolutional layers in its architecture. A convolution can be understood as a filter that is being slid over portions of the previous layer to calculate the next. This gives the model a way of perceiving neighbourhoods in the input vector and it is therefore useful when there are thought to be large structures in the data that is linked to the closeness of its building blocks. One of the most classical examples of this is shapes and objects in an image. The filter is called a kernel and it can be distributed in one or more dimensions. For image analysis of colour images, 3-dimensional kernels are most commonly used since this corresponds to the two spacial dimensions of the image plus and the "channel dimension" where the red, green and blue values are separately. A kernel is a small matrix of weights. The placement of the individual weights in the kernel can be arranged such that it is tuned to detect a certain kind of low-level feature such as lines, edges or dots in the data tensor. This is done by element-wise multiplication between values in a region of the preceding layer and the kernel as can be seen in Figure 2.1. By combining multiple kernels with different feature detecting abilities into a kernel tensor the model can assimilate the distribution of such features in the data and can hence be analysed with a maintained geometrical interpretation as opposed to a fully connected layers where all geometrical integrity is lost. Fully convolutional networks are networks that doesn’t contain any fully connected layers but completely relies on convolutions throughout the propagation. Since a fully connected layer is equivalent to using a kernel of size 1 × 1, or 1 pixel in the image processing setting, the geometric interpretation of a FCN with larger kernels 8 2. Theory Figure 2.1: The concept of a convolutional layer. In this particular example we have data in 2 dimensions and a third kernel dimension. The items in the data tensor gets element-wise multiplied with a kernel tensor and summed to form the consequent layer in the network. is that it exclusively considers regions in it’s propagation and never values of unique neurons. 2.1.2 Optimisation and vanishing gradients In Section 2.2 we described that we want to improve our candidate Gθ over time with at backpropagation algorithm B that is dependent on the current parameter state and the chosen loss function. There are however many ways we can define B to do this. The most straightforward approach would be to look for the direction in which to tweak the the parameters of Gθ to get the biggest local decrease in the loss function. We can then take a step of size η, known as the learning rate, in that direction. This is what is called i.e. the gradient descent method and one update step can be written as θt+1 ← θt − η∇θL(θt|x,y) (2.4) for some loss function L our input-target data set (x,y) of size N . However, there are a few problems with this method. Since the gradient is calculated for every single data point for each step, it can be very slow if N is large, especially if θ is large as well. Another unwanted feature of this method is that it is greedy in the sense that it will always choose the direction that is locally thought to be the most efficient step at any time. The problem with this is that the algorithm can get stuck in a local optimum without having any chance of getting out of it to find the global optimum. A popular way to solve these problems is to use a stochastic gradient descent method, first described in a paper by Robbins et. al. [14]. In our setting, one step can be described as 9 2. Theory θt+1 ← θt − η∇θ 1 K K∑ i L(θt|xi, yi) (2.5) where {x1, . . . , xk} ⊂R x is a random subset of x of size k with corresponding tar- gets {y1, . . . , yk} ⊂R y. In the original paper k was set to 1, but in the general case the batch size can be set to any number 1 ≤ k < N to reduce the probability of an unrepresentative samples while maintaining a big overhead compared to de- terministic gradient descent. Since {x1, . . . xk} is a random variable, it introduces the possibility to occasionally move in locally non-optimal directions that can be globally beneficial which makes it less prone to getting stuck in local optima. Both these presented algorithms have a fixed learning rate η that does not change thought the training. There has been many approaches to making the optimisation more effective with dynamic learning rates. Some of these methods include AdaGrad which works well with sparse gradients [15], and MSRProp with good performance in on-line non-stationary[16]. In 2015, Diederik Kingma and Jimmy Ba proposed an algorithm they called ADAM, short for adaptive moment estimation, that combines the benefits of the AdaGrad and RMSProp[17]. It does this by introducing a momentum in the training schedule by using a decaying moving average and squared average of the gradient. One step in the ADAM algorithm can be described as executing the following steps: We first calculate the moving decaying averages of previous gradient m and squared gradient v respectively m← β1m+ (1− β1)∇θL(Gθ|x,y) v ← β2v + (1− β2)∇θL(Gθ|x,y)2 (2.6) These become estimates of the 1st and 2nd moment of the gradient of the objective function L. Since these are initialised to 0, they are negatively biased. To counteract this we calculate unbiased versions of these variables the following way t← t+ 1, m̂ = m 1− βt1 , v̂ = v 1− βt2 . (2.7) We also update our time parameter since we have the time-dependent terms βt1 and βt2 that we want to be progressively smaller as the effect of the initialisation wears of. Finally we update our parameters of the network using m̂ and v̂ using the following equation θt+1 ← θt − η 1√ v̂ + ε m̂ (2.8) In ADAM we introduce three new hyperparameters β1, β2 and ε. The β-terms corresponds to the relative exponential decay rate of the 1st and 2nd moment of the gradient respectively and ε is just a small number that prevent us from a getting a zero-term in the the denominator of equation 2.8. 10 2. Theory It was for some time hypothesised that it was possible to create more powerful con- volutional networks just by stacking more layers because of the recent breakthroughs in image classification [18] and object detection [19]. However, it was also recog- nised that deeper neural networks are often more difficult to train and it was in 2016 shown that if you just keep adding more layer to a network, it eventually get worse, not better[12]. This is in part because adding more parameters will make a network more prone to overfitting if the dataset is small [19], but another big contributor to the problem is what has been called the The fundamental deep learning problem. The vanishing gradient problem was first formally identified by S. Hochreiter in 1991. Ten years later an additional paper in English by Hochreiter et. al. was published that elaborates further on topic with more extensive surveys and [20], [21]. The vanishing gradient problem inherits from the way deep neural networks are traditionally trained. Through backpropagation, the weights of each layer is updated based on the gradient of the previous layers activation function [22]. The core of the problem comes form that the activation function is chosen to squeeze any input to a much narrower range, e.g. (0,1) for the commonly used sigmoid function. As the derivatives progresses thorough the network, the it becomes a chain of derivatives that each depend on the previous. For the weights of the first hidden layer, the update formula becomes ∂L ∂W1 = ∂L ∂Vn ∂Vn ∂Vn−1 . . . ∂V1 ∂W1 . (2.9) Where Vk and Wk are the outputs and weights of the kth layer. Now, since each layer uses an activation function we are going to get the derivative of the activation function as the outer derivative for each layer. In the kth layer we get Vk ∂Vk−1 = ∂φ(zk) ∂zk Wk (2.10) For some activation function φ where zk = Vk−1 ×Wk [23]. If we choose φ(x) = Sigmoid(x), we have that the therms containing ∂φ always have an amplitude in the interval (0, 1/4]. The standard approach to weight initialisation in a typical neural network isW ∼ N (0, 1). Hence, the weights in a neural network will also usually be between -1 and 1. As we multiply more and more of these terms together it is easy to see that the gradients quickly grow small and hence they are barley affected by the backpropagation. This also explains why the problem is especially prominent in networks with many hidden layers. On the other hand, if an activation function with a large derivative in the relevant interval, these can also accumulate and instead cause exploding gradient problems. Exploding gradients results in exponentially large updates to the network weights which is likely to cause a very unstable network. 2.1.3 Residual networks There are a few ways to deal with the problem of vanishing gradients. The trivial solution is to just make the networks shallower. However, this solution has some drawbacks since it has been shown that the depth of the network is often of great 11 2. Theory importance to its performance as stated earlier [18], [19], [24]. Moreover, the activa- tion at different depths of a deep network has been shown to sometimes have useful interpretation by encoding a hierarchy of different features sizes. The early layers can be thought of as representing low level features such as lines and dots while layers closer to the output are capable of capturing high-lever features such as shapes or objects[25]. A better solution to the problem was proposed by He et. al. with the introduction of ResNet in their 2016 paper [12]. What they suggested was to add skip connections to the network avoid the problem. A skip connection is a connection that jumps over a certain number of layers, a so called ResBlock, and then connect back to the network. The argument for using deeper networks is that since a deep network Gn define a more powerful function class than the shallow counterpart Gk, it should in some sense have a better potential to mimic the true mapping G that we want to find. However this might not be the case because it assumes that Gk is nested within Gn s.t. Gn can do everything that Gk can do an more [26]. This concept is illustrated in Figure 2.2. Figure 2.2: For nested function classes, using a bigger function class means that we can get closer to the true function G, but this is not necessarily the case for non-nested function classes. The reasoning behind the proposal by He. et. al. is that in theory the function class of deeper model should completely inclose that of a shallower, since it can just mimic the shallower model by using identity mappings, but that in practice this is not always the case because the identity function is not a trivial function to learn. Hence the deeper model can struggle to make as good predictions as the shallow one just because it needs a lot of training data juts to lean what layers that should have an identity mapping. By realising this it was deducted that we can help the model to by explicitly re- formulate the layers as residual functions with reference to the layer inputs. This is done by the observation that we can split any function f(x) into a sum of the identity function I(x) := x and a residual function r(x) := f(x) − x. We can then propagate the residual function through a number of layers in the network and then add the identity back to it, as can be seen in Figure 2.3. This makes it easy for the model to "skip" a layer by just pushing all the weights to zero, which has empirically been shown to be much easier than to find the identity mapping. We are basically giving the network a shortcut that makes it possible to combine the power of a 12 2. Theory deeper network with the agility of a shallower network. Figure 2.3: The structure of the ResBlock. The function f is split into a residual and an identity function. Only the residual function is propagated through the network to later be added back together with the identity function. 2.2 Supervised learning The framework that deep learning presents to us for finding a suitable candidate for Gθ is the training of an artificial neural network. The principle is to initiate a model with the architecture of a layered network with many free parameters that can be tuned to make it imitate G. This is usually done through a process known as forward- and back propagation where a set of data points with known mappings, {x,y} = {{x1, y1}, . . . , {xB, yB}}T ∈ H × I s.t ∀i ≤ B : xi G7→ yi, are presented to the model and its parameters are updated to reduce the prediction error in each time step: θt+1 ← θt +B(θt,L(x,y)) (2.11) Here B is a backpropagation algorithm that updates the the model in a way that is likely to reduce its prediction error, e.g. by using stochastic gradient descent. As earlier established, it is required to have some kind of metric for how good the models current prediction is. This distance function, represented by L in equation 2.11 is in deep learning known as the energy function or the loss function. 2.2.1 Bias-variance tradeoff One of the biggest dilemmas in supervised learning is what is known as the Bias- variance tradeoff problem. The issue comes from the fact that we only use a small subset of all possible examples to fit a model that we want to generalise well for all data in the distribution. [27]. This leads to an inevitable tradeoff between two different sources of errors: • The model bias Measures the average difference between the model prediction and the target. A model with high bias cares little about the training data it is presented with 13 2. Theory and tries to oversimplify the problem. Therefore models with high bias are often described being underfitted. • The model variance Measures how much the model predictions move around its mean on average. A model with high variance pays a lot of attention to the specific training data it is presented with but does not generalise well outside this specific sample. A model with high bias is often described as overfitted to the training data. Figure 2.4: The concepts of underfitting and overfitting a model to the data. The model in the middle has a good balance between capturing the main features of the data but is at the same time stable to noise and therefore better approximates the true function (green). The tradeoff is not only a conceptual construct to easier define model behaviour; it can be shown that the expected test loss of any model can be described in terms of its variance and bias errors in the following way [28]: Ex∈H [ (Gθ(x)− y)2 ] = Biasx∈H[Gθ(x)]2 + Varx∈H[Gθ(x)] + ε2 (2.12) Where Ex∈H [(Gθ(x)− y)2] is the expected test Mean Squared Error (MSE). This refers to the value we would approach if we estimated Gθ based on a large number of training sets from the distribution H and averaged the squared distances from the model predictions to the target of iid samples x, also in H. Since the variance term is always non-negative and the bias term is squared, it is easy to see that Ex∈H [(Gθ(x)− y)2] ≥ ε2, where ε is the irreducible error in the data. This is the so-called unexplained variance, also called the noise. Equation 2.12 also implies that it is impossible to escape this trade-off. A model with zero variance will inevitably have unbounded bias and vice versa [29]. When trying to find the parameters of a model that minimises some objective func- tion we only use a small subset of the possible samples that could be in the dis- tribution. If the model is trained until convergence we are therefore at big risk of lowering the bias term too much at expense of the variance. This effect is especially prominent for highly non-linear models with a large number of parameters such as ANNs [30]. To find a reasonable balance between the two sources of errors, we use a separate partition of the dataset — the test set — independent of the training set to determine when the model is starting to become overfitted and to terminate the training at this point. 14 2. Theory 2.3 Semi-supervised learning Semi-Supervised Learning (SSL) is a framework for machine learning where we use both labelled and unlabelled data to train a model. The primary assumption in SSL that is used to justify the technique is that for a small amount of labelled data together with a bigger amount of unlabelled data can be used to create a stronger model than the two data sets each on their own. This has also empirically been shown to be the case for many important problems [31]. For instance it has been shown that classification models perform better than the models trained only on labelled data and that joint training — where both labelled and unlabelled data is used simultaneously — is one of the most successful iterative approaches to semi- supervised learning [32]. Thanks to their high performance-to-cost ratio, semi-supervised learning models has risen in popularity over the last years and many frameworks that uses a combination of labelled and unlabelled data has been developed [31]. But how can we know for what problems we can hope for semi-supervised models to work? Or more precisely: If we compare an algorithm that only uses labelled data to one that has access to both labelled and unlabelled data, when is it reasonable to think that the combined model can make a more accurate prediction? In general one could say that there are gains to make if the knowledge on P (x̂) that one can make through the unlabelled data x̂, is useful in the inference of P (x|x̂). For this to be the case some assumptions on the correlation between the labelled- and unlabelled data distributions needs to be fulfilled. All semi-supervised learning models make use of at least one of the following statements[33]: • The semi-supervised smoothness assumption If two points x1 ∈ x̂, x2 ∈ x in a high-density region are close, then so should their labels y1, y2 be. This is to say that the true mapping G is at least as smooth in areas where we have many observations as in regions where we have few or none. This implies that if a path of high density links two points their outputs are likely to be close, but if a low-density region separates them then their outputs can very well be quite different. • The cluster assumption The data tends to come in discrete clusters, and data within one cluster is likely to have similar labels. If this is the case then the unlabelled data points might help us to find the cluster boundaries more accurately. In the idealised case we just need one labelled point to tell us the flavour of the cluster and we can then map out its outline by introducing more unlabelled points. Note that this assumption does not say that points from multiple clusters can’t have similar labels. • The Manifold Assumption The data points x ∈ Rn, x̂ ∈ Rn lies roughly on a manifold M ∈ Rk where k << n. This is useful because of what is known as the curse of dimensionality; the fact that the volume grows exponentially with the number of dimensions of our data and thus exponentially more data is required to have the same sample density in a higher-dimensional space. However, if we can find a manifold of 15 2. Theory a lower dimension that accurately portrays the structure of the data we can operate in this subspace and partly avoid the problem. For it to be reasonable to make any of these assumptions, we need to know that the probability distribution of our unlabelled data P (x̂) is the marginal distribution of that of our labelled data P (x). This means that x̂ and x must come from the same underlying distribution. This is not always possible to guarantee in practice, but even if it does not hold there are things that can be done if P (x) and P (x̂) share some similarities. For instance we can use unsupervised domain adaptation where a model is trained on labelled data from a different distribution than the one it will later be applied on [34]. 2.3.1 Pseudolabelling One of the the most obvious, and therefore also earliest, ways of implementing semi-supervised learning is through so called pseudolabelling or self-training. One of the most basic implementations of a pseudolabelling framework is a wrapping of the supervised learning algorithm. First we train our model on only labelled data, but for each epoch of the training we label a fraction of the unlabelled data points with the current model state and use these as training examples from that point on. When all unlabelled points have gotten a label the we continue to train the model until convergence to reach our final model state[33]. The high-level idea of the framework was described already in the 60’s [35], [36] but has been much refined and packaged since then. Another way to perform self-training is to first train the model until convergence on the labelled data and then use this model to label all unlabelled examples at once [37]. The training is then continued with a certain fraction pseudolabelled data until convergence is once again reached. Figure 2.5: The self-training scheme described in [31]. 16 2. Theory 2.4 Loss functions In deep learning the loss function has two main purposes: • To give a measure for how well a model is currently preforming. • To give a prediction for what direction to nudge its parameters in to most likely increase its performance. Since we want the model to learn something from our labelled data, the loss function is in general a distance function that measures how close the models prediction is to the ground truth e.g. the data label. Depending in the type of information the network is trying to learn, different types of loss functions might be more or less suitable. 2.4.1 Loss functions and probability transformations For semantic segmentation, a common choice of loss function is the cross entropy loss that gives a measure for the difference between the models probability prediction for each class and the true class, summed over all pixels. Cross entropy loss uses a probabilistic scheme to determine the distance between the true and predicted label for a data point. Since z := Gθ(x) not necessarily is a per-pixel probability distribution over all classes, we might have to a format the output to something that can be interpreted as such, e.g. by using the Softmax function, Softmax(xi) := ezi∑ j∈C ezj q(x) :=  Softmax(x1) ... Softmax(xC)  , (2.13) where C := {1, . . . , C} is the sequence between 1 and the number of classes, denoted by C. Using this notion of q, the cross entropy loss can be written as LCE(x, y) = − ∑ p∈P ∑ i∈C yi,p log q(x)i,p (2.14) where P is the pixels in an image, y.p is the one-hot 2 representation of the true class of pixel p and q(x)i,p is the models probability prediction that xp ∈ Ci. As mentioned, a convenient property of the Softmax transformation is the class prob- abilities will sum to 1 for each pixel in the image and can hence be interpreted as a probability distribution. The transformation can however suffer from some numeri- cal performance issues and is therefore with advantage combined with a logarithmic transformation, as in the case with cross entropy loss. In cases where no logarithmic transformation is performed, the sigmoid function might be a more suitable choice of transformation. The sigmoid function has a similar inferred interpretation as the Softmax, with the main difference being that it treats the probability of each class as independent from other classes [38], 2A C-vector with all zeros but a single 1 in the position of the true class. 17 2. Theory Sigmoid(xi) = 1 1 + e−zi . (2.15) Still with z = Gθ(x) defined as the raw model output. When we consider the task of object detection, a common loss measure is the inter- section over union LIOU = |A ∩B| |A ∪B| (2.16) for a given prediction box A and the true bounding box B [39]. This base formula can be extended in many clever ways to account for multiple predictions and classes. However, for it to be stable it requires the objects to be detected to have dimensions big enough that it is a reasonable to look at it as a continuous function. If we instead want to find points of interest in an image, a regression loss function such as mean mean squared error LMSE(x, y) = 1 N N∑ n=1 ∑ i∈C (zi,n − yi,n)2 (2.17) is usually a better choice. Here zi,n is the location of the models prediction of the nth occurrence of a point of the ith class. A more in-depth description of the concept of different loss functions will follow in the next section. 2.4.2 Multi objective loss and relative loss weighting In some settings we want a model that can make different kinds of predictions on a single data point. The conventional way to do this is to train multiple separate models to perform parts of the full task, but it can also be done by utilising a so called multi-task model. Baxer et. al. showed in the early 2000s that this approach can increase efficiency and learning accuracy for each task [40]. Simply speaking, the way this is thought to work is that inductive knowledge transfer between complimentary tasks can improve the generalisation capabilities of a model and therefore result in more stable and reliable results. However, this approach comes with a cost. It requires the models total loss to be treated as a sum of multiple individual losses corresponding to the different learning objectives. This raises the question of how the different terms of the loss functions should be weighted against each other. The performance of each task is in one sense arbitrary since the objectives of the model can have different scales and units, but it is often desirable to have a model that at least prioritise improving on all tasks equally. Tuning such a hyper parameter manually can be a tedious and time-consuming task and it has to be done over again for each model that we want to train. Kendall et. al. [41] showed that the relative weighting of parameters can be considered an implicit learning goal of the model and can therefore be learned automatically thought the training. 18 2. Theory The proposed method is based on looking at the homoscedastic uncertainty for each task of the model. Homoscedastic uncertainty can be defined as the intrinsic uncertainty of the model i.e. the part that is not dependent on how well-trained the model is, but rather the insufficiency of the data that the model has been presented with. For regression, if we assume identical observation noise for each data point x in the we can write y ∼ N ( Gθ(x), σ2I ) , (2.18) where y is the batch model output, I the identity matrix and σ the noise scalar of the model. From this we can see that we use the assumption that the model predictions have the same variance and no co-variance. For classification we instead have under the same assumption that y ∼ Softmax ( 1 σ2Gθ(x) ) . (2.19) Using this we can calculate the joint probability distribution of multiple outputs by p (y1, . . . ,yn | Gθ(x)) = p (y1 | Gθ(x)) . . . p (yn | Gθ(x)) . (2.20) This means that we can then use maximum likelihood inference for the terms on the right hand side. For regression, respectively Softmax, what we finally arrive at is a joint objective function: min θ,σ1,σ2 L = 1 2σ2 1 min θ L1 + 1 σ2 2 min θ L2 + log σ1 + log σ2. (2.21) The first term of this equation L1 = ‖y1 −Gθ(x)‖2 , (2.22) is for the model regression labels y1 and the second term L2 = −y2 log Softmax (Gθ(x)) (2.23) is for the cross entropy loss of the classification outputs with the and classification labels y2. We can now optimise L w.r.t all the model parameters θ, σ1 and σ2. This can be seen as the combined loss function giving the model a way of learning the relative weights of the losses for each output. If the value of e.g. σ2 is small, it will increase the contribution of L2, whereas a big value will decrease its contribution. The equation is regulated by the last two terms in the equation that penalises large values for σ. More details and the entire derivation of equation 2.21 can be found in the paper by Kandall et. al [41]. 19 2. Theory 2.5 Consistency regulation 2.5.1 Vicinal Risk Minimisation The task of labelling big sets sets is very labour expensive, especially when it comes to rich information extraction such as bounding-box annotation for objects or even pixel-wise segmentation. Introducing augmentations to increase both the amount and the diversity of the data has been shown efficient [42], [43] and is today seen as a standard procedure in all of machine learning. Data augmentation has traditionally been viewed in theoretical statistics as a method of Vicinal Risk Minimisation (VRM) [44]. The reasoning can be understood by first defining a the learning problem as a search of a θ that minimises the expected loss. We can write this as a risk function R(Gθ) = ∫ L(Gθ(x),y)dP (x,y) (2.24) where our objective is to find θ̂ := argmin θ R(Gθ) and P (x,y) is the probability density function over all possible source-target pairs. The problem here is that we cannot know what the true distribution P (x,y) is since we do not have all that data in our dataset. But given a data set {x,y} = {{x1, . . . , xB}, {y1, . . . , yB}} we can still estimate an empirical risk function3 since Remp(Gθ) = 1 n n∑ i=1 L(Gθ(xi), yi) ∝ ∫ L(Gθ(x), y)δxi (x)dx (2.25) where the delta function δxi = 1 if xi is in x 0 otherwise. (2.26) The VRM framework is built on the assumption that we can preform some random modifications to the data and still retain the overall structure and its semantic information to a high degree. This means that we can include modified samples to out training data set with similar (or even identical) labels as the sample that they are derived from to get a better approximation for P (x, y). This is done by exchanging δxi by some estimate of the density in the vicinity of xi, say Pxi (x) to get the vicinal risk function Rvic(Gθ) = ∫ L(Gθ(x), y)Pxi (x)dx (2.27) By using data augmentations the models performance is expected to increase by making it less sensitive to overfitting [45]. 3Note that we only have the delta function explicitly dependent on x. This is because {xi, yi} is ordered in source-target pairs and hence ∀i : δyi = δyi , so adding the y-dependent part would not change the value of Equation 2.27. 20 2. Theory 2.5.2 Geometric transformation consistency regularisation The VRM framework is in its simplicity very effective for e.g. image classification as the class for an image is still the same after an augmentation and its original label therefore can be used. However, for the segmentation task, the unchanged-label assumption cannot be made with the same confidence. Mustafa et. al. presented in their 2020 paper a consistency regularisation scheme that partly solves this problem by applying reversible transformations to both source and targets of the training data [46]. By doing this, we can get more than one training example with a perfect label for each of the data points in the training set. The loss function for this training scheme can be written as a sum of a supervised and an unsupervised term L = Ls(x, y) + λ (Lus(u) + Lus(x)) (2.28) where the two loss terms are defined as Ls(x,y) = 1 B B∑ i=1 ‖Gθ (xi)− yi‖2 2 , Lus(x̂) = 1 rB rB∑ i=1 ( 1 M M∑ m=1 ‖Tm (Gθ (x̂i))−Gθ (Tm (x̂i))‖2 2 ) , (2.29) with B the supervised batch size, r the ratio of unsupervised to supervised samples in a training batch, {T1, . . . , TM} the set of transformations applied to each data point in the unsupervised training set and λ a the supervised to unsupervised weighting parameter. As can be seen in Equation 2.29, the unsupervised part of the loss function penalises the model for not making consistent predictions for transformed variations of the same image. Mustafa et. al. [46] showed that this technique indeed can be used to improve model performance. 21 2. Theory 22 3 Method This section describes our preliminary experiments that lead to our chosen training scheme that we call PixMax. We also describe the data sources and the data splits that were used in the training as well as the model, our proposed model training scheme and the metrics we have used to evaluate our models. 3.1 Data This project investigates how a large amount of unannotated data can be used to- gether with a smaller amount of annotated data through the semi-supervised training framework to create a better model than what could be achieved with annotated data alone. Previous work has been done where performance increase has been in- vestigated based on how the fraction of unlabelled examples are introduced to train the model. It has been shown that significantly better results can be reached by adding a large fraction of unlabelled examples, especially in the sparse setting [47], [48]. Since no really large dataset with both annotated and unannotated floor plan images exists currently, we have chosen to compose our data from two different sources. The datasets both contain floor floor plans but they both have their similarities and their differences. The assumption that all data comes from the same distribution cannot necessarily be made. In the following sections we will explain where the data comes from and the discrepancies in the data are being dealt with. 3.1.1 Annotated data - The CubiCasa5k dataset The main source of annotated data in this project is a novel dataset called Cubi- Casa5k. It was first introduced by Kalervo et. al. in their 2019 paper[1]. The dataset contains 5.000 labelled raster pictures fetched from image scans and they are divided into three categories based on their visual style, as can be seen in Figure 3.1. The annotation of the dataset is rich in the sense of precision and the amount of information contained in each label. Altogether, there are around 80 different ob- ject types and room labels are represented by polygons in contrast to other earlier datasets where often rectangles are used for simplicity [49]. The labels of the data has an intricate structure where there are three primary types of labels: junctions, rooms and icons. The junctions are pixel accurate locations for each interest point of types as can be seen in Figure 3.2. The rooms and icons are each represented by a pixel-wise segmentation maps of room and icon classes. 23 3. Method Figure 3.1: Examples of the visual style of the images of the three categories in the CubiCasa5k dataset with their respective labels above. The images are scaled to fit the page format. Figure 3.2: A visual representation of all the different annotation categories of the CubiCasa5k dataset. Junctions, openings and corners are lists of coordinates while rooms and icon categories are pixel-wise segmentation maps over the image. 44 Output maps 21 Interest points (heatmaps) 23 Segmentation maps Wall corners Opening endpoints Icon corners Room class Icon class 13 4 4 12 11 Table 3.1: The structure of the output layers of the network. 3.1.2 Unannotated data - The Lifull Home’s dataset The LIFULL HOME’s Dataset [11] is the biggest collection of floorplan image data available for research today. The dataset consists of about 5.31 million images in .jpg format. The images have a high variance in size, colour and quality since it has been collected over some period of time from multiple sources all over Japan. The 24 3. Method architectural qualities of the floor plans as well as many of the graphical represen- tations used to convey information is quite different from the western style used in the CubiCasa5k dataset. Figure 3.3: Examples of the visual style of the images of in the LIFULL HOME’s dataset. The images are scaled to fit the page format. Figure 3.4: The resolution distributions of the different datasets used in the project. A simple random sample of 4000 image instances of each set was used. The CubiCasa5k dataset is split into a training, a validation and a test set. The training set consists of 4200 data points while the validation and test sets hold 400 images each. Each of the slices are assigned equal proportions of images from each of the visual style categories to not introduce some unwanted bias, see Figure 3.1. For convenience we also chose to use this predefined partitioning. For the LIFULL HOME’S dataset we had to slice it ourselves. We also had to manually annotate a small portion of it for testing purposes. How this was done is described in Section 3.3.3. For the validation slice used to determine when to terminate training and to provide the ADAM optimisation algorithm with a time series of the model performance evolution during training we concluded that we had 3 options: • Using pseudo labels for the validation set. • Manually labelling a big enough portion of the LIFULL HOME’S dataset to be used as a validation set. • Use the validation slice of the CubiCasa5k dataset for all the experiments. 25 3. Method We judged that using pseudolabels as annotation in the validation set would be too unreliable and labelling a big enough portion of the LIFULL HOME’s dataset would be too time consuming and outside the scope of this project. We therefore decided to use the validation slice of the CubiCasa5k dataset for validation in all our preliminary and final experiments. 3.2 Pseudolabelling Our proposed method utilises semi-supervised learning through self-learning by let- ting a model trained on labelled data make prediction on unlabelled data and use these predictions as a basis for creating pseudolabels for further training. When run on an image, the model outputs its predictions in terms of sets of pixel- wise feature maps that each correspond to either rooms, icons or interest-points. In other words we get a measure for the model prediction on each pixel for every class, but we want the labels to be on the same format as our pre-labelled examples. This leads us to the non-trivial task of picking a way of creating psuedolabels from the model output. Moreover, if we can’t make any of the assumptions stated in Section 2.3, we at least need to do some kind of augmentation to the information of the model output when we create the pseudolabel for for there to be any reason to believe that the model should perform any better after the continued training than after just being trained on the labelled dataset. If x,y is our labelled dataset and x̂ is our unlabelled data with corresponding pseudolabels ŷ, we can express this as min θ̂ L(Gθ̂(x), y|x ∈ x, y ∈ y) ?= min θ̂ L(Gθ̂(x), y|x ∈ x ∪ x̂, y ∈ y ∪ ŷ), (3.1) where θ is the parameters of the model after being trained on only labelled data. This is simply to say that if our unlabelled data is not telling us something about the true distribution of x nothing new can by learnt by sole extrapolation of what is already known. The following sections will explain the tested approaches of finding a suitable way of creating enhanced pseudolabels better than the raw network output. 3.2.1 Statistical approach The most obvious way of picking pseudolabels would be to use some probability measure on the model output to decide what information to keep. The idea would be that predictions with a high likelihood of being correct would be kept while those with a high prediction uncertainty would be discarded. This could for instance be done in the following way: • For heatmap classes1: From all prediction instances, keep only those with a higher probability of being correct than a certain threshold. • For the one-hot encoded2 classes: For each pixel, pick the class that is most likely to be the true class. 1The corner classes. 2The the rooms and icon classes where each pixel must be one and only one of the classes. 26 3. Method The use of this approach can be motivated by the fact that if we select what infor- mation to include in our pseduolabels based the probability of the information to be correct, we can expect the information in our pseudolabels to be correct with that same probability on average. We can express this as Ex∼PT (x)[T (x)] = ∫ T (x)dPT (x) (3.2) where PT (x) is the probability density function of any finite statistic T (x) of the data x. In our case T (x) would be the correctness of the model prediction w.r.t the label of. Here, T (x) can be interpreted as what is known as a conformal predictor [50]. This means that we by utilising this metric could control the quality of our pseudolabels by requiring a higher or lower probability for the included information to be correct. For instance, we could require a confidence level α ∈ [0, 1) in all information we chose to include and get E[T (x)|PT (x) > α]E[JPT (x) > αK] ≥ 1∫ 1 α dPT (x) E[T (x)] (3.3) where J·K is the generalised Kroneker delta function: JP K = { 1 if P is true 0 otherwise (3.4) However, to use this approach we are required to have a good measure for how confident we can be that a prediction is correct. Since the model outputs for the one-hot classes is in form of scores for all classes on each pixel, it is a reasonable assumption that there could be a correlation between a score of a class and the probability that the prediction is correct for that class. We can also transform the model output to something with the same form as a discrete probability distribution by passing it through the Softmax function, described in Section 2.4.1. Figure 3.5: An example of what a correlation between the prediction certainty and correctness could look like. The pixels that the network is most sure about is to a high extent also the pixels that are classified correctly. To confirm that Softmax of the model output itself is indeed a good proxy for PT (x), we had to run a few experiments. For the room and icon classification this would mean the pixels with high prediction certainty would be classified correctly more 27 3. Method often than those with low prediction certainty. To see if this was the case, we used the the absolute loss function LABS(x, y) = 1 P ∑ p∈P |yp − xp|1 (3.5) as our statistic T . This is the normalised pixel-wise distance function that returns the Manhattan distance between the network prediction x and the correct label y. Under the assumption of the described scenario, we would expect to see ∂ ∂α EPT (x)>α[LABS(x, y)] ≤ 0 ∀α ∈ (0, 1) (3.6) . This is to say that we expect the loss to be lower, and hence a higher quality of the information, when we are more conservative with what information to include w.r.t the certainty of the prediction. An easy way to check if this assumption holds is to use what is known as calibration plot [50], [51] where the correctness is plotted as a function of the model certainty. The results of this experiments can be found in Section 4.1.1. It showed that the correlation was to weak to be useful for our purpose. 3.2.2 Post-processing technique Another possible approach to the problem of creating a pseudolabel from the net- works raw output is to run it through some kind of function that can enhance it’s quality. If we call this function q, this would mean that we could expect to see L(Gθ(x), y) ≥ L(q(Gθ(x)), y) (3.7) for some arbitrary loss function L that in some sense measures the quality of the prediction. Although the existence of such a function q does not seem too unreason- able, we need to address the question of why this transformation is not implicitly learned by the network by a different parameterisation θ̂ to get Gθ̂(x) = q(Gθ(x)) if it reduces the loss of the model. One possible explanation for why a function with these properties could exist is if it has access to different information than the model itself that can be used to enhance its performance. In the model that are used for this project, we can do exactly this. The model output is naturally divided into three distinct categories of predictions; pixelwise room and icon segmentation maps and heatmaps of junctions. This means that we can potentially use the junction heatmaps to infer better room and icon segmentation. If we for convenience define the network output z := Gθ(x) so that zh represents the heatmap channels and zr,i the room and icon channels, we can reformulate our criterion on a sufficient function q in equation 4.3 to be explicitly dependent on zh Lr,i(zr,i) ≥ Lr,i(q(zr,i|zh)) (3.8) where Lr,i is some loss function that only cares about the room and icon channels. With this reasoning we can conjecture that a function with the properties of q may 28 3. Method exist, but it is in no way a guarantee for it’s existence and we have no general algorithm for finding it. In our setting where we are concerned with the parsing of floor plans, there are however a few empirical observations to be made that can be used to guide our search of a sufficient q. Some of these are the following: • Rooms are made up of simple polygons with orthogonal corners. • Icons are rectangles. • Every region completely surrounded by walls and apertures contains only one room type. • The region that is outside the outermost closed wall-loop is the background. The hope is that we can use these observation in combination with the output from the heatmaps to create a prediction that is better than the raw network output in the sense that it gives a substantially lower loss for some loss function L. Ideally this should also be the case for the cross entropy loss function LCE since this is what we use to train the network. Kalervo et. al [1] proposed in their work a novel post-processing algorithm with the structure of q that we we will call qcc. It is a procedural algorithm with the aim to extract all elements of interest in the floor plan including walls, rooms openings and icons. It can be understood as 4 distinct steps that are being executed in sequence in the following way: • Inferring wall skeleton The algorithm starts by connecting pairs of junctions based on their position, type and orientation. This means that if there are two junctions that are close to being vertically or horizontally aligned and they have a joining direction facing each other they will be connected by a line. The junctions are also batched together so that multiple neighbouring junction points of the same type gets mapped to just a single point to avoid crowding. The result from this part of the algorithm is a "skeleton" of possible wall centre lines. • Inferring walls The wall skeleton is used together with the wall segmentation map to construct a final wall prediction. First the skeleton is pruned by removing lines that do not consist with the wall segmentation and the wall thickness is then decided based on the intensity profile of the wall segmentation map. • Inferring rooms Next the processed room segmentations are calculated based on outcome from the prevous two steps. The algorithm searches for all junction triplets that spans a rectangle without any junctions inside it to create a grid of the interior of the floor plan. For each of the grid cells, a voting mechanism is being applied that samples from the pixel predictions of the room segmentation maps to decide what room type to assign for for the cell. Adjacent cells of the same class are being merged together if and only if there are no fully separating walls between them. The same mechanism is being used to find the icons but here the icon heatmaps and segmentations are being used. • Inferring apertures The last step is to find the doors and windows by utilising the corresponding endpoint heatmaps. First all points that do not coincide with wall segments the processed wall segment map are abandoned. The remaining points are 29 3. Method then matched into window and door segments and the width of the aperture is chosen to be the same as the host wall where it is located. Figure 3.6: Examples of what qcc does with the segmentation maps from the room channel. The examples are randomly sampled from the test set of the CubiCasa5k dataset. The most visually prominent changes is that all segmentation have been translated into simple polygons. Figure 3.7: The concept of how the post-processing algorithm qcc works. Given the predicted junction heatmaps and the room and icon segmentations, qcc can "clean up" the segmentations e.g. by inferring a closed room between 4 suitable L-type corners. As previously mentioned, we want the choice of q to give us better performance than the raw network output. To confirm that this can be expected with qcc we measure the loss before and after applying qcc to the network prediction. The results from these experiments can be found in Section 4.1.2. 3.2.3 PixMax pseudolabelling technique Similar to the method proposed in Section 3.2.2, we might choose a function q : R→ RB operating on predictions of several different non-destructive rotations and flips of the input image. Then the batch b := Gθ(q(x̂)) of B predictions can be ori- ented back and combined to form our pseudolabel ŷ = q̂(b) for the given image in the unlabelled dataset. We use all unique orientations of 90 deg rotations and flips, 30 3. Method which gives B = 8 combinations. Here, q̂(b) is set to the mode-function, pixel-wise selecting the most common value in the batch b of predictions. Also, instead of generating pseudolabels for all images in the unannotaded dataset, we could try to find a scalar measurement for how accurate the prediction is. As a proxy of accuracy, we implement a function for calculating a scalar β for how confident the network is of the prediction on a image. Then we can further im- prove the quality of the pseudolabels by only using the labels that satisfy β(x) > τ for a threshold hyper-parameter τ . Equations 3.9 and 3.10 describe this (denoted CONF(·) in Algorithm 1). Here cmc,j,i is the value of the most common class at pixel (i, j) of batch b, and cb,j,i the value at the bth prediction. J·K is the generalised Kroneker delta function (Equation 3.4), βj,i = 1 B ∑ b∈B Jcb,j,i = cmc,j,iK (3.9) , β̄ = 1 HW ∑ βj,i (3.10) . Initial results in 4.1.3 was promising and this method was chosen for the PixMax self-training. 3.3 PixMax self-training Once we have found a way of generating pseudolabels of good quality we might use this in a self-training scheme to further improve a supervised model. As can be seen in Figure 3.8, the PixMax model training scheme consists of three phases; a supervised learning phase, a pseudolabelling phase and a semi-supervised self-training phase. In the supervised training phase a model is trained until con- vergence in a purely supervised manner. We are then using this model to create pseudolabels for an unlabelled dataset with similar features as the original dataset. We do this by combining predictions for multiple light augmentations of the images into a single label as described in 3.2.3. If the label passes the β-threshold check, then it is accepted and will be used as a pseudolabel for the second phase of the model training. In the last phase the trained models weights and biases are copied but new hyperparameters are initialized. The model is once again trained until con- vergence with a combined dataset consisting of both the original labelled dataset and the accepted portion of the pseudolabelled dataset. The following sections describe how we implement the network model and what augmentations we have applied to diversify the data as well as the details of how we evaluate the performance of the trained models. 31 3. Method Figure 3.8: The PixMax model training scheme. Light blue: The labels for the labelled dataset. Dark blue: The images and predictions for the labelled dataset. Dark green: The images and model predictions for the images in the unlabelled dataset. Light green: The pseudolabels created by the model in the pseudolabelling phase. 3.3.1 Network model The model used in [2] have achieved state of the art results and we have therefore chosen it as our model to be used. The model converts the floor plan image through two intermediate representation layers. The first step is the network inference step which outputs 44 maps of interest points and pixel-wise semantics. Second these are converted through integer programming (IP) to form a set of geometric primitives. Note that the sole purpose of the interest points is to infer points to be used to construct the geometric primitives. Finally a step of post processing is applied to form the vectorized output of geometries with class labels. The architecture of the model is borrowed from the ResNet-152 [12] model with an altered output layer as shown in Table 3.1. Figure 3.9 shows the high-level structure of the model. 32 3. Method Algorithm 1: PixMax Pseudolabelling scheme Input : Set of unlabelled images: x̂ = {x̂i : i ∈ {1, . . . ,M}}. Pre-trained model Gθ with parameters θ. Parameters: Threshold τ , for β metric. for x̂i ∈ x̂ do b← Gθ(q(x̂i)) . Batch of predictions ŷi ← q̂(b) . Inverse augmentation ŷmcp ← MODE(b) . Most common prediction βi ← CONF(ŷmcp, ŷi) . Calculate confidence Output : Images with pseudolabels: {{x̂i, ŷi} : i ≤M,βi ≥ τ}. Figure 3.9: A simplified illustration of the architecture of the model used. Image from [1]. 3.3.2 Data diversity augmentations Computer vision tasks benefits greatly from having augmentations applied to the data to generate a more diverse dataset of which the model can better generalise from. Hence, we have used a few common augmentations to vary the data during the training. All images are either cropped or resized to a target size. This speeds up the training time of the network and also allows the network to learn features from different scales of floorplan drawings. • Resize with padding. Resizing the image to the target size keeping the aspect ratio intact by padding the needed pixels with zeros (black pixels and no label). • Crop to size. Cropping out a part of the image and label to get a smaller piece for the training. • Rotations. Random 90 deg rotations of the image and label. • Color adjustments. Weakly adjusting the brightness, contrast and satura- tion of the image. 3.3.3 Evaluation datasets A model that has been trained on the CubiCasa5k[1] dataset in a exclusively super- vised fashion will be used as our primary benchmark for our semi-supervised model. It is clearly reasonable to expect that this model will have a positive bias on it’s per- formance on the CubiCasa5k dataset relative to our semi-supervised model. To be 33 3. Method able to get a reliable measure and fair comparison of the performance of our model we needed to be able to evaluate models on both the labelled and the unlabelled datasets. For our unlabelled dataset, we therefore needed to create enough labels to be able to evaluate the models performance ourselves. For this task we used a online image- annotation pipeline called labelbox 3 that supports polygon segmentation of images. We selected a random subset of 50 images from the LIFULL HOME’S dataset and manually annotated them with the same set of room and icon classes as the Cubi- Casa5k dataset. To not introduce unnecessary bias to the results we tried to use the same style and conventions as possible when performing the annotation. All our annotations are publicly available here: github.com/xRadne/LH_annotations. 3.3.4 Evaluation metrics We have chosen a set of performance metrics to evaluate our model in reconciliation with what’s common practice for semantic segmentation [52]. Although our model predicts both room- and icon segmentation maps as well as heatmaps of interest points, we have chosen not to evaluate the model’s performance on the heatmaps for several reasons. Annotating interest points is a very time-consuming task and using a really small test set would result in very wide confidence intervals that does not provide much information about the model’s actual quality. Also, this information will be captured indirectly by the segmentation evaluations of the polygonised (post- processed) predictions since these are constructed using the interest points (3.2.2). Moreover, to convert the heatmaps to discreet point locations, a threshold value has to be chosen4 — that makes the recall and accuracy valued somewhat arbitrary. With C classes and ni,j = ∑ pi,j the number of pixels of class i predicted to be of class j, our chosen metrics can be described in the following way: • Overall accuracy - The ratio of correctly classified pixels. Overall Acc = ∑ i ni,i∑ i ∑ j ni,j (3.11) • Frequency weighted averaged accuracy - The ratio of correctly classified pixels, weighted for the occurrence frequency of each class. FreqW Acc = 1∑ i ∑ j ni,j ∑ i ni,i ∑ j ni,j∑ j ni,j +∑ j(nj,i − ni,i) (3.12) • Mean intersection over union - The overlap of the predicted and true pixels, averaged over all classes. Mean IoU = 1 C ∑ i ni,i∑ j ni,j +∑ j(nj,i − ni,i) (3.13) 3The annotation tool used can be found here: https://labelbox.com/. 4For the evaluation to be fair we would need to create a metric based on the distance between each predicted point and the closest ground truth point of the same category. This would require a conversion of the heatmaps to a boolean map for each layer through the use of a threshold function. 34 https://github.com/xRadne/LH_annotations 3. Method These four parameters are picked to give fair metric for the model performance. Since some of the room classes are heavily under-represented compare to other classes in both the datasets used, it is a legitimate hypothesis that the model will not be able to recall these classes to the same extent as the others. For this reason, the Frequency weighted average accuracy is in a sense the metric that gives the most nuanced grade for the actual performance of the model. All models are evaluated both with and without Test Time Augmentations (TTA) for a complete result. The TTA is a set of light augmentations that is applied to each image in the testing set during model evaluation. The final model prediction is then defined as the pixel-wise most common class prediction. This technique reduces the statistical fluctuations and therefore gives a better prediction on average. In addition to this we also evaluated the models mean precision and mean recall for rooms and icons. These metrics can be described as follows: • Mean Precision - The fraction of relevant instances among the retrieved in- stances. Mean Precision = true positives true positives+ false positives (3.14) • Mean Recall - the fraction of relevant instances that were retrieved. Mean Recall = true positives true positives+ false negatives (3.15) 3.4 Implementation details 3.4.1 Hardware specifications The code for this project was all written in python with pytorch v1.6.0 for the deep learning implementations. This framework was chosen since the earlier work that the project is based on uses this framework. It is also one of the currently most popular and widely supported implementations for building deep learning models. The code for this project can be found at . We had access to gpu-clusters for running our code throughout the second part of the project. For model evaluations and small scale experiments we used a cluster with a Nvidia Tesla T4 GPU with 16GB RAM and for pseudolabel creation and model training we used gpu-clusters with the following specifications: • Nvidia Tesla V100 SXM2 GPU with 32GB RAM • 2 x 8-core Intel®Xeon®Gold 6244 CPU @ 3.60GHz (total 16 cores) • 387GB SSD scratch disk The training time naturally varied between different runs. The maximum training time of about 9 hours was for a run of 100 epochs with a combined dataset of about 14.000 data points consisting of both annotated and pseudo-annotated images. 35 3. Method 3.4.2 Experimental setup for model training Here we list our settings and hyperparametes used in training that have been consis- tent in all runs as well as the search ranges for parameters we have tried to optimise. Since optimizing for all parameter would result in a search space too large for the scope of this project, we have limited out search for the best model candidate to the subset of the parameters that we think are the most rewarding and interesting for our purpose. For all experiments, the pretrained model from [1] have been used as the starting point but with new hyperparameters. The model is a ResNet-152 [12] with a modi- fied output layer that has been pretrained on ImageNet [13] and MPII Human Pose dataset [53]. The ADAM optimiser as described in Section 2.1.2 was used with an initial learning rate of 1e−4 and a scheduled learning rate drop of 0.5 kcurr kmax where kcurr and kmax are the indices of the current and last epoch respectively. The initial learning rate was picked according to what we found to be conventional and the learning rate drop schedule was picked based on empirical results from our prelimi- nary experiments. We discovered early that the models performance rarely changed significantly after more than 80 epochs of continued training from the pretrained model state. For this reason we choose to train all models for no more than 100 epochs from the pretrained state. For all runs a — for the model previously unseen — subset of the CubiCasa5k [1] datset was used as the validation set. The reason for this was simply the lack of a big enough manually labelled set of LIFULL HOME’s [11] images to be able to determine a reliable stopping criteria for the training. The model we label Ours in Section 4.2 is the best performing model from our final set of runs. Since hypothesised that the β-value was an important hyperparameter we trained models with β-values ranging from 0.95 to 0.99 with intervals of 0.01, everything else identical. From our preliminary experiments we saw that both the reference model and the models that we tested were heavily biased to predicting false positives the undefined room class. To take this into account, we tried training all our models with a setting that would ignore the contribution of the undefined room class to the loss. It would then weight the loss to compensate for the propor- tion of the image removed. Figure 3.10 shows an how the overall accuracy varies between all our final models and similar plots for the other performance statistics can be found in Table 4.3, Appendix A. 36 3. Method Figure 3.10: The overall accuracy on the LIFULL HOME’S test set for models trained with different β-thresholds for selecting what pseudolabels to use. Blue: Models trained using the ignore index setting describ