DF

Vectorization of architectural floor plans
PixMax – a semi-supervised approach to domain adaptation
through pseudolabelling

Master’s thesis in Complex Adaptive Systems

Alexander Radne
Erik Forsberg

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021


Master’s thesis 2021

Vectorizaton of architectural floor plans
PixMax – a semi-supervised approach to domain adaptation through

pseudolabelling

Alexander Radne
Erik Forsberg

DF

Department of Electrical Engineering
Division of Computer Vision

Chalmers University of Technology
Gothenburg, Sweden 2021


Vectorization of architectural floor plans
PixMax – a semi-supervised approach to domain adaptation through pseudolabelling

Alexander Radne
Erik Forsberg

© Alexander Radne, Erik Forsberg, 2021.

Supervisor and examiner: Fredrik Kahl, Department of Electrical Engineering

Master’s Thesis 2021:NN
Department of Electrical Engineering
Division of Computer Vision
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Illustration of different stages of vectorization of a floor plan. Raster image
from a scanned or rasterized architectural drawing (left), neural network’s pixel-wise
class segmentation map (middle) and polygonized vector graphics image (right).

Typeset in LATEX, template by David Frisk
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021

iv


Vectorization of architectural floor plans
PixMax - a semi-supervised approach to domain adaptation through pseudolabelling
Alexander Radne, Erik Forsberg
Department of Electrical Engineering
Chalmers University of Technology

Abstract
Machine Learning and Computer Vision techniques are rapidly improving comput-
ers’ abilities of image comprehension. In recent years, these techniques have been
applied to information parsing on floor plan bitmap images, thus addressing the
problem of converting rasterized images to vector graphics. Current state of the art
models have shown great results in predicting walls as well as room types and archi-
tectural drawing icons. However, these models require a large amount of annotated
data, and since the cost of labelling can be quite high, the current available datasets
are limited in terms of diversity of styles and regional-specific features. Therefore,
there is an opportunity for algorithms that exploit unlabelled data to further im-
prove these models. Semi-supervised learning is a set of algorithms commonly used
to achieve this.

We propose and analyse three approaches utilising semi-supervised learning through
self-training by letting a model trained on labelled data make predictions on unla-
belled data. We then use a collection of the best of these predictions as a basis for
creating pseudolabels for further training. In the first approach, we use a probability
measure on the model output as a proxy for high quality predictions. Our second
approach is to use a post-processing algorithm as a quality enhancement of the pre-
dictions on all unannotated images. Finally we propose and evaluate our proposed
prediction quality measurement, PixMax. This method aims to give a proxy for how
confident the network is on its predictions by measuring inter-consistency between
several non-destructive augmentations of any input image. The created pseudola-
bels are then compared to evaluate whether the network is confident enough or not
for the pseudolabels to be included in the continued training.

With PixMax we obtain results comparable with — and for recall better than — the
fully supervised state-of-the-art model that we benchmark against. Our evaluations
are carried out both on the labelled and unlabelled dataset used to train the models.
As expected, the relative performance boost is most prominent on the unlabelled
dataset where we reach a 69 % average recall. We show that the PixMax approach
can be used for adapting a trained model to a new domain.

Keywords: semantic segmentation, object detection, semi-supervised learning,
floor plan images, domain adaptation, self-training.

v


Acknowledgements
First we would like to thank our supervisor Fredrik Kahl for his support and guid-
ance during the course of this project. He helped us to both on an academic and
administrative level to find and access the right resources to develop the project
in the desired way. Lars Hammarstrand helped us to get admitted to a compute
project which allowed us to access GPU-resources. We would like to thank him as
well as the team at C3SE for helping us with this. Also Anders Karlström was of
great assistance to the project by taking of his time to read and sign the application
for access to one of the datasets we used.

During this time of social distancing and isolation, taking time for some coffee and
small talk is more important than ever. We would therefore like to send a special
thanks to Erica Samuelsson and Sara Eidenvall for sharing the morning coffee break
with us every day and for all the interesting discussions that this led to. We would
also like to in particular thank Adnan Fazlinovic, Joel Ekelöf, Sofia Malmsten among
many others who have been supportive during this process.

Finally we would like to thank our families and friends for all the support and
patience shown during this time.

Alexander Radne
Erik Forsberg

Gothenburg, January 2021

vii


"You have broken new ground for the Architecture and Engineering
programme"

— Karl-Gunnar Olsson, former head of programme

"Really nice stuff!"
— Markus Häikiö, CTO, CubiCasa

"Sometimes you don’t see the full picture for all the pixels."
— Common saying

ix


x


Contents

Abstract v

List of Figures xiii

List of Tables xvii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Method outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scope and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6.1 Vectorization of architectural floor plans . . . . . . . . . . . . 4
1.6.2 Raster-to-Vector & CubiCasa5k . . . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 8
2.1.2 Optimisation and vanishing gradients . . . . . . . . . . . . . . 9
2.1.3 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Pseudolabelling . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Loss functions and probability transformations . . . . . . . . . 17
2.4.2 Multi objective loss and relative loss weighting . . . . . . . . . 18

2.5 Consistency regulation . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Vicinal Risk Minimisation . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Geometric transformation consistency regularisation . . . . . . 21

3 Method 23
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Annotated data - The CubiCasa5k dataset . . . . . . . . . . . 23
3.1.2 Unannotated data - The Lifull Home’s dataset . . . . . . . . . 24

3.2 Pseudolabelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xi


Contents

3.2.1 Statistical approach . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Post-processing technique . . . . . . . . . . . . . . . . . . . . 28
3.2.3 PixMax pseudolabelling technique . . . . . . . . . . . . . . . . 30

3.3 PixMax self-training . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Network model . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Data diversity augmentations . . . . . . . . . . . . . . . . . . 33
3.3.3 Evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Hardware specifications . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Experimental setup for model training . . . . . . . . . . . . . 36

4 Results 39
4.1 Pseudolabelling techniques . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1 Statistical approach . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Post-processing technique . . . . . . . . . . . . . . . . . . . . 41
4.1.3 PixMax pseudolabelling technique and model training scheme 42

4.2 Results for PixMax model training scheme . . . . . . . . . . . . . . . 44

5 Discussion 49
5.1 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 Pseudolabelling techniques . . . . . . . . . . . . . . . . . . . . 49
5.1.1.1 Statistical approach . . . . . . . . . . . . . . . . . . 49
5.1.1.2 Post-processing technique . . . . . . . . . . . . . . . 50
5.1.1.3 PixMax pseudolabelling technique . . . . . . . . . . 50

5.1.2 PixMax model performance . . . . . . . . . . . . . . . . . . . 51
5.2 Limiting factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.1 Data sufficiency and utilisation . . . . . . . . . . . . . . . . . 52
5.2.2 Post-processing algorithm . . . . . . . . . . . . . . . . . . . . 53

5.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Contributions and implications . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusion 55

Bibliography 57

A Results of all tested model hyperparameters I

B Class distribution III

C Visual comparison of models V

xii


List of Figures

1.1 A concept representation of the method first introduced by Kalervo.
et al [1] where a specific set of interest points are detected to aid the
vectorization algorithm that is separate from the main network model. 6

2.1 The concept of a convolutional layer. In this particular example we
have data in 2 dimensions and a third kernel dimension. The items
in the data tensor gets element-wise multiplied with a kernel tensor
and summed to form the consequent layer in the network. . . . . . . 9

2.2 For nested function classes, using a bigger function class means that
we can get closer to the true function G, but this is not necessarily
the case for non-nested function classes. . . . . . . . . . . . . . . . . . 12

2.3 The structure of the ResBlock. The function f is split into a residual
and an identity function. Only the residual function is propagated
through the network to later be added back together with the identity
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 The concepts of underfitting and overfitting a model to the data.
The model in the middle has a good balance between capturing the
main features of the data but is at the same time stable to noise and
therefore better approximates the true function (green). . . . . . . . . 14

2.5 The self-training scheme described in [31]. . . . . . . . . . . . . . . . 16

3.1 Examples of the visual style of the images of the three categories
in the CubiCasa5k dataset with their respective labels above. The
images are scaled to fit the page format. . . . . . . . . . . . . . . . . . 24

3.2 A visual representation of all the different annotation categories of
the CubiCasa5k dataset. Junctions, openings and corners are lists of
coordinates while rooms and icon categories are pixel-wise segmenta-
tion maps over the image. . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Examples of the visual style of the images of in the LIFULL HOME’s
dataset. The images are scaled to fit the page format. . . . . . . . . . 25

3.4 The resolution distributions of the different datasets used in the project.
A simple random sample of 4000 image instances of each set was used. 25

3.5 An example of what a correlation between the prediction certainty
and correctness could look like. The pixels that the network is most
sure about is to a high extent also the pixels that are classified cor-
rectly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

xiii


List of Figures

3.6 Examples of what qcc does with the segmentation maps from the room
channel. The examples are randomly sampled from the test set of the
CubiCasa5k dataset. The most visually prominent changes is that all
segmentation have been translated into simple polygons. . . . . . . . 30

3.7 The concept of how the post-processing algorithm qcc works. Given
the predicted junction heatmaps and the room and icon segmenta-
tions, qcc can "clean up" the segmentations e.g. by inferring a closed
room between 4 suitable L-type corners. . . . . . . . . . . . . . . . . 30

3.8 The PixMax model training scheme. Light blue: The labels for the
labelled dataset. Dark blue: The images and predictions for the
labelled dataset. Dark green: The images and model predictions for
the images in the unlabelled dataset. Light green: The pseudolabels
created by the model in the pseudolabelling phase. . . . . . . . . . . 32

3.9 A simplified illustration of the architecture of the model used. Image
from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.10 The overall accuracy on the LIFULL HOME’S test set for models
trained with different β-thresholds for selecting what pseudolabels to
use. Blue: Models trained using the ignore index setting described
above. Red: Models trained without the ignore index setting. :
Models evaluated with test-time augmentations. : Models evaluated
without test-time augmentations. : The best model that we found
in our final experiments. Called ours in the following section. . . . . . 37

4.1 Left: The proportion of the pixels with US(Gθ) ≥ x for 4 different
images. Take note of the logarithmic scale on the x-axis. Right: A
zoom-in on the graph of the first image with a higher resolution. . . . 40

4.2 Left: The decrease in Labs as a function of how big proportion of
the pixels removed for 100 images. The green, dashed line shows the
average over all images. The values are calculated at fixed intervals
and interpolated in between. Right: The quotient of the loss of the
whole image and the truncated image with respect to the fraction of
pixels removed. Note the logarithmic y-axis. All values are weighted
to compensate for the fraction of the pixels removed and the size of
the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 The correlation between LABS before and after qcc for 100 images. . . 42
4.4 The correlation between LCE before and after qcc for 100 images. . . . 42
4.5 A histogram over the distribution of β for 8400 images from the

LIFULL HOME’S dataset. The best sample fit for the gamma-
distribution has the shape parameter k = 4.22 and the scale pa-
rameter θ = 55.69. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6 Examples of predictions images with different β-values. Column 1:
The original image and the room colour legend. Column 2-5: The
predictions for each of the augmentations. Column 6: The resulting
pseudolabel (most common pixel prediction) and per-pixel βi,j-value
maps for the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

xiv


List of Figures

4.7 Comparison of results from different models. Evaluated on 4 images
from the LIFULL HOME’s dataset. . . . . . . . . . . . . . . . . . . . 47

5.1 A conceptual model training training training scheme for inductive
conformal prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

B.1 The per-pixel distribution of the room classes in the CubiCasa5k
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

B.2 The per-pixel distribution of the icon classes in the CubiCasa5k dataset. III
B.3 The per-pixel distribution of the room classes in the LIFULL HOME’s

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
B.4 The per-pixel distribution of the icon classes in the LIFULL HOME’s

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

C.1 Comparison of results from different models. Evaluated on 4 images
from the CubiCasa5k dataset. . . . . . . . . . . . . . . . . . . . . . . VI

xv


List of Figures

xvi


List of Tables

1.1 The layers of information that our model extracts from a floor plan
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 The structure of the output layers of the network. . . . . . . . . . . . 24

4.1 1st row: The β-thresholds used for model training in the PixMax
training scheme. 2nd row: The number of images with a β-value
larger than the tested set of thresholds. 3rd row: The percentage
of pseudolabelled images used in the model training scheme (4200
labelled examples used for all runs.) . . . . . . . . . . . . . . . . . . 43

4.2 Per class-comparison between CubiCasa’s (CC) model[1], our best
reproduced model of CC and our best model achieved using PixMax.
All models evaluated on our LIFULL HOME’s test set. Note that
classes with a (-) is not present in the test set and can not be evaluated. 45

4.3 Performance comparison of models. CubiCasa’s best model vs our
best model trained on CubiCasa5k training data and unannotated
LIFULL HOME’s data, tested on the CubiCasa5k test set . . . . . . 46

4.4 Performance comparison of models. CubiCasa’s best model vs our
best model trained on CubiCasa5k training data and unannotated
LIFULL HOME’s data, tested on our annotated LIFULL HOME’s
test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Performance of our best model (with β = 0.97 and TTA) on the
LIFULL HOME’s test set. Subscript p stands for the post-processed
(polygonized) predictions. . . . . . . . . . . . . . . . . . . . . . . . . 46

A.1 Full table of all evaluated models. . . . . . . . . . . . . . . . . . . . . II

xvii


List of Tables

xviii


1
Introduction

In this chapter we will give a brief background and history to the topics that are
being investigated in this thesis. We will start by describing the the possible benefits
there could be in the industry in using the techniques that we propose and we will
then continue to give a brief outline of our proposed method. In the last section of
the chapter we present a chronology of work that has been done on related topics
though the last decade and the papers that this project is based on.

1.1 Background
The building industry has undergone a lot of large technological changes during the
last few decades. One of the most prominent examples of this is the way digital
tools is now used to aid the complex coordination process of big projects that span
over multiple disciplines and long time-scales.
Even though the industry is susceptible to digital development and progression in
general, the evolution is slow due to the long project time scales and hence slow
information reversal. The digitalisation has up until a few years ago mostly been
focusing on how to streamline the design process. This has been done for instance
by simplifying communication between disciplines with automated clash-checks and
by moving from 2D-drawings to 3D-modelling software.
Today effectively all new production design and planning are heavily aided by dig-
ital tools such as CAD (Computer Aided Design) and BIM (Building Information
Modelling) software. These software give a much improved way of coordinate the
workflow between different disciplines and they make it easier and faster to change
parts of the design in different stages of the design process compared to the tra-
ditional way of designing buildings with pen and paper. Moreover these modern
software are in general both vector-based and support object oriented modelling in
some form. This gives the user the ability to combine drawing and other kinds of
information in an efficient way.
Vector-based drawings have the advantage of being easy to both modify and anno-
tate compared to raster-based drawing formats. In combination with the metadata
attachment capability of these drawings they are in many ways far superior com-
pared to traditional raster-based images not only in the planning of a project, but
throughout its whole life cycle when it comes to versatility and maintenance.
The problem that we want to address is that many of the drawings that are used to
convey information to clients and customers are often stripped of this information
when they get converted to a raster image format for distribution outside the soft-

1


1. Introduction

ware where they were originally made. There are also a big fraction of drawings that
were made before the adoption of these software and hence only ever have existed
in raster-based image format or as a physical printed drawing. By converting these
drawings (back) into vector format, many new possibilities for how they can be used
will emerge.
Floor plans is the type of architectural drawing that most often is used to carry
information about a building or an apartment to the general public and it is also
one of the most common type of drawing to encounter in all sorts of projects. To
have these drawings in an annotated, vector-based format would open doors for e.g.
property owners, real estate agents and property management firms that wants to
convey information about floor plan layout in a more intuitive way. This could be
done by e.g. create a 3D representation of the property on their website. This is
a much easier task to do with a vectorized floor plan as a basis rather than on a
raster image due to the geometrical information being represented explicitly in a
vector based image as opposed to a raster image. Another potential application is
to extract information from old drawings to then be included in a reference database
for architects, planners and engineers etc. to make informed decisions.

1.2 Proposal
Our proposal is to improve on current automated pipelines that exist for converting
raster based floor plan images into a vectorized (mathematically represented) format
with the use of machine learning.
To do this, we want to create a model that is able to able to distinguish between
several of the most common ways that floor plans are represented by being able to
detect a finite set of features such as walls, room spaces, doors and windows. After
identification of these features the model should be able to create an accurate vector
representation of the floor plan with geometries in the form of drawing "symbols"
and metadata attached to these object symbols.
The output from the model should be a in a format that is easily read and convert
to the most common and widely used CAD-software file formats. Once the result
is converted to one vector-based format, it is quite easy to convert it to other for-
mats since most modern CAD-software have methods for converting files from other
common formats built in.
Recent work in the area by Kalervo et. al. and Liu et al. [1], [2] has been shown
to give good and reliable results using Artificial Neural Networks (ANN). Despite
reaching impressive results, the lack of large, annotated data sets are in these works
pointed out as one of the greatest challenges in creating a model with even better
generalisation capabilities. This project aims to work towards a solution to this
problem by introducing a framework for using unannotated floor plan images to let
the model learn to work on images with novel drawing styles.
This idea can also by extension be seen as a step in the direction of being able to
create larger, annotated, custom datasets of floor plans that can be used for data
analysis or to train ever more intricate models. In other words, a good and reliable
model for parsing floor plan images might be used for annotating large datasets to
be used for other applications.

2


1. Introduction

1.3 Method outline
To address the lack of large quantities of annotated data of high quality, our proposed
method is based on Kalervo et. al. [1] and Liu et. al. [2], with the difference that
we instead use a semi-supervised approach allowing us to use a large data set of
unannotated floor plan drawings to train the model further.
Our approach consists of letting the network model perform predictions on unnan-
otated data. These predictions will after carefully chosen refinements be used as
pseudolabels for the model to be further trained on. In order to increase the per-
formance of the network we test several techniques for improving the quality of the
pseudolabel based on the network output.
In the first pseudolabelling method, we evaluate the potential correlation between
the "confidence" (network output after softmax layer) and the average pixel-accuracy.
Second, we measure the potential increase of quality after using a post-processing
algorithm used in [1]. Finally, we develop a prediction quality measurement —
PixMax — based on a batch of non-destructively augmentations to the same image.
This measurement is used to select only pseudolabels of high quality for further
training.

1.4 Scope and limitations
We limit the scope of the project to only architectural floor plans. In most projects, a
variety of different floor plan drawings are used to convey different types of informa-
tion to the construction workers. These plans can include installations and fixtures,
electrical wiring and plumbing the materials being used and different phases of the
building process. We have chosen to only look at architectural drawings since it is
the type of drawing that is primarily used after the building is finished to display
information about its architectural qualities.
To also limit the scope of the project we will only use single-level floor plans for our
model. Following from the approach of [1], we limit the scope to use a set of 12
room classes, 11 icon types and 21 types of interest points as given in Table 3.1.

44 Output maps
21 Interest points (Heatmaps) 23 Segmentation Maps

Wall corners Opening endpoints Icon corners Room class Icon class
13 4 4 12 11

Table 1.1: The layers of information that our model extracts from a floor plan
image.

Two datasets will be used in this project; one annotated dataset that is used to train
the initial model and one unannotated datasets that will be used to for creating
pseudolabels and evaluate the models performance after the extended training.

3


1. Introduction

1.5 Research questions
• With what accuracy are we able to recover information from a raster image of a

floor plan to a vectorized representation using our proposed method PixMax?
• How does our algorithm compare to state of the art results in the field?
• Could a semi-supervised algorithm be used to improve the results based on

only a fraction of annotated data in the dataset?

1.6 Related work
The following is a brief summary of the field and the two papers that have inspired
this thesis the most.

1.6.1 Vectorization of architectural floor plans
The problem of converting floor plan raster-images to a vector-based format has
been explored extensively over the last few years [1]–[5]. The techniques used for
the task has shifted from conventional algorithms such as patch-based segmentation
[4], to the use of neural networks as bigger data sets have been released and the
cost of computation has become cheaper, making the data-hungry networks a viable
option [6]. Due to the intricacy of the task and the variety in the data, neural
networks has been shown to yield good results compared to other algorithms. This
can be accredited to their ability to find complex correlations in the data and to
generalise[1], [2].
A commonly used method lately for image parsing is semantic segmentation where
an image is split up into a set of pixels per class segmentation maps, each corre-
sponding to a one of N predefined classes in the specific dataset [7]. Several datasets
have been released and the and the results have been steadily improving as better
techniques have evolved [8]. Although great results have been shown on a vari-
ety of different datasets, the main focus of the research has been on natural image
segmentation since this has been an important problem to solve for several major
industries such as the automobile industry where the goal of building self driving
cars is a strong driving force.
Although not the main focus of the research in machine learning, a significant
amount of work has been done on semantic segmentation for human-created images
such as drawings. The idea of using segmentation maps as a way to automatically
vectorize architectural floor plan drawing in some cases predates the use of neural
network based methods. Heras et al. uses a statistical, grid-based method to seg-
ment walls, windows and doors from floor plans [9]. To train their model they used
the very popular CVC-FP dataset [10]. This was at the time one of the biggest and
most popular, publicly available floor plan datasets with annotated images. After
this work it seems segmentation methods making use of neural networks has been
increasingly popular. Dodge et al.[3] proposed in 2017 a method where Optical
Character Recognition (OCR) from the Google vision API was combined with a
fully convolutional network, the Faster R-CNN framework, to obtain a model that
can both segment walls and also interpret semantic information in the input such

4


1. Introduction

as measurements and room types written in the drawing. In their work they also
introduced a new public floor plan dataset known as the R-FP dataset containing
500 high-resolution real-estate floor plan images.
In 2018 Yang et al. managed to segment walls and doors simultaneously [5] using U-
Net+DCL - an alteration to the U-Net where the deconvolutional layers was replaced
with a simplified version of pixel deconvolution layers. They managed to achieve an
validation pixel accuracy of 97.5 respectively 99.5% for walls and doors, establishing
that impressive results can be reached using convolutional neural networks on real-
estate floor plans.

1.6.2 Raster-to-Vector & CubiCasa5k
The two works that this project is most heavily inspired by are Kalervo. et al. and
Liu. et al. from [1], [2]. Liu. et al. proposed in their 2017 paper a learning-based
method with multiple objectives. By transforming a rastor-image with the model
into both heatmaps of low-level geometric and semantic information (a set of corner
and end points) and also a semantic segmentation map of different room and icon
types, they managed to extract multiple layers of information from a single image.
In contrast to Dodge et al. [3] where the model architectures consisted of multiple
network that learned separate tasks, instead of using a single model for all learning
goals. This was done by implementing single fully convolutional network (FCN)
with a multi-objective loss for the different output maps, and then combining the
individual losses with a weighting for the prediction categories to a combined total
loss that is used to back-propagate the network.
The advantage of this approach is that the geometric feature maps (corners, wall
end points etc.) can be used in an intelligent post-processing schedule that aims to
refine the rather coarse network output in terms of the room and wall segmentation
maps. An conceptual illustration of this can be seen in Figure 1.1. For example,
by knowing with high precision the four corner points of a rectangular room, the
segmentation map of the pixels within that room can in some cases be improved
by knowing that all pixels within that room most likely are of the same class. This
fact have been researched and evaluated in chapter 3.2.2 to see if this can be used
to further improve our model.
Their algorithm ultimately yielded around 90% precision recall for wall junctions,
walls, drawing icons and rooms on the LIFULL HOME’S dataset [11], significantly
outperforming most other methods trying to extract the same amount of information
from a single drawing.
In 2019, Kalervo. et al. [1] continued on this work by using the same baseline
network architecture, ResNet-152 [12] pretrained with ImageNet [13], but they ex-
tended the dataset used substantially by making use of their (CubiCasa) manual
floor plan annotation pipeline to collect 5 000 high-quality, human-annotated data
points from a set of 15 000. They also more than doubled the number of target
room- and icon classes to get more reliable and exact predictions.
Making use of this data, they managed to outperform [2] in both recall and accuracy
for all classes but one, all while just making use of one single model for predicting
all the different feature maps with varying learning criteria.

5


1. Introduction

This project intends to continuing to build on the framework that was created
respectively refined by these two works.
e

Figure 1.1: A concept representation of the method first introduced by Kalervo.
et al [1] where a specific set of interest points are detected to aid the vectorization
algorithm that is separate from the main network model.

6


2
Theory

This section will give a brief background of some to the techniques and key concepts
used in machine learning in general and used in this project in particular. We
will cover all necessary theory that is relevant to the project. It will however be
assumed that the reader has a solid understanding of the field a priori. We will
go into themes related to semi-supervised learning and computer vision through
convolutional neural networks more thoroughly since these are the primary concepts
for this work and it is vital to understand the techniques used properly. We will
also touch upon concepts such as the residual neural networks and multi objective
loss functions.

2.1 Deep learning
Deep learning is a subgenre of machine learning that deals with algorithms based
on Artificial Neural Networks (ANN) and representation learning. Deep learning is
substantiated by the notion that there exists some non-linear, often complex function
G that can map points x from a high-dimensional input domain H to target points
y in a defined target domain I:

G : H → I. (2.1)

The assumption we usually make is that for certain kinds of problems we only need
to know a small fraction of all point-mappings from input to target space to be able
to predict a much bigger fraction with a good precision.
Just based on our trivial assumption we cannot induce any boundary on the complex-
ity of G. To make this model framework useful we need to be able to approximate it
with a reasonably good parameterised approximation Gθ

1. So we want Gθ to mimic
the mapping of G for x in the domain we are concerned with. For a parametrization
θ of dimension m, the objective can be stated as follows:

find θ ∈ Rm s.t. ∀x ∈ H : Gθ(x) ≈ G(x) (2.2)

Since we in general do not know what G does for the majority of all x ∈ H, we
cannot explicitly use equation 2.2 to find θ.
But if we have a random sample of known mappings of size B, we can view {x,y} =
{{x1, . . . , xB}, {y1, . . . , yB}} as a random variable and use this approximate the real
probability distribution of x, P (x), by an empirical approximation Pemp(x). We can

1θ typically represents the weights and biases in an artificial neural network.

7


2. Theory

now try to find the set of parameters θ̂ that when applied to a fixed G minimises
the expected loss over this simplified probability distribution Pemp(x):

Gθ := Gθ̂ with θ̂ := argmin
θ

∫
L(Gθ(x), y)dPemp(x). (2.3)

Here L is a loss function that measures the difference between the prediction and the
target. The exact composition of this function will be further discussed in Section
2.4.1. For practical purposes we would also like to require θ to not be too big since we
want to be able to conduct calculations with it in reasonable timescales. Generally
speaking, it is clear that it is not always the case that such a function exists, but
it has been shown empirically that for certain kinds of problems it seems to often
be the case. Fortunately these problems often coincides with the problems that
have applications in many areas and that is the reason why deep learning and in
particular deep artificial neural networks have become so popular in the recent years.
They give us a framework for finding well-behaved parametrizations of seemingly
arbitrary mappings.

2.1.1 Convolutional Neural Networks
Convolutional Neural Networks (CNN) is a type of artificial neural network that
is commonly used in computer vision applications. Their high performance in im-
age analysis partly comes from their shared-weights architecture and translation
invariant characteristics.
A convolutional network is defined as any artificial neural network that uses one or
more convolutional layers in its architecture. A convolution can be understood as
a filter that is being slid over portions of the previous layer to calculate the next.
This gives the model a way of perceiving neighbourhoods in the input vector and
it is therefore useful when there are thought to be large structures in the data that
is linked to the closeness of its building blocks. One of the most classical examples
of this is shapes and objects in an image. The filter is called a kernel and it can
be distributed in one or more dimensions. For image analysis of colour images,
3-dimensional kernels are most commonly used since this corresponds to the two
spacial dimensions of the image plus and the "channel dimension" where the red,
green and blue values are separately.
A kernel is a small matrix of weights. The placement of the individual weights in
the kernel can be arranged such that it is tuned to detect a certain kind of low-level
feature such as lines, edges or dots in the data tensor. This is done by element-wise
multiplication between values in a region of the preceding layer and the kernel as
can be seen in Figure 2.1. By combining multiple kernels with different feature
detecting abilities into a kernel tensor the model can assimilate the distribution of
such features in the data and can hence be analysed with a maintained geometrical
interpretation as opposed to a fully connected layers where all geometrical integrity
is lost.
Fully convolutional networks are networks that doesn’t contain any fully connected
layers but completely relies on convolutions throughout the propagation. Since a
fully connected layer is equivalent to using a kernel of size 1 × 1, or 1 pixel in the
image processing setting, the geometric interpretation of a FCN with larger kernels

8


2. Theory

Figure 2.1: The concept of a convolutional layer. In this particular example we
have data in 2 dimensions and a third kernel dimension. The items in the data
tensor gets element-wise multiplied with a kernel tensor and summed to form the
consequent layer in the network.

is that it exclusively considers regions in it’s propagation and never values of unique
neurons.

2.1.2 Optimisation and vanishing gradients
In Section 2.2 we described that we want to improve our candidate Gθ over time
with at backpropagation algorithm B that is dependent on the current parameter
state and the chosen loss function. There are however many ways we can define B
to do this. The most straightforward approach would be to look for the direction
in which to tweak the the parameters of Gθ to get the biggest local decrease in the
loss function. We can then take a step of size η, known as the learning rate, in that
direction. This is what is called i.e. the gradient descent method and one update
step can be written as

θt+1 ← θt − η∇θL(θt|x,y) (2.4)

for some loss function L our input-target data set (x,y) of size N . However, there
are a few problems with this method. Since the gradient is calculated for every
single data point for each step, it can be very slow if N is large, especially if θ is
large as well.
Another unwanted feature of this method is that it is greedy in the sense that it
will always choose the direction that is locally thought to be the most efficient step
at any time. The problem with this is that the algorithm can get stuck in a local
optimum without having any chance of getting out of it to find the global optimum.
A popular way to solve these problems is to use a stochastic gradient descent method,
first described in a paper by Robbins et. al. [14]. In our setting, one step can be
described as

9


2. Theory

θt+1 ← θt − η∇θ
1
K

K∑
i

L(θt|xi, yi) (2.5)

where {x1, . . . , xk} ⊂R x is a random subset of x of size k with corresponding tar-
gets {y1, . . . , yk} ⊂R y. In the original paper k was set to 1, but in the general
case the batch size can be set to any number 1 ≤ k < N to reduce the probability
of an unrepresentative samples while maintaining a big overhead compared to de-
terministic gradient descent. Since {x1, . . . xk} is a random variable, it introduces
the possibility to occasionally move in locally non-optimal directions that can be
globally beneficial which makes it less prone to getting stuck in local optima.
Both these presented algorithms have a fixed learning rate η that does not change
thought the training. There has been many approaches to making the optimisation
more effective with dynamic learning rates. Some of these methods include AdaGrad
which works well with sparse gradients [15], and MSRProp with good performance
in on-line non-stationary[16].
In 2015, Diederik Kingma and Jimmy Ba proposed an algorithm they called ADAM,
short for adaptive moment estimation, that combines the benefits of the AdaGrad
and RMSProp[17]. It does this by introducing a momentum in the training schedule
by using a decaying moving average and squared average of the gradient. One step
in the ADAM algorithm can be described as executing the following steps:
We first calculate the moving decaying averages of previous gradient m and squared
gradient v respectively

m← β1m+ (1− β1)∇θL(Gθ|x,y)
v ← β2v + (1− β2)∇θL(Gθ|x,y)2 (2.6)

These become estimates of the 1st and 2nd moment of the gradient of the objective
function L. Since these are initialised to 0, they are negatively biased. To counteract
this we calculate unbiased versions of these variables the following way

t← t+ 1,

m̂ = m

1− βt1
,

v̂ = v

1− βt2
.

(2.7)

We also update our time parameter since we have the time-dependent terms βt1 and
βt2 that we want to be progressively smaller as the effect of the initialisation wears of.
Finally we update our parameters of the network using m̂ and v̂ using the following
equation

θt+1 ← θt − η 1√
v̂ + ε

m̂ (2.8)

In ADAM we introduce three new hyperparameters β1, β2 and ε. The β-terms
corresponds to the relative exponential decay rate of the 1st and 2nd moment of the
gradient respectively and ε is just a small number that prevent us from a getting a
zero-term in the the denominator of equation 2.8.

10


2. Theory

It was for some time hypothesised that it was possible to create more powerful con-
volutional networks just by stacking more layers because of the recent breakthroughs
in image classification [18] and object detection [19]. However, it was also recog-
nised that deeper neural networks are often more difficult to train and it was in 2016
shown that if you just keep adding more layer to a network, it eventually get worse,
not better[12]. This is in part because adding more parameters will make a network
more prone to overfitting if the dataset is small [19], but another big contributor to
the problem is what has been called the The fundamental deep learning problem.
The vanishing gradient problem was first formally identified by S. Hochreiter in 1991.
Ten years later an additional paper in English by Hochreiter et. al. was published
that elaborates further on topic with more extensive surveys and [20], [21].
The vanishing gradient problem inherits from the way deep neural networks are
traditionally trained. Through backpropagation, the weights of each layer is updated
based on the gradient of the previous layers activation function [22]. The core of the
problem comes form that the activation function is chosen to squeeze any input to
a much narrower range, e.g. (0,1) for the commonly used sigmoid function. As the
derivatives progresses thorough the network, the it becomes a chain of derivatives
that each depend on the previous. For the weights of the first hidden layer, the
update formula becomes

∂L
∂W1

= ∂L
∂Vn

∂Vn
∂Vn−1

. . .
∂V1

∂W1
. (2.9)

Where Vk and Wk are the outputs and weights of the kth layer. Now, since each
layer uses an activation function we are going to get the derivative of the activation
function as the outer derivative for each layer. In the kth layer we get

Vk
∂Vk−1

= ∂φ(zk)
∂zk

Wk (2.10)

For some activation function φ where zk = Vk−1 ×Wk [23]. If we choose φ(x) =
Sigmoid(x), we have that the therms containing ∂φ always have an amplitude in the
interval (0, 1/4]. The standard approach to weight initialisation in a typical neural
network isW ∼ N (0, 1). Hence, the weights in a neural network will also usually be
between -1 and 1. As we multiply more and more of these terms together it is easy
to see that the gradients quickly grow small and hence they are barley affected by
the backpropagation. This also explains why the problem is especially prominent in
networks with many hidden layers.
On the other hand, if an activation function with a large derivative in the relevant
interval, these can also accumulate and instead cause exploding gradient problems.
Exploding gradients results in exponentially large updates to the network weights
which is likely to cause a very unstable network.

2.1.3 Residual networks
There are a few ways to deal with the problem of vanishing gradients. The trivial
solution is to just make the networks shallower. However, this solution has some
drawbacks since it has been shown that the depth of the network is often of great

11


2. Theory

importance to its performance as stated earlier [18], [19], [24]. Moreover, the activa-
tion at different depths of a deep network has been shown to sometimes have useful
interpretation by encoding a hierarchy of different features sizes.
The early layers can be thought of as representing low level features such as lines and
dots while layers closer to the output are capable of capturing high-lever features
such as shapes or objects[25].
A better solution to the problem was proposed by He et. al. with the introduction of
ResNet in their 2016 paper [12]. What they suggested was to add skip connections
to the network avoid the problem. A skip connection is a connection that jumps
over a certain number of layers, a so called ResBlock, and then connect back to the
network.
The argument for using deeper networks is that since a deep network Gn define a
more powerful function class than the shallow counterpart Gk, it should in some
sense have a better potential to mimic the true mapping G that we want to find.
However this might not be the case because it assumes that Gk is nested within Gn

s.t. Gn can do everything that Gk can do an more [26]. This concept is illustrated
in Figure 2.2.

Figure 2.2: For nested function classes, using a bigger function class means that
we can get closer to the true function G, but this is not necessarily the case for
non-nested function classes.

The reasoning behind the proposal by He. et. al. is that in theory the function
class of deeper model should completely inclose that of a shallower, since it can just
mimic the shallower model by using identity mappings, but that in practice this is
not always the case because the identity function is not a trivial function to learn.
Hence the deeper model can struggle to make as good predictions as the shallow
one just because it needs a lot of training data juts to lean what layers that should
have an identity mapping.
By realising this it was deducted that we can help the model to by explicitly re-
formulate the layers as residual functions with reference to the layer inputs. This
is done by the observation that we can split any function f(x) into a sum of the
identity function I(x) := x and a residual function r(x) := f(x) − x. We can then
propagate the residual function through a number of layers in the network and then
add the identity back to it, as can be seen in Figure 2.3. This makes it easy for the
model to "skip" a layer by just pushing all the weights to zero, which has empirically
been shown to be much easier than to find the identity mapping. We are basically
giving the network a shortcut that makes it possible to combine the power of a

12


2. Theory

deeper network with the agility of a shallower network.

Figure 2.3: The structure of the ResBlock. The function f is split into a residual
and an identity function. Only the residual function is propagated through the
network to later be added back together with the identity function.

2.2 Supervised learning
The framework that deep learning presents to us for finding a suitable candidate
for Gθ is the training of an artificial neural network. The principle is to initiate a
model with the architecture of a layered network with many free parameters that
can be tuned to make it imitate G. This is usually done through a process known
as forward- and back propagation where a set of data points with known mappings,
{x,y} = {{x1, y1}, . . . , {xB, yB}}T ∈ H × I s.t ∀i ≤ B : xi G7→ yi, are presented
to the model and its parameters are updated to reduce the prediction error in each
time step:

θt+1 ← θt +B(θt,L(x,y)) (2.11)

Here B is a backpropagation algorithm that updates the the model in a way that
is likely to reduce its prediction error, e.g. by using stochastic gradient descent.
As earlier established, it is required to have some kind of metric for how good the
models current prediction is. This distance function, represented by L in equation
2.11 is in deep learning known as the energy function or the loss function.

2.2.1 Bias-variance tradeoff
One of the biggest dilemmas in supervised learning is what is known as the Bias-
variance tradeoff problem. The issue comes from the fact that we only use a small
subset of all possible examples to fit a model that we want to generalise well for
all data in the distribution. [27]. This leads to an inevitable tradeoff between two
different sources of errors:

• The model bias
Measures the average difference between the model prediction and the target.
A model with high bias cares little about the training data it is presented with

13


2. Theory

and tries to oversimplify the problem. Therefore models with high bias are
often described being underfitted.

• The model variance
Measures how much the model predictions move around its mean on average.
A model with high variance pays a lot of attention to the specific training data
it is presented with but does not generalise well outside this specific sample.
A model with high bias is often described as overfitted to the training data.

Figure 2.4: The concepts of underfitting and overfitting a model to the data. The
model in the middle has a good balance between capturing the main features of the
data but is at the same time stable to noise and therefore better approximates the
true function (green).

The tradeoff is not only a conceptual construct to easier define model behaviour; it
can be shown that the expected test loss of any model can be described in terms of
its variance and bias errors in the following way [28]:

Ex∈H
[
(Gθ(x)− y)2

]
= Biasx∈H[Gθ(x)]2 + Varx∈H[Gθ(x)] + ε2 (2.12)

Where Ex∈H [(Gθ(x)− y)2] is the expected test Mean Squared Error (MSE). This
refers to the value we would approach if we estimated Gθ based on a large number
of training sets from the distribution H and averaged the squared distances from
the model predictions to the target of iid samples x, also in H. Since the variance
term is always non-negative and the bias term is squared, it is easy to see that
Ex∈H [(Gθ(x)− y)2] ≥ ε2, where ε is the irreducible error in the data. This is the
so-called unexplained variance, also called the noise. Equation 2.12 also implies that
it is impossible to escape this trade-off. A model with zero variance will inevitably
have unbounded bias and vice versa [29].
When trying to find the parameters of a model that minimises some objective func-
tion we only use a small subset of the possible samples that could be in the dis-
tribution. If the model is trained until convergence we are therefore at big risk of
lowering the bias term too much at expense of the variance. This effect is especially
prominent for highly non-linear models with a large number of parameters such as
ANNs [30]. To find a reasonable balance between the two sources of errors, we use
a separate partition of the dataset — the test set — independent of the training set
to determine when the model is starting to become overfitted and to terminate the
training at this point.

14


2. Theory

2.3 Semi-supervised learning
Semi-Supervised Learning (SSL) is a framework for machine learning where we use
both labelled and unlabelled data to train a model. The primary assumption in
SSL that is used to justify the technique is that for a small amount of labelled data
together with a bigger amount of unlabelled data can be used to create a stronger
model than the two data sets each on their own. This has also empirically been
shown to be the case for many important problems [31]. For instance it has been
shown that classification models perform better than the models trained only on
labelled data and that joint training — where both labelled and unlabelled data is
used simultaneously — is one of the most successful iterative approaches to semi-
supervised learning [32].
Thanks to their high performance-to-cost ratio, semi-supervised learning models has
risen in popularity over the last years and many frameworks that uses a combination
of labelled and unlabelled data has been developed [31]. But how can we know for
what problems we can hope for semi-supervised models to work? Or more precisely:
If we compare an algorithm that only uses labelled data to one that has access to
both labelled and unlabelled data, when is it reasonable to think that the combined
model can make a more accurate prediction?
In general one could say that there are gains to make if the knowledge on P (x̂) that
one can make through the unlabelled data x̂, is useful in the inference of P (x|x̂).
For this to be the case some assumptions on the correlation between the labelled-
and unlabelled data distributions needs to be fulfilled. All semi-supervised learning
models make use of at least one of the following statements[33]:

• The semi-supervised smoothness assumption
If two points x1 ∈ x̂, x2 ∈ x in a high-density region are close, then so should
their labels y1, y2 be. This is to say that the true mapping G is at least as
smooth in areas where we have many observations as in regions where we have
few or none. This implies that if a path of high density links two points their
outputs are likely to be close, but if a low-density region separates them then
their outputs can very well be quite different.

• The cluster assumption
The data tends to come in discrete clusters, and data within one cluster is
likely to have similar labels. If this is the case then the unlabelled data points
might help us to find the cluster boundaries more accurately. In the idealised
case we just need one labelled point to tell us the flavour of the cluster and
we can then map out its outline by introducing more unlabelled points. Note
that this assumption does not say that points from multiple clusters can’t have
similar labels.

• The Manifold Assumption
The data points x ∈ Rn, x̂ ∈ Rn lies roughly on a manifold M ∈ Rk where
k << n.
This is useful because of what is known as the curse of dimensionality; the
fact that the volume grows exponentially with the number of dimensions of
our data and thus exponentially more data is required to have the same sample
density in a higher-dimensional space. However, if we can find a manifold of

15


2. Theory

a lower dimension that accurately portrays the structure of the data we can
operate in this subspace and partly avoid the problem.

For it to be reasonable to make any of these assumptions, we need to know that the
probability distribution of our unlabelled data P (x̂) is the marginal distribution of
that of our labelled data P (x). This means that x̂ and x must come from the same
underlying distribution. This is not always possible to guarantee in practice, but
even if it does not hold there are things that can be done if P (x) and P (x̂) share
some similarities. For instance we can use unsupervised domain adaptation where
a model is trained on labelled data from a different distribution than the one it will
later be applied on [34].

2.3.1 Pseudolabelling
One of the the most obvious, and therefore also earliest, ways of implementing
semi-supervised learning is through so called pseudolabelling or self-training.
One of the most basic implementations of a pseudolabelling framework is a wrapping
of the supervised learning algorithm. First we train our model on only labelled data,
but for each epoch of the training we label a fraction of the unlabelled data points
with the current model state and use these as training examples from that point
on. When all unlabelled points have gotten a label the we continue to train the
model until convergence to reach our final model state[33]. The high-level idea of
the framework was described already in the 60’s [35], [36] but has been much refined
and packaged since then.
Another way to perform self-training is to first train the model until convergence on
the labelled data and then use this model to label all unlabelled examples at once
[37]. The training is then continued with a certain fraction pseudolabelled data until
convergence is once again reached.

Figure 2.5: The self-training scheme described in [31].

16


2. Theory

2.4 Loss functions
In deep learning the loss function has two main purposes:

• To give a measure for how well a model is currently preforming.
• To give a prediction for what direction to nudge its parameters in to most

likely increase its performance.
Since we want the model to learn something from our labelled data, the loss function
is in general a distance function that measures how close the models prediction is
to the ground truth e.g. the data label. Depending in the type of information the
network is trying to learn, different types of loss functions might be more or less
suitable.

2.4.1 Loss functions and probability transformations
For semantic segmentation, a common choice of loss function is the cross entropy loss
that gives a measure for the difference between the models probability prediction
for each class and the true class, summed over all pixels. Cross entropy loss uses
a probabilistic scheme to determine the distance between the true and predicted
label for a data point. Since z := Gθ(x) not necessarily is a per-pixel probability
distribution over all classes, we might have to a format the output to something
that can be interpreted as such, e.g. by using the Softmax function,

Softmax(xi) := ezi∑
j∈C
ezj

q(x) :=

 Softmax(x1)
...

Softmax(xC)

 ,
(2.13)

where C := {1, . . . , C} is the sequence between 1 and the number of classes, denoted
by C. Using this notion of q, the cross entropy loss can be written as

LCE(x, y) = −
∑
p∈P

∑
i∈C

yi,p log q(x)i,p (2.14)

where P is the pixels in an image, y.p is the one-hot 2 representation of the true class
of pixel p and q(x)i,p is the models probability prediction that xp ∈ Ci.

As mentioned, a convenient property of the Softmax transformation is the class prob-
abilities will sum to 1 for each pixel in the image and can hence be interpreted as a
probability distribution. The transformation can however suffer from some numeri-
cal performance issues and is therefore with advantage combined with a logarithmic
transformation, as in the case with cross entropy loss. In cases where no logarithmic
transformation is performed, the sigmoid function might be a more suitable choice
of transformation. The sigmoid function has a similar inferred interpretation as the
Softmax, with the main difference being that it treats the probability of each class
as independent from other classes [38],

2A C-vector with all zeros but a single 1 in the position of the true class.

17


2. Theory

Sigmoid(xi) = 1
1 + e−zi

. (2.15)

Still with z = Gθ(x) defined as the raw model output.

When we consider the task of object detection, a common loss measure is the inter-
section over union

LIOU = |A ∩B|
|A ∪B|

(2.16)

for a given prediction box A and the true bounding box B [39]. This base formula
can be extended in many clever ways to account for multiple predictions and classes.
However, for it to be stable it requires the objects to be detected to have dimensions
big enough that it is a reasonable to look at it as a continuous function. If we instead
want to find points of interest in an image, a regression loss function such as mean
mean squared error

LMSE(x, y) = 1
N

N∑
n=1

∑
i∈C

(zi,n − yi,n)2 (2.17)

is usually a better choice. Here zi,n is the location of the models prediction of the
nth occurrence of a point of the ith class. A more in-depth description of the concept
of different loss functions will follow in the next section.

2.4.2 Multi objective loss and relative loss weighting
In some settings we want a model that can make different kinds of predictions on
a single data point. The conventional way to do this is to train multiple separate
models to perform parts of the full task, but it can also be done by utilising a so called
multi-task model. Baxer et. al. showed in the early 2000s that this approach can
increase efficiency and learning accuracy for each task [40]. Simply speaking, the way
this is thought to work is that inductive knowledge transfer between complimentary
tasks can improve the generalisation capabilities of a model and therefore result in
more stable and reliable results.
However, this approach comes with a cost. It requires the models total loss to be
treated as a sum of multiple individual losses corresponding to the different learning
objectives. This raises the question of how the different terms of the loss functions
should be weighted against each other. The performance of each task is in one sense
arbitrary since the objectives of the model can have different scales and units, but
it is often desirable to have a model that at least prioritise improving on all tasks
equally.
Tuning such a hyper parameter manually can be a tedious and time-consuming task
and it has to be done over again for each model that we want to train. Kendall
et. al. [41] showed that the relative weighting of parameters can be considered
an implicit learning goal of the model and can therefore be learned automatically
thought the training.

18


2. Theory

The proposed method is based on looking at the homoscedastic uncertainty for
each task of the model. Homoscedastic uncertainty can be defined as the intrinsic
uncertainty of the model i.e. the part that is not dependent on how well-trained the
model is, but rather the insufficiency of the data that the model has been presented
with. For regression, if we assume identical observation noise for each data point x
in the we can write

y ∼ N
(
Gθ(x), σ2I

)
, (2.18)

where y is the batch model output, I the identity matrix and σ the noise scalar
of the model. From this we can see that we use the assumption that the model
predictions have the same variance and no co-variance. For classification we instead
have under the same assumption that

y ∼ Softmax
( 1
σ2Gθ(x)

)
. (2.19)

Using this we can calculate the joint probability distribution of multiple outputs by

p (y1, . . . ,yn | Gθ(x)) = p (y1 | Gθ(x)) . . . p (yn | Gθ(x)) . (2.20)

This means that we can then use maximum likelihood inference for the terms on the
right hand side. For regression, respectively Softmax, what we finally arrive at is a
joint objective function:

min
θ,σ1,σ2

L = 1
2σ2

1
min
θ
L1 + 1

σ2
2
min
θ
L2 + log σ1 + log σ2. (2.21)

The first term of this equation

L1 = ‖y1 −Gθ(x)‖2 , (2.22)

is for the model regression labels y1 and the second term

L2 = −y2 log Softmax (Gθ(x)) (2.23)

is for the cross entropy loss of the classification outputs with the and classification
labels y2.

We can now optimise L w.r.t all the model parameters θ, σ1 and σ2. This can be seen
as the combined loss function giving the model a way of learning the relative weights
of the losses for each output. If the value of e.g. σ2 is small, it will increase the
contribution of L2, whereas a big value will decrease its contribution. The equation
is regulated by the last two terms in the equation that penalises large values for σ.
More details and the entire derivation of equation 2.21 can be found in the paper
by Kandall et. al [41].

19


2. Theory

2.5 Consistency regulation

2.5.1 Vicinal Risk Minimisation
The task of labelling big sets sets is very labour expensive, especially when it comes
to rich information extraction such as bounding-box annotation for objects or even
pixel-wise segmentation. Introducing augmentations to increase both the amount
and the diversity of the data has been shown efficient [42], [43] and is today seen as
a standard procedure in all of machine learning.
Data augmentation has traditionally been viewed in theoretical statistics as a method
of Vicinal Risk Minimisation (VRM) [44]. The reasoning can be understood by first
defining a the learning problem as a search of a θ that minimises the expected loss.
We can write this as a risk function

R(Gθ) =
∫
L(Gθ(x),y)dP (x,y) (2.24)

where our objective is to find θ̂ := argmin
θ

R(Gθ) and P (x,y) is the probability
density function over all possible source-target pairs. The problem here is that we
cannot know what the true distribution P (x,y) is since we do not have all that data
in our dataset. But given a data set {x,y} = {{x1, . . . , xB}, {y1, . . . , yB}} we can
still estimate an empirical risk function3 since

Remp(Gθ) = 1
n

n∑
i=1
L(Gθ(xi), yi) ∝

∫
L(Gθ(x), y)δxi

(x)dx (2.25)

where the delta function

δxi
=

1 if xi is in x
0 otherwise.

(2.26)

The VRM framework is built on the assumption that we can preform some random
modifications to the data and still retain the overall structure and its semantic
information to a high degree. This means that we can include modified samples
to out training data set with similar (or even identical) labels as the sample that
they are derived from to get a better approximation for P (x, y). This is done by
exchanging δxi

by some estimate of the density in the vicinity of xi, say Pxi
(x) to

get the vicinal risk function

Rvic(Gθ) =
∫
L(Gθ(x), y)Pxi

(x)dx (2.27)

By using data augmentations the models performance is expected to increase by
making it less sensitive to overfitting [45].

3Note that we only have the delta function explicitly dependent on x. This is because {xi, yi}
is ordered in source-target pairs and hence ∀i : δyi

= δyi
, so adding the y-dependent part would

not change the value of Equation 2.27.

20


2. Theory

2.5.2 Geometric transformation consistency regularisation
The VRM framework is in its simplicity very effective for e.g. image classification as
the class for an image is still the same after an augmentation and its original label
therefore can be used. However, for the segmentation task, the unchanged-label
assumption cannot be made with the same confidence. Mustafa et. al. presented in
their 2020 paper a consistency regularisation scheme that partly solves this problem
by applying reversible transformations to both source and targets of the training
data [46]. By doing this, we can get more than one training example with a perfect
label for each of the data points in the training set. The loss function for this
training scheme can be written as a sum of a supervised and an unsupervised term

L = Ls(x, y) + λ (Lus(u) + Lus(x)) (2.28)

where the two loss terms are defined as

Ls(x,y) = 1
B

B∑
i=1
‖Gθ (xi)− yi‖2

2 ,

Lus(x̂) = 1
rB

rB∑
i=1

(
1
M

M∑
m=1
‖Tm (Gθ (x̂i))−Gθ (Tm (x̂i))‖2

2

)
,

(2.29)

with B the supervised batch size, r the ratio of unsupervised to supervised samples in
a training batch, {T1, . . . , TM} the set of transformations applied to each data point
in the unsupervised training set and λ a the supervised to unsupervised weighting
parameter.
As can be seen in Equation 2.29, the unsupervised part of the loss function penalises
the model for not making consistent predictions for transformed variations of the
same image. Mustafa et. al. [46] showed that this technique indeed can be used to
improve model performance.

21


2. Theory

22


3
Method

This section describes our preliminary experiments that lead to our chosen training
scheme that we call PixMax. We also describe the data sources and the data splits
that were used in the training as well as the model, our proposed model training
scheme and the metrics we have used to evaluate our models.

3.1 Data
This project investigates how a large amount of unannotated data can be used to-
gether with a smaller amount of annotated data through the semi-supervised training
framework to create a better model than what could be achieved with annotated
data alone. Previous work has been done where performance increase has been in-
vestigated based on how the fraction of unlabelled examples are introduced to train
the model. It has been shown that significantly better results can be reached by
adding a large fraction of unlabelled examples, especially in the sparse setting [47],
[48].
Since no really large dataset with both annotated and unannotated floor plan images
exists currently, we have chosen to compose our data from two different sources.
The datasets both contain floor floor plans but they both have their similarities and
their differences. The assumption that all data comes from the same distribution
cannot necessarily be made. In the following sections we will explain where the data
comes from and the discrepancies in the data are being dealt with.

3.1.1 Annotated data - The CubiCasa5k dataset
The main source of annotated data in this project is a novel dataset called Cubi-
Casa5k. It was first introduced by Kalervo et. al. in their 2019 paper[1]. The
dataset contains 5.000 labelled raster pictures fetched from image scans and they
are divided into three categories based on their visual style, as can be seen in Figure
3.1. The annotation of the dataset is rich in the sense of precision and the amount
of information contained in each label. Altogether, there are around 80 different ob-
ject types and room labels are represented by polygons in contrast to other earlier
datasets where often rectangles are used for simplicity [49].
The labels of the data has an intricate structure where there are three primary types
of labels: junctions, rooms and icons. The junctions are pixel accurate locations for
each interest point of types as can be seen in Figure 3.2. The rooms and icons are
each represented by a pixel-wise segmentation maps of room and icon classes.

23


3. Method

Figure 3.1: Examples of the visual style of the images of the three categories in
the CubiCasa5k dataset with their respective labels above. The images are scaled
to fit the page format.

Figure 3.2: A visual representation of all the different annotation categories of the
CubiCasa5k dataset. Junctions, openings and corners are lists of coordinates while
rooms and icon categories are pixel-wise segmentation maps over the image.

44 Output maps
21 Interest points (heatmaps) 23 Segmentation maps

Wall corners Opening endpoints Icon corners Room class Icon class
13 4 4 12 11

Table 3.1: The structure of the output layers of the network.

3.1.2 Unannotated data - The Lifull Home’s dataset
The LIFULL HOME’s Dataset [11] is the biggest collection of floorplan image data
available for research today. The dataset consists of about 5.31 million images in
.jpg format. The images have a high variance in size, colour and quality since it has
been collected over some period of time from multiple sources all over Japan. The

24


3. Method

architectural qualities of the floor plans as well as many of the graphical represen-
tations used to convey information is quite different from the western style used in
the CubiCasa5k dataset.

Figure 3.3: Examples of the visual style of the images of in the LIFULL HOME’s
dataset. The images are scaled to fit the page format.

Figure 3.4: The resolution distributions of the different datasets used in the
project. A simple random sample of 4000 image instances of each set was used.

The CubiCasa5k dataset is split into a training, a validation and a test set. The
training set consists of 4200 data points while the validation and test sets hold 400
images each. Each of the slices are assigned equal proportions of images from each
of the visual style categories to not introduce some unwanted bias, see Figure 3.1.
For convenience we also chose to use this predefined partitioning.
For the LIFULL HOME’S dataset we had to slice it ourselves. We also had to
manually annotate a small portion of it for testing purposes. How this was done is
described in Section 3.3.3.
For the validation slice used to determine when to terminate training and to provide
the ADAM optimisation algorithm with a time series of the model performance
evolution during training we concluded that we had 3 options:

• Using pseudo labels for the validation set.
• Manually labelling a big enough portion of the LIFULL HOME’S dataset to

be used as a validation set.
• Use the validation slice of the CubiCasa5k dataset for all the experiments.

25


3. Method

We judged that using pseudolabels as annotation in the validation set would be
too unreliable and labelling a big enough portion of the LIFULL HOME’s dataset
would be too time consuming and outside the scope of this project. We therefore
decided to use the validation slice of the CubiCasa5k dataset for validation in all
our preliminary and final experiments.

3.2 Pseudolabelling
Our proposed method utilises semi-supervised learning through self-learning by let-
ting a model trained on labelled data make prediction on unlabelled data and use
these predictions as a basis for creating pseudolabels for further training.
When run on an image, the model outputs its predictions in terms of sets of pixel-
wise feature maps that each correspond to either rooms, icons or interest-points. In
other words we get a measure for the model prediction on each pixel for every class,
but we want the labels to be on the same format as our pre-labelled examples. This
leads us to the non-trivial task of picking a way of creating psuedolabels from the
model output.
Moreover, if we can’t make any of the assumptions stated in Section 2.3, we at least
need to do some kind of augmentation to the information of the model output when
we create the pseudolabel for for there to be any reason to believe that the model
should perform any better after the continued training than after just being trained
on the labelled dataset. If x,y is our labelled dataset and x̂ is our unlabelled data
with corresponding pseudolabels ŷ, we can express this as

min
θ̂
L(Gθ̂(x), y|x ∈ x, y ∈ y) ?= min

θ̂
L(Gθ̂(x), y|x ∈ x ∪ x̂, y ∈ y ∪ ŷ), (3.1)

where θ is the parameters of the model after being trained on only labelled data.
This is simply to say that if our unlabelled data is not telling us something about
the true distribution of x nothing new can by learnt by sole extrapolation of what is
already known. The following sections will explain the tested approaches of finding a
suitable way of creating enhanced pseudolabels better than the raw network output.

3.2.1 Statistical approach
The most obvious way of picking pseudolabels would be to use some probability
measure on the model output to decide what information to keep. The idea would
be that predictions with a high likelihood of being correct would be kept while those
with a high prediction uncertainty would be discarded. This could for instance be
done in the following way:

• For heatmap classes1: From all prediction instances, keep only those with a
higher probability of being correct than a certain threshold.

• For the one-hot encoded2 classes: For each pixel, pick the class that is most
likely to be the true class.

1The corner classes.
2The the rooms and icon classes where each pixel must be one and only one of the classes.

26


3. Method

The use of this approach can be motivated by the fact that if we select what infor-
mation to include in our pseduolabels based the probability of the information to be
correct, we can expect the information in our pseudolabels to be correct with that
same probability on average. We can express this as

Ex∼PT (x)[T (x)] =
∫
T (x)dPT (x) (3.2)

where PT (x) is the probability density function of any finite statistic T (x) of the
data x. In our case T (x) would be the correctness of the model prediction w.r.t the
label of. Here, T (x) can be interpreted as what is known as a conformal predictor
[50]. This means that we by utilising this metric could control the quality of our
pseudolabels by requiring a higher or lower probability for the included information
to be correct. For instance, we could require a confidence level α ∈ [0, 1) in all
information we chose to include and get

E[T (x)|PT (x) > α]E[JPT (x) > αK] ≥ 1∫ 1
α dPT (x)

E[T (x)] (3.3)

where J·K is the generalised Kroneker delta function:

JP K =
{

1 if P is true
0 otherwise (3.4)

However, to use this approach we are required to have a good measure for how
confident we can be that a prediction is correct. Since the model outputs for the
one-hot classes is in form of scores for all classes on each pixel, it is a reasonable
assumption that there could be a correlation between a score of a class and the
probability that the prediction is correct for that class. We can also transform the
model output to something with the same form as a discrete probability distribution
by passing it through the Softmax function, described in Section 2.4.1.

Figure 3.5: An example of what a correlation between the prediction certainty and
correctness could look like. The pixels that the network is most sure about is to a
high extent also the pixels that are classified correctly.

To confirm that Softmax of the model output itself is indeed a good proxy for PT (x),
we had to run a few experiments. For the room and icon classification this would
mean the pixels with high prediction certainty would be classified correctly more

27


3. Method

often than those with low prediction certainty. To see if this was the case, we used
the the absolute loss function

LABS(x, y) = 1
P

∑
p∈P
|yp − xp|1 (3.5)

as our statistic T . This is the normalised pixel-wise distance function that returns
the Manhattan distance between the network prediction x and the correct label y.
Under the assumption of the described scenario, we would expect to see

∂

∂α
EPT (x)>α[LABS(x, y)] ≤ 0 ∀α ∈ (0, 1) (3.6)

.
This is to say that we expect the loss to be lower, and hence a higher quality of the
information, when we are more conservative with what information to include w.r.t
the certainty of the prediction. An easy way to check if this assumption holds is to
use what is known as calibration plot [50], [51] where the correctness is plotted as a
function of the model certainty.
The results of this experiments can be found in Section 4.1.1. It showed that the
correlation was to weak to be useful for our purpose.

3.2.2 Post-processing technique
Another possible approach to the problem of creating a pseudolabel from the net-
works raw output is to run it through some kind of function that can enhance it’s
quality. If we call this function q, this would mean that we could expect to see

L(Gθ(x), y) ≥ L(q(Gθ(x)), y) (3.7)
for some arbitrary loss function L that in some sense measures the quality of the
prediction. Although the existence of such a function q does not seem too unreason-
able, we need to address the question of why this transformation is not implicitly
learned by the network by a different parameterisation θ̂ to get Gθ̂(x) = q(Gθ(x)) if
it reduces the loss of the model.
One possible explanation for why a function with these properties could exist is
if it has access to different information than the model itself that can be used to
enhance its performance. In the model that are used for this project, we can do
exactly this. The model output is naturally divided into three distinct categories of
predictions; pixelwise room and icon segmentation maps and heatmaps of junctions.
This means that we can potentially use the junction heatmaps to infer better room
and icon segmentation. If we for convenience define the network output z := Gθ(x)
so that zh represents the heatmap channels and zr,i the room and icon channels,
we can reformulate our criterion on a sufficient function q in equation 4.3 to be
explicitly dependent on zh

Lr,i(zr,i) ≥ Lr,i(q(zr,i|zh)) (3.8)
where Lr,i is some loss function that only cares about the room and icon channels.
With this reasoning we can conjecture that a function with the properties of q may

28


3. Method

exist, but it is in no way a guarantee for it’s existence and we have no general
algorithm for finding it. In our setting where we are concerned with the parsing of
floor plans, there are however a few empirical observations to be made that can be
used to guide our search of a sufficient q. Some of these are the following:

• Rooms are made up of simple polygons with orthogonal corners.
• Icons are rectangles.
• Every region completely surrounded by walls and apertures contains only one

room type.
• The region that is outside the outermost closed wall-loop is the background.

The hope is that we can use these observation in combination with the output from
the heatmaps to create a prediction that is better than the raw network output in
the sense that it gives a substantially lower loss for some loss function L. Ideally
this should also be the case for the cross entropy loss function LCE since this is what
we use to train the network.
Kalervo et. al [1] proposed in their work a novel post-processing algorithm with the
structure of q that we we will call qcc. It is a procedural algorithm with the aim to
extract all elements of interest in the floor plan including walls, rooms openings and
icons. It can be understood as 4 distinct steps that are being executed in sequence
in the following way:

• Inferring wall skeleton
The algorithm starts by connecting pairs of junctions based on their position,
type and orientation. This means that if there are two junctions that are close
to being vertically or horizontally aligned and they have a joining direction
facing each other they will be connected by a line. The junctions are also
batched together so that multiple neighbouring junction points of the same
type gets mapped to just a single point to avoid crowding. The result from
this part of the algorithm is a "skeleton" of possible wall centre lines.

• Inferring walls
The wall skeleton is used together with the wall segmentation map to construct
a final wall prediction. First the skeleton is pruned by removing lines that do
not consist with the wall segmentation and the wall thickness is then decided
based on the intensity profile of the wall segmentation map.

• Inferring rooms
Next the processed room segmentations are calculated based on outcome from
the prevous two steps. The algorithm searches for all junction triplets that
spans a rectangle without any junctions inside it to create a grid of the interior
of the floor plan. For each of the grid cells, a voting mechanism is being applied
that samples from the pixel predictions of the room segmentation maps to
decide what room type to assign for for the cell. Adjacent cells of the same
class are being merged together if and only if there are no fully separating
walls between them. The same mechanism is being used to find the icons but
here the icon heatmaps and segmentations are being used.

• Inferring apertures
The last step is to find the doors and windows by utilising the corresponding
endpoint heatmaps. First all points that do not coincide with wall segments
the processed wall segment map are abandoned. The remaining points are

29


3. Method

then matched into window and door segments and the width of the aperture
is chosen to be the same as the host wall where it is located.

Figure 3.6: Examples of what qcc does with the segmentation maps from the room
channel. The examples are randomly sampled from the test set of the CubiCasa5k
dataset. The most visually prominent changes is that all segmentation have been
translated into simple polygons.

Figure 3.7: The concept of how the post-processing algorithm qcc works. Given the
predicted junction heatmaps and the room and icon segmentations, qcc can "clean
up" the segmentations e.g. by inferring a closed room between 4 suitable L-type
corners.

As previously mentioned, we want the choice of q to give us better performance than
the raw network output. To confirm that this can be expected with qcc we measure
the loss before and after applying qcc to the network prediction. The results from
these experiments can be found in Section 4.1.2.

3.2.3 PixMax pseudolabelling technique
Similar to the method proposed in Section 3.2.2, we might choose a function q :
R→ RB operating on predictions of several different non-destructive rotations and
flips of the input image. Then the batch b := Gθ(q(x̂)) of B predictions can be ori-
ented back and combined to form our pseudolabel ŷ = q̂(b) for the given image in
the unlabelled dataset. We use all unique orientations of 90 deg rotations and flips,

30


3. Method

which gives B = 8 combinations. Here, q̂(b) is set to the mode-function, pixel-wise
selecting the most common value in the batch b of predictions.

Also, instead of generating pseudolabels for all images in the unannotaded dataset,
we could try to find a scalar measurement for how accurate the prediction is. As
a proxy of accuracy, we implement a function for calculating a scalar β for how
confident the network is of the prediction on a image. Then we can further im-
prove the quality of the pseudolabels by only using the labels that satisfy β(x) > τ
for a threshold hyper-parameter τ . Equations 3.9 and 3.10 describe this (denoted
CONF(·) in Algorithm 1). Here cmc,j,i is the value of the most common class at
pixel (i, j) of batch b, and cb,j,i the value at the bth prediction. J·K is the generalised
Kroneker delta function (Equation 3.4),

βj,i = 1
B

∑
b∈B

Jcb,j,i = cmc,j,iK (3.9)

,

β̄ = 1
HW

∑
βj,i (3.10)

.
Initial results in 4.1.3 was promising and this method was chosen for the PixMax
self-training.

3.3 PixMax self-training
Once we have found a way of generating pseudolabels of good quality we might use
this in a self-training scheme to further improve a supervised model.

As can be seen in Figure 3.8, the PixMax model training scheme consists of three
phases; a supervised learning phase, a pseudolabelling phase and a semi-supervised
self-training phase. In the supervised training phase a model is trained until con-
vergence in a purely supervised manner. We are then using this model to create
pseudolabels for an unlabelled dataset with similar features as the original dataset.
We do this by combining predictions for multiple light augmentations of the images
into a single label as described in 3.2.3. If the label passes the β-threshold check,
then it is accepted and will be used as a pseudolabel for the second phase of the
model training. In the last phase the trained models weights and biases are copied
but new hyperparameters are initialized. The model is once again trained until con-
vergence with a combined dataset consisting of both the original labelled dataset
and the accepted portion of the pseudolabelled dataset.

The following sections describe how we implement the network model and what
augmentations we have applied to diversify the data as well as the details of how we
evaluate the performance of the trained models.

31


3. Method

Figure 3.8: The PixMax model training scheme. Light blue: The labels for the
labelled dataset. Dark blue: The images and predictions for the labelled dataset.
Dark green: The images and model predictions for the images in the unlabelled
dataset. Light green: The pseudolabels created by the model in the pseudolabelling
phase.

3.3.1 Network model
The model used in [2] have achieved state of the art results and we have therefore
chosen it as our model to be used. The model converts the floor plan image through
two intermediate representation layers. The first step is the network inference step
which outputs 44 maps of interest points and pixel-wise semantics. Second these are
converted through integer programming (IP) to form a set of geometric primitives.
Note that the sole purpose of the interest points is to infer points to be used to
construct the geometric primitives. Finally a step of post processing is applied to
form the vectorized output of geometries with class labels.
The architecture of the model is borrowed from the ResNet-152 [12] model with an
altered output layer as shown in Table 3.1. Figure 3.9 shows the high-level structure
of the model.

32


3. Method

Algorithm 1: PixMax Pseudolabelling scheme
Input : Set of unlabelled images: x̂ = {x̂i : i ∈ {1, . . . ,M}}.

Pre-trained model Gθ with parameters θ.
Parameters: Threshold τ , for β metric.
for x̂i ∈ x̂ do

b← Gθ(q(x̂i)) . Batch of predictions
ŷi ← q̂(b) . Inverse augmentation
ŷmcp ← MODE(b) . Most common prediction
βi ← CONF(ŷmcp, ŷi) . Calculate confidence

Output : Images with pseudolabels: {{x̂i, ŷi} : i ≤M,βi ≥ τ}.

Figure 3.9: A simplified illustration of the architecture of the model used. Image
from [1].

3.3.2 Data diversity augmentations
Computer vision tasks benefits greatly from having augmentations applied to the
data to generate a more diverse dataset of which the model can better generalise
from. Hence, we have used a few common augmentations to vary the data during
the training. All images are either cropped or resized to a target size. This speeds
up the training time of the network and also allows the network to learn features
from different scales of floorplan drawings.

• Resize with padding. Resizing the image to the target size keeping the
aspect ratio intact by padding the needed pixels with zeros (black pixels and
no label).

• Crop to size. Cropping out a part of the image and label to get a smaller
piece for the training.

• Rotations. Random 90 deg rotations of the image and label.
• Color adjustments. Weakly adjusting the brightness, contrast and satura-

tion of the image.

3.3.3 Evaluation datasets
A model that has been trained on the CubiCasa5k[1] dataset in a exclusively super-
vised fashion will be used as our primary benchmark for our semi-supervised model.
It is clearly reasonable to expect that this model will have a positive bias on it’s per-
formance on the CubiCasa5k dataset relative to our semi-supervised model. To be

33


3. Method

able to get a reliable measure and fair comparison of the performance of our model
we needed to be able to evaluate models on both the labelled and the unlabelled
datasets.
For our unlabelled dataset, we therefore needed to create enough labels to be able
to evaluate the models performance ourselves. For this task we used a online image-
annotation pipeline called labelbox 3 that supports polygon segmentation of images.
We selected a random subset of 50 images from the LIFULL HOME’S dataset and
manually annotated them with the same set of room and icon classes as the Cubi-
Casa5k dataset. To not introduce unnecessary bias to the results we tried to use
the same style and conventions as possible when performing the annotation. All our
annotations are publicly available here: github.com/xRadne/LH_annotations.

3.3.4 Evaluation metrics
We have chosen a set of performance metrics to evaluate our model in reconciliation
with what’s common practice for semantic segmentation [52]. Although our model
predicts both room- and icon segmentation maps as well as heatmaps of interest
points, we have chosen not to evaluate the model’s performance on the heatmaps for
several reasons. Annotating interest points is a very time-consuming task and using
a really small test set would result in very wide confidence intervals that does not
provide much information about the model’s actual quality. Also, this information
will be captured indirectly by the segmentation evaluations of the polygonised (post-
processed) predictions since these are constructed using the interest points (3.2.2).
Moreover, to convert the heatmaps to discreet point locations, a threshold value has
to be chosen4 — that makes the recall and accuracy valued somewhat arbitrary.
With C classes and ni,j = ∑

pi,j the number of pixels of class i predicted to be of
class j, our chosen metrics can be described in the following way:

• Overall accuracy - The ratio of correctly classified pixels.

Overall Acc =
∑
i ni,i∑

i

∑
j ni,j

(3.11)

• Frequency weighted averaged accuracy - The ratio of correctly classified pixels,
weighted for the occurrence frequency of each class.

FreqW Acc = 1∑
i

∑
j ni,j

∑
i ni,i

∑
j ni,j∑

j ni,j +∑
j(nj,i − ni,i)

(3.12)

• Mean intersection over union - The overlap of the predicted and true pixels,
averaged over all classes.

Mean IoU = 1
C

∑
i ni,i∑

j ni,j +∑
j(nj,i − ni,i)

(3.13)

3The annotation tool used can be found here: https://labelbox.com/.
4For the evaluation to be fair we would need to create a metric based on the distance between

each predicted point and the closest ground truth point of the same category. This would require a
conversion of the heatmaps to a boolean map for each layer through the use of a threshold function.

34

https://github.com/xRadne/LH_annotations


3. Method

These four parameters are picked to give fair metric for the model performance.
Since some of the room classes are heavily under-represented compare to other classes
in both the datasets used, it is a legitimate hypothesis that the model will not be
able to recall these classes to the same extent as the others. For this reason, the
Frequency weighted average accuracy is in a sense the metric that gives the most
nuanced grade for the actual performance of the model. All models are evaluated
both with and without Test Time Augmentations (TTA) for a complete result. The
TTA is a set of light augmentations that is applied to each image in the testing set
during model evaluation. The final model prediction is then defined as the pixel-wise
most common class prediction. This technique reduces the statistical fluctuations
and therefore gives a better prediction on average.

In addition to this we also evaluated the models mean precision and mean recall for
rooms and icons. These metrics can be described as follows:

• Mean Precision - The fraction of relevant instances among the retrieved in-
stances.

Mean Precision = true positives

true positives+ false positives
(3.14)

• Mean Recall - the fraction of relevant instances that were retrieved.

Mean Recall = true positives

true positives+ false negatives
(3.15)

3.4 Implementation details

3.4.1 Hardware specifications
The code for this project was all written in python with pytorch v1.6.0 for the deep
learning implementations. This framework was chosen since the earlier work that
the project is based on uses this framework. It is also one of the currently most
popular and widely supported implementations for building deep learning models.
The code for this project can be found at .
We had access to gpu-clusters for running our code throughout the second part of
the project. For model evaluations and small scale experiments we used a cluster
with a Nvidia Tesla T4 GPU with 16GB RAM and for pseudolabel creation and
model training we used gpu-clusters with the following specifications:

• Nvidia Tesla V100 SXM2 GPU with 32GB RAM
• 2 x 8-core Intel®Xeon®Gold 6244 CPU @ 3.60GHz (total 16 cores)
• 387GB SSD scratch disk

The training time naturally varied between different runs. The maximum training
time of about 9 hours was for a run of 100 epochs with a combined dataset of about
14.000 data points consisting of both annotated and pseudo-annotated images.

35


3. Method

3.4.2 Experimental setup for model training
Here we list our settings and hyperparametes used in training that have been consis-
tent in all runs as well as the search ranges for parameters we have tried to optimise.
Since optimizing for all parameter would result in a search space too large for the
scope of this project, we have limited out search for the best model candidate to the
subset of the parameters that we think are the most rewarding and interesting for
our purpose.
For all experiments, the pretrained model from [1] have been used as the starting
point but with new hyperparameters. The model is a ResNet-152 [12] with a modi-
fied output layer that has been pretrained on ImageNet [13] and MPII Human Pose
dataset [53]. The ADAM optimiser as described in Section 2.1.2 was used with an
initial learning rate of 1e−4 and a scheduled learning rate drop of 0.5

kcurr
kmax where

kcurr and kmax are the indices of the current and last epoch respectively. The initial
learning rate was picked according to what we found to be conventional and the
learning rate drop schedule was picked based on empirical results from our prelimi-
nary experiments. We discovered early that the models performance rarely changed
significantly after more than 80 epochs of continued training from the pretrained
model state. For this reason we choose to train all models for no more than 100
epochs from the pretrained state. For all runs a — for the model previously unseen
— subset of the CubiCasa5k [1] datset was used as the validation set. The reason for
this was simply the lack of a big enough manually labelled set of LIFULL HOME’s
[11] images to be able to determine a reliable stopping criteria for the training.

The model we label Ours in Section 4.2 is the best performing model from our final
set of runs. Since hypothesised that the β-value was an important hyperparameter
we trained models with β-values ranging from 0.95 to 0.99 with intervals of 0.01,
everything else identical. From our preliminary experiments we saw that both the
reference model and the models that we tested were heavily biased to predicting
false positives the undefined room class. To take this into account, we tried training
all our models with a setting that would ignore the contribution of the undefined
room class to the loss. It would then weight the loss to compensate for the propor-
tion of the image removed. Figure 3.10 shows an how the overall accuracy varies
between all our final models and similar plots for the other performance statistics
can be found in Table 4.3, Appendix A.

36


3. Method

Figure 3.10: The overall accuracy on the LIFULL HOME’S test set for models
trained with different β-thresholds for selecting what pseudolabels to use. Blue:
Models trained using the ignore index setting describ