Incorporating Interior Property Images
for Predicting Housing Values

Master’s thesis in Data Science and AI

Adrian Gortzak

Nedim Can Ulusoy

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2024


Master’s thesis 2024

Incorporating Interior Property Images for
Predicting Housing Values

Adrian Gortzak

Nedim Can Ulusoy

Department of Computer Science and Engineering
Chalmers University of Technology

Gothenburg, Sweden 2024


Incorporating Interior Property Images for Predicting Housing Values

Adrian Gortzak
Nedim Can Ulusoy

© Adrian Gortzak, Nedim Can Ulusoy, 2024.

Supervisor: Milad Malekipirbazari, Computer Science and Engineering
Advisor: David Magnusson, Valueguard Index Sweden AB
Examiner: Aila Särkkä, Mathematical Sciences

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Visual features as part of the comparative market analysis tool.

Gothenburg, Sweden 2024

iv


Incorporating Interior Property Images for Predicting Housing Values

Adrian Gortzak
Nedim Can Ulusoy
Department of Computer Science and Engineering
Chalmers University of Technology

Abstract
The property valuation process for the real estate market is essential for predicting
a fair market value. This process is traditionally carried out by brokers, including
inspecting and assessing the subject property to find comparable sales for compara-
tive market analysis (CMA). Meanwhile, an automated valuation model (AVM) can
help achieve an autonomous version of this process, which speeds up the process but
lacks some of the inputs that a manual assessment provides. AVMs have difficulty
considering more subjective architectural qualities, such as beauty, stability, and util-
ity, due to the difficulty of quantifying these aspects objectively. New advancements
in Visual Transformers (ViT), self-supervised learning and Contrastive Language-
Image Pre-training (CLIP) technologies have shown favourable improvements in the
field of computer vision. Therefore, this study explores the potential improvements
of these new techniques within the visual feature extraction task to enhance the
AVMs from interior images. By applying ViTs as binary classifiers, clusters, and
textual descriptions matching, we aim to enrich the feature extraction process for
a property valuation model in the region of Uppsala County, Sweden. Our find-
ings show modest enhancements in the AVM’s performance, which align with prior
studies, but also highlight that these new technologies can extract more detailed fea-
tures compared to previous methods. Furthermore, they demonstrate the potential
for these technologies to capture more comprehensible architectural qualities from
images, which could significantly assist brokers in the valuation process.

Keywords: Computer Vision, Transformers, Feature Extraction, Machine Learning,
Deep Learning, Real Estate, Automated Valuation Models, Architectural Qualities.

v


Acknowledgements
Firstly, we want to extend our sincere gratitude to Milad Malekipirbazari, our aca-
demic supervisor, for quickly providing suggestions and answers to our inquiries.
Additionally, he suggested alternatives and supplied a solid foundation in the field
of AI and ML while still being patient with us.

Secondly, we would also like to show appreciation and gratitude to Valueguard In-
dex Sweden AB is for an exciting research area, hardware access, and an educative
process. Moreover, we would like to thank David Magnusson, our company super-
visor, for his engagement, fast support and industry expertise in guiding the thesis
forward and resolving issues along the way.

Finally, we want to express our heartfelt gratitude to all the individuals who have
contributed feedback and input throughout the thesis.

Adrian Gortzak & Nedim Can Ulusoy , Gothenburg, 2024-06-17

I want to express my heartfelt appreciation to my partner, Sandra, for her invaluable
help and emotional support throughout the thesis. I am deeply grateful for her
encouragement and support.

Adrian Gortzak, Gothenburg, 2024-06-17

I would like to express my heartfelt gratitude to my family for their unwavering
support and encouragement throughout the entirety of my academic journey.

Nedim Can Ulusoy, Gothenburg, 2024-06-17

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Property Valuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Market Value in Real Estate . . . . . . . . . . . . . . . . . . 5
2.1.2 Comparative Market Analysis . . . . . . . . . . . . . . . . . . 6
2.1.3 Automated Valuation Models . . . . . . . . . . . . . . . . . . 6

2.2 Features Impacting Property Price . . . . . . . . . . . . . . . . . . . 7
2.2.1 Architectural Quality . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Factors Connected to Market Value in the Location . . . . . 7
2.2.3 Factors Connected to Market Value in the Property . . . . . . 8
2.2.4 Property Type Specific Factors . . . . . . . . . . . . . . . . . 8
2.2.5 Other Factors Connected to Price . . . . . . . . . . . . . . . . 9

2.3 Images in the Sales Process . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Interior Visible Features . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Review of Similar Studies . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Previous Attempts . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 13
2.7.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Gaps in the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Methods 17
3.1 Research Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Limitations and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 18

ix


Contents

3.3 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Valueguard Index Sweden AB . . . . . . . . . . . . . . . . . . . . . . 20

3.5.1 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5.2.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2.2 Base Features and Target . . . . . . . . . . . . . . . 22

3.6 Room Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.9 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.10 Visual Target Features . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.10.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . 29
3.10.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.10.3 Contrastive Language-Image Pre-Training . . . . . . . . . . . 32

3.11 Utilising Visual Features in the Model . . . . . . . . . . . . . . . . . 32
3.12 Automated Valuation Model . . . . . . . . . . . . . . . . . . . . . . . 33
3.13 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Results 37
4.1 Classification of Images . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Self-Supervised Models . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.1 Binary Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Clustering Features Found . . . . . . . . . . . . . . . . . . . . 41
4.3.3 CLIP Features Explored . . . . . . . . . . . . . . . . . . . . . 41

4.4 Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Automated Valuation Model . . . . . . . . . . . . . . . . . . . . . . . 43

4.5.1 Apartments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.2 Houses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5.3.1 Apartment . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.3.2 House . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.4 Visual Features Importance . . . . . . . . . . . . . . . . . . . 48
4.5.4.1 Apartment . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.4.2 House . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Conclusion 55
5.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5.1 Recommendations for Future Research . . . . . . . . . . . . . 58
5.5.2 Conclusive Summary . . . . . . . . . . . . . . . . . . . . . . . 58

Bibliography 59

x


Contents

A CLIP features I

B Neural Network model architectures VII

xi


Contents

xii


List of Figures

1.1 Blueprint for the system’s process flow . . . . . . . . . . . . . . . . . 3

2.1 Highlighting comp selection problem . . . . . . . . . . . . . . . . . . 10
2.2 Example comp (Desired) . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Example comp (Undesired) . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Uppsala County on OpenStreetMap [60] . . . . . . . . . . . . . . . . 18
3.2 Geographical areas from Statistics Sweden on OpenStreetMap . . . . 21
3.3 H3 Index with different resolutions on OpenStreetMap . . . . . . . . 21
3.4 Room types explored . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 An instance of labeling process . . . . . . . . . . . . . . . . . . . . . . 25
3.8 Pipeline for the AVM model . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Neural Network with two-dimensional input . . . . . . . . . . . . . . 27
3.10 Neural Network with one-dimensional input . . . . . . . . . . . . . . 28
3.11 Histogram of the percentage error . . . . . . . . . . . . . . . . . . . . 30
3.12 Histogram of the standard point . . . . . . . . . . . . . . . . . . . . . 30
3.13 Use of the CLIP model to generate spaciousness score [106] . . . . . . 32

4.1 Confusion matrix of interior and exterior classifications . . . . . . . . 38
4.2 Number of images in each room type used in the study . . . . . . . . 39
4.3 Pre-trained ViT model attention on a kitchen example . . . . . . . . 39
4.4 Our Self-supervised ViT model attention on a kitchen example . . . . 40
4.5 Sales of property type before and after filtering . . . . . . . . . . . . 42
4.6 Apartment sales in Uppsala County . . . . . . . . . . . . . . . . . . . 43
4.7 House sales in Uppsala County . . . . . . . . . . . . . . . . . . . . . 43
4.8 Top 30 features - apartment AVM [Ridge] - With base features . . . . 47
4.9 Top 30 features - apartment AVM [XGBoost] - With base features . . 47
4.10 Top 30 features - house AVM [Ridge] - with base features . . . . . . . 48
4.11 Top 30 features - house AVM [XGBoost] - with base features . . . . . 48
4.12 Top features - apartment AVM [Ridge] - with only cluster features . . 49
4.13 Top 30 features - apartment AVM [Ridge] - with only CLIP features . 50
4.14 Top 30 features - apartment AVM [XGBoost] - with CLIP features . . 50
4.15 Top 30 features - apartment AVM [XGBoost] - only CLIP features . . 51
4.16 Top features - house AVM [Ridge] - only cluster features . . . . . . . 51

xiii


List of Figures

4.17 Top 30 features - house AVM [Ridge] - only CLIP features . . . . . . 52
4.18 Top 30 features - house AVM [XGBoost] - with CLIP features . . . . 53
4.19 Top 30 features - house AVM [XGBoost] - only CLIP features . . . . 53

B.1 Neural Network model structure for apartment AVM . . . . . . . . . VII
B.2 Neural Network model structure for house AVM . . . . . . . . . . . . VIII
B.3 Head of the Neural Network classification model for percentage Error

and Standard Points . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
B.4 Head of the Neural Network classification model - interior and exterior IX
B.5 Head of the Neural Network Classification Model - room type . . . . IX

xiv


List of Tables

3.1 Housing metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Apartment metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Best thresholds and F1 scores for different rooms . . . . . . . . . . . 38
4.2 Area under the curve score for the room types on percentage error . . 40
4.3 Area under the curve score for each room type on standard point . . 41
4.4 Visual clustering features selected with beauty (B) and utility (U) . . 41
4.5 Examples of the CLIP features selected . . . . . . . . . . . . . . . . . 42
4.6 Performance metrics on Ridge for apartment AVM . . . . . . . . . . 44
4.7 Paired sample t-Test results for apartment AVM with Ridge . . . . . 44
4.8 Performance metrics on XGBoost for apartment AVM . . . . . . . . . 44
4.9 Paired sample t-Test results for XGBoost . . . . . . . . . . . . . . . . 44
4.10 Performance metrics on Neural Network for apartment AVM . . . . . 45
4.11 Paired sample t-Test results on Neural Network for apartment . . . . 45
4.12 Performance metrics on Ridge for house AVM . . . . . . . . . . . . . 45
4.13 Paired sample t-Test Results for house Ridge . . . . . . . . . . . . . . 45
4.14 Performance metrics on XGBoost for house AVM . . . . . . . . . . . 46
4.15 Paired sample t-Test results for house XGBoost . . . . . . . . . . . . 46
4.16 Performance metrics on Neural Network for house AVM . . . . . . . . 46
4.17 Paired sample t-Test results for house Neural Network MAPE . . . . 46

A.1 CLIP features - bedroom . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 CLIP features - bathroom . . . . . . . . . . . . . . . . . . . . . . . . II
A.3 CLIP features - kitchen . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.4 CLIP features - living room . . . . . . . . . . . . . . . . . . . . . . . IV
A.5 CLIP features - dining room . . . . . . . . . . . . . . . . . . . . . . . V

xv


List of Tables

xvi


1
Introduction

This section introduces our research topic of visual feature extraction in real es-
tate, outlines the research question, and provides a brief background on automated
valuation models and visual feature extraction with deep neural networks.

1.1 Background
Property value assessment is an essential part of the real estate field, ensuring that
both buyer and seller get a fair price for what is typically one of the most significant
investments in their lifetime. The aim of property value assessment is to predict
market value, which is the expected value of a property under normal conditions [1].
A broker conventionally does this with the help of a manual Comparative Market
Analysis (CMA), which uses comparable sales to derive the market value [1].

While thorough, the traditional method can be time-consuming, especially when
comparing architectural qualities that require an assessment of utility, stability, and
beauty. Quantifying these aspects automatically poses a challenge, often requiring a
manual visual comparison of the images.

To address this challenge, the real estate field has seen significant advancements
in Automated Valuation Models (AVMs), which efficiently and accurately estimate
a property’s market value efficiently and precisely [2]. While AVMs also primarily
rely on easily quantifiable data, such as the living area and the number of bedrooms,
computer vision and pattern recognition improvements have made it feasible to
combine unstructured data, such as images, as part of the AVMs’ input.

Previous studies have tried incorporating visual features from images [3], [4]. They
have focused on exterior and interior images, using techniques such as Convolutional
Neural Network (CNN) [5] for pattern recognition related to market value. While
these studies have shown feasibility, they have also shown only modest improvements
[3].

Building on this foundation, new computer vision improvements have shown promis-
ing results in quantifying aesthetics and outperforming the earlier state-of-the-art
CNN models [6] with the recent development of Vision Transformers (ViT) [7].

Furthermore, self-supervised methods, such as Self-Distillation with No Labels (DINO)
[8] and A Simple Framework for Contrastive Learning of Visual Representations

1


1. Introduction

(SimCLR) [9], have demonstrated the ability to learn robust features from images
without relying on labelled data. They achieve this by utilising parts or transformed
version of the image during training of the ViT model, predicting whether the im-
ages originate from the same source. These methods reduce the previous need for a
larger labelled image dataset to create robust models for feature extraction.

Additionally, the arrival of Contrastive Language-Image Pre-Training (CLIP) [10],
trained to find the similarities between the ViT model and a text encoder, makes
scoring the images by textual input feasible without additional training or provided
examples.

1.2 Problem
This thesis aims to solve the problem of extracting and incorporating visual features
related to architectural qualities and utilising advancements in computer vision, par-
ticularly via ViTs to reduce the uncertainty in AVMs. Through methods such as
classification, clustering, self-supervised learning, and CLIP, this research aims to
extract more comprehensive information from interior images, surpassing the limi-
tations of previous techniques. These improvements could potentially revolutionise
both automated and manual valuation processes.

Another problem with similar studies is the impact of cultural differences and pref-
erences when focusing on different countries. The preferences for particular architec-
tural styles and access to material or functionalities of rooms may cause significant
shifts in data distribution. For instance, due to differing market dynamics and Euro-
pean trends, the pattern found in the United States study [11] may not be directly
applicable or effective in the Swedish housing market.

Therefore, this thesis is an exploratory study to quantify visual features from listing
images using Deep Neural Networks (DNNs), such as CNN and ViT, to improve
AVMs in Sweden. These findings aim to find the answer to the research question:

RQ: Which visual features predict market value most significantly in Sweden?

1.3 Outcomes
This project aims to create an extraction system similar to the flowchart in Figure
1.1. It first takes images as input and proceeds to classify the type of room in these
images. The final step is to extract features from these images that are relevant to
the architectural qualities, using models tailored to the specific room type identified
and utilising deep learning techniques, such as CNNs and ViTs, in the extraction
process. These extracted features aim to be used as inputs for the AVMs, allowing for
the comparison with AVMs lacking visual features. In addition to assessing changes
in accuracy, the importance of the features will be extracted from the model and
compared to show the impact of the extracted features.

2


1. Introduction

Figure 1.1: Blueprint for the system’s process flow

1.4 Structure of the Thesis
The thesis structure will begin with a theoretical summary of the field of housing
valuation and previous attempts to use images to enhance the performance of AVMs.
The study also explores the visual features connected to market value in Sweden,
their importance, and the visual features used in similar studies in other regions.
This thesis aims to highlight the most promising techniques and visual features to
focus on, based on their proven impact on housing valuation, to answer our research
question.

Given this deeper understanding of the Swedish housing market and previous re-
search on visual extraction, the methodology section describes the methodology used
within this thesis and the necessary pre-processing to focus on different room types
individually. Additionally, it seeks to describe the different automated valuation
models and the visual feature extraction techniques applied.

Following the description of the methodology employed in the thesis, the results
section highlights the data used in sales and images. Additionally, it discusses the
features identified from the different models in this study and their correlation with
market value. The thesis ends with a conclusion section, where the results from
the experiments are discussed, and further directions are recommended for further
exploration.

3


1. Introduction

4


2
Theory

This section introduces the field of housing valuation, current manual and automated
methods to estimate market value, and the potential use of visual data to enhance
these models. Additionally, previous research utilising images and computer vision
within the valuation process will be highlighted.

2.1 Property Valuation
The property valuation process holds significant importance across numerous fields.
In the private sector, property purchases are among the most expensive decisions
an individual makes in their lifetime [12]. Ensuring a fair price is essential for both
the seller and buyer during negotiations and transactions [13].

Unlike other major purchases, such as a car, a property serves not only as a utility
but also as an investment with a potential resale value that historically tends to
increase over time [14], [15]. This dual nature of property as both a necessity and
an investment underscores its significance as future capital appreciation becomes a
key consideration [16]. Furthermore, property valuation plays a crucial role in risk
assessments [17], and is utilised by banks during loan assessment [18].

2.1.1 Market Value in Real Estate
In the real estate field, market value refers to the most probable selling price in a fair
and open market without personal relations between the seller and buyer or coercion
and with enough marketing time [1]. This implies equal access to information and
opportunities for all parties involved in the selling process. It also implies that the
broker would have enough preparation time, and it also requires the seller to be
patient without urgency in the selling process. Additionally, the lack of a personal
relationship between the broker, the buyer and the seller ensures that the price is not
dependent on the personal relation, thereby preventing potentially biased pricing.

Therefore, the sales can be seen as samples from the distribution where the market
value is the mean and can not be observed. The sales price is thereby used as the
target, with the aim of estimating the market value, assuming that all the sales
follow the market value conditions.

Consequently, a slight difference in the market value estimation and the selling price

5


2. Theory

is expected. However, a significant difference between the two may indicate either
a bad estimation or a sale price that does not follow the conditions [1].

2.1.2 Comparative Market Analysis
In Sweden, the broker usually performs a manual appraisal of houses and apartments
using a comparative market analysis (CMA) [1]. This process determines the esti-
mated market value of the subject property by finding similar comparable property
sales, referred to as comps, in an area that also shares similar characteristics.

The sale date of the comps must also be close to the valuation date to be comparable
due to market trends and price changes over time [1]. Alternatively, if it is necessary
to use an older sale, a housing index, such as the one derived from the Hedonic
Regression Model [19], can be used to track changes over larger areas and take the
feature of the property under control to isolate changes over time. Thereafter, the
index can be used to adjust the expected market value of the valuation date as either
the current or the past date, depending on the specific valuation requirements.

To derive the estimated market value, one can either use the average sale price per
square meter of the comps or the Purchase Price Coefficient (K/T) value, which is
the sale price divided by the taxation value for the comps. Depending on the chosen
sub-methods, this is then multiplied by the subject property’s corresponding living
area or taxation value [1].

Suppose a significant distinction between the comps and the subject property; for
example, a worse condition or another addition makes them different. In such a
situation, the broker can make additional adjustments to align the price with the
predicted market value [1].

Predicting the market value accurately is generally difficult [20], considering the real
estate market’s very competitive nature and frequent price fluctuations [16].

2.1.3 Automated Valuation Models
The AVM is an automated way of estimating the market value without doing a
deeper and more time-consuming analysis manually, especially when large numbers
of properties need to be assessed. In addition, it provides an automated early indi-
cator for potential sellers before a broker does a deeper analysis. Multiple types of
models can be used to do this process autonomously.

Firstly, a straightforward AVM approach imitates the traditional manual evaluation
process using a comparable approach. This approach entails a k-nearest neighbour-
like search for comps that can then be used to estimate the subject property auto-
matically [21].

Secondly, linear regression methods are also widely used, especially as an improve-
ment baseline [22]. These methods find the linear relationships between the features
and the target, assuming linearity in relations.

6


2. Theory

Thirdly, Tree-based methods and gradient-boosted methods, such as the eXtreme
Gradient Boosting (XGBoost) [23], work well when there are no linear relationships
and have shown promising results on the AVM task in previous studies [2].

In recent years, the use of Neural Networks has also shown promising results, espe-
cially in combining different sources of data into one model. Notably, market leader
Zillow transitioned from a multi-model approach to a larger Deep Neural Network
(DNN) model, resulting in improved performance and reduced maintenance [24].

2.2 Features Impacting Property Price
Understanding the importance of finding similar comps is crucial for a reliable price
estimation. These comps should closely mirror the factors of subject properties. In
the Swedish market value process [1], Fredrik Brunes categorises these factors into
two main groups: those related to the property and the property’s location.

Apartments and houses have additional separate features that connect to the market
value, such as housing cooperatives for the apartment and land area for the house.
Therefore, they are handled separately in the models and the theory [1].

2.2.1 Architectural Quality
The features connected to the property and location are scored on architectural
quality, drawing from the foundational principles in the 10 Books on Architecture
by Marcus Vitruvius Pollio [25]. These criteria are divided into three main parts
when scoring the architecture: stability, utility, and beauty, which are commonly
regarded as standard in Sweden. However, due to the unclear meaning and different
interpretations of standard, this thesis will use a specific definition of architectural
quality [1].

2.2.2 Factors Connected to Market Value in the Location
Starting with location, which is assessed similarly for both houses and apartments,
the utility of the location can be evaluated by considering the proximity to positive
factors.These can be the distance to marketplaces, commuting possibilities, and
workplaces. Additionally, it entails factors such as a sense of safety and relaxation
and the availability of places for socialisation with friends and family, such as parks
or forest areas within walking distance of the property [1].

Secondly, the area’s beauty can be assessed based on the quality of the surrounding
houses, streets, and parking spaces in terms of material and detail perspective. This
assessment includes factors such as the amount of daylight the area receives, whether
views are long or blocked by high buildings and accessibility of the area including
multiple routes to access the property [1].

Finally, the stability of the location is assessed based on the materials used in the
nearby houses and common areas, such as parks and streets. This includes evaluating
if these areas are well maintained and stating any damages and their severity [1].

7


2. Theory

The location is assessed in multiple layers, which include micro-location, the sur-
rounding area, and the neighbourhood’s reputation. The micro-location is the re-
gion with a direct connection to the property. Additionally, the surrounding area
is the region within walking distance. Lastly, the neighbourhood’s reputation is a
wider region where the general opinion is assessed. In particular, the reputation is
not assessed based on architectural quality but rather as an independent value [1].

2.2.3 Factors Connected to Market Value in the Property
Unlike the nearby area, which is shared between multiple residences, a property
offers a private space that the owner can customise. This gives the owner more
control over the home environment and architectural qualities.

Although there are distinctions between houses and apartments, there are also sev-
eral commonalities in terms of the layout, utility, and aesthetics of the rooms. This
section explains the overlapping characteristics of the apartment and the house,
while the differences will be addressed in the subsequent section.

Stability within the property is connected to the material and building techniques
employed during its construction. This relates to the predicted maintenance re-
quirements in the form of expected repairs and the associated cost. Typically, the
foundation has longer intervals between repairs than the flooring and walls, but it
comes with a comparably high repair cost at the time of repair [1], [26].

The property’s utility relates to how it can be used, and this relates to the ability
to spend time with friends, cook food, maintain hygiene, and recover through sleep.
This could be in the form of a larger room, making it possible to spend time together,
sound isolation that keeps noise out, or a bathroom or laundry machine within the
residence [1], [26].

Within the property, beauty is related to finer details in pleasing materials and
openness in combination with the balance of natural light. It also includes a balance
between open and closed areas and the generality of the home, as well as, the ability
to use rooms for multiple purposes [1], [26]. While aesthetic preferences may vary,
certain features are generally considered appealing, while others, such as damaged
walls or broken details, are not.

2.2.4 Property Type Specific Factors
Given the explanation of the common characteristics, the focus now shifts to the dif-
ferences. Swedish apartments are usually part of a housing cooperative. A monthly
fee determined by the financial status and planned maintenance is paid to the coop-
erative. High fees add to the buyer’s expenses, particularly if the cooperative has
high loans and may need to increase fees during times of high loan rates. Thus, un-
derstanding the cooperative’s financial situation is crucial, enabling buyers to assess
future potential costs [1].

On the other hand, the house also comes with extensions, such as land, the possibility
of extra buildings, and a foundation that is part of the residence. Unlike apartments,

8


2. Theory

where utilities are shared responsibilities within the cooperative, houses typically
place these responsibilities on the owner. This increases the potential repair cost
and underscores the importance of assessing the comp’s current state during the
selection process [1].

2.2.5 Other Factors Connected to Price
There are other factors that affect the subject property’s price, such as demand and
supply, regulation changes, mortgage rates, and disposable income rates. Since the
comps are supposed to be sold close in time to the subject property or adjusted by
indices to be, these can therefore be assumed to be shared between the comps and
the subject property.

2.3 Images in the Sales Process
When it comes to property sales, images play a crucial role as part of the listing
material, providing potential buyers with a first impression of the property. They
give the buyer a general idea of the property and assess if they are interested in
bidding or attending a viewing [13].

Capturing people’s interest is essential in persuading potential buyers to pursue
further steps, increasing the number of potential buyers and, thereby, the demand for
the property. This part of the selling process has generated a niche in the company
area of home staging. It takes advantage of the importance of aesthetics, aiming
to make the home feel and look better to attract more buyers. Consequentially, it
potentially increases the price and thereby makes the service a potential investment
[4].

In a study using eye tracking, it was determined that the subjects spent 60% of the
time watching the images in the property advertisement compared to the description
and comments from the broker [27]. This highlights the importance of the images
in the sale process. In addition, images have the advantage of universally communi-
cating the property’s condition without language barriers, conveying its appearance
in a way that words alone may struggle to achieve [16].

2.4 Interior Visible Features
A broker will visually inspect the subject property during the valuation process to
identify the previously mentioned features related to the property’s beauty, utility,
and stability, which are necessary to find suitable comps [1]. The property’s rooms
show material choices related to stability. They can also indicate the feasibility of
spending time with friends and family or whether this area is too cramped. Simul-
taneously, the bathroom and kitchen conditions can indicate one’s ability to cook
food and maintain proper hygiene [1], [26]. Damages and aesthetics are also visible
components on the surface layer of the exterior and interior. These damages can be
moisture and humidity on the ceiling or walls, as well as cracks and stains.

9


2. Theory

Within the home, both in the exterior and the interior, there are time-typical features
that are normal for the era [28], [29]. These features can be the type of wallpaper,
the doors and the windows of the property or some additional details. While these
features can show a desirable style, they also hint at potential underlying issues; the
construction industry has tried different techniques and materials throughout the
years, only exhibiting the usual problems long after the construction [28], [30].

2.5 Limitations and Challenges

A limitation in the manual assessment process is that while the subject property
can be visually inspected and assessed thoroughly, the comps are usually not readily
available for inspection, making it hard to adjust the assessment based on these
features [1]. This is especially true when the interior parts are involved, while the
exterior and surroundings can be viewed with satellite or street view images.

This can lead to a situation as seen in Figure 2.1, where two similar-looking sales
differ in price, making it problematic to assess the features setting them apart and
adjusting accordingly.

Figure 2.1: Highlighting comp selection problem

A broker with area knowledge and previous sales experience within that area might
have a good understanding of the differences, including the general condition and
how to adjust the price accordingly. Conversely, a new broker might be more re-
stricted [1]. The visual aspect, if available, can then aid the broker, leading to a
better understanding of the differences, as Figures 2.2 and 2.3 show an excessive
example of the potential differences in stability, utility, and beauty

10


2. Theory

Figure 2.2: Example comp (Desired) Figure 2.3: Example comp (Unde-
sired)

One of the main limitations of these data-based valuation systems, like the AVM, is
their difficulty in grasping unstructured data that has proven important to the mar-
ket value or data hard to access. These, for example, can be natural light or sound
levels from the surrounding area and views from the property [16]. While it is easy
to add quantitative data, using unstructured data such as ad descriptions, satellite
images, and exterior and interior images of a property, requires more advanced fea-
ture extraction techniques. However, extracted features from the unstructured data
could provide essential information for comparing comps with the subject property,
whether for AVMs or the broker.

These architectural qualities for the home have been referred to as the property’s
unmeasurable values [26], and the literature on what is considered high within these
topics is limited in Sweden [31]. This has triggered new research within the field to
find objective guidelines [32], thereby highlighting the difficulty and importance of
the topic.

2.6 Review of Similar Studies
Along with advancements in Machine Learning (ML) and Artificial Intelligence (AI)
and the increasing ability to utilise unstructured data, methods based on DNNs have

11


2. Theory

begun to be used to obtain more objective and data-driven real estate valuations.
These methods have also shown promising development in predicting these hard-to-
quantify features related to architectural qualities.

The literature contains various studies on real estate valuation and the use of visual
features from images, mainly aiming to reduce uncertainty in the assessment. The
studies have focused on different properties with overlapping themes, such as attrac-
tiveness [11], [33], aesthetics [34], material usage [35], luxury levels [4], the impact
of damages [36], and the effects of furniture and unfurnitured images [37].

Most of these studies have been conducted on exterior images, such as satellite
images [38], [39], street views [20], [40], possibly due to the ease of accessing this
data afterwards, with services such as Google Street View [41] and Google Maps
Static [42]. However, the interior images have also been focused on in some studies,
where the room types were assessed separately [43]. Closely related, a study was
conducted on the number and location of photos taken from Facebook and how that
indicated something beautiful or photo-worthy within the area [44].

These studies have been conducted in many countries, such as China [40], Italy [38],
England [39], United States [3], and South Korea [20], highlighting region-specific
insights. For instance, research in Beijing, China, showed a negative correlation
between water bodies and market value due to pollution [40].

2.6.1 Methods
These studies use different methods to measure visual features and their importance.
One method of gathering ratings on aesthetics and damages has been used to catch
subjective opinions [37], [43]. In these studies, participants grade the visual features
in a comparable fashion, and multiple options are presented. A rating is set on the
targeted features, such as damages or aesthetics, and an average rating serves as the
objective truth during the training of the models. This has also been done to quan-
tify beauty with the help of natural language processing (NLP), where comments
on images are gathered to extract assertions in the form of undesired and desired
comments regarding image aesthetics [6].

Another method is to use the error of the original estimate as indicators of the visual
features’ effect on the price [3]. A negative difference in an area can indicate that
the visual aspects found differ from the region negatively. Using multiple examples,
the model can find these commonly negative and positive patterns as features that
can be added to the assessment. This method is heavily based on the assumption
that the estimate’s error is based on the visual aspects.

2.6.2 Previous Attempts
In case of the studies that utilised ratings, one of the studies evaluates the effect of
features in property images on real estate, and a group of experts uses structured
methodologies to evaluate the functionality and aesthetics of furniture [37]. The
results showed that this approach was effective in aligning furniture design with

12


2. Theory

consumer preferences and quality standards.

In the study, "Image-Based Appraisal for Real Estate Using Mask R-CNN" [36],
they labelled each image with multiple annotations related to the room conditions,
including damages and their severity. This study focused on the lack of importance
of the property’s current situation based on its image in real estate valuation. In this
study, the Mask R-CNN [45] approach published by Facebook AI Research was used,
and both defect and damage detection were performed by object segmentation on the
interior and exterior images of real estate. The primary purpose was to understand
the effect of the defects and damages in the interior and exterior images on the price.

The price error was used as an indicator for undesired visual features in the "House
Price Estimation from Visual and Textual Features" [3] study. Specifically, a binary
classifier for Curb Appeal of houses was developed. It was based on the error of
the previous prediction in combination with Principal Component Analysis (PCA)
of the Pre-trained ResNet features was developed. The choice of a binary classifier
for a good and a bad curb is a simplification over a regression task where the actual
difference is the target. While their attempts led to an improvement, it was stated
that it was only a modest improvement.

In the context of AVM models, these studies have mainly used models such as
Ordinary Least Squares (OLS) [46] and XGBoost as baseline models, comparing the
result with and without visual features extracted from images. Another study used
recurrent neural networks (RNNs) [47] to process data from random walks based on
the location of properties to embed locality to improve property pricing [16]. During
performance evaluations of these models, they generally used Mean Square Error
(MSE), Mean Absolute Percentage Error (MAPE), and R2. In addition, the results
of these studies showed moderate improvements, suggesting that visual features
reduce uncertainty and emphasise the need for further research [3], [20].

2.7 Computer Vision
The field of computer vision, which retrieves information from images and video,
has been active for a long time. CNN’s early breakthrough was its ability to capture
patterns, making it possible to classify text and scenery from visual media [5].

2.7.1 Convolutional Neural Networks
Yann LeCun and his collaborators first entered this field in 1998. In the paper
"Gradient-Based Learning Applied to Document Recognition" [5], they introduced
the use of CNNs for document recognition. After this introduction, they pioneered
CNNs’ architecture and training methods and showed their effectiveness for two-
dimensional shapes, such as handwritten characters.

After LeNet, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton introduced AlexNet
in 2012 in the paper "ImageNet Classification with Deep Convolutional Neural Net-
works" [48]. AlexNet was deeper than LeNet and used ReLU activation functions
to increase the model’s performance and GPUs for computation. It also achieved

13


2. Theory

a top-five error rate of 15.3% on the ImageNet challenge, which was a significant
performance among the existing models.

Following this, Karen Simonyan and Andrew Zisserman introduced the VGG net-
works in 2014, in the paper "Very Deep Convolutional Networks for Large-Scale
Image Recognition" [49]. VGG’s architecture was deeper than AlexNet because of
the use of very small 3x3 convolutional filters. The VGG model with this architec-
ture achieved better performance on the ImageNet challenge than other models like,
AlexNet. This result showed that using deeper networks with smaller convolutional
filters can increase the performance of the model and provide better accuracy in
image-oriented tasks.

Subsequently, the ResNet model was introduced by Kaiming He, Xiangyu Zhang,
Shaoqing Ren, and Jian Sun in 2016 in the paper "Deep Residual Learning for
Image Recognition" [50]. This paper presented a new approach to the problem of
vanishing gradients during the training of deeper networks. By incorporating skip
connections between layers, ResNet effectively mitigated this issue and achieved
superior results across various image-oriented tasks, surpassing the performance of
previous architectures such as VGG.

While these larger models are trained with millions of images [51] to find robust
features, they can start from a pre-trained stage, where the model has already
learned base patterns that can be fine-tuned to a set objective. These larger models
must be trained on data that are connected to the target domain at hand. For
instance, a large model that is trained on a dataset linked to animals might not
have found the patterns that would be useful in real estate. The continuation of
openly available datasets, from the MIT indoor 67 [52], with 15,620 images, to
Places 365 [53], with approximately 8 million training images, gives the ability to
train these larger DNN models, which would have been impractical due to resource,
time and image constraints.

The AlexNet [48], VGG [49], and ResNet [50] models have shown promising per-
formance in the papers related to classifying scenery and distinguishing between
different types of settings, such as cafeterias and classrooms. While not perfect, the
models still show room for improvement, especially in the top-1 prediction accuracy,
which means the accuracy of the class with the highest probability. That is currently
between 50% and 60%. This is still impressive given the 365 options. Demonstrating
the ability and slight improvement between the generations of models. Therefore,
showing the ability and slight improvement between the generations of models [53].

2.7.2 Transformers
The recent success of transformers in NLP, starting from the paper "Attention is
All You Need" [54], started an AI leap with the advancements of new tools such as
Bert [55] and GPT [56] models. The significant improvement with transformers is
the ability to handle sequential data without having to process them sequentially, a
limitation that previous models, such as the RNN, had [54], [55]. It does this with the
help of attention or self-attention mechanism, which is used to set the importance

14


2. Theory

or weight of part of the inputs during training, thereby keeping attention on the
high-weighted parts [54]. This has shown improvements over previous methods [6].

The use of transformers has also entered the computer vision field in the form of
Vision Transformer (ViT) [57], which is a transformer model designed to push the
limits of transformers outside their primary field, NLP, and perform in the computer
vision or Image Analysis field. In ViT, the main idea is to let the model learn image
structures independently by representing all image inputs as sequences of patches
using the attention mechanism of the transformers [57].

The main advantage of ViT over CNNs is that ViT uses these attention mechanisms
that are not constrained by the spatial structures on which CNNs are based. This
allows the model to focus on the most relevant parts of the input image [57]. This can
lead to more efficient processing, especially for tasks that benefit from understanding
the global context of the image [57].

2.8 Gaps in the Research
Given the rapid advancements in AI, the vast amount of unstructured data within
the real estate field in the form of text, images, and videos, this research area holds
significant potential [58]. Specifically, it can leverage these unstructured data to
improve the market value assessment.

Also, due to the difficulty of estimating market value, it makes property valuation a
good test for new developments in the inclusion of extracted features that are difficult
to quantify to reduce uncertainty. Given the current difficulty in quantifying the
architectural quality features of the property and its usage in the comp selection,
we believe that the future of real estate research around the valuation process will
continue to find ways to incorporate more complex components. These components
can then be correlated with the market valuation to highlights their importance.
Additionally, they can be used to make the selection of comps easier for real estate
brokers.

Many features related to the architectural qualities could be extracted, such as sound
levels and natural light levels. However, they are also hard to quantify accurately
and clearly. Furthermore, acquiring relevant data for these features presents a chal-
lenge [21]. As AI models continue to advance and more open data becomes available,
the field of property assessment will undergo renewed exploration, exploring the un-
certainty and understanding of the correlated features. This includes the dimensions
of the architectural quality.

2.9 Future Directions
While the use of CNN and DNNs in the valuation process has been studied, mainly
in other regions and with both exterior and interior images, the use of ViT remains
limited. Regional differences might also include regional biases in the studies, po-
tentially limiting the scope of the findings to other regions, such as Sweden.

15


2. Theory

Another area for improvement with these findings lies in the ability to interpret
the result. While these methods have shown slight improvements in automated
valuation, they are usually hard to use outside of AVMs. A more target extraction,
with a visual understanding, could also aid the manual assessment process, helping
the broker in the comp selection process and speeding up their workflow.

Therefore, there is a need for future studies of visual features in regions such as
Sweden that continue to explore ways of incorporating additional features into the
valuation process. This could enhance its accuracy, efficiency, and understanding of
the market value and its relation to characteristics as architectural qualities.

16


3
Methods

The method chapter begins with a summary of the research plan, followed by the
data and pre-processing required for this thesis. Thereafter, the theory and best
practices for training DNNs are explained. The chapter concludes with the visual
extraction methods, the AVM models and the scoring methodology used in this
research.

3.1 Research Plan
The primary objective of this thesis is to leverage visual aspects from interior im-
ages with the help of state-of-the-art computer vision models to reduce uncertainty
in Swedish property valuation. The hypothesis is that interior images reflect the
architectural qualities that advanced computer vision models can extract and use in
the predictions, thereby improving the AVMs’ accuracy.

We test this hypothesis by conducting an empirical research study on private housing
within the Uppsala county region in collaboration with Valueguard Index Sweden AB
(Valueguard) [59].

The collaboration with Valueguard enables us to access listing images taken at the
time of sale, along with the metadata associated with the property, selling date, and
selling price. It also supplies us with housing indices that track price changes over
time, which enables us to adjust these sales to the same date, resulting in a more
comparable dataset. Lastly, their extensive expertise in the real estate field provided
invaluable guidance throughout the project.

This thesis focuses on the interior images and excludes exterior and surrounding
images. These interior images are categorised separately according to room types
for a more tailored comparison. One initial limitation in this study is that these
labels are not provided, thus requiring extensive manual pre-processing through
image classification to obtain the required dataset.

In this thesis, the accuracy of the AVM model is used to measure the impact of these
extracted features, measuring the reduced uncertainty in the form of MAPE in the
AVM prediction. Thereafter, a 10-fold cross-validation is combined with paired t-
tests to test for statistical significance of the added visual features. Additionally,
when a statistically significant improvement in the reduction of MAPE is found,
the feature weight in the model is examined to understand the importance of the

17


3. Methods

newly added features. Consequentially, this research plan aims to provide a solid
foundation for the research.

3.2 Limitations and Scope

The first limitation in scope is the types of properties explored. This thesis focuses
exclusively on apartments and smaller houses, essentially year-round private housing.
This approach defers the inclusion of images from interior commercial establishments
and summer cottages to future studies. This decision ensures a targeted approach,
considering the propertys’ distinct customer groups and usages. There is also a
limited availability of relevant data for commercial housing for this thesis.

Secondly, this thesis only explores the interior images and excludes images of the
property’s exterior and those depicting the surrounding area. This, in combination
with the focus on the room types separately, creates a high reliance on representative
data connected to all the room types for each sale.

Thirdly, this thesis does not collect votes or labels from the broker to use as a target
for architectural quality. Instead, it attempts to find these structures in the data.
This scope is set to explore the advancement of new AI tools for finding patterns,
primarily due to the scarcity of available experts in the field to assist in this process.

Finally, the region’s size is limited to control the number of sales and images pro-
cessed within the study. Choosing a region rather than sampling from the country
as a whole is chosen to capture comparable sales in the regions. Therefore, it is
decided that only sales within the region of Uppsala County are included, as shown
in Figure 3.1. The region is chosen based on the available data, with a preference
for regions the authors are familiar with. Furthermore, the region is also considered
sufficiently large to generate a dataset big enough to make the use of DNN models
meaningful.

Figure 3.1: Uppsala County on OpenStreetMap [60]

18


3. Methods

3.3 Technologies

Multiple programming languages can be utilised for visual feature extraction. How-
ever, due to previous knowledge and experience, the choice is made to work in
Python [61] and use PyTorch [62], as the main library for working with images. Ad-
ditionally, we use GeoPandas [63] to mark up areas and calculate distances, while we
use pandas [64] to load and process the metadata. Subsequently, we run these tools
and models mainly in Jupyter notebooks [65], which run inside a Docker container
[66] with graphics processing unit (GPU) access.

A key tool in the project is Label Studio [67], which makes it fast and efficient to
work with labeling and ensures that data policies are upheld by working locally.
Additionally, it supports the required multi-label and single-class classification that
we use in the thesis. The tool is essential for labelling a large quantity of images in
a secure and reasonable time with an easy import function from a JSON format to
generate the task and hotkeys to speed up the labelling process.

MlFlow [68] is another valuable tool for this thesis. It makes saving experiment
results with the connected parameters, scores, models, and graphs more concisely
and easily manageable. This reduces the associated difficulties with a more extensive
set of models with different training parameters. It also makes it possible to do larger
experiments sequentially over multiple days, trying out a wider range of parameters
that can be assessed afterwards.

Computing power is an essential part of running these models. Given that DNNs
run considerably faster on GPUs than central processing units (CPUs) [69]. GPU
resources are used to increase the number of feasible experiments within the lim-
ited time. Also, due to the sensitive nature of some of this data, an additional
requirement is that it has to stay within the company’s hardware. This requirement
removes the alternative of utilising cloud computing, which can scale more freely.
As a result, for this project, a Linux server with an Intel(R) Core(TM) i7-6700 CPU
@ 3.40GHz CPU, 32GB DDR4 RAM and an Nvidia GeForce RTX 3090 with 24GB
VRAM is provided and used throughout the project. Regarding the dataset and
models explored, the hardware is deemed sufficient to run and train the models
within a reasonable time for multiple attempts throughout the project.

3.4 Data

Data is our project’s most critical and essential part, especially when using the DNN
approach due to its data-hungry nature [70]. This data consisted of information
related to the property, such as the number of rooms and size. Valueguard provides
most of this data through metadata and images connected to the sales. However,
some additional data sources are used to enrich the dataset. This additional data is
mainly related to the locality in the form of additional regions marked up with the
provided coordinates.

19


3. Methods

3.5 Valueguard Index Sweden AB
Throughout the thesis, there is a collaboration with the company Valueguard. This
company has a long history of creating hedonic housing indices and aiding brokers in
the valuation process by providing CMA tools that suggest comps [71]. Additionally,
they offer AVM services in the form of an API [72]. Furthermore, Valueguard gathers
its data from various sources, such as realtors, and direct data transfers from multiple
real estate agencies as well as large providers such as Svensk Mäklarstatistik [73].

3.5.1 Ethical Considerations
Working with images from people’s residences might be intrusive ethically. Although
these images have been used for marketing and public viewing, within the thesis,
there is a commitment to maintaining confidentiality throughout the project. This is
done by implementing strict safety measures to anonymise all images, ensuring that
individual privacy is protected and no personal data is compromised. This includes
keeping the images secure on the server, generating universally unique identifiers
(UUIDs) for the images that can only be tied to sales with credentials during the
run, and removing any images with individuals from the dataset.

These security measures also led to the decision that the images in the thesis in the
form of visual examples are from license free image providers as Unsplash [74] rather
than the images that are provided.

3.5.2 Metadata
The metadata here refers to the data connected to the sale and includes numerical
and categorical variables tied to the property and the connected location. The
provided metadata is used as the baseline features in the AVM model comparisons.

3.5.2.1 Location

As previously mentioned, the locality is an essential component of the valuation
process. Using publicly available regions extended the given base data, giving the
models more opportunities to learn characteristics tied to the region.

In this thesis, two external location-oriented sources are used. The first is Statistics
Sweden [75], which exposes multiple definitions of Swedish regions to capture and
compare localities. The regions used from Statistics Sweden in our study include
Demographic Statistical Areas (DeSo) [76], a geographical division of Sweden for
a more detailed demographic and statistical data analysis. Additionally, Regional
Statistical Areas (RegSO) [77] are used, where Sweden is divided into 3,363 statistical
areas based on municipal and county boundaries. Urban areas [78] where at least
200 inhabitants are in a contiguous built-up area are also considered. Finally, the
study incorporated Municipalities [79] that divide Sweden into 290 more extensive
regions.

The different regions within the Uppsala County boundaries can be observed in

20


3. Methods

Figure 3.2 showing the different sizes of the regions and their overlap, allowing the
model to tie smaller regions to more extensive regions. Additionally, this is added
to ensure the models can capture the micro-location and regional features explored
in theory.

Figure 3.2: Geographical areas from Statistics Sweden on OpenStreetMap

Besides the pre-defined regions provided by Statistics Sweden, geographical tiling,
similar to a grid system, is used to capture regional characteristics on multiple
levels or, as they refer to it, resolutions. This is achieved by marking the sales
with multiple layers of Ubers H3 index (H3) [80] with multiple resolutions. This
approach is inspired by a blog post from Zillow about the design of their DNN AVM
[24]. Figure 3.3 displays a visual example of the resolution of 7, 9 and 11 in Uppsala
centre.

Figure 3.3: H3 Index with different resolutions on OpenStreetMap

One final feature is added to the sale regarding the location, which is the distance
to the urban area centre. It aims to capture how far the sales are from desirable
locations in the region centre, such as stores and other necessities. The centre is
predicted using the 90th quantile on the recounted sale price within the region,
thereby only selecting the most expensive sales. The centre point is then generated
by taking the median of the x-axis and y-axis of the RT90 coordinates separately.
This method operates under the assumption that property market values generally
increase the closer they are to the city center. The straight-line distance is then
calculated to the center point from the sale within the region and added to the base
features.

21


3. Methods

This study exclusively used the distance to the generated centre of the urban areas.
However, other distances, such as the municipality centre, travel distance to the
airport, and public transport, can enrich the models with more information regarding
the location. Nonetheless, they are excluded to minimise the scope and workload.

3.5.2.2 Base Features and Target

This section describes the base features and the target of the AVM model. Due to
the skewed distribution of some of the features, a logarithmic transformation or log
scaling is applied to convert these distributions to a more normally distributed one,
potentially making it easier for the model to work with these features [81].

The base features in the form of metadata can be seen in Table 3.1 for houses and
Table 3.2 for apartments. The type of variable and transformation are indicated
with the tag notation of C for a categorical feature, L for a feature with logarithmic
transformation, and N for a numerical feature.

Table 3.1: Housing metadata

Feature
Home type [C]

Standard points [N]

Plot area [N]

Living area [N,L]

Rooms [C]

Ancillary area [N]

Construction year [N]

RT90x [N]

RT90y [N]

DeSO [C]

RegSO [C]

Municipality [C]

H3 Index res 3-9 [C]

LKF [C]

Distance to urban area center [N]

Table 3.2: Apartment metadata

Feature
Elevator [C]

Monthly fee [N]

Living area [N,L]

Rooms [C]

Construction year [N]

Floor [N]

Floors [N]

Housing cooperative [C]

RT90x [N]

RT90y [N]

DeSO [C]

RegSO [C]

Municipality [C]

H3 Index res 3-9 [C]

LKF [C]

Distance to urban area center [N]

Regarding the shared features between the property types, living area indicates the
livable area size. Meanwhile, the rooms are the number of rooms within the property
and are decided to be categorical due to the shown separation and differences in the
number of rooms and the price per square meter [82]. Additionally, the construction
year refers to when the property is built.

Furthermore, the regional features, RT90x denote the horizontal distance to the east
from the central meridian of the RT90. Meanwhile, RT90y indicates the vertical
distance to the north from the central meridian of the RT90. The DeSO, RegSo,
municipality, and H3 Index with different resolutions are the regions in which the

22


3. Methods

data is marked up, while the LKF represented a region connected to a parish region
[83]. Lastly, distance to urban area center is the shortest distance to the generated
urban area center.

In the case of house-specific variables the home types indicate the type of house,
such as chain house, semi-detached house or terraced house. Meanwhile, standard
points reflect the condition scoring of different features in the house, such as kitchen
setup and renovations [84]. Additionally, the plot area and ancillary area indicate
secondary areas that are not part of the main living areas of the property, such as
the garage or basement.

In the case of apartment-specific features, the elevator feature indicates whether the
apartment building has an elevator. The monthly fee is the monthly rental fee for
the apartment to the housing cooperative. The floor represents the floor where the
apartment is located, while floors represents the total number apartment building
floors. Meanwhile, the housing cooperative is the legal institution that owns the real
estate.

The target variable for the AVM, is a logarithmic transformed recounted sale price
that is used as a proxy for the market value. In the recount process, the price at the
time of the sale is adjusted with the help of a regional index to represent the market
value on January 15, 2024. The recounted date is selected because it is the latest
published index value at the start of this thesis. This approach allows us to exclude
the time variable in our thesis and simplify the task. However, it shall be stated that
the further away the original estimate is, the more uncertain the recounted value
becomes. In our study, the sales and the corresponding images analyzed cover the
period from 2019 to 2024.

The images play a crucial role in our study. Each sale is provided with an average of
approximately 30 images related to the property. The room-type images are grouped
between the property types. This grouping is based on the assumption that the
characteristics of these rooms are comparable and overlap between apartments and
houses.

3.6 Room Types
In this study, the focus is primarily on the interior images and handling the room
types separately. This is done to make comparisons within the model more mean-
ingful, not comparing kitchens with bathrooms but rather bathrooms with another
bathroom to find similarities and differences relevant to the room type regarding
the architectural qualities.

Furthermore, the room types used in this study include the bathroom, bedroom,
kitchen, living room, and dining room. The room types are decided partly due to
the architectural quality connection in the theory, the functional importance of the
bedroom for sleep, the importance of the kitchen for making food and the function
of handling the hygiene in the bathroom. However, some of the room types, such
as dining room are chosen because they were used in similar studies [4]. The goal

23


3. Methods

of the pre-processing is to obtain images that only depicted the chosen room types,
as seen in Figure 3.4.

Figure 3.4: Room types explored

3.7 Labelling
Labelling is a time-consuming but essential part of this study. It generated the
core dataset by categorising the images to the corresponding room type. A hasty
execution of this step can create issues for the rest of the study through a low-
quality dataset or a lack of data. Therefore, it is determined that a longer period
for labelling will be designated to reduce the impact of time pressure. Each day, a
portion of the dataset will be labelled. The data quality is also reviewed throughout
the thesis to maintain its high quality. This is done by excluding low-quality and
undesirable images, and marking them when seen.

Due to the previous success in distinguishing scenes with high accuracy [4], [53], the
labelling task is split into two stages, using a so-called hierarchical approach with
two sub-tasks. The first stage aimed to exclude a larger proportion of the irrelevant
images from this study, thereby maximising the study related images for a more
thorough review in a second stage. The classes of the initial stage included interior,
exterior, and others, where others are a class for floor plans and 3D renderings of the
property. These three classes are selected due to their apparent visual differences
and the assumption with previous successes that they will be easily split with high
accuracy.

The second stage, which involved classifying the room types from the labelled interior
images, presented a more complex challenge. This is partly due to the presence of
multiple room types in one image. This issue is handled with the help of a multi-
label classifier, where each room type is represented as an individual probability

24


3. Methods

vector for each image. Thereby, adding the option to select multiple rooms in the
second labeling stage.

In the study, 10,000 images are labeled in the first stage and 5,000 in the second
stage. The higher number in the first stage is due to the ease of labelling them, and
an early accurate model gives more interior images to the second stage and thereby
increasing the number of images that can be used and reducing additional filtering
out non-interior images.

Furthermore, the first stage is considered a single-classification task where the
highest-scoring class is chosen. In contrast, the second stage is considered a multi-
label task where each class is assessed separately with a threshold. This threshold is
determined by testing a range of thresholds and selecting the one with the highest
F1 score of each room type to ensure a balance between the models precision and
recall. A visual explanation of the output of the two stages can be observed in the
visual representation of the label processes seen in Figure 3.7.

Figure 3.5: Stage 1 Figure 3.6: Stage 2

Figure 3.7: An instance of labeling process

Another technique used to improve the classification model is semi-supervised learn-
ing. This approach leverages pseudo-labels or high certainty predictions on unseen
data from the initial model as additional training data in a second training stage.
Consequently, it is deemed that a high-accuracy prediction above 90% in our study
can be used as training data in a second training run.

Besides the five main room type labels in the second stage, there are three addi-
tional labels for aiding purposes, namely Miscellaneous, Needs work, and Uncertain,
respectively. Needs work label is for images needing cropping or additional process-
ing in case an image is a composite image of multiple images. Uncertain labels tag
images that do not depict a room clearly, or the labeller is unsure. Miscellaneous
labels are used to set up rooms not analysed in this study and are excluded; these
rooms could, for example, be saunas or gyms. These additional tags makes it pos-
sible to indicate uncertainty and the required further work in parallel with labeling

25


3. Methods

the dataset. This is especially helpful in the form of non-furniture rooms that can
serve multiple functions and are hard to label.

3.8 Data Pre-Processing

The metadata and the images are pre-processed, partly to create the required format
for the models and partly to aid their convergence.

A frequently employed approach in similar studies [13] is to filter out sales for which
the price deviates significantly from the other sales in the region in which it is sold
or from an initial prediction. There can be multiple reasons for this price difference,
such as data quality issues or failure to adhere to the market value conditions. There-
fore, we determined that prices deviating from the initial estimate by more than a
certain margin of error are unreliable and undesirable for the model to learn. As a
result, this thesis excludes sales with prices that deviate from the original assessment
by more than 80% in any direction.

Normalisation is another useful method that can make it easier for AI models to
converge and learn representations, primarily because it is easier to grasp the ranges
of the features [81]. Therefore, the numerical values used in the valuation model are
normalised with a mean of zero and a standard division of one. This normalisation
is based on the training data and then applied to the testing data in the test stage
to continue out-of-sample learning.

Due to the high dimensional feature space in the valuation model, especially the
categorical features connected to the location, feature reduction techniques are ex-
plored to limit the model from over-fitting to noisy variables. In this thesis, this is
done by comparing the model trained on all provided variables with one that is only
trained on the ones with higher feature importance using the SelectFromModel [85]
function. The latter uses a pre-trained model on all the variables, keeping only the
variables above the mean of the absolute feature importance for the specific model.
This technique focuses on the more essential variables with the aim of a model that
generalises better with only the more robust features. The model with the highest
performance on the validation data is then chosen.

To streamline these pre-processing steps, two sci-kit pipelines [86] are created, with
the first step handling the numerical and categorical features separately. The numer-
ical features are standardised as described earlier with the provided StandardScaler
[87]. At the same time, the categorical variables got one-hot encoded with the One-
HotEncoder [88]. In the second pipeline, the additional feature selection is added,
and only the features above the mean of importance are kept, as previously de-
scribed. Figure 3.8 visually represents the pre-processing pipeline where X is the
features and y_pred is the predicted market value.

26


3. Methods

Figure 3.8: Pipeline for the AVM model

In the case of the normalisation for the DNNs used with the images, the pre-trained
model usually has normalisation applied to the images during the training [89]–[91].
To work as expected, the same normalisation must be applied to the new data to
give comparable and reasonable results. In PyTorch, these pre-processing steps are
usually provided by the Transforms library [92] and applied during training. This
is also the case for this thesis.

3.9 Deep Neural Networks
DNNs refer to neural networks with multiple hidden layers between the input and
the output, making them deep. These deep models are primarily used during this
thesis, and the following section outlines the methods used to train the DNN and
highlights the best practices used.

These DNNs come in different forms to handle different kinds of data, whether it
comes to data in succession, images, sounds, or inputs suited for standard feed-
forward networks. However, the central concept is that these networks take these
initial input signals and propagate them through a network, resulting in an output
format designed to align with the task.

This thesis mainly works with two-dimensional and one-dimensional data. The two-
dimensional inputs relate to the images, and the connected RGB colours relate to
the red, green, and blue in the images. These are then used as input to find spatial
patterns related to the task. A visual example of this can be seen in Figure 3.9

Figure 3.9: Neural Network with two-dimensional input

Meanwhile, the one-dimensional input relates to the numerical and categorical vari-
ables for the AVMs. Figure 3.10 shows a visual example of a model with a one-
dimensional input.

27


3. Methods

Figure 3.10: Neural Network with one-dimensional input

In the domain of AI models, and especially in the use of DNNs, the complexity of
the models can be decided quite freely. This flexibility makes it a powerful tool that
can be applied to various tasks. However, the trade-off between variance and bias
on the task must be carefully assessed to decide on a suitable model, ensuring a
suitable model complexity that has the ability to learn the patterns required in the
task without overfitting to the noise within the data.

In addition to the model complexity selection, regularisation methods can reduce
the model’s tendency to over-fit on the training data. In this thesis, several regu-
larisation methods are used. These include early stopping, which aims to stop the
training phase when no new improvements are being made to the validation data
[93]. Additionally, weight decay is used to penalise high weights during training
to reduce weight changes on the noise. Furthermore, batch normalisation, which
normalised the data between the layers, is incorporated into some of the models [94].
Dropout layers are also utilised to set a percentage of the neurons to zero during
training to lower the reliance on certain neurons.

Another method used to leverage earlier models’ learned patterns is transferring
learning, where a model trained on a task has learned robust features to differentiate
the data [95]. An example of this can be a room classifier on the Places365 dataset.
This would then be able to be used in another task, replacing the end of the model
with a new specific task. This new task is usually solved with a new sub-model on
top referred to as the head of the model.

When training these pre-trained models, the best practice is to freeze or lock the
base to retain the robust pattern learned in the previous task and only train the
head [95]. This is done within a main training stage with a larger learning rate,
followed by a fine-tuning training stage with a lower learning rate. Finally, even
the base or backbone is added to the training with a minimal learning rate as a
final task tuning of the whole model. For each stage, the model is trained until a set

28


3. Methods

maximum number of epochs is reached or until the early stopping halts the training.

Recent developments with training techniques and models within projects such as
SimCLR [96] [97] and DINO [98] have shown the strength of self-supervised models.
These models can learn robust features without the need for provided labeled data by
augmenting an image and aiming to maximise the similarities between the original
and the augmented image. The idea is that if the model has difficulty distinguishing
the two, they are presumably similar.

In this thesis, these improvements are a perfect match due to the lack of provided
labels and the focus on the differences in the images. The study used these im-
provements in the form of the provided pre-trained models [99] and in the form of
training our own self-supervised base models for each room type with the DINO
V1 [98] approach. All DINO V1 self-supervised trained models are trained using
the default parameters provided, with the recommendation of 100 epochs for initial
convergence. This limitation is set due to the runtime of training these models,
requiring approximately two days per model, and uncertainly about whether the
provided data size will be enough to generate an adequate model.

3.10 Visual Target Features
The following chapter focuses on visual feature extraction. This thesis explored
three primary methods for feature extraction. The first is a binary classifier, where
the decision between a desired and undesired attribute is assumed to be related to
the percentage error or standard point. The second is an unsupervised approach,
where clusters in connection with different models are assessed visually in relation
to the architectural quality. The last approach uses the CLIP model, which can
compare the similarity between images and a positive and negative description of a
architectural quality, in a zero-shot fashion. Zero-shot learning refers to a scenario
where the model can be applied to a task it was not trained on, and no examples
were given with the task to the model [100].

3.10.1 Binary Classification
The binary classifier can be used as a simplified regression task where the magnitude
is not directly related to the desired target variable but is assumed to be related.
The provided magnitude is therefore ignored, and the target is converted into a
simplified undesired or desired category related to the positive and negative sides.
The model can then create its magnitude by analysing the common patterns on
the positive and negative sides in the form of probability related to the desired or
undesired classes.

This thesis uses two distributions to retrieve desirable and undesirable features. The
first target, also used in earlier studies [3], is the percentage error of the prediction
and the actual price without the visual features. Given the usage in the theory, it
is assumed that the difference between the initial assessment and the market value
depends on the lack of attributes related to the architectural quality that can be

29


3. Methods

observed in the image. The more images in the undesired class that share similar
patterns, the stronger the predictor of an undesired feature.

The second target is the standard point, which adds a Swedish-specific approach.
This score only exists for the houses. It is a condition measurement used as part
of the housing declaration. This score assesses multiple factors, such as the aspect
of the property’s exterior, energy management, kitchen condition, sanitation, and
other interior features that align with the condition and function [84].

However, while these scores are not only based on interior features that can be seen,
within this thesis, it is assumed that a home with a high standard point score or a
high percentage error has visual features in the interior of the property that indicate
higher architectural quality.

Figures 3.11 and 3.12 show the percentage error and the standard points distribution
used during training, which seem to follow a normal distribution. Zero is the divider
in the percentage error case, while the empirical mean is the divider between lower
and higher in the standard point scenario. To focus on the more distinct differences,
the sales outside of the absolute ten percentages are used in the percentage sale case.

Figure 3.11: Histogram of the percent-
age error

Figure 3.12: Histogram of the stan-
dard point

To evaluate and score the binary classifier, an area under the curve (AUC) score is
generated for each model to compare their ability to quantify the desired and unde-
sired features. The AUC score is obtained from the receiver operating characteristic
(ROC) curve, which shows the true positive rate (TPR) compared against the false
positive rate (FPR). The equations for FPR and TPR can be seen below.

TPR = True Positives
True Positives + False Negatives

(3.1)

FPR = False Positives
False Positives + True Negatives

(3.2)

Consequentially, the common method to calculate the AUC is using the trapezoidal
rule. This rule approximates the area under the ROC curve by dividing that area
into multiple trapezoids, with vertical lines for FPR values and horizontal lines
for TPR values. After that, the area is calculated by summing the areas of these

30


3. Methods

trapezoids [101]. During this project, sci-kit learn roc_auc_score [102] function
implementation is used to get the AUC score.

Due to a late improvement, these improvements are not assessed in the form of the
AVM models but rather by themselves compared to the validation data. AUC is
a way to score how well the model distinguishes the classes. Generally, a score of
1 is considered perfect class separation, and a score of 0.8 is regarded as a good
separation. However, a score of 0.5 shows no ability to separate the groups [101].

3.10.2 Clustering
The last method used for visual feature extraction is clustering. In this method,
a pre-trained model generates a high-dimensional vector that is then used to find
clusters of images based on similar visual traits. The aim here is to find visually
interpretable clusters that relate to architectural qualities.

Multiple models are used to generate these high-dimensional vectors. Firstly, the
ViT Base model with 14 patches provided by the DINO v2 project [103] [104], which
has been trained on ImageNet [51] using a self-supervised approach, is used. Sec-
ondly, the self-supervised model that is trained within this thesis on each room
type is tried. Finally, two pre-trained CNN models, VGG and ResNet50, that are
pre-trained on the Places365 dataset are utilised.

These high-dimensional vectors are then normalised and clustered with the K-Means
algorithm based on the Euclidean distance seen below:

√√√√ n∑
i=1

(Ai − Bi)2 (3.3)

In this equation, A and B are the vectors in n-dimensional space, and Ai and Bi

indicate components of the A and B vectors, respectively.

This process is repeated with a range of different numbers of clusters (k-values)
between 2 and 10. This range is chosen to look for more significant clusters and
assess more models within the thesis.

Getting a good indication for well-divided clusters might vary depending on the
project framework, domain, and data. For this project’s framework, the Davies-
Bouldin score is used as the primary indicator to determine the best number of
clusters to focus on. This effectively differentiates between the distinct clusters and
ensures they are well-separated and compact [105]. This is used as an indicator for
what clusters to explore more.

To select the cluster feature to be extracted and included in the AVM, a visual
inspection is performed with the aim to understand the clusters visually and relate
them to our interpretation of the architectural qualities. This involved randomly
sampling five images from each cluster, a process repeated three times to ensure a
diverse representation of the perceived quality found in the images. This rigorous
approach lowers the chances of the characteristics being found simply due to chance.

31


3. Methods

3.10.3 Contrastive Language-Image Pre-Training
CLIP is a model developed by the OpenAI team that has been trained to tie together
the ViT encoding of images to the text encoded image descriptions with the help of
cosine similarities, which have earlier been used within the field of Natural Language
Processing (NLP) to match document types [10]. This returns a score between minus
one and one, representing the similarity between vectors A and B, as shown below.
In addition, the representation of ||X|| in the equation refers to the Euclidean norm
for vectors.

Cosine Similarity(A, B) = A · B
∥A∥∥B∥

(3.4)

In this study, the clip model is used to score the architectural qualities by providing
a textual description and then matching it with the images in a zero-shot fashion.
A positive and a negative version of the targeted feature is used to provide a range
of results. Figure 3.13 shows an example of extracting a score for the room’s utility
to move around by matching the spaciousness in text format with the room type
image. This involves inputting two sentences in the text encoder and taking the
positive score minus the negative score as the saved score for the feature.

Figure 3.13: Use of the CLIP model to generate spaciousness score [106]

These text versions of the positive and negative architectural qualities are primarily
based on examples found in the literature, with some additions from our understand-
ing of what is considered desirable and undesirable within the room types.

3.11 Utilising Visual Features in the Model
After the corresponding model extracted the visual features, these features are la-
belled with a name that tied them to the model and targeted features. They are then

32


3. Methods

added to the enriched version of the base features with the model-specific features.
The scores are averaged, when multiple images associated with the same room type
and numerical visual features are present. This is a typical approach when using
multiple images connected to the same feature [13], [16]. However, in the case of a
categorical variable, the presence of one is enough for it to be valid for the entire
sale. For example, one bathroom image with a bathtub is enough for the "bathroom
has bathtub" feature.

In the case of missing images of a room type, which results in missing features, the
average of the features is used to fill in the missing values. This method is also
used in a similar study, and it is a normal way of handling missing values without
completely excluding the rows [4]. This is only done when a row has any visual
features. However, if there are only non-visual features, the sale is excluded.

3.12 Automated Valuation Model
A crucial part of our study is the AVM, which highlights the importance of our visual
features related to the price. Within this thesis, three base models are chosen due
to their previous use as baseline models and the ease of extracting the importance
of the feature input.

An advantage of linear regression models and XGBoost [23] is that they are typi-
cally used as baselines. This is primarily because they can also provide a feature
importance that shows the weight or importance of a feature [34].

Firstly, Ridge regression is a model that uses a regularisation technique to improve
the model’s accuracy. During this regularisation, a penalty term proportional to the
square of these coefficients is added to the loss function. This makes the model better
at generalising the data, thus controlling the model’s complexity and reducing the
model’s tendency to overfit the data. This is chosen over normal Linear Regression
in our case because it minimises the chance of overfitting problem on data with
high variance. However, they share the same fundamental basis, which is the linear
assumption, and can also be used to show the weight of the feature.

Secondly, the XGBoost is an implementation of gradient-boosted decision trees. It
aims to improve prediction accuracy by using an ensemble of trees [107]. To do
this, it corrects errors from previous trees iteratively. A key difference to the linear
methods is that it can catch more complex patterns in the datasets.

Lastly, neural networks, or DNNs, which can have one or multiple neuron layers,
can process data through these neuron layers to recognize patterns in the dataset.
It can learn non-linear relationships in the dataset, making it capable of finding
more complex relations and features available. In the context of AVM, it provides
the benefit of being able to processing complex inputs such as images as part of
the same model to obtain several patterns that might affect the valuation of the
properties.

Both of the neural network AVMs used are trained in five stages with a mean square
error loss (MSELoss) [108] loss function seen below.

33


3. Methods

MSE = 1
N

N∑
i=1

(ti − pi)2 (3.5)

where N represents the batch size, ti is the actual or true value and pi is the predicted
value. It was trained with an initial learning rate of 0.01 that is then divided by ten
after each run until the final run of 0.00001 with a base size of 512, and an early
stopping with patience of 4 and weight decay regulation of 0.00046. The dropout
layers in the model are halved between each iteration, starting with 10%.

In the selection of hyper-parameters for the AVMs, a grid search is performed to
find the highest-scoring combination of parameters. It is achieved by running cross-
validation on the training data with different pre-decided ranges for the different
parameters. Scoring the parameters on the lowest MAPE score achieved.

The hyper-parameters that are explored in this thesis related to the Ridge model
is the alpha (α) value, which refers to the constant that controls the regularisation
strength by being multiplied with the L2 term.

For the neural network, the training regularisation hyperparameter of weight decay,
the learning rate during the training stages, and patience for early stopping are
explored.

For the XGBoost model, the following hyperparameters are explored to prevent over-
fitting. These include colsample_bytree that indicates the fraction of features per
tree, learning_rate controls the training step size, max_depth sets maximum tree
depth, min_child_weight ensures minimum instance weight in child, n_estimators
sets the number of trees, subsample uses a fraction of data for each tree to generalise
better, gamma sets the minimum loss reduction required for a split which makes the
model conservative, and alpha applies a regularisation to prevent overfitting for the
XGBoost model. The resulting hyper-parameters used are described with the model
in the result.

3.13 Scores
The final analysis of this thesis focuses on the AVM error rates with and without the
newly integrated features to capture the importance of the visual features regarding
market value prediction.

One important rule when validating models is to use an out-of-sample prediction
approach, where all models are scored on unseen data during the training stage with
the aim of capturing how well they would perform in a real-case scenario. Due to the
various models used and the diverse models that contribute to the final visual feature
pool, it is decided to split the dataset in the pre-processing stage. This separation
prevents these different splits from causing in-sample bias when extracting visual
features.

One of these metrics is Mean Absolute Error (MAE) [109], which highlights the av-
erage difference in the error. In our case, it shows the absolute amount of Swedish

34


3. Methods

Krona (SEK) that the predictions differ from the actual selling price. The formula-
tion of MAE can be seen below, where the pi is the i:th prediction and ti is the i:th
actual value.

MAE = 1
N

N∑
i=1

|ti − pi| (3.6)

Another more easily comprehensive error metric is the Mean Absolute Percentage
Error (MAPE) score [110], which shows how much an estimation is wrong on average
in the percentage of the actual value. This makes it easier to get comparable results
between regions with different prices. For example, a 200,000 SEK error on a 200,000
SEK property differs from a 200,000 SEK error on a 2,000,000 SEK property. The
formulation of MAPE can be seen below.

MAPE = 1
N

N∑
i=1

∣∣∣∣ti − pi

ti

∣∣∣∣ (3.7)

Furthermore, another metric is chosen to highlight the worst prediction in proportion.
This metric generates a score for the MAPE on the 10% worst predictions. This
highlights how far off the model is on the worse predictions. Additionally, R2 is the
coefficient of determination of a regression model. Its value shows the proportion
of variance in the dependent variable, which is the selling price in our case [111].
Finally, the Median Error Rate [112] is used to highlight the centre of the errors.
This method is robust against outliers because it excludes them from the actual
score, unlike the averaging scores method. Instead, as the name suggests, it shows
the median error.

These results in five scores that highlight different metrics to give a broader picture
of the differences between before-and-after visual features and where improvements
are made. However, the main focus of this thesis is the MAPE score due to its ease
of interpretation concerning the different property types.

Lastly, the feature’s importance is assessed. In the case of Ridge Regression, the
feature’s importance is ordered by the absolute coefficient to focus on magnitude
and not solely on positive features. Additionally, two usual alternatives for XGBoost
feature importance are gain and weight. Gain indicates the contribution makes by
the feature, and weight indicates the frequency in which it is used [113]. This thesis
focuses on improvements in the form of gain rather than usage to align with the
goal of reducing the uncertainty.

35


3. Methods

36


4
Results

The following chapter shows the study’s results. First, the outcomes of the pre-
processing and the performance of the room classifier will be highlighted. Next, the
results of the self-supervised attention will be compared visually. Then, the results
of the visual feature extraction process will be displayed. Finally, the AVM score
and the importance of the features of the models will be exhibited.

4.1 Classification of Images
The hierarchical classification approach to label the rooms began with separating
the interior images. Different models were compared, leading to the selection of a
ViTS14 with the pre-trained weights from DINO V2 [103], [104]. The model was
trained using a cross entropy (CE) loss [114] function where its equations in the
form of binary and multi-class can be seen below.

CE (Binary Classification) = − (y log(p) + (1 − y) log(1 − p)) (4.1)

In the binary version, y denotes the actual label which is either 0 for false or 1 for
true and p represents the predicted probability that the label is true.

CE (Multi-class Classification) = −
N∑

i=1
yi log(pi) (4.2)

In the multi-class version, N represents the number of classes, yi indicates the binary
indicator where 1 indicates the correct classification for class label i and 0 otherwise.
Also, pi denotes the predicted probability for class i.

It was trained with an initial learning rate of 0.001 and fine-tuning at 0.0001. The
head of this model can be found in Appendix B, as shown in Figure B.4.

After pseudo-labelling and re-training using the same loss function and step sizes,
Figure 4.1 shows the final confusion matrix on the test dataset. It highlights an
excellent ability to distinguish the classes, with a few instances where the model
was confused.

For the second labelling stage, the same ViT base model was used with a similar
head, but ending with a Sigmoid function and one neuron per room type, as shown

37


4. Results

Figure 4.1: Confusion matrix of interior and exterior classifications

in Figure B.5 in Appendix B. It was trained with an initial learning rate of 0.001,
which changed to 0.00001 during fine-tuning, and an early stop patience of 4 while
the maximum number of epochs was set to 100.

Thereafter, the model generated the thresholds indicating that an image corresponds
to a particular room type. Table 4.1 shows the threshold selected by finding the
best threshold according to the highest F1 score on the validation data for each
room type. Consequentially, these thresholds indicates a high F1 score and ability
to differentiate the room types