Embedding-Enhanced Real Estate
Valuation in Non-Metropolitan Sweden
A Hybrid Modeling Approach

Master’s thesis in Complex Adaptive Systems

Leonard Smedenman
Teddy Sallén

DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Master’s thesis 2025

Embedding-Enhanced Real Estate Valuation in
Non-Metropolitan Sweden

A Hybrid Modeling Approach

Leonard Smedenman
Teddy Sallén

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2025


Embedding-Enhanced Real Estate Valuation in Non-Metropolitan Sweden
A Hybrid Modeling Approach
Leonard Smedenman
Teddy Sallén

© Leonard Smedenman & Teddy Sallén, 2025.

Supervisor and Examiner:
Mats Granath, Director, M.Sc Complex Adaptive Systems

Master’s Thesis 2025
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Symbolic image of AI in housing. Source: Primary.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


Embedding-Enhanced Real Estate Valuation in Non-Metropolitan Sweden
A Hybrid Modeling Approach
Leonard Smedenman & Teddy Sallén
Department of Physics
Chalmers University of Technology

Abstract
Automated valuation of residential properties in sparsely populated regions poses
unique challenges due to thin transaction volumes, diverse housing stock, and lim-
ited comparables. This thesis presents a hybrid modeling approach combining an
embedding-based artificial neural network (ANN) with a LightGBM gradient boost-
ing machine to predict sale prices in six Swedish municipalities, focusing specifically
on houses in non-metropolitan areas. The ANN learns dense representations of
categorical and geographic features that capture latent spatial and socioeconomic
patterns, while the GBM leverages both raw features and ANN embeddings to refine
residual errors. Model interpretability is achieved via SHAP values and case studies
of embedding dimensions, revealing that distance to regional centers, living area,
property condition, and proximity to points of interest are key value drivers, even
where market data are scarce. The hybrid model demonstrates competitive accu-
racy, particularly for mid-priced homes, and offers transparent explanations for each
valuation. However, large errors persist for rare, high-end properties and extremely
remote dwellings, reflecting fundamental data limitations. The results highlight how
AI-driven valuation tools can complement traditional appraisal methods by provid-
ing rapid, interpretable estimates for routine cases and flagging high-uncertainty
transactions for expert review.

Keywords: Automated Valuation Model, real-estate appraisal, neural embeddings,
gradient boosting, SHAP interpretability, non-metropolitan housing.

v


Acknowledgements
We would like to express our gratitude to everyone who has contributed to the
completion of this project. Firstly, we would like to thank our contact persons
at Värderingsdata, Magnus Persson, Jon Larborn and Niklas Stenwreth. Without
their assistance, guidance and knowledge this project would not be as successful.
We would also like to extend our appreciation to our supervisor and examiner Mats
Granath for accepting the role and providing input.
Thank you for all your contributions.
Sincerely,

Leonard Smedenman & Teddy Sallén, Gothenburg, June 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

AI Artificial Intelligence
ANN Artificial Neural Network
AVM Automated Valuation Model
DeSO Demographic Statistical Areas
GBDT Gradient Boosted Decision Trees
GBM Gradient Boosting Machine
GRP Regional GDP
HPM Hedonic Pricing Model
KNN k-Nearest Neighbors
KTH Kungliga Tekniska Högskolan
LGBM Light Gradient Boosting Machine
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
ML Machine Learning
MSE Mean Squared Error
NN Nearest Neighbor
P10 Percentage of predictions within ±10% of sale price
P20 Percentage of predictions within ±20% of sale price
R2 Coefficient of Determination
RMSE Root Mean Squared Error
SHAP SHapley Additive exPlanations
t–SNE t–Distributed Stochastic Neighbor Embedding

ix


Nomenclature

Below is the nomenclature of indices, Hyper-parameters and constants, parameters,
variables, and metrics used throughout this thesis.

Indices

i Index for property / transaction in the dataset
j Index for input feature Xj in the hedonic model
m Index of boosting iteration / tree (hm, Fm)
c Index of price-quantile class in the auxiliary classifier (pi,c)

Hyper-parameters and constants

α Weight of the P10 term in the composite loss (annealed from
αstart to αend)

γ Focusing parameter of the focal classification loss
δ Huber-loss threshold that separates MAE/MSE regimes
k Number of neighbours in the kNN component
m (margin) Margin in the triplet-loss constraint
ν Shrinkage (learning-rate) parameter in gradient boosting
B Mini-batch size used during stochastic optimisation
wc, wt Fixed weights of classification and triplet losses in Ltotal

Variables

x Raw feature vector of a property
x′ Standardised feature: (x − µ)/σ

µ, σ Empirical mean and standard deviation of a feature
y True log-transformed sale price (target)

xi


ŷ Predicted log-price produced by the model
V, V̂ Price on the original SEK scale (V̂ = eŷσ+µ)
zj, aj Pre-activation and activation of neuron j in the ANN
W(ℓ), b(ℓ) Weight matrix and bias vector of layer ℓ

e 128-dimensional learned embedding of a property
pi,c Probability that property i belongs to class c (softmax output)
ri,m Residual of sample i at boosting stage m

hm(x) Weak learner (regression tree) at stage m

Fm(x) Ensemble prediction after m trees

Losses

Lreg P10-aware regression loss (Huber + soft-P10)
Lcls Focal classification loss
Ltriplet Triplet embedding loss
Ltotal Composite training objective Lreg + (1 − α)wcLcls + wtLtriplet

Evaluation metrics

n Number of observations in a sample or split
MAPE, MAE,
RMSE

Standard error statistics defined in Section 2.7

R2 Coefficient of determination
P10, P20 Share of predictions within ±10% and ±20% of the true price,

respectively

xii


Contents

List of Acronyms ix

Nomenclature xi

List of Figures xvii

List of Tables xix

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Traditional Valuation Methods in Sweden . . . . . . . . . . . . . . . 2
1.4 Rationale for ML-Based Valuation . . . . . . . . . . . . . . . . . . . . 3
1.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Objectives / Research Questions . . . . . . . . . . . . . . . . . . . . . 5

1.6.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 5

1.7 Scope and Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 Price Prediction and Regression Models . . . . . . . . . . . . . . . . . 7

2.1.1 Overview of house price prediction as a regression task . . . . 7
2.1.2 Hedonic regression models . . . . . . . . . . . . . . . . . . . . 8
2.1.3 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Logarithmic Transformation of Skewed Variables . . . . . . . . 9
2.2.2 Label Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Feature Standardization . . . . . . . . . . . . . . . . . . . . . 9

2.3 Neural Networks for Regression . . . . . . . . . . . . . . . . . . . . . 9
2.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Huber Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Combining Loss Functions in Regression Models . . . . . . . . 12
2.4.3 Combining Loss Functions in Regression Models . . . . . . . . 12

2.5 Gradient Boosting and LightGBM . . . . . . . . . . . . . . . . . . . . 13
2.5.1 The Gradient Boosting Process . . . . . . . . . . . . . . . . . 13
2.5.2 Gradient Boosting in Real Estate Valuation . . . . . . . . . . 13

2.6 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

xiii


Contents

2.7 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.1 Mean Absolut Percentage Error (MAPE) . . . . . . . . . . . . 14
2.7.2 Mean Absolute Error (MAE) . . . . . . . . . . . . . . . . . . 14
2.7.3 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . . . 15
2.7.4 Coefficient of Determination (R2) . . . . . . . . . . . . . . . . 15
2.7.5 P10 and P20 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.6 SHAP values – feature importance and interpretability . . . . 15
2.7.7 t-SNE visualizing a high-dimensional representation . . . . . . 16

3 Methodology 17
3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Cleaning and Imputation . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Categorical Encoding and Vocabulary Extraction . . . . . . . 17
3.1.3 Proportion Clipping and Cyclical Date Features . . . . . . . . 17
3.1.4 Feature Selection and Scaling . . . . . . . . . . . . . . . . . . 18

3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Training, Validation, and Test Split . . . . . . . . . . . . . . . 19
3.2.2 Artificial Neural Network with Embeddings . . . . . . . . . . 19

3.2.2.1 Input and Embedding Layers . . . . . . . . . . . . . 20
3.2.2.2 Residual Stack and Embedding Head . . . . . . . . . 20
3.2.2.3 Multi-Task Output Heads . . . . . . . . . . . . . . . 21
3.2.2.4 Composite Loss Function . . . . . . . . . . . . . . . 21

3.2.2.4.1 P10-Aware Regression Loss Lreg . . . . . . . 22
3.2.2.4.2 P10-Aware Regression Loss Lreg . . . . . . . 22
3.2.2.4.3 Focal Classification Loss Lcls . . . . . . . . . 23
3.2.2.4.4 Triplet Embedding Loss Ltriplet . . . . . . . 24

3.2.2.5 Optimization and Regularization . . . . . . . . . . . 25
3.2.3 LightGBM Ensemble with Raw Features and ANN Embeddings 26

3.2.3.1 Stage 1: Raw-Feature GBM . . . . . . . . . . . . . . 26
3.2.3.2 Stage 2: Embedding-Based Residual GBM . . . . . . 26
3.2.3.3 Combined Prediction and Performance . . . . . . . . 27

3.3 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Hedonic Regression Baseline . . . . . . . . . . . . . . . . . . . 27
3.3.2 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Baseline Model Configurations . . . . . . . . . . . . . . . . . . 28

4 Summary of Findings 29
4.1 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Model Performance by Price Decile . . . . . . . . . . . . . . . 30
4.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Case Studies: Best and Worst Predictions . . . . . . . . . . . 32
4.2.3 Case Studies of Selected Transactions . . . . . . . . . . . . . . 33

4.3 Embedding Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Embeddings Clustering . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 t-SNE Projection of Embeddings . . . . . . . . . . . . . . . . 36
4.3.3 Embedding-Feature Correlation Analysis . . . . . . . . . . . . 37

xiv


Contents

4.4 Model Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 SHAP Analysis on Raw Features . . . . . . . . . . . . . . . . 38

4.4.1.1 Property Attributes and Size/Quality Effects . . . . 39
4.4.1.2 Location . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1.3 Categorical Location Effects . . . . . . . . . . . . . . 40

4.4.2 Raw Feature Importance by Gain . . . . . . . . . . . . . . . . 40
4.4.3 Quantified Embedding Importance . . . . . . . . . . . . . . . 41

4.4.3.1 Gain-Based Embedding Importance . . . . . . . . . . 41
4.4.3.2 SHAP Analysis on Embeddings . . . . . . . . . . . . 41
4.4.3.3 Case Studies of Three Different Embedding Dimen-

sions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Demographic statistical areas analysis . . . . . . . . . . . . . . . . . . 43
4.6 Model Proficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Conclusion 45
5.1 Key Factors Influencing Property Values . . . . . . . . . . . . . . . . 45
5.2 Model Performance and Limitations . . . . . . . . . . . . . . . . . . . 46
5.3 Implications for Low-Density Housing Markets . . . . . . . . . . . . . 47
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Bibliography I

A Appendix 1 I

References III

xv


Contents

xvi


List of Figures

2.1 Overview of house price prediction as a regression task. Input fea-
tures are mapped through a regression model to produce a continuous
output (price). Source: Primary. . . . . . . . . . . . . . . . . . . . . . 7

2.2 A multi-layer feed-forward Artificial Neural Network with an input
layer, one hidden layer, and an output layer. Each connection has a
weight, and each neuron (circle) computes a function of the weighted
inputs. Source: Primary. . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Comparison of Huber loss (green) with standard squared error loss
(blue) as a function of the prediction residual. Source: Qwertyus
https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_
loss.svg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 A simple diagram, visualizing the steps of the hybrid model. Source:
Primary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 An illustration of how the composite loss-function penalizes wrong
predictions as α increases. Source: Primary. . . . . . . . . . . . . . . 22

3.3 An illustration of how the composite loss penalizes prediction wrong
predictions as α increases. Source: Primary. . . . . . . . . . . . . . . 23

3.4 Illustration of focal loss, as γ increases, well classified examples (high
p) are down-weighted, i.e their loss goes to zero faster, which helps
focus the training on more difficultly classified examples (low p). The
difference might look small but is quite tangible in practice. Source:
Primary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Triplet Embedding Loss. For δ ≤ −margin, the negative sample
is at least "margin" farther than the positive -> zero loss. For δ >
−margin, the loss gros linearly with δ + margin.Source: Primary. . 25

3.6 Illustration of KNN regression. A new house (vertical dashed line
at 180 m2) is valued by averaging prices of its 5 nearest neighbors
(orange points) among a sample of training homes. Source: Primary. 28

4.1 These two plots show the Absolute Error Distribution (left) and Rel-
ative Error Distribution (right) for the Hybrid Model. . . . . . . . . . 31

4.2 Scatter plot of all true (x-axis) and predicted prices (y-axis), if all
predictions were totally correct, they would align with the red dotted
line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Boxplots of sale-price distributions for five clusters obtained by ap-
plying K-means to the 128-dimensional neural network embeddings. . 36

xvii

https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg
https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg


List of Figures

4.4 Two-dimensional t-SNE projection of the 128-dimensional ANN em-
beddings for each property, colored by log sale price. Points that
cluster together share similar learned representations. . . . . . . . . . 37

4.5 SHAP summary plot for the model, showing each feature’s contri-
bution to the predicted price (x-axis) and the distribution of feature
values (color) across observations . . . . . . . . . . . . . . . . . . . . 38

4.6 Feature importance plot illustrating the top 20 influential raw fea-
tures in the Light GBM model, ranked by gain (total reduction in
the loss function). The horizontal bars represent the relative con-
tribution of each feature to the predictive performance, highlighting
LogDistMediumCity, LogUtilityArea, and LogLivingArea as the most
impactful features for predicting real estate prices. . . . . . . . . . . . 40

4.7 The plot shows the embedding gains on the residuals of the Light
GBM on raw features. The gain refers to the reduction of loss function 41

4.8 SHAP summary plot for 20 embedding dimensions, showing each em-
bedding’s impact on predicted price and its value distribution. . . . . 42

xviii


List of Tables

3.1 Overview of baseline models and their configurations . . . . . . . . . 28

4.1 Test Set Performance Comparison of All Models . . . . . . . . . . . . 29
4.2 Hybrid Model Performance by True Price Decile . . . . . . . . . . . . 30
4.3 The table displays the best and worst predictions made by the model

in absolute terms. The sale prices are still adjusted to 2020-06, hence
the strange price sequences. . . . . . . . . . . . . . . . . . . . . . . . 32

4.4 Comparison of the model’s 7th worst prediction and its nearest neigh-
bors (NN1)(see Table 4.3) between two nearly identical property records,
one from the training set (true sale price 12,500,000 SEK) and one
from the test set (true sale price 4,313,017 SEK), showing adjusted
true vs. predicted sale prices and key features(more of key features
in section 4.4), highlighting a likely duplicate entry. . . . . . . . . . . 33

4.5 Comparison of the model’s 14th worst prediction (see Table 4.3), be-
tween transaction the Anomaly and its five nearest neighbor (NN1–NN5),
showing true sale prices and key features. . . . . . . . . . . . . . . . . 34

4.6 Performance Metrics by DesoClass . . . . . . . . . . . . . . . . . . . 43

A.1 Table of all included counties and municipalities in the dataset. . . . I

xix


List of Tables

xx


1
Introduction

Advancements in artificial intelligence (AI) offer new opportunities for real estate
valuation, especially in data-scarce markets. This study explores the use of Machine
Learning (ML) models, specifically Artificial Neural Networks (ANN) and gradient
boosting, to improve valuation accuracy in sparsely populated regions of Sweden
where traditional methods face significant limitations.

1.1 Background
Real estate valuation plays a central role in the functioning of the property market
and financial system. Accurate property values are needed for a range of purposes,
including sales and purchases, taxation, investment analysis, and securing mort-
gage loans [1]. In Sweden, official assessments of property value ("taxeringsvärde")
are determined periodically by the national tax authority (Skatteverket) and are
intended to reflect approximately 75% of market value for taxation purposes [2].
These assessments rely on recent sale prices of comparable properties within defined
value areas (värdeområden) where properties are assumed to have similar condi-
tions. However, in parts of Sweden that are generally more non-urban, such as the
provinces of Östergötland, Småland, Gotland and Blekinge (see A.1 for full list of
included counties and municipalities), property transactions occur less frequently,
specifically in regards to houses, leading to thin markets with scarce comparable
sales data [3]. In these areas, traditional indicators of market value become less
reliable or even nonexistent, as noted historically in legal preparatory works that
questioned the applicability of a market value concept in locales with virtually no
sales activity. This poses challenges for property owners, buyers, and lenders, as
valuation uncertainty increases outside urban centers.

Traditional real estate appraisal in Sweden has long been based on professional
judgment supported by standard methods. These conventional approaches, while
grounded in decades of experience, often struggle to capture market dynamics in
real time, especially when data on actual transactions are limited.

1.2 Problem Description
Valuing properties in sparsely populated regions like the ones this thesis focuses on,
presents significant challenges due to the limited number of transactions and the
diverse nature of the properties. The standard sales comparison approach, which

1


1. Introduction

relies on identifying recently sold comparable properties in a given area, becomes
less reliable when few or no truly similar sales exist. In these areas, appraisers may
be forced to base valuations on a very small sample of transactions, increasing the
risk of error. Moreover, non-urban properties often possess unique features such as
old building years or large plots of land, which make direct comparisons difficult.
These factors contribute to considerable uncertainty in valuation and highlight the
need for more flexible or data-driven approaches in sparsely populated markets.

Due to the limited availability of market data, valuers may be forced to rely on
alternative methods or general assumptions. For example, cost based valuations or
income based approaches may be used in place of direct market comparisons. These
methods, however, may not reflect what a buyer would actually pay, especially if
there are intangible values associated with location and amenities that are not cap-
tured by cost or income alone. As a result, valuations in these areas carry a higher
degree of uncertainty and risk. This is problematic not only for private stakehold-
ers but also for banks and public agencies. Lenders face difficulties in mortgage
risk assessment when valuations are uncertain, and municipalities or tax authorities
struggle to ensure fairness and accuracy in taxation when comparable sales are lack-
ing [3]. Recent market fluctuations have highlighted this issue, during periods of
market downturn or upheaval, transaction volumes can drop sharply, for example,
in 2022 the transaction volume in Sweden fell by over 40% year-on-year, creating an
extremely thin market [3], making it even harder to gauge true property values in
affected regions.

1.3 Traditional Valuation Methods in Sweden
Real estate valuers traditionally employ a few fundamental methods to estimate
market value, each with its own assumptions and data requirements. The sales
comparison approach, as mentioned in the previous section, wherein the appraiser
identifies recent sales of similar properties and adjusts for differences to estimate
the subject property’s value. In Sweden, hedonic pricing models (HPMs) based on
multiple regression are used to support both private appraisals and mass appraisal
for tax assessment. These models generalize the relationship between property char-
acteristics and market prices within a given region [1]. HPMs represent property
value as a function of its attributes, such as size, location, and quality, and have
been foundational in valuation theory since 1974 [4]. They are relatively transparent
and grounded in economic theory, but they typically assume a linear (or log-linear)
relationship and may struggle with complex, non-linear interactions between fea-
tures.

All these traditional methods require substantial expertise and judgment. Apprais-
ers must carefully select comparables or estimate depreciation. In thin markets, the
lack of data forces greater reliance on professional judgment, potentially introduc-
ing bias or error. Moreover, manual valuation processes are time consuming and
not easily scalable. As the demand grows for rapid valuations, traditional methods
show their limitations in terms of speed and consistency [5]. These limitations mo-

2


1. Introduction

tivate the search for more automated and data-driven valuation methods that can
complement or enhance the traditional techniques.

1.4 Rationale for ML-Based Valuation
Advancements in ML offer promising opportunities to address the challenges of non-
urban property valuation. Automated Valuation Models (AVMs) are increasingly
being used in real estate markets worldwide to produce instant value estimates by
analyzing large datasets of property features and past transactions [6]. An AVM
is a computer-driven algorithm that inputs property data and outputs a value es-
timate, often very rapidly, making it attractive for both lenders and investors who
need quick assessments [5]. The key advantage of ML-based models is their ability
to detect complex, non-linear patterns in data that traditional linear models fail to
capture. This is particularly relevant for diverse non-urban properties where inter-
actions between attributes, like land size, building condition, and locational factors,
may influence value in complicated, often non-obvious ways.

Among ML techniques, one that stands out is Artificial Neural Networks models.
ANNs are computational models inspired by the human brain, capable of fitting ex-
tremely flexible functional forms to data. They have shown promise in house price
prediction tasks; for instance, studies have found that neural networks can outper-
form multiple regression models and other techniques in terms of valuation accuracy
[7]. By learning from a broad set of input examples, an ANN can capture subtle
relationships. The downside is that ANNs are often criticized as "black boxes", of-
fering little transparency into how they arrive at a given estimate [8]. This lack of
interpretability can be problematic for gaining trust in valuations, since proposing a
valuation without a clear basis might make for a weak argument for a stakeholder,
which is why recent research emphasizes explainable AI methods in real estate ap-
plications.

Embracing AI for property valuation in Sweden’s non-urban context is not just a
theoretical exercise; the industry has already begun moving in this direction. Banks
and valuation firms in Sweden are experimenting with AI-driven models to com-
plement traditional appraisals, especially for residential properties. According to
Värderingsdata, which is a leading provider of property data in Sweden, AI-based
valuation models are already used in practice and can drastically speed up the val-
uation process, allowing human experts to focus on more complex analysis [5]. In
non-urban areas, an ML model might, for example, learn from transactions in a
wider region or over a longer time horizon to compensate for the lack of recent local
sales.

The rationale for this study is thus clear; by applying ML techniques to the problem
of non-urban property valuation in Östergötland, Småland, Gotland, and Blekinge,
the thesis aims to assess whether these methods can improve accuracy and consis-
tency over traditional approaches.The practical considerations of using such models
will also be examined, including data requirements and the interpretability of results.

3


1. Introduction

This thesis aims to bring meaningful insights into what features, or combination of
features, are deemed most important in the chosen focus group and how they differ
from, for instance, apartments in urban areas. The ultimate goal is to develop an
Automated Valuation Model tailored to non-urban Swedish conditions, or at least
to evaluate its feasibility.

The quality of an AI-based valuation is heavily dependent on the quality and quan-
tity of the input data; poor or biased data can lead to misleading estimates [5].
Additionally, stakeholders must be able to trust the output of a model, which yet
agin highlights importance of transparency and validation [8]. This study is under-
taken with these considerations in mind. By focusing on a geographically specific
and data-challenged context, the research will highlight not only the potential ac-
curacy gains from ML, but also the limitations and requirements for deploying such
technology in real-world valuation practice.

1.5 Dataset
The dataset used in this study was provided by Värderingsdata and comprises
roughly 90,000 residential properties with transactions ranging from 2015 to 2022.
Each object represents an individual sale of a property (the same property can thus
appear more than once if it was sold multiple times during the timeframe). The
data includes approximately 170 features representing various physical, geographic,
socioeconomic, and temporal characteristics. There is a column with index-adjusted
sale-prices to 2020-06, which allows for a fair comparison across all the transaction
years.

The dataset is organized into several feature domains. Object-level features describe
each property’s individual characteristics, including variables such as living area,
construction year, energy class, and water and sewage access etc. Neighborhood-
level characteristics capture sociodemographic and economic indicators from the
surrounding area, including population age distribution, household types, educa-
tion levels, income distribution, and local real estate market statistics. Macro-
economic indicators such as interest rates, gross regional product (GRP), and in-
flation measures are also included, contextualizing each transaction within broader
market conditions. Geospatial attributes incorporate detailed locational data, in-
cluding distances to various urban centers, natural features (like lakes and coast),
infrastructure (like roads, rail, airports), and points of interest such as golf courses,
schools, and ski resorts. Additionally, temporal variables encode the time dimension
of each transaction, with fields like sale year, month, and day of the week.

The diversity and detail of the data offer a rich foundation for statistical learn-
ing. While some variables contain missing values, the overall completeness is high.
Ideally, the dataset would include even more detailed object-specific features, such
as the number of rooms, construction material, window type, roof condition, heat-
ing system type, floor material, ceiling height, and the presence of amenities such
as a balcony, fireplace, or integrated household appliances, but these types of data

4


1. Introduction

is not easily obtained. Alas, most of the columns pertain to more regional data.
More specific data on the condition of the houses and their appliances could lead
to more accurate valuation, but the variety of data is sufficient to make meaningful
distinctions, though additional detail would still be desirable.

1.6 Objectives / Research Questions
The objectives of the master’s thesis are summarized in the defined research ques-
tions below.

1.6.1 Objectives
1. To develop an ML-based model for property appraisal.
2. To assess the accuracy and reliability of the model in comparison to alternative

methods.
3. To identify the most influential factors in property valuation as determined by

the model, and what features contribute to the valuation.
4. To uncover what non-obvious features, and interactions between features,

might be specifically important in non-urban housing.

1.6.2 Research Questions
1. How does an AI-driven model compare in performance to benchmark models

in terms of accuracy and performance in?
2. What are the key factors influencing non-urban property valuation in the Små-

land, Östergötland, Gotland and Blekinge regions, as identified by the model?
3. What challenges and limitations arise when applying machine learning tech-

niques to real estate valuation, and how can they be mitigated?
4. What insights can be gained from this study to inform future advancements

in property valuation processes?

1.7 Scope and Delimitations
To maintain a clear and manageable scope, the following delimitations were applied:

• Property Types: The analysis is restricted to residential properties, such as
family homes and small non-urban dwellings. However, assessments on very
cheap smaller houses are discarded, since their sale prices and features more
rarely coincide and thus only provide noise.

• Temporal Scope: Transaction data used for training and evaluation will be
limited to a defined historical period 2015-2022.

• Model Focus: The study will concentrate on a hybrid model consisting of
ANN and gradient boosting.

• Model Inputs: Only the structured dataset provided by Värderingsdata will
be used in this study. No additional data collection from external sources has

5


1. Introduction

been conducted, although lagged features derived from the original data have
been created.

• Comparison Baseline: The performance of the model is compared to other
traditional computational models. Manual expert appraisals are referenced for
context but not replicated in this study.

• Outcome Metrics: Model performance will be evaluated primarily using sta-
tistical measures of predictive accuracy (e.g, RMSE, MAE, MAPE, P10/P20).
Broader impacts such as user acceptance and regulatory considerations are not
tested.

• Implementation Context: The study is exploratory and does not include
real-time deployment or integration of the developed models into production
environments used by Värderingsdata.

6


2
Theory

This chapter outlines the theoretical foundations of property valuation and machine
learning, providing the conceptual framework for the methods used in the study.

2.1 Price Prediction and Regression Models
Predicting housing prices is a key challenge in real estate economics and data science.
This section explores how regression models are applied to estimate property values
based on various input features.

2.1.1 Overview of house price prediction as a regression task
House price prediction is the task of estimating a property’s market value from its
attributes. It is framed as a regression problem because the target output (price)
is a continuous variable. In a regression model, the house’s features serve as input
variables and the output is a predicted price. The goal is to learn a mapping f that
relates these features to the sale price by training on historical sales data. House
price prediction is therefore a classic example of supervised regression analysis in
real estate economics and machine learning [9]. An illustration with three arbitrary
features is shown in Fig. 2.1

Property Size
(e.g., sq.m.)

Location
(e.g., coordinates)

Property Condition
(e.g., renovated)

Regression Model
f(X, β)

Predicted Price
(output)

Figure 2.1: Overview of house price prediction as a regression task. Input features
are mapped through a regression model to produce a continuous output (price).
Source: Primary.

7


2. Theory

2.1.2 Hedonic regression models
A cornerstone of traditional house valuation is hedonic regression. This method
models a property’s value

V = f(X1, X2, . . . , Xn) (2.1)

as a function of its characteristics. In practice, f is often assumed linear:

V = β0 + β1X1 + · · · + βnXn + ϵ (2.2)

where each Xj is a property feature (size, location, etc.) and βj its estimated
effect on price. Each coefficient thus represents the contribution of that feature,
making the model easy to interpret. Hedonic regression has been frequently used
for decades in market analysis and mass appraisal [10] because of its simplicity
and transparency. However, the linear additive assumptions of hedonic models can
be limiting. A simple hedonic model may fail to capture complex or non-linear
relationships (for example, varying impacts of property age on market value) or
interactions between factors. Moreover, it requires high quality data containing key
variables or working with sparse data can lead to biased, unreliable estimates [11].
Furthermore, they are sensitive to multicollinearity, multiple additions of features
without careful consideration can therefore lead to lopsided or misleading results,
which calls for an informed user in order to get accurate results.

2.1.3 K-Nearest Neighbors
An alternative non-parametric approach to hedonic regression is the K-Nearest
Neighbors algorithm (KNN), originally proposed by Fix and Hodges and later for-
malized by Cover and Hart [12]. Instead of specifying a functional form for the
relationship between property characteristics and price, KNNs assume that simi-
lar properties have similar market values. For an object with feature vector X =
(X1, X2, . . . , Xn), the set of its k nearest neighbors is defined in the training data
as:

Nk(X) =
{

(X(i), V (i)) : X(i) is among the k closest points to X
}
.

The predicted value V̂ is then computed as the average of the neighbor prices:

V̂ (X) = 1
k

∑
V (i). (2.3)

By using a distance metric, the model captures non-linear relationships and inter-
actions without explicit model assumptions. The method is intuitive and straight-
forward to implement, but can become computationally expensive for large datasets
and suffer from the “curse of dimensionality” as the feature space grows [13]. It is
also sensitive to noise and unevenly distributed data. Nonetheless, KNN remains
a popular baseline in real estate valuation studies and in graph–based extensions
where local similarity is leveraged.

8


2. Theory

2.2 Feature Engineering
Effective feature engineering is essential for extracting maximal predictive power
from structured data. In real estate valuation, raw inputs can include highly skewed
numeric variables, high-cardinality categorical variables and proportional features
due to the inherently diverse nature of housing. This section reviews the theory
behind each transformation applied in the code.

2.2.1 Logarithmic Transformation of Skewed Variables
Many real estate attributes exhibit a long right tail, where a small fraction of high-
end properties inflate the mean and violate Gaussian assumptions. Applying the
natural logarithm compresses large values more than small ones, stabilizing variance
and often improving both linear and non-linear model performance [14]. In economic
contexts, log-errors correspond to relative errors, making them more interpretable
when predicting quantities that span multiple orders of magnitude.

2.2.2 Label Encoding
Simple categorical fields with low cardinality can advantageously be converted to
integer labels via a "Label encoder", which preserves uniqueness but imposes an
arbitrary order. While tree-based models are unaffected by ordinal label codes,
neural networks can learn embeddings on these integer indices.

2.2.3 Feature Standardization
Features with heterogeneous scales, for example living area in square meters vs a
log-adjusted price can dominate optimization and distance metrics. Z-score stan-
dardization,

x′ = x − µ

σ
,

centers each feature to zero mean and unit variance, facilitating stable gradient
descent in neural networks and balanced Euclidean distances in k-nearest neighbors
[15].

2.3 Neural Networks for Regression
Artificial Neural Networks are a class of models inspired by the human brain, com-
posed of interconnected units called neurons organized in layers. An ANN typically
consists of an input layer, which takes in the features, one or more hidden layers
that transform the inputs through weighted connections, and an output layer that
produces the prediction. Each connection between neurons has a weight that ampli-
fies or reduces the signal, and each neuron applies a non-linear activation function
to the weighted sum of its inputs. Through a learning process, these weights are
adjusted so that the network outputs accurate predictions on the training data [16].
In Fig. 2.2 there is a simple illustration of a neural network.

9


2. Theory

Figure 2.2: A multi-layer feed-forward Artificial Neural Network with an input
layer, one hidden layer, and an output layer. Each connection has a weight, and
each neuron (circle) computes a function of the weighted inputs. Source: Primary.

Mathematically, a simple ANN with one hidden layer can be described as follows:
Suppose there are d input features x1, . . . , xd. Each hidden neuron h(j) computes a
linear combination:

zj =
d∑

i=1
w

(1)
ij xi + b

(1)
j (2.4)

and then applies a non-linear activation aj = f(zj), where f could be a ReLU
or sigmoid function. The output layer then takes these hidden activations and
computes the final output:

ŷ =
∑

j

w
(2)
j aj + b(2) (2.5)

(for a regression network, often a linear activation is used at the output so that ŷ is
a continuous number). In vectorized form, the network function is:

ŷ = W (2)f(W (1)x + b(1)) + b(2) (2.6)

The key point is that by composing two (or more) linear transformations with non-
linear activations, the network can approximate very complex functions. In fact,
the Universal Approximation Theorem states that a sufficiently large neural net-
work can approximate any continuous function on compact domains to arbitrary
accuracy, given enough neurons in the hidden layer [17]. This theory explains why
neural networks are so useful for predicting house prices, they can learn complex
relationships between features.

ANNs learn the weights from data through an iterative optimization process called
backpropagation combined with gradient based optimizers. The network starts with
random weights and then in each training epoch, the predictions are compared to

10


2. Theory

true prices using a loss function (discussed later more deeply in the Methodology
chapter 3). The gradient of the loss with respect to each weight is computed through
the backpropagation algorithm, and the weights are adjusted in the direction that
reduces the error. Over many iterations, the network hopefully converges to a set
of weights that make accurate predictions on the data in question.

One appeal of ANNs in real estate is their ability to automatically learn latent
features. For instance, the hidden neurons learn to represent combinations of inputs,
for example, a neuron might push for "lakeside rural cottage" properties if such a
pattern is present. ANNs are flexible and can theoretically handle interactions and
non-linearities better than any predefined regression formula. However, there are
challenges and considerations with ANNs. First, they generally require a large
amount of data to train effectively, especially compared to many simpler models. In
a data-sparse rural context, a complex neural network could overfit, learning quirks
of the training data that don’t generalize, if not carefully regularized. Simpler
network architectures or additional data might be necessary for more enhanced
results. Second, ANNs are often criticized as “black boxes” as mentioned in 1.4,
because the relationship between inputs and outputs is encoded in many weights
in a non-transparent way. It’s not obvious why a particular prediction was made,
which can be a disadvantage in valuation, where explainability is important. Later
in this chapter, interpretability methods which can mitigate the interpretability
problem are discussed. Finally, hyperparameter tuning (choosing the number of
layers, neurons, learning rate, etc.) is important to get good performance and can
be time-consuming. Despite these issues, ANNs remain a promising and highly
feasible method for capturing complex value drivers in properties.

2.4 Loss Functions
In training and evaluating regression models, the choice of loss function/error metric
is critical. The loss function is the quantitative measure of error that the model
tries to minimize during training. Different losses have different properties and can
lead to different model behavior, especially important in valuation where one might
care about relative error more than absolute error, or want to avoid over-penalizing
outliers. Below, common loss functions and specialized ones used in this thesis are
outlined.

2.4.1 Huber Loss
The Huber loss is a robust loss function that behaves like mean squared error (MSE)
for small errors and like mean absolute error (MAE) for large errors [18] [19]. Math-
ematically, it is defined piecewise, being quadratic when the absolute residual is
below a certain threshold δ and linear beyond that point. This hybrid nature gives
Huber loss the advantages of both MSE and MAE. Huber loss is commonly used in
robust regression and machine learning settings where the user expects noisy data,
providing a balance between sensitivity to small errors and insensitivity to very large

11


2. Theory

deviations. An illustration of how Huber loss penalizes wrong predictions is shown
in Fig. 2.3.

Figure 2.3: Comparison of Huber loss (green) with standard squared error loss
(blue) as a function of the prediction residual. Source: Qwertyus https://en.
wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg

2.4.2 Combining Loss Functions in Regression Models

In practice, a single regression loss may not capture all modeling objectives. Com-
bining multiple loss terms allows the model to balance these priorities. In general,
one forms a composite loss as a weighted sum of components, so that each term
contributes to guiding the training process. This strategy can improve generaliza-
tion: prior work has shown that multi-objective loss functions often yield better
performance on heterogeneous data and allow practitioners to tune tradeoff hyper-
parameters between different goals [20].

2.4.3 Combining Loss Functions in Regression Models

In practice, optimizing a single regression loss may not capture all modeling ob-
jectives, particularly when the model must also learn a structured or generalizable
internal representation. In multi-task learning settings, it is common to combine sev-
eral loss terms, each with a different purpose. For example, alongside the primary
regression loss which predicts the sale price, additional losses such as classification
or contrastive objectives can help the model toward learning embeddings that re-
flect meaningful relationships in the data, for example market segment. This allows
each loss term to influence training in proportion to its assigned weight. While the
model still predicts a single scalar target, additional losses support generalization by
enforcing structure in the learned representation. Prior work shows that such multi-
objective training can improve both convergence and out-of-distribution robustness
[20].

12

https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg
https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg


2. Theory

2.5 Gradient Boosting and LightGBM
Gradient Boosting is an ensemble method that builds a strong predictor by se-
quentially combining weak learners, typically shallow regression trees. Originally
developed for classification, it was extended to regression by Friedman (2001) as
Gradient Boosted Decision Trees (GBDT) [21].

2.5.1 The Gradient Boosting Process
Instead of training one complex model, gradient boosting builds a sequence of simple
models (h1, h2, . . . , hM), where each new model tries to correct the errors made by
the ones before it. The process starts with a basic guess F0(x), often just the average
sale price, and gradually improves this prediction in steps [22].

1. Compute residuals for each training example. For Mean Squared Error loss,
the residual at stage m is ri,m = y − Fm−1(xi).

2. Train a new decision tree hm(x) on these residuals, learning how the current
model errs.

3. Update the model: Fm(x) = Fm−1(x) + ν · hm(x), where ν is a shrinkage
parameter (learning rate).

4. Repeat until M trees have been added or validation error ceases to improve.
Each tree greedily reduces remaining error, by moving in the negative gradient
direction of the loss function, hence the term gradient boosting. The final model is
a weighted sum of M decision trees. Although individual trees are generally shallow,
the ensemble collectively achieves accuracy and robustness.

2.5.2 Gradient Boosting in Real Estate Valuation
In real estate valuation, gradient boosting models provide distinct advantages. Deci-
sion trees naturally handle numerical and categorical features, effectively capturing
non-linear relationships among features. For example, trees can specifically model
scenarios like rural properties with long commutes or waterfront properties, accu-
mulating adjustments from multiple trees for nuanced predictions.
Despite its strengths, gradient boosting, like all models, has drawbacks. Optimal
performance requires careful hyperparameter tuning, like tree number, depth, and
learning rate, using techniques like cross-validation [23] to balance overfitting and
underfitting. Furthermore, large ensembles can slow predictions with very lagre
datasets, though typically manageable in real estate contexts. However, despite
these minor drawbacks, gradient boosting remains a powerful, flexible methodology
well suited to modeling complex and sparse data.

2.6 Overfitting
Overfitting is a phenomenon in which a model becomes too closely aligned with the
training data, capturing noise or unusual patterns that do not generalize well to un-
seen inputs [24]. This often results in a steadily decreasing training error while the
validation error, after an initial improvement, begins to rise. This pattern indicates

13


2. Theory

that the model is not learning the underlying data distribution but is instead mem-
orizing specific examples from the training set. As a result, the model performs well
on the data it has seen but poorly on new or unseen data. Bad fitting typically oc-
curs when a model is either too small to capture the data well i.e underfitting, when
model capacity is similar in scale to the training data i.e when it is large enough to
memorize patterns without generalizing, the risk of overfitting is often at its highest.
Interestingly, work has shown that very large or overparameterized networks often
generalize better than moderately sized ones [25]. When the training data is small
or contains noise, the risk increases further. For instance, a neural network with a
number of parameters similar to the number of training samples can easily memorize
the data rather than learn generalizable features. In practice, overfitting is often
diagnosed by monitoring the difference between training and validation errors. A
growing gap between them signals that the model’s generalization ability is deteri-
orating. To mitigate overfitting, techniques such as regularization, early stopping
during training, and the use of a separate validation set are commonly applied.

2.7 Model Evaluation Metrics

In order to make a thorough comparison of the different models, standardized and
clear evaluation metrics are needed. The ones used for this thesis are listed in the
subsequent subsections, where y is the true price, ŷ the predicted price, and n is the
total number of properties.

2.7.1 Mean Absolut Percentage Error (MAPE)
MAPE is essentially the average percentage error and is calculated as equation (2.7).

100%
n

n∑
i=1

∣∣∣∣∣yi − ŷi

yi

∣∣∣∣∣ (2.7)

For instance, an MAPE of 10% means predictions are off by 10% on average. MAPE
is scale-independent, which is useful in real estate portfolios with a wide range of
prices. It is intuitive and a common metric in appraisal literature.

2.7.2 Mean Absolute Error (MAE)
In contrast to MAPE, MAE measures the average absolute difference between pre-
dicted and true values. It is defined as:

MAE = 1
n

n∑
i=1

∣∣∣yi − ŷi

∣∣∣ (2.8)

MAE is in the same units as the target, i.e SEK, making it directly interpretable
for stakeholders.

14


2. Theory

2.7.3 Root Mean Squared Error (RMSE)
RMSE penalizes larger errors more heavily by squaring the residuals before averaging
and then taking the square root:

RMSE =
√√√√ 1

n

n∑
i=1

(
yi − ŷi

)2
(2.9)

Because of the squaring, RMSE is more sensitive to outliers than MAE. A lower
RMSE indicates fewer large deviations, which is critical when extreme misvaluations
carry high risk.

2.7.4 Coefficient of Determination (R2)
The R2 metric quantifies the proportion of variance in the true values explained by
the model:

R2 = 1 −

n∑
i=1

(yi − ŷi)2

n∑
i=1

(yi − ȳ)2
, ȳ = 1

n

n∑
i=1

yi (2.10)

An R2 of 0.80 means 80% of the variance in sale prices is captured by the model,
indicating strong explanatory power. Unlike error metrics, higher R2 is better, with
a maximum of 1.0 for perfect fit.

2.7.5 P10 and P20
In real estate mass appraisal, P10 and P20 are accuracy metrics indicating the share
of model predictions that fall within a certain margin of the true property value. In
other words, P10 is the percentage of predicted prices within ±10% of the actual
sale price, and P20 is the percentage within ±20%. Formally, one can define these
as:

P10 = 100% · (# predictions with |ŷ − y|) ≤ (0.10 · y)
n

,

P20 = 100% · (#predictions with |ŷ − y|) ≤ (0.20 · y)
n

,

2.7.6 SHAP values – feature importance and interpretabil-
ity

SHAP (SHapley Additive exPlanations) is a method rooted in cooperative game the-
ory for interpreting machine learning predictions by assigning each feature a SHAP
value. These values represent how much each feature increases or decreases a pre-
diction relative to a baseline (e.g., the average prediction) [26]. SHAP values extend
Shapley values from cooperative game theory to machine learning. They attribute
the model’s prediction to input features by averaging each feature’s contribution

15


2. Theory

across all possible subsets of features.

SHAP values show the effect of each feature on a specific prediction. For exam-
ple, a rural house’s valuation might decrease due to being farther from a city, but
increase with a larger lot size, for example. Summing the baseline and these SHAP
contributions explains the final prediction clearly, similar to how an appraiser would
justify property valuation.
In this thesis, SHAP values clarify the gradient boosting model’s predictions, iden-
tifying which features influence house valuations and ensuring the model captures
logical patterns (e.g., larger living area positively affecting price). SHAP analyses
can also detect any potential spurious correlations and visually demonstrate feature
importance and nonlinear effects through SHAP summary and dependence plots.
SHAP values are also attributed to neural embeddings if they are used in a gra-
dient boosting, allowing for interpretation of what embeddings, or set of combined
features contribute to the valuation.

2.7.7 t-SNE visualizing a high-dimensional representation
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for visualizing
high-dimensional data by embedding it into a low-dimensional space, typically 2D,
while preserving local relationships [27]. Unlike linear methods like Principal Com-
ponent Analysis (PCA) [28] that preserve global variance, t-SNE emphasizes local
structure: points that are close together in the high-dimensional space are mapped
close together in 2D, while dissimilar points are placed farther apart.
The algorithm proceeds in two main steps:

1. High-dimensional similarities: Computes probabilities pij that reflect how
similar data points are using a Gaussian distribution.

2. Low-dimensional mapping: Computes probabilities qij in 2D using a Stu-
dent t-distribution.

16


3
Methodology

3.1 Data Preprocessing
Accurate and robust data preprocessing is critical to ensure that the models learn
meaningful patterns rather than artifacts of noise or absence of data. The following
steps were applied to transform the raw transaction records into a fully numeric
dataset with consistent scaling and minimal missing values.

3.1.1 Cleaning and Imputation
After loading the Parquet transaction file, columns with more than 30 % missing
entries were discarded to avoid distorting model training. For municipality-level
attributes such as population, population change rates, migration fractions, gaps
were forward-and back-filled within each Municipality code group (see A.1), since
within each respective municipality, municipality-level features should be identical.
Highly skewed numeric features, identified by a maximum-to-minimum ratio above
10 and strictly positive values were log-transformed and the originals were dropped.

3.1.2 Categorical Encoding and Vocabulary Extraction
Object-dtype columns were first changed to UTF-8 text and nulls replaced with the
literal category “Unknown.” Each was then converted to a pandas.Categorical
type [29], allowing LightGBM to treat them natively as categorical features. Si-
multaneously, integer codes for each category level were extracted into new coded
columns, which serve as inputs to the neural network’s embedding layers. The code
also records each category’s vocabulary size, ensuring that each embedding matrix
is sized precisely to its feature’s cardinality.

3.1.3 Proportion Clipping and Cyclical Date Features
A set of fraction-type variables was clipped to the [0, 1] interval to enforce valid
bounds. To allow models to learn smooth seasonal effects, the month of sale was
encoded as two cyclical features:

SaleMonthSin = sin
(
2π SaleMonthOfYear/12

)
,

SaleMonthCos = cos
(
2π SaleMonthOfYear/12

)
.

This representation ensures December and January are adjacent in feature space,
instead of interpreting the months as far apart (1 and 12).

17


3. Methodology

3.1.4 Feature Selection and Scaling

After dropping raw identifiers and geometry columns, the remaining numeric fea-
tures (original, log-transformed, and cyclical) were split into FEATURES for contin-
uous inputs and CAT_CODE_COLS for the integer codes of each categorical feature.
The continuous features were standardized to zero mean and unit variance using
Scikit-Learn’s StandardScaler [30] fitted on the training split, then applied un-
changed to development and test splits, in order to completely avoid any test data
leakage into training. Categorical columns retained their category dtype for Light-
GBM, while the corresponding coded columns were fed into the neural network’s
embedding layers.

After these definitive preprocessing steps, the dataset consists exclusively of:

• Log-transformed and z-score standardized continuous features (FEATURES)

• Integer-coded categorical features (CAT_CODE_COLS) with known vocabulary
sizes

• Validated proportion and cyclical date features.

This scaled representation is favorable for the ANN’s embedding layers and the
LightGBM’s native categorical handling.

3.2 Model Development

In Fig. 3.1 a simple flowchart of the hybrid model is displayed. The following
subsections detail the architecture and implementation of each model, as well as the
techniques used to improve their efficiency and effectiveness.

18


3. Methodology

Figure 3.1: A simple diagram, visualizing the steps of the hybrid model. Source:
Primary.

3.2.1 Training, Validation, and Test Split
The data is partitioned in three stages to ensure a strict temporal hold-out for final
evaluation and a separate development set for model selection and early stopping.
The dataset is split chronologically, simulating a real life scenario, the training set
consists of transactions from 2015-2020, the validation set consists of transactions
from 2021, and accordingly the test set consists of purchases that occurred during
2022. This ensures that no data from 2022 are used in any training or validation
step and that the final test set remains completely unseen until the very end.

3.2.2 Artificial Neural Network with Embeddings
One of the core parts of the hybrid model is a multi-task ANN that incorporates
learned entity embeddings for categorical features. By mapping each category into
a trainable dense vector, the model captures intrinsic similarities among categori-
cal values, avoiding sparse one-hot encodings. The network processes numeric and
embedded categorical inputs jointly, feeding them through a deep feed-forward ar-
chitecture with residual connections. This design allows the ANN to learn a rich

19


3. Methodology

embedding of each data point, which is used for both a continuous value predic-
tion and a class prediction where the classification head teaches the model broader
market segments (see 3.2.2.3). The multi-task setup aims to enrich the shared repre-
sentation by learning from both regression and classification targets simultaneously,
improving generalization.

3.2.2.1 Input and Embedding Layers

The input features x consist of standardized continuous variables and categorical
variables. Continuous features are input directly, while each categorical feature is
handled via a dedicated embedding layer. Specifically, for each categorical field c
with Vc unique values an embedding matrix Ec ∈ RVc×d with a small dimension d
(e.g. d = 8) tuned for the task is included. These are implemented as a ModuleDict
of embedding layers in PyTorch [31]. During a forward pass, each categorical input
is transformed into its d-dimensional embedding vector, and all embeddings are
concatenated with the numeric features to form the combined input. By learning
embeddings, the model can place similar category values close together in the vector
space, reflecting their inherent similarities, such as being in the same region or similar
area. This approach not only reduces dimensionality compared to one-hot encoding,
but also can reveal meaningful relationships between categories. The resulting input
vector (continuous features + all embedding outputs) is then passed into the first
hidden layer of the network.

3.2.2.2 Residual Stack and Embedding Head

After the input layer, the network feeds forward through a stack of fully connected
layers with residual connections inspired by ResNet architectures [32]. The first layer
expands the concatenated input to a high-dimensional hidden state of 512 neurons
with batch normalization and ReLU activation. Then, several residual blocks follow:
each block is a two-layer MLP that learns an increment ∆h and adds it to the
block’s input via a skip connection. Formally, if a block’s input is hin, it produces
hout = ReLU(W2(Dropout(ReLU(W1hin)))) + hin, with a linear projection on hin

if dimensions differ. The main reasons of using a residual network rather than a
purely feed-forward network was:

1. Stabilizing gradient flow: In a deep network, gradients can "vanish" or
"explode" during backpropagation. By introducing a skip-connection that adds
each block’s input directly to its output, the network effectively learns only
the residual function δ(hin) instead of a full transformation. This identity-
mapping shortcut allows gradients to propagate freely from deeper layers all
the way to the input.

2. Non-linear interactions: House values depends on highly non-linear in-
teractions among features. A deeper architecture can, in principle, capture
hierarchical feature interactions. Residual blocks allows the user to train a
deeper stack (512 →256→ 128 neurons over two blocks) without suffering from
“degradation”, where additional layers actually hurt performance. Each block
only needs to learn an additive correction on top of its input, so the network
can gradually improve representations instead of forcing a large transformation

20


3. Methodology

in one go.
3. Faster Convergence and Reduced Overfitting. ResNet-style blocks gen-

erally converge faster than plain MLPs of the same depth. This meant requir-
ing fewer epochs and less aggressive regularization. A 20% dropout was used
inside each residual block and so was batch normalization after each ReLU
to keep embedding magnitudes consistent. This combination reduced over-
fitting on the training set, leading to a more robust embedding (128-D) for
downstream stacking.

After the final residual block, an embedding head was applied which is a linear layer
that compresses the last hidden activations into a 128-dimensional embedding vector
e. This e is a compact representation of the input property, integrating signals from
all numeric and categorical features. It serves as the input to the subsequent output
prediction heads, and also as a learned feature for the hybrid model. The use of a
lower-dimensional embedding bottleneck (128-D) encourages the network to distill
informative features of the data point, which was used later in the hybrid stacking
approach.

3.2.2.3 Multi-Task Output Heads

From the shared embedding vector e, the ANN branches into three output heads,
one for regression, one for classification and one for structuring of the embedding
space. The regression head predicts the log-adjusted price as a single scalar output.
It consists of a small fully connected sub-network: a dropout layer followed by a
dense layer (128→32 with ReLU) and a final linear layer to output ŷ (log-adjusted
sale price) as a single continuous value. In parallel, the classification head predicts a
discrete price category indicating the relative price level of the property. Five ordinal
buckets are defined by partitioning the training-set prices into 20% quintiles. The
classification head is similarly a dropout plus dense layers ending in a 5-logit output
p̂ = (p̂1, . . . , p̂5). These correspond to the model’s confidence that the object’s price
falls into each quantile range. A focal loss for this classification task to mitigate class
imbalance was used, focusing the training on under-represented price ranges. The
multi-task design provides an auxiliary learning signal, the classification objective,
i.e distinguishing different price-ranges guides the network to learn features that
segment properties by value, complementing the exact regression objective. Both
heads share the same underlying embedding e, so the gradients from the regression
and classification tasks jointly update the preceding layers.

3.2.2.4 Composite Loss Function

The network is trained to minimize a composite loss

Ltotal = Lreg︸︷︷︸
P10-aware
regression

+ (1 − α) wc Lcls︸ ︷︷ ︸
focal

classification

+ wt Ltriplet︸ ︷︷ ︸
embedding

triplet

,

where
• α ∈ [0, 1] is linearly annealed from αstart to αend over training epochs.
• wc and wt are fixed weights for the classification and triplet losses, respectively.

21


3. Methodology

3.2.2.4.1 P10-Aware Regression Loss Lreg Let ŷ and y be the network’s
prediction and ground-truth for the standardized log-price. Define the Huber com-
ponent

δHuber(ŷ, y) =


1
2(ŷ − y)2, if |ŷ − y| ≤ δ,

δ
(
|ŷ − y| − 1

2δ
)
, otherwise

and recover original-scale prices p = exp(ŷ σ + µ), t = exp(y σ + µ). The soft P10
term is

P10soft = 1 − 1
B

B∑
i=1

σ
(
k (0.10 − |pi−ti|

ti
)
)
,

where σ is the sigmoid and k a large constant. Then

Lreg = (1 − α) δHuber(ŷ, y) + α P10soft .

An illustration of how the composite loss-function penalizes predictions is displayed
in Fig 3.3.

Figure 3.2: An illustration of how the composite loss-function penalizes wrong
predictions as α increases. Source: Primary.

3.2.2.4.2 P10-Aware Regression Loss Lreg Let ŷ and y be the network’s
prediction and ground-truth for the standardized log-price:

y = log(price) − µt

σt

,

where µt and σt are the mean and standard deviation of log-prices in the training
set.
To recover original-scale prices, the inverse transformation is applied:

p̂ = exp(ŷ σt + µt), t = exp(y σt + µt).

where p̂ is the predicted sale price and t is the true sale price. The regression loss
consists of two components:

• A Huber loss on the standardized log-price:

δHuber(ŷ, y) =


1
2(ŷ − y)2, if |ŷ − y| ≤ δ,

δ
(
|ŷ − y| − 1

2δ
)

, otherwise

22


3. Methodology

• A soft P10 loss, which softly penalizes predictions that deviate more than 10%
from the true price. It is defined using a sigmoid function:

P10soft = 1 − 1
B

B∑
i=1

σ

(
k

(
0.10 − |p̂i − ti|

ti

))
,

where σ is the sigmoid function and k is a steepness constant.
The final composite loss is a convex combination of these two objectives:

Lreg = (1 − α) δHuber(ŷ, y) + α P10soft,

where α ∈ [0, 1] controls the trade-off between squared-log error and P10-aware
supervision.
An illustration of how the composite loss penalizes errors at different α levels is
shown in Fig. 3.3.

Figure 3.3: An illustration of how the composite loss penalizes prediction wrong
predictions as α increases. Source: Primary.

3.2.2.4.3 Focal Classification Loss Lcls The classification head outputs logits
for nb price-quantile buckets. After softmax, let pi,c be the predicted probability for
the true bucket c of sample i. The focal loss with focusing parameter γ is defined
by:

Lcls = − 1
B

B∑
i=1

(1 − pi,c)γ log(pi,c).

In Fig. 3.4 the focal loss is visualized for different values of the focusing parameter
γ.

23


3. Methodology

Figure 3.4: Illustration of focal loss, as γ increases, well classified examples (high
p) are down-weighted, i.e their loss goes to zero faster, which helps focus the training
on more difficultly classified examples (low p). The difference might look small but
is quite tangible in practice. Source: Primary.

3.2.2.4.4 Triplet Embedding Loss Ltriplet To encourage the network to learn
an embedding space in which similarly priced properties lie close together and dis-
similar properties are pushed apart, a triplet-based loss was implemented in addition
to the regression and classification heads.

• Price Quantile Buckets. Before training, all sale prices in the training set
are sorted and partitioned into five equal-sized buckets / quintiles.

• Sampling Anchors, Positives, and Negatives. During each mini-batch,
anchors are sampled uniformly at random. For an anchor a with sale price in
quantile bucket ba, the model chooses:

– A positive example p from the same bucket ba, i.e. a property whose sale
price falls into the same quintile as a.

– A negative example n from a different bucket bn, such that |ba − bn| ≥ 1.
In practice, negatives are drawn uniformly from all buckets at least one
quantile away, ensuring a clear price separation.

Given embeddings ea, ep, and en, a standard margin-based triplet loss is implemented
by:

Ltriplet = 1
T

∑
(a,p,n)

max{0, ∥ea − ep∥2
2 − ∥ea − en∥2

2 + m}.

In Fig. 3.5 the Triplet Embedding Loss is visualized, with δ = 0.2. This makes clear
how the margin parameter creates a zero-loss region and then penalizes violations
linearly.

24


3. Methodology

Figure 3.5: Triplet Embedding Loss. For δ ≤ −margin, the negative sample is
at least "margin" farther than the positive -> zero loss. For δ > −margin, the loss
gros linearly with δ + margin.Source: Primary.

3.2.2.5 Optimization and Regularization

The ANN model was trained using the AdamW optimizer (Adam with decou-
pled weight decay) for efficient stochastic gradient descent. Key training hyper-
parameters such as learning rate, weight decay, dropout probability, and the loss
weight coefficients (wc, wt, and the α schedule) were tuned using the Optuna hy-
perparameter optimization framework [33]. In particular, Optuna’s TPE sampler
explored ranges for the initial learning rate, the L2 weight decay penalty, the em-
bedding dimensionality for categories, and the starting/ending values of α (which
define how quickly the P10 term ramps up). Adopting Optuna [33] allowed for an
efficient search of a well-performing configuration. The final chosen parameters (e.g.
learning rate ≈ 2 × 10−4, weight decay ≈ 1 × 10−2, dropout ≈ 0.20) reflect the best
trade-offs found. To train effectively, PyTorch’s One-Cycle Learning Rate (OneCy-
cleLR) [34] schedule was also applied, which cyclically adjusts the learning rate from
a low value up to a peak and back down to a low value within one training run.
This method, introduced by Smith [35] for "super-convergence", allows the model to
use a relatively high learning rate briefly and often leads to faster convergence and
better generalization.

Batch normalization was also applied in each layer to stabilize learning and dropout
in the hidden layers and output heads to reduce overfitting by randomly deactivat-
ing neurons during training. The trade-off parameter α, which controls the balance
between Huber loss and soft P10 supervision, was scheduled to increase linearly
over training. Specifically, α started at approximately 0.19 and increased to 0.63
by the final epoch. This gradually shifted emphasis from minimizing squared error
on log-price to optimizing the soft P10 metric on original-scale prices. This sched-
ule gave the model time to learn an accurate overall fit before focusing too much
on the stricter P10 criterion. Simultaneously, the classification loss weight wc was
effectively scaled by (1 − α), so that as α grew, the classification task was gradually
down-weighted to zero towards the end of training. This ensured that in later epochs
the model concentrates on P10 and embedding structure, having already benefited
from the classification signal early on. Early stopping was introduced on MAPE as

25


3. Methodology

well as increasing validation error. The best model, with lowest validation MAPE
was saved for final evaluation. PyTorch’s ReduceLROnPlateau scheduler [36] was
also used as a fallback, if progress stagnated, the learning rate would be halved after
5 epochs of no MAPE improvement. However, with OneCycleLR in effect, this was
rarely needed until the very end of training.

3.2.3 LightGBM Ensemble with Raw Features and ANN
Embeddings

In the hybrid valuation framework, two LightGBM regressors are employed in a
stacking configuration. LightGBM is known for its efficiency and accuracy, training
faster than traditional GBMs while maintaining similar accuracy, making it suit-
able for the large feature set. The two-stage ensemble is outlined in the following
subsections.

3.2.3.1 Stage 1: Raw-Feature GBM

In Stage 1, the LightGBM regressor is trained on the same scaled continuous features
as the ANN, but the original categorical columns are kept as pandas Categorical
dtype [29], so that LightGBM can handle splits on them natively while predicting
the log-adjusted sale price.
The Stage1 LightGBM was tuned with Optuna as well, optimizing hyperparameters
like number of leaves, learning rate, feature fraction, and regularization terms. The
objective was standard regression, i.e minimizing MAPE as the evaluation metric.
The LightGBM was trained with early stopping on a validation set to determine the
optimal number of boosting rounds. The Stage 1 model learns a baseline mapping
from raw inputs to price. For instance, it can directly learn effects like “houses in
region X are more expensive” or “larger living area increases price,” and so on, by
leveraging decision-tree splits. After training, the Stage 1 predictions are produced.
Denote ŷ0(i) as the Stage1 predicted log-price for sample i.

3.2.3.2 Stage 2: Embedding-Based Residual GBM

For Stage 2, a second LightGBM model is trained to predict the residual r(i) using
the ANN’s learned embedding as input. Essentially, Stage 2 is learning to predict
what Stage 1 missed, but only using the information encoded in the embeddings e.
The Stage2 LightGBM also uses a set of Optuna tuned parameters. It trains on the
dataset (e(i), , r(i)) with an objective of regression on MAPE as well. Because the
range of residuals is smaller than the original target, this stage can focus on finer
details. For example, the ANN embedding might encode subtle interactions which
the model can pick up on by splitting on e dimensions. Given all the complex signals
the ANN captured, the Stage 2 model tries to find the remaining price adjustment
that needs to be added to the Stage 1 prediction. Typically, Stage2 required fewer
trees than Stage 1, as the residual signal is weaker than the original. After training,
the model outputs a residual correction ŷ1(i) for each input embedding. This model
effectively boosts the performance of the ensemble by adding back the nonlinear,

26


3. Methodology

interaction-driven effects that a single GBM could not easily find from raw features
only.

3.2.3.3 Combined Prediction and Performance

The final prediction for a given property is the sum of Stage 1 and Stage 2 outputs:
ŷfinal = ŷ0 + ŷ1. Where ŷ0 is the Stage1 GBM’s prediction using raw features, and ŷ1
is the Stage2 GBM’s predicted residual using the ANN embedding. The two terms
together give the full predicted log-price, that is then exponentiated to obtain the
predicted sale price in SEK.

3.3 Benchmark Models

Two simple baseline models were constructed to benchmark the proposed hybrid
ANN approach: a classical hedonic regression and a straightforward KNN regression.
These models serve as interpretable, traditional baselines for comparison.

3.3.1 Hedonic Regression Baseline

The code implements a standard procedure for training and evaluating a hedonic
regression model using the same log-adjusted sale price as the target variable. The
linear regression model is fitted on the training set, with both the predictors and the
log-transformed sale prices. Predictions for the test set are generated in the log-price
space and subsequently exponentiated to return to the original price scale. Model
performance is then assessed on the natural price scale using the same evaluation
metrics as for the other models.

3.3.2 KNN

The kNN regression model predicts property prices by averaging the prices of the
nearest training examples in feature space. For this implementation, each property
identifies the five most similar neighbors using standard Euclidean distance across
identical features. Predictions are generated through uniform weighting, meaning
each neighbor contributes equally. Same as for the rest of the models, the KNN
regression was applied to the log-transformed price target. Fig. 3.6 illustrates a
simple example of this method: the price of a new house of 180m2 (indicated by the
vertical dashed line) is predicted by averaging the prices of its 5 closest neighbors
(marked in orange) within the training dataset.

27


3. Methodology

Figure 3.6: Illustration of KNN regression. A new house (vertical dashed line
at 180 m2) is valued by averaging prices of its 5 nearest neighbors (orange points)
among a sample of training homes. Source: Primary.

3.3.3 Baseline Model Configurations
Table 3.1 summarizes the key settings of each baseline model. The hedonic regression
has no adjustable hyperparameters, while the kNN model’s main parameter is K (the
number of neighbors).

Table 3.1: Overview of baseline models and their configurations

Model Target Variable Key Settings
Hedonic Regression log-adjusted sale price OLS linear regression

on structural & loca-
tional features; no hy-
perparameter tuning.

KNN Regression log-adjusted sale price k-nearest neighbors
(e.g. k=5), Euclidean
distance, uniform
weighting.

The comparative evaluation was then carried out on an identical held-out test
dataset for all models. Using the same test set for each model ensures a fair, direct
comparison of predictive accuracy no model has an advantage from different data
splits. The same set of error metrics was applied to each model’s predictions. In
this way, the hybrid model is benchmarked against both conventional methods.

28


4
Summary of Findings

4.1 Comparative Evaluation
Table 4.1 summarizes the test-set performance of each modeling approach. The
hybrid model emerges as the top performer across every metric. Although its im-
provements over the raw-only GBM might seem modest, they are consistent and
meaningful in a valuation context.

Table 4.1: Test Set Performance Comparison of All Models

Model MAPE MAE RMSE R2 P10 P20

Hybrid Model 15.9 431 926 624 460 0.814 41.4 73.6
Raw features LGBM 17.1 464 525 665 020 0.798 38.6 68.0
Embeddings LGBM 19.3 481 560 739 780 0.751 37.3 63.6
Neural Network 18.6 481 290 737 220 0.752 38.9 65.1
KNN 23.6 619 170 963 270 0.577 28.6 53.0
Hedonic Regression 22.8 552 140 799 780 0.709 30.0 56.0

The hybrid model reduces MAPE by over 1 percentage point (pp) relative to the
raw-only GBM (15.9% vs. 17.1%), translating into an average error reduction of
roughly 33,000 SEK. This means more appraisals fall closer to their true value. The
RMSE also decreases by roughly 41,000 SEK. The hybrid model scores an R2 of
0.814 which means that the model explains 81.4% of the variance in sale price in
the test set.
Turning to coverage metrics, the hybrid’s P10 of 41.4% signifies that four out of ten
valuations lie within ±10% of the sale price, compared to only 38.6% for the raw
GBM. Similarly, P20 improves by 5.6pp. These gains reflect a clear tightening of the
error distribution, which can translate to stronger confidence intervals in practice.
The embeddings-only GBM and the standalone neural network both underperform
the raw GBM, confirming that while learned embeddings excel at capturing complex,
nonlinear feature interactions, they do not substitute for the breadth of information
contained in the original variables. Embeddings distill higher-order patterns, but
require the raw features to ground those patterns in measurable property attributes.
By contrast, the kNN and hedonic regression baselines underperform substantially
on every metric. Hedonic regression, relying on linear relationships and pre-specified
interaction terms, struggles to accommodate the irregular, multimodal distributions
of property characteristics outside major urban centers, without proper, careful and

29


4. Summary of Findings

thorough pre-processing. Likewise, kNN depends on finding truly comparable sales
in the training set; in thin markets or highly heterogeneous rural regions, suitable
comps may be sparse or distant in feature space, leading to noisy, unstable estimates.
This sharp underperformance of classical methods underscores the sheer difficulty
of automated property appraisal in diverse, data-sparse contexts. Real estate mar-
kets outside metropolitan areas exhibit wide variability in lot sizes, building styles,
renovation levels, and locational premiums that differ from the smoothness and
homogeneity assumptions of simple regression or nearest-neighbor approaches. In
such settings, hybrid models that combine global pattern-learning and local context
provide the flexibility and robustness needed to attain practical accuracy.

4.1.1 Model Performance by Price Decile
To understand how valuation accuracy varies across the price spectrum, test-set
properties were grouped into ten equally sized deciles by true sale price. For each
decile, Table 4.2 reports the average true price, the model’s mean prediction, and
key error metrics: MAE, MAPE, P10 and P20.

Table 4.2: Hybrid Model Performance by True Price Decile

Decile True Mean Pred Mean MAE MAPE P10 P20

0 1,326,602 1,237,231 203,372 13.85% 43.5% 74.8%
1 1,461,995 1,428,079 274,347 18.36% 30.5% 58.3%
2 1,681,692 1,628,669 318,380 18.87% 32.4% 62.0%
3 1,903,869 1,839,429 365,234 19.41% 32.1% 60.5%
4 2,155,763 2,093,139 407,223 19.86% 34.0% 61.0%
5 2,480,797 2,399,840 435,851 18.32% 39.3% 65.5%
6 2,853,116 2,771,432 484,960 18.12% 39.8% 69.3%
7 3,321,370 3,249,310 468,718 14.94% 47.8% 76.8%
8 3,994,205 3,867,537 519,312 13.14% 49.5% 79.9%
9 5,597,320 5,315,956 839,685 14.72% 45.5% 75.6%

Takeaways:
• Low-to-Mid-market challenge (deciles 1–4): MAE and MAPE peak in

the second through fifth deciles, and P10 dips to its lowest. These low-to-mid-
range properties exhibit the greatest disparities in features, making precise
valuation more difficult.

• Improved accuracy at extremes: Both the lowest decile (0) and the top
three deciles (7–9) show stronger P10 and lower MAPE. In the cheapest seg-
ment, homes are more homogeneous, while in the mid-to-high-value tiers, the
model excels at identifying objects, similar to mid-value properties, but with
slight improvements across the object-specific features. However, for the most
expensive of objects (found in decile 9) the object struggles to make correct
valuation, for the very most expensive properties (≈ 12M+) mostly due to the
small sample size of very expensive properties in the training set.

• P20 stability: The P20 metric remains above 58% across all deciles, peaking

30


4. Summary of Findings

at nearly 80% for decile 8. This indicates that even when ±10% accuracy is
challenging, the model still generally stays within ±20% of sale price.

• Systematic underestimation: Across all deciles, the model underestimates
the value. The predicted mean in decile 9 (5.32 M SEK) is slightly below
the true mean (5.60 M SEK), aligning with the MAPE and MAE increases,
suggesting a modest bias that could be addressed by targeted calibration in
the highest price bracket.

4.2 Error analysis
This section quantifies how well the hybrid valuation model performs and where it
fails. With aggregate views, histograms, and scatter plots that reveal the overall
spread and systematic biases of its residuals, and then drill down to illustrative case
studies that expose the specific transactions driving the largest misestimations.

4.2.1 Error Distribution
Fig. 4.1 shows the distributions of absolute and relative errors for the hybrid model
on the test set. The absolute-error histogram is tightly concentrated: over 50% of
predictions fall within ±400 000 SEK, and only 5% exceed 1 000 000 SEK. The
relative-error plot confirms that more than 40% of predictions lie within ±10% of
actual price (P10), and roughly 75% within ±20% (P20). This error profile indicates
that the hybrid model delivers both small typical errors and a compressed tail of
large misvaluations critical for reducing risk in automated appraisal.

Figure 4.1: These two plots show the Absolute Error Distribution (left) and Rel-
ative Error Distribution (right) for the Hybrid Model.

The scatter plot in Figure 4.2 compares true sale prices against model predictions.
Most points lie close to the 45° line, demonstrating a relatively accurate fit across
price ranges. A heavy under-prediction bias appears as the prices increase, though
mostly due to the lack of expensive objects in the entire dataset. Overall, the scatter
confirms that the hybrid stack generalizes quite well and maintains linearity between
predicted and actual values.

31


4. Summary of Findings

Figure 4.2: Scatter plot of all true (x-axis) and predicted prices (y-axis), if all
predictions were totally correct, they would align with the red dotted line.

4.2.2 Case Studies: Best and Worst Predictions
Table 4.3 lists five, respectively seven, examples of the lowest and highest absolute
errors. The best-predicted properties tend to be around mid-market, but as was
evident in Fig. 4.2, the model successfully makes good estimations even in higher
price segments. Conversely, the worst estimations almost exclusively fall into the
most expensive range of properties in the test set. This holds for almost all of the
50 worst predictions as well, with two exceptions: one object, whose adjusted sale
price was 4,313,017 SEK, was valued at 9,456,976 SEK by the model, almost double
the actual price, and another, priced at just 1,548,869 SEK, was overestimated at
5,222,245 SEK.

Table 4.3: The table displays the best and worst predictions made by the model
in absolute terms. The sale prices are still adjusted to 2020-06, hence the strange
price sequences.

Best Estimates by the Hybrid Model SEK
Adjusted sale price Predicted Sale Price Absolute Error

1 1,466,269 1,466,336 67
2 3,248,320 3,248,389 69
3 1,881,336 1,881,457 121
4 2,481,618 2,481,743 124
5 2,616,279 2,615,984 295

Worst Estimates by the Hybrid Model SEK
Adjusted sale price Predicted Sale Price Absolute Error

1 16,298,297 5,117,908 11,180,389
2 19,281,824 11,055,575 8,226,249
3 16,071,772 8,416,230 7,655,542
4 13,593,189 7,154,120 6,439,069
... ... ... ...
7 4,313,017 9,956,976 5,643,959
14 1,548,869 5,222,246 3,673,377

32


4. Summary of Findings

4.2.3 Case Studies of Selected Transactions

It is expected that many of the worst predictions would fall into the high-end market
segment in terms of absolute error. As mentioned multiple times before, the dataset
did not contain a sufficient portion of very expensive homes that the model could
learn from. However, the two bottom predictions in table 4.3 are not part of the
most expensive price segment, something that needs to be investigated. Table 4.4,
shows a feature comparison between two instances in the data set, one from the test
set (Anomaly) and one from the training set (Similar object).

Table 4.4: Comparison of the model’s 7th worst prediction and its nearest neigh-
bors (NN1)(see Table 4.3) between two nearly identical property records, one from
the training set (true sale price 12,500,000 SEK) and one from the test set (true
sale price 4,313,017 SEK), showing adjusted true vs. predicted sale prices and key
features(more of key features in section 4.4), highlighting a likely duplicate entry.

Example of a Potential Duplicate Transaction
Anomaly Similar object

Adjusted Sale Price (SEK) 4,313,017 12,500,000
BuildingAge (years) 113 109
UtilityArea m2 187 187
Lot Area m2 548 548
QualityScore 27 27
Closetobeach (1/0) 0 0
DistmediumCity (m) 681.55 681.31
Distcoast (m) 363 363
strand 4 4
Deso Class C C

When evaluating the test data, it was observed that the property with a sale price of
4,313,017 SEK was valued by the model at 9,883,873 SEK, resulting in an absolute
error of 5,570,856 SEK. Examining the values for both objects in the table reveals
that they are highly similar, in fact almost identical, across key features. Further-
more, an analysis of their respective longitude and latitude coordinates confirmed
that the two transactions certainly correspond to the same object. The reason be-
hind the same object being sold for almost a third of what it had been sold for just
four years prior is not apparent. It could be due to an external action, the property
is quite large and could have been subdivided into a two-family building, and that
the data for the building simply has not been updated accordingly. It is unclear
how many such "identical" or misleading entries exist within the dataset, and no
thorough investigations were made.

Another anomaly, i.e a relatively cheap object with a high absolute prediction error,
is found in Table 4.5 together with some similar objects.

33


4. Summary of Findings

Table 4.5: Comparison of the model’s 14th worst prediction (see Table 4.3), be-
tween transaction the Anomaly and its five nearest neighbor (NN1–NN5), showing
true sale prices and key features.

Summary of Property Transactions
Anomaly NN 1 NN 2 NN 3 NN 4 NN 5

Adjusted sale price (SEK) 1 548 868 6 829 478 6 550 359 6 056 122 5 725 983 4 684 105
BuildingAge (years) 26 24 27 6 14 27
UtilityArea (m2) 199 168 145 229 220 147
Lot Area m2 1742 829 816 1795 2671 937
QualityScore 36 26 30 32 44 30
Closetobeach 0 0 0 0 0 0
DistmediumCity (km) 5.04 4.54 2.73 18.70 15.45 5.86
Distcoast (km) 6.66 0.093 126.40 0.39 5.3 39.57
strand 4 3 4 4 4 4
Deso Class C C C C B C

The left-most column in Table 4.5 corresponds to the Anomaly, an object priced at
1,548,868 SEK. The model predicts a sale price of ŷ = 5,185,451 SEK, whereas
the actual price is only y = 1,548,868 SEK. This yields an absolute error of
|ŷ − y| = 3,636,583 SEK, corresponding to a relative error of 235 %. A row-
wise scan of Table 4.5 reveals that the Anomaly is not easily separated from its
five nearest neighbours (NN 1–NN 5) across all high-importance features. These
include BuildingAge, UtilityArea, LotArea, QualityScore, the socio-economic
DesoClass, and categorical indicators like strand. There is no standout covariate
that would suggest this object should be treated differently by the model. In other
words, the Anomaly is fully embedded in the typical feature space.

In contrast to its feature similarity, the Anomaly is dramatically dissimilar in price.
While its neighbors all transacted between 4.7 and 6.8 million SEK, the Anomaly
closed at just 1.55 million SEK, a discount of 65% to 77 % relative to every peer.
Thus, price becomes the only truly anomalous dimension for this transaction.
Additionally, distance-based features such as DistCoast offer little help. The Anomaly
lies 6.6 km from the coast, but the five nearest neighbors span a wide range from just
93 meters to 126 kilometers. The variation within this group weakens the predictive
signal in that dimension, making it unlikely that the model can rely on it to adjust
for the price outlier.

Feed-forward ANNs are well-suited to learning smooth, high-frequency patterns in
feature space. Given that the Anomaly’s input vector xAnomaly closely resembles
those associated with sale prices in the 5–7 million SEK range, the model naturally
maps it into that price manifold. From the network’s perspective, there is no statis-
tical precedent suggesting that a home with such observable features can transact
at 1.5 million SEK. It therefore extrapolates upward in a way that is rational from
a data-driven standpoint.
Furthermore, an external valuation benchmark was found on the Anomaly on Booli
[37]. Booli is a Swedish real estate platform offering comprehensive housing market
data, now owned by SBAB Bank. It provides users with access to current property

34


4. Summary of Findings

listings, historical sale prices, area-level trends, and market statistics across Swe-
den. Booli’s platform serves home buyers, sellers, and investors seeking data-driven
insights into the property market.

One of Booli’s core services is its automated property valuation tool. In this context,
Booli’s automated valuation of the Anomaly was 3,540,000 SEK, substantially higher
than its realized sale price of 1,548,868 SEK but still below the price predicted by the
model. This reinforces the notion that, even from the perspective of an independent,
market-wide algorithm, the sale price of the Anomaly stands out as an extreme
outlier [37].

In summary, the Anomaly is not a failure of the model but rather a reflection of data
limitations. While the model is trained on a rich feature set, the analysis presented
here focuses on a carefully selected subset of the most important features, those that
contribute most significantly to price prediction according to feature importance
metrics. Displaying the entire feature space would obscure the interpretability of
the analysis and offer limited additional insight.

As shown, the Anomaly is virtually indistinguishable from its nearest neighbours
across this high-importance subset, yet its price deviates dramatically. This high-
lights a fundamental limitation: if two homes appear nearly identical in all observ-
able and influential aspects but sell for vastly different prices, a model, even one
trained on a comprehensive feature space, cannot be expected to resolve such dis-
crepancies without access to additional factors. Without access to richer data or
targeted algorithmic adjustments, large errors for cases like the ones investigated
are not only unsurprising, they are inevitable.

4.3 Embedding Analysis

In the following subsections, the model’s embedding space is explored to uncover its
key patterns and insights.

4.3.1 Embeddings Clustering

K-means clustering (k = 5) was applied to the 128-dimensional embeddings produced
by the neural network’s projection layer for all training samples.

There is a boxplot in Fig. 4.3, where sale prices by cluster reveals five distinct tiers:

35


4. Summary of Findings

Figure 4.3: Boxplots of sale-price distributions for five clusters obtained by apply-
ing K-means to the 128-dimensional neural network embeddings.

• Cluster 0 Mid-high-market segment (median ≈ 4.0 M SEK) with moderate
interquartile range and a few high-price outliers.

• Cluster 1 Low-priced homes (median ≈ 1.2 M SEK) showing a tight distri-
bution and minimal skew.

• Cluster 2 High-end homes (median ≈ 5.8 M SEK) with a pronounced long
upper tail.

• Cluster 3 Mid tier (median ≈ 2.8 M SEK) exhibiting the widest overall spread
and several extreme values.

• Cluster 4 Low-mid-level units (median ≈ 1.7 M SEK) with relatively low
interquantile range but some upper-end outliers.

The plot shows that embeddings naturally partition the data into value-driven
groups beyond any single raw feature.

4.3.2 t-SNE Projection of Embeddings

Figure 4.4 presents a two-dimensional t-distributed Stochastic Neighbor Embedding
projection of the 128-dimensional neural network embeddings for all test transac-
tions. Each point corresponds to a single transaction and is colored according to
its log-sale price. A smooth gradient from low to high prices is apparent, with
higher-priced properties clustering in a distinct region of the embedding space. This
continuous organization demonstrates that the learned embeddings capture price
information in a structured manner, mapping gradual increases in sale price onto
similarly gradual transitions in embedding coordinates.

36


4. Summary of Findings

Figure 4.4: Two-dimensional t-SNE projection of the 128-dimensional ANN em-
beddings for each property, colored by log sale price. Points that cluster together
share similar learned representations.

4.3.3 Embedding-Feature Correlation Analysis
To interpret what the learned embedding dimensions capture, the absolute Pear-
son correlation was computed between each embedding coordinate and all original
numeric and categorical-code features. Here are some key takeaways:

• Educational and demographic signals: Many of the most important di-
mensions correlate strongly with educational features, such as the fraction that
attended post-elementary education and higher education. These embedding
axes clearly encode neighborhood education-level statistics.

• Population and area metrics: Many embeddings show high correlation
with municipality population, its deviation and its total and its annual (and
biannual) change, indicating that latent dimensions capture market size and
growth dynamics.

• Spatial proximity: Several embeddings also correlate with distance to medium
sized cities, and distances to points of interest, such as golf courses, confirming
that network-learned axes encode locational gradients.

• Heterogeneity across dimensions: While some dimensions focus on socioe-
conomic factors, others capture built-environment attributes, demonstrating
that the embedding space distributes different types of signals across separate
axes.

• Object-specific dimensions: Many dimensions also focus almost solely
object-specific features, such as living area, lot area, distance to beach etc,
meaning that the network encodes a combination of attributes that pushes
value.

37


4. Summary of Findings

4.4 Model Interpretability
The following subsections explore the model’s prediction process by examining fea-
ture contributions and importance using various interpretability methods.

4.4.1 SHAP Analysis on Raw Features
The individual feature contribution in SHAP is visualized in Fig. 4.5 for 20 of the
most important features.

Figure 4.5: SHAP summary plot for the model, showing each feature’s contribution
to the predicted price (x-axis) and the distribution of feature values (color) across
observations

The topmost feature is LogDistMediumCity: its red points (high values) lie on the
left (negative impact), and blue points (low values) on the right (positive impact).
This indicates that larger distance to a medium-sized city lowers the predicted price,
whereas being closer (small distance, blue) raises it. This aligns with empirical find-
ings that home values decline with distance from city centers [38]. The second and
third features from the top, MunicipalityCodeStr and DesoArea, are categorical

38


4. Summary of Findings

location codes. They exhibit nearly symmetric, grey-colored distributions around
zero, meaning they adjust baseline prices up or down by municipality or district but
show no clear monotonic trend. In other words, different municipalities or DeSO
areas simply shift the model output without a single direction of effect since they
are merely categorical.

4.4.1.1 Property Attributes and Size/Quality Effects

The model’s next most important features are structural attributes of the property.
LogBuildingAge has high (red) points on the left and low (blue) on the right:
older buildings