Embedding-Enhanced Real Estate Valuation in Non-Metropolitan Sweden A Hybrid Modeling Approach Master’s thesis in Complex Adaptive Systems Leonard Smedenman Teddy Sallén DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 Embedding-Enhanced Real Estate Valuation in Non-Metropolitan Sweden A Hybrid Modeling Approach Leonard Smedenman Teddy Sallén Department of Physics Chalmers University of Technology Gothenburg, Sweden 2025 Embedding-Enhanced Real Estate Valuation in Non-Metropolitan Sweden A Hybrid Modeling Approach Leonard Smedenman Teddy Sallén © Leonard Smedenman & Teddy Sallén, 2025. Supervisor and Examiner: Mats Granath, Director, M.Sc Complex Adaptive Systems Master’s Thesis 2025 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Symbolic image of AI in housing. Source: Primary. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv Embedding-Enhanced Real Estate Valuation in Non-Metropolitan Sweden A Hybrid Modeling Approach Leonard Smedenman & Teddy Sallén Department of Physics Chalmers University of Technology Abstract Automated valuation of residential properties in sparsely populated regions poses unique challenges due to thin transaction volumes, diverse housing stock, and lim- ited comparables. This thesis presents a hybrid modeling approach combining an embedding-based artificial neural network (ANN) with a LightGBM gradient boost- ing machine to predict sale prices in six Swedish municipalities, focusing specifically on houses in non-metropolitan areas. The ANN learns dense representations of categorical and geographic features that capture latent spatial and socioeconomic patterns, while the GBM leverages both raw features and ANN embeddings to refine residual errors. Model interpretability is achieved via SHAP values and case studies of embedding dimensions, revealing that distance to regional centers, living area, property condition, and proximity to points of interest are key value drivers, even where market data are scarce. The hybrid model demonstrates competitive accu- racy, particularly for mid-priced homes, and offers transparent explanations for each valuation. However, large errors persist for rare, high-end properties and extremely remote dwellings, reflecting fundamental data limitations. The results highlight how AI-driven valuation tools can complement traditional appraisal methods by provid- ing rapid, interpretable estimates for routine cases and flagging high-uncertainty transactions for expert review. Keywords: Automated Valuation Model, real-estate appraisal, neural embeddings, gradient boosting, SHAP interpretability, non-metropolitan housing. v Acknowledgements We would like to express our gratitude to everyone who has contributed to the completion of this project. Firstly, we would like to thank our contact persons at Värderingsdata, Magnus Persson, Jon Larborn and Niklas Stenwreth. Without their assistance, guidance and knowledge this project would not be as successful. We would also like to extend our appreciation to our supervisor and examiner Mats Granath for accepting the role and providing input. Thank you for all your contributions. Sincerely, Leonard Smedenman & Teddy Sallén, Gothenburg, June 2025 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AI Artificial Intelligence ANN Artificial Neural Network AVM Automated Valuation Model DeSO Demographic Statistical Areas GBDT Gradient Boosted Decision Trees GBM Gradient Boosting Machine GRP Regional GDP HPM Hedonic Pricing Model KNN k-Nearest Neighbors KTH Kungliga Tekniska Högskolan LGBM Light Gradient Boosting Machine MAE Mean Absolute Error MAPE Mean Absolute Percentage Error ML Machine Learning MSE Mean Squared Error NN Nearest Neighbor P10 Percentage of predictions within ±10% of sale price P20 Percentage of predictions within ±20% of sale price R2 Coefficient of Determination RMSE Root Mean Squared Error SHAP SHapley Additive exPlanations t–SNE t–Distributed Stochastic Neighbor Embedding ix Nomenclature Below is the nomenclature of indices, Hyper-parameters and constants, parameters, variables, and metrics used throughout this thesis. Indices i Index for property / transaction in the dataset j Index for input feature Xj in the hedonic model m Index of boosting iteration / tree (hm, Fm) c Index of price-quantile class in the auxiliary classifier (pi,c) Hyper-parameters and constants α Weight of the P10 term in the composite loss (annealed from αstart to αend) γ Focusing parameter of the focal classification loss δ Huber-loss threshold that separates MAE/MSE regimes k Number of neighbours in the kNN component m (margin) Margin in the triplet-loss constraint ν Shrinkage (learning-rate) parameter in gradient boosting B Mini-batch size used during stochastic optimisation wc, wt Fixed weights of classification and triplet losses in Ltotal Variables x Raw feature vector of a property x′ Standardised feature: (x − µ)/σ µ, σ Empirical mean and standard deviation of a feature y True log-transformed sale price (target) xi ŷ Predicted log-price produced by the model V, V̂ Price on the original SEK scale (V̂ = eŷσ+µ) zj, aj Pre-activation and activation of neuron j in the ANN W(ℓ), b(ℓ) Weight matrix and bias vector of layer ℓ e 128-dimensional learned embedding of a property pi,c Probability that property i belongs to class c (softmax output) ri,m Residual of sample i at boosting stage m hm(x) Weak learner (regression tree) at stage m Fm(x) Ensemble prediction after m trees Losses Lreg P10-aware regression loss (Huber + soft-P10) Lcls Focal classification loss Ltriplet Triplet embedding loss Ltotal Composite training objective Lreg + (1 − α)wcLcls + wtLtriplet Evaluation metrics n Number of observations in a sample or split MAPE, MAE, RMSE Standard error statistics defined in Section 2.7 R2 Coefficient of determination P10, P20 Share of predictions within ±10% and ±20% of the true price, respectively xii Contents List of Acronyms ix Nomenclature xi List of Figures xvii List of Tables xix 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Traditional Valuation Methods in Sweden . . . . . . . . . . . . . . . 2 1.4 Rationale for ML-Based Valuation . . . . . . . . . . . . . . . . . . . . 3 1.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Objectives / Research Questions . . . . . . . . . . . . . . . . . . . . . 5 1.6.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7 Scope and Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Price Prediction and Regression Models . . . . . . . . . . . . . . . . . 7 2.1.1 Overview of house price prediction as a regression task . . . . 7 2.1.2 Hedonic regression models . . . . . . . . . . . . . . . . . . . . 8 2.1.3 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Logarithmic Transformation of Skewed Variables . . . . . . . . 9 2.2.2 Label Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Feature Standardization . . . . . . . . . . . . . . . . . . . . . 9 2.3 Neural Networks for Regression . . . . . . . . . . . . . . . . . . . . . 9 2.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 Huber Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.2 Combining Loss Functions in Regression Models . . . . . . . . 12 2.4.3 Combining Loss Functions in Regression Models . . . . . . . . 12 2.5 Gradient Boosting and LightGBM . . . . . . . . . . . . . . . . . . . . 13 2.5.1 The Gradient Boosting Process . . . . . . . . . . . . . . . . . 13 2.5.2 Gradient Boosting in Real Estate Valuation . . . . . . . . . . 13 2.6 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 xiii Contents 2.7 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.1 Mean Absolut Percentage Error (MAPE) . . . . . . . . . . . . 14 2.7.2 Mean Absolute Error (MAE) . . . . . . . . . . . . . . . . . . 14 2.7.3 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . . . 15 2.7.4 Coefficient of Determination (R2) . . . . . . . . . . . . . . . . 15 2.7.5 P10 and P20 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7.6 SHAP values – feature importance and interpretability . . . . 15 2.7.7 t-SNE visualizing a high-dimensional representation . . . . . . 16 3 Methodology 17 3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Cleaning and Imputation . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Categorical Encoding and Vocabulary Extraction . . . . . . . 17 3.1.3 Proportion Clipping and Cyclical Date Features . . . . . . . . 17 3.1.4 Feature Selection and Scaling . . . . . . . . . . . . . . . . . . 18 3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Training, Validation, and Test Split . . . . . . . . . . . . . . . 19 3.2.2 Artificial Neural Network with Embeddings . . . . . . . . . . 19 3.2.2.1 Input and Embedding Layers . . . . . . . . . . . . . 20 3.2.2.2 Residual Stack and Embedding Head . . . . . . . . . 20 3.2.2.3 Multi-Task Output Heads . . . . . . . . . . . . . . . 21 3.2.2.4 Composite Loss Function . . . . . . . . . . . . . . . 21 3.2.2.4.1 P10-Aware Regression Loss Lreg . . . . . . . 22 3.2.2.4.2 P10-Aware Regression Loss Lreg . . . . . . . 22 3.2.2.4.3 Focal Classification Loss Lcls . . . . . . . . . 23 3.2.2.4.4 Triplet Embedding Loss Ltriplet . . . . . . . 24 3.2.2.5 Optimization and Regularization . . . . . . . . . . . 25 3.2.3 LightGBM Ensemble with Raw Features and ANN Embeddings 26 3.2.3.1 Stage 1: Raw-Feature GBM . . . . . . . . . . . . . . 26 3.2.3.2 Stage 2: Embedding-Based Residual GBM . . . . . . 26 3.2.3.3 Combined Prediction and Performance . . . . . . . . 27 3.3 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Hedonic Regression Baseline . . . . . . . . . . . . . . . . . . . 27 3.3.2 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Baseline Model Configurations . . . . . . . . . . . . . . . . . . 28 4 Summary of Findings 29 4.1 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Model Performance by Price Decile . . . . . . . . . . . . . . . 30 4.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.1 Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.2 Case Studies: Best and Worst Predictions . . . . . . . . . . . 32 4.2.3 Case Studies of Selected Transactions . . . . . . . . . . . . . . 33 4.3 Embedding Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.1 Embeddings Clustering . . . . . . . . . . . . . . . . . . . . . . 35 4.3.2 t-SNE Projection of Embeddings . . . . . . . . . . . . . . . . 36 4.3.3 Embedding-Feature Correlation Analysis . . . . . . . . . . . . 37 xiv Contents 4.4 Model Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1 SHAP Analysis on Raw Features . . . . . . . . . . . . . . . . 38 4.4.1.1 Property Attributes and Size/Quality Effects . . . . 39 4.4.1.2 Location . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.1.3 Categorical Location Effects . . . . . . . . . . . . . . 40 4.4.2 Raw Feature Importance by Gain . . . . . . . . . . . . . . . . 40 4.4.3 Quantified Embedding Importance . . . . . . . . . . . . . . . 41 4.4.3.1 Gain-Based Embedding Importance . . . . . . . . . . 41 4.4.3.2 SHAP Analysis on Embeddings . . . . . . . . . . . . 41 4.4.3.3 Case Studies of Three Different Embedding Dimen- sions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Demographic statistical areas analysis . . . . . . . . . . . . . . . . . . 43 4.6 Model Proficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Conclusion 45 5.1 Key Factors Influencing Property Values . . . . . . . . . . . . . . . . 45 5.2 Model Performance and Limitations . . . . . . . . . . . . . . . . . . . 46 5.3 Implications for Low-Density Housing Markets . . . . . . . . . . . . . 47 5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bibliography I A Appendix 1 I References III xv Contents xvi List of Figures 2.1 Overview of house price prediction as a regression task. Input fea- tures are mapped through a regression model to produce a continuous output (price). Source: Primary. . . . . . . . . . . . . . . . . . . . . . 7 2.2 A multi-layer feed-forward Artificial Neural Network with an input layer, one hidden layer, and an output layer. Each connection has a weight, and each neuron (circle) computes a function of the weighted inputs. Source: Primary. . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Comparison of Huber loss (green) with standard squared error loss (blue) as a function of the prediction residual. Source: Qwertyus https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_ loss.svg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 A simple diagram, visualizing the steps of the hybrid model. Source: Primary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 An illustration of how the composite loss-function penalizes wrong predictions as α increases. Source: Primary. . . . . . . . . . . . . . . 22 3.3 An illustration of how the composite loss penalizes prediction wrong predictions as α increases. Source: Primary. . . . . . . . . . . . . . . 23 3.4 Illustration of focal loss, as γ increases, well classified examples (high p) are down-weighted, i.e their loss goes to zero faster, which helps focus the training on more difficultly classified examples (low p). The difference might look small but is quite tangible in practice. Source: Primary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Triplet Embedding Loss. For δ ≤ −margin, the negative sample is at least "margin" farther than the positive -> zero loss. For δ > −margin, the loss gros linearly with δ + margin.Source: Primary. . 25 3.6 Illustration of KNN regression. A new house (vertical dashed line at 180 m2) is valued by averaging prices of its 5 nearest neighbors (orange points) among a sample of training homes. Source: Primary. 28 4.1 These two plots show the Absolute Error Distribution (left) and Rel- ative Error Distribution (right) for the Hybrid Model. . . . . . . . . . 31 4.2 Scatter plot of all true (x-axis) and predicted prices (y-axis), if all predictions were totally correct, they would align with the red dotted line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Boxplots of sale-price distributions for five clusters obtained by ap- plying K-means to the 128-dimensional neural network embeddings. . 36 xvii https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg List of Figures 4.4 Two-dimensional t-SNE projection of the 128-dimensional ANN em- beddings for each property, colored by log sale price. Points that cluster together share similar learned representations. . . . . . . . . . 37 4.5 SHAP summary plot for the model, showing each feature’s contri- bution to the predicted price (x-axis) and the distribution of feature values (color) across observations . . . . . . . . . . . . . . . . . . . . 38 4.6 Feature importance plot illustrating the top 20 influential raw fea- tures in the Light GBM model, ranked by gain (total reduction in the loss function). The horizontal bars represent the relative con- tribution of each feature to the predictive performance, highlighting LogDistMediumCity, LogUtilityArea, and LogLivingArea as the most impactful features for predicting real estate prices. . . . . . . . . . . . 40 4.7 The plot shows the embedding gains on the residuals of the Light GBM on raw features. The gain refers to the reduction of loss function 41 4.8 SHAP summary plot for 20 embedding dimensions, showing each em- bedding’s impact on predicted price and its value distribution. . . . . 42 xviii List of Tables 3.1 Overview of baseline models and their configurations . . . . . . . . . 28 4.1 Test Set Performance Comparison of All Models . . . . . . . . . . . . 29 4.2 Hybrid Model Performance by True Price Decile . . . . . . . . . . . . 30 4.3 The table displays the best and worst predictions made by the model in absolute terms. The sale prices are still adjusted to 2020-06, hence the strange price sequences. . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Comparison of the model’s 7th worst prediction and its nearest neigh- bors (NN1)(see Table 4.3) between two nearly identical property records, one from the training set (true sale price 12,500,000 SEK) and one from the test set (true sale price 4,313,017 SEK), showing adjusted true vs. predicted sale prices and key features(more of key features in section 4.4), highlighting a likely duplicate entry. . . . . . . . . . . 33 4.5 Comparison of the model’s 14th worst prediction (see Table 4.3), be- tween transaction the Anomaly and its five nearest neighbor (NN1–NN5), showing true sale prices and key features. . . . . . . . . . . . . . . . . 34 4.6 Performance Metrics by DesoClass . . . . . . . . . . . . . . . . . . . 43 A.1 Table of all included counties and municipalities in the dataset. . . . I xix List of Tables xx 1 Introduction Advancements in artificial intelligence (AI) offer new opportunities for real estate valuation, especially in data-scarce markets. This study explores the use of Machine Learning (ML) models, specifically Artificial Neural Networks (ANN) and gradient boosting, to improve valuation accuracy in sparsely populated regions of Sweden where traditional methods face significant limitations. 1.1 Background Real estate valuation plays a central role in the functioning of the property market and financial system. Accurate property values are needed for a range of purposes, including sales and purchases, taxation, investment analysis, and securing mort- gage loans [1]. In Sweden, official assessments of property value ("taxeringsvärde") are determined periodically by the national tax authority (Skatteverket) and are intended to reflect approximately 75% of market value for taxation purposes [2]. These assessments rely on recent sale prices of comparable properties within defined value areas (värdeområden) where properties are assumed to have similar condi- tions. However, in parts of Sweden that are generally more non-urban, such as the provinces of Östergötland, Småland, Gotland and Blekinge (see A.1 for full list of included counties and municipalities), property transactions occur less frequently, specifically in regards to houses, leading to thin markets with scarce comparable sales data [3]. In these areas, traditional indicators of market value become less reliable or even nonexistent, as noted historically in legal preparatory works that questioned the applicability of a market value concept in locales with virtually no sales activity. This poses challenges for property owners, buyers, and lenders, as valuation uncertainty increases outside urban centers. Traditional real estate appraisal in Sweden has long been based on professional judgment supported by standard methods. These conventional approaches, while grounded in decades of experience, often struggle to capture market dynamics in real time, especially when data on actual transactions are limited. 1.2 Problem Description Valuing properties in sparsely populated regions like the ones this thesis focuses on, presents significant challenges due to the limited number of transactions and the diverse nature of the properties. The standard sales comparison approach, which 1 1. Introduction relies on identifying recently sold comparable properties in a given area, becomes less reliable when few or no truly similar sales exist. In these areas, appraisers may be forced to base valuations on a very small sample of transactions, increasing the risk of error. Moreover, non-urban properties often possess unique features such as old building years or large plots of land, which make direct comparisons difficult. These factors contribute to considerable uncertainty in valuation and highlight the need for more flexible or data-driven approaches in sparsely populated markets. Due to the limited availability of market data, valuers may be forced to rely on alternative methods or general assumptions. For example, cost based valuations or income based approaches may be used in place of direct market comparisons. These methods, however, may not reflect what a buyer would actually pay, especially if there are intangible values associated with location and amenities that are not cap- tured by cost or income alone. As a result, valuations in these areas carry a higher degree of uncertainty and risk. This is problematic not only for private stakehold- ers but also for banks and public agencies. Lenders face difficulties in mortgage risk assessment when valuations are uncertain, and municipalities or tax authorities struggle to ensure fairness and accuracy in taxation when comparable sales are lack- ing [3]. Recent market fluctuations have highlighted this issue, during periods of market downturn or upheaval, transaction volumes can drop sharply, for example, in 2022 the transaction volume in Sweden fell by over 40% year-on-year, creating an extremely thin market [3], making it even harder to gauge true property values in affected regions. 1.3 Traditional Valuation Methods in Sweden Real estate valuers traditionally employ a few fundamental methods to estimate market value, each with its own assumptions and data requirements. The sales comparison approach, as mentioned in the previous section, wherein the appraiser identifies recent sales of similar properties and adjusts for differences to estimate the subject property’s value. In Sweden, hedonic pricing models (HPMs) based on multiple regression are used to support both private appraisals and mass appraisal for tax assessment. These models generalize the relationship between property char- acteristics and market prices within a given region [1]. HPMs represent property value as a function of its attributes, such as size, location, and quality, and have been foundational in valuation theory since 1974 [4]. They are relatively transparent and grounded in economic theory, but they typically assume a linear (or log-linear) relationship and may struggle with complex, non-linear interactions between fea- tures. All these traditional methods require substantial expertise and judgment. Apprais- ers must carefully select comparables or estimate depreciation. In thin markets, the lack of data forces greater reliance on professional judgment, potentially introduc- ing bias or error. Moreover, manual valuation processes are time consuming and not easily scalable. As the demand grows for rapid valuations, traditional methods show their limitations in terms of speed and consistency [5]. These limitations mo- 2 1. Introduction tivate the search for more automated and data-driven valuation methods that can complement or enhance the traditional techniques. 1.4 Rationale for ML-Based Valuation Advancements in ML offer promising opportunities to address the challenges of non- urban property valuation. Automated Valuation Models (AVMs) are increasingly being used in real estate markets worldwide to produce instant value estimates by analyzing large datasets of property features and past transactions [6]. An AVM is a computer-driven algorithm that inputs property data and outputs a value es- timate, often very rapidly, making it attractive for both lenders and investors who need quick assessments [5]. The key advantage of ML-based models is their ability to detect complex, non-linear patterns in data that traditional linear models fail to capture. This is particularly relevant for diverse non-urban properties where inter- actions between attributes, like land size, building condition, and locational factors, may influence value in complicated, often non-obvious ways. Among ML techniques, one that stands out is Artificial Neural Networks models. ANNs are computational models inspired by the human brain, capable of fitting ex- tremely flexible functional forms to data. They have shown promise in house price prediction tasks; for instance, studies have found that neural networks can outper- form multiple regression models and other techniques in terms of valuation accuracy [7]. By learning from a broad set of input examples, an ANN can capture subtle relationships. The downside is that ANNs are often criticized as "black boxes", of- fering little transparency into how they arrive at a given estimate [8]. This lack of interpretability can be problematic for gaining trust in valuations, since proposing a valuation without a clear basis might make for a weak argument for a stakeholder, which is why recent research emphasizes explainable AI methods in real estate ap- plications. Embracing AI for property valuation in Sweden’s non-urban context is not just a theoretical exercise; the industry has already begun moving in this direction. Banks and valuation firms in Sweden are experimenting with AI-driven models to com- plement traditional appraisals, especially for residential properties. According to Värderingsdata, which is a leading provider of property data in Sweden, AI-based valuation models are already used in practice and can drastically speed up the val- uation process, allowing human experts to focus on more complex analysis [5]. In non-urban areas, an ML model might, for example, learn from transactions in a wider region or over a longer time horizon to compensate for the lack of recent local sales. The rationale for this study is thus clear; by applying ML techniques to the problem of non-urban property valuation in Östergötland, Småland, Gotland, and Blekinge, the thesis aims to assess whether these methods can improve accuracy and consis- tency over traditional approaches.The practical considerations of using such models will also be examined, including data requirements and the interpretability of results. 3 1. Introduction This thesis aims to bring meaningful insights into what features, or combination of features, are deemed most important in the chosen focus group and how they differ from, for instance, apartments in urban areas. The ultimate goal is to develop an Automated Valuation Model tailored to non-urban Swedish conditions, or at least to evaluate its feasibility. The quality of an AI-based valuation is heavily dependent on the quality and quan- tity of the input data; poor or biased data can lead to misleading estimates [5]. Additionally, stakeholders must be able to trust the output of a model, which yet agin highlights importance of transparency and validation [8]. This study is under- taken with these considerations in mind. By focusing on a geographically specific and data-challenged context, the research will highlight not only the potential ac- curacy gains from ML, but also the limitations and requirements for deploying such technology in real-world valuation practice. 1.5 Dataset The dataset used in this study was provided by Värderingsdata and comprises roughly 90,000 residential properties with transactions ranging from 2015 to 2022. Each object represents an individual sale of a property (the same property can thus appear more than once if it was sold multiple times during the timeframe). The data includes approximately 170 features representing various physical, geographic, socioeconomic, and temporal characteristics. There is a column with index-adjusted sale-prices to 2020-06, which allows for a fair comparison across all the transaction years. The dataset is organized into several feature domains. Object-level features describe each property’s individual characteristics, including variables such as living area, construction year, energy class, and water and sewage access etc. Neighborhood- level characteristics capture sociodemographic and economic indicators from the surrounding area, including population age distribution, household types, educa- tion levels, income distribution, and local real estate market statistics. Macro- economic indicators such as interest rates, gross regional product (GRP), and in- flation measures are also included, contextualizing each transaction within broader market conditions. Geospatial attributes incorporate detailed locational data, in- cluding distances to various urban centers, natural features (like lakes and coast), infrastructure (like roads, rail, airports), and points of interest such as golf courses, schools, and ski resorts. Additionally, temporal variables encode the time dimension of each transaction, with fields like sale year, month, and day of the week. The diversity and detail of the data offer a rich foundation for statistical learn- ing. While some variables contain missing values, the overall completeness is high. Ideally, the dataset would include even more detailed object-specific features, such as the number of rooms, construction material, window type, roof condition, heat- ing system type, floor material, ceiling height, and the presence of amenities such as a balcony, fireplace, or integrated household appliances, but these types of data 4 1. Introduction is not easily obtained. Alas, most of the columns pertain to more regional data. More specific data on the condition of the houses and their appliances could lead to more accurate valuation, but the variety of data is sufficient to make meaningful distinctions, though additional detail would still be desirable. 1.6 Objectives / Research Questions The objectives of the master’s thesis are summarized in the defined research ques- tions below. 1.6.1 Objectives 1. To develop an ML-based model for property appraisal. 2. To assess the accuracy and reliability of the model in comparison to alternative methods. 3. To identify the most influential factors in property valuation as determined by the model, and what features contribute to the valuation. 4. To uncover what non-obvious features, and interactions between features, might be specifically important in non-urban housing. 1.6.2 Research Questions 1. How does an AI-driven model compare in performance to benchmark models in terms of accuracy and performance in? 2. What are the key factors influencing non-urban property valuation in the Små- land, Östergötland, Gotland and Blekinge regions, as identified by the model? 3. What challenges and limitations arise when applying machine learning tech- niques to real estate valuation, and how can they be mitigated? 4. What insights can be gained from this study to inform future advancements in property valuation processes? 1.7 Scope and Delimitations To maintain a clear and manageable scope, the following delimitations were applied: • Property Types: The analysis is restricted to residential properties, such as family homes and small non-urban dwellings. However, assessments on very cheap smaller houses are discarded, since their sale prices and features more rarely coincide and thus only provide noise. • Temporal Scope: Transaction data used for training and evaluation will be limited to a defined historical period 2015-2022. • Model Focus: The study will concentrate on a hybrid model consisting of ANN and gradient boosting. • Model Inputs: Only the structured dataset provided by Värderingsdata will be used in this study. No additional data collection from external sources has 5 1. Introduction been conducted, although lagged features derived from the original data have been created. • Comparison Baseline: The performance of the model is compared to other traditional computational models. Manual expert appraisals are referenced for context but not replicated in this study. • Outcome Metrics: Model performance will be evaluated primarily using sta- tistical measures of predictive accuracy (e.g, RMSE, MAE, MAPE, P10/P20). Broader impacts such as user acceptance and regulatory considerations are not tested. • Implementation Context: The study is exploratory and does not include real-time deployment or integration of the developed models into production environments used by Värderingsdata. 6 2 Theory This chapter outlines the theoretical foundations of property valuation and machine learning, providing the conceptual framework for the methods used in the study. 2.1 Price Prediction and Regression Models Predicting housing prices is a key challenge in real estate economics and data science. This section explores how regression models are applied to estimate property values based on various input features. 2.1.1 Overview of house price prediction as a regression task House price prediction is the task of estimating a property’s market value from its attributes. It is framed as a regression problem because the target output (price) is a continuous variable. In a regression model, the house’s features serve as input variables and the output is a predicted price. The goal is to learn a mapping f that relates these features to the sale price by training on historical sales data. House price prediction is therefore a classic example of supervised regression analysis in real estate economics and machine learning [9]. An illustration with three arbitrary features is shown in Fig. 2.1 Property Size (e.g., sq.m.) Location (e.g., coordinates) Property Condition (e.g., renovated) Regression Model f(X, β) Predicted Price (output) Figure 2.1: Overview of house price prediction as a regression task. Input features are mapped through a regression model to produce a continuous output (price). Source: Primary. 7 2. Theory 2.1.2 Hedonic regression models A cornerstone of traditional house valuation is hedonic regression. This method models a property’s value V = f(X1, X2, . . . , Xn) (2.1) as a function of its characteristics. In practice, f is often assumed linear: V = β0 + β1X1 + · · · + βnXn + ϵ (2.2) where each Xj is a property feature (size, location, etc.) and βj its estimated effect on price. Each coefficient thus represents the contribution of that feature, making the model easy to interpret. Hedonic regression has been frequently used for decades in market analysis and mass appraisal [10] because of its simplicity and transparency. However, the linear additive assumptions of hedonic models can be limiting. A simple hedonic model may fail to capture complex or non-linear relationships (for example, varying impacts of property age on market value) or interactions between factors. Moreover, it requires high quality data containing key variables or working with sparse data can lead to biased, unreliable estimates [11]. Furthermore, they are sensitive to multicollinearity, multiple additions of features without careful consideration can therefore lead to lopsided or misleading results, which calls for an informed user in order to get accurate results. 2.1.3 K-Nearest Neighbors An alternative non-parametric approach to hedonic regression is the K-Nearest Neighbors algorithm (KNN), originally proposed by Fix and Hodges and later for- malized by Cover and Hart [12]. Instead of specifying a functional form for the relationship between property characteristics and price, KNNs assume that simi- lar properties have similar market values. For an object with feature vector X = (X1, X2, . . . , Xn), the set of its k nearest neighbors is defined in the training data as: Nk(X) = { (X(i), V (i)) : X(i) is among the k closest points to X } . The predicted value V̂ is then computed as the average of the neighbor prices: V̂ (X) = 1 k ∑ V (i). (2.3) By using a distance metric, the model captures non-linear relationships and inter- actions without explicit model assumptions. The method is intuitive and straight- forward to implement, but can become computationally expensive for large datasets and suffer from the “curse of dimensionality” as the feature space grows [13]. It is also sensitive to noise and unevenly distributed data. Nonetheless, KNN remains a popular baseline in real estate valuation studies and in graph–based extensions where local similarity is leveraged. 8 2. Theory 2.2 Feature Engineering Effective feature engineering is essential for extracting maximal predictive power from structured data. In real estate valuation, raw inputs can include highly skewed numeric variables, high-cardinality categorical variables and proportional features due to the inherently diverse nature of housing. This section reviews the theory behind each transformation applied in the code. 2.2.1 Logarithmic Transformation of Skewed Variables Many real estate attributes exhibit a long right tail, where a small fraction of high- end properties inflate the mean and violate Gaussian assumptions. Applying the natural logarithm compresses large values more than small ones, stabilizing variance and often improving both linear and non-linear model performance [14]. In economic contexts, log-errors correspond to relative errors, making them more interpretable when predicting quantities that span multiple orders of magnitude. 2.2.2 Label Encoding Simple categorical fields with low cardinality can advantageously be converted to integer labels via a "Label encoder", which preserves uniqueness but imposes an arbitrary order. While tree-based models are unaffected by ordinal label codes, neural networks can learn embeddings on these integer indices. 2.2.3 Feature Standardization Features with heterogeneous scales, for example living area in square meters vs a log-adjusted price can dominate optimization and distance metrics. Z-score stan- dardization, x′ = x − µ σ , centers each feature to zero mean and unit variance, facilitating stable gradient descent in neural networks and balanced Euclidean distances in k-nearest neighbors [15]. 2.3 Neural Networks for Regression Artificial Neural Networks are a class of models inspired by the human brain, com- posed of interconnected units called neurons organized in layers. An ANN typically consists of an input layer, which takes in the features, one or more hidden layers that transform the inputs through weighted connections, and an output layer that produces the prediction. Each connection between neurons has a weight that ampli- fies or reduces the signal, and each neuron applies a non-linear activation function to the weighted sum of its inputs. Through a learning process, these weights are adjusted so that the network outputs accurate predictions on the training data [16]. In Fig. 2.2 there is a simple illustration of a neural network. 9 2. Theory Figure 2.2: A multi-layer feed-forward Artificial Neural Network with an input layer, one hidden layer, and an output layer. Each connection has a weight, and each neuron (circle) computes a function of the weighted inputs. Source: Primary. Mathematically, a simple ANN with one hidden layer can be described as follows: Suppose there are d input features x1, . . . , xd. Each hidden neuron h(j) computes a linear combination: zj = d∑ i=1 w (1) ij xi + b (1) j (2.4) and then applies a non-linear activation aj = f(zj), where f could be a ReLU or sigmoid function. The output layer then takes these hidden activations and computes the final output: ŷ = ∑ j w (2) j aj + b(2) (2.5) (for a regression network, often a linear activation is used at the output so that ŷ is a continuous number). In vectorized form, the network function is: ŷ = W (2)f(W (1)x + b(1)) + b(2) (2.6) The key point is that by composing two (or more) linear transformations with non- linear activations, the network can approximate very complex functions. In fact, the Universal Approximation Theorem states that a sufficiently large neural net- work can approximate any continuous function on compact domains to arbitrary accuracy, given enough neurons in the hidden layer [17]. This theory explains why neural networks are so useful for predicting house prices, they can learn complex relationships between features. ANNs learn the weights from data through an iterative optimization process called backpropagation combined with gradient based optimizers. The network starts with random weights and then in each training epoch, the predictions are compared to 10 2. Theory true prices using a loss function (discussed later more deeply in the Methodology chapter 3). The gradient of the loss with respect to each weight is computed through the backpropagation algorithm, and the weights are adjusted in the direction that reduces the error. Over many iterations, the network hopefully converges to a set of weights that make accurate predictions on the data in question. One appeal of ANNs in real estate is their ability to automatically learn latent features. For instance, the hidden neurons learn to represent combinations of inputs, for example, a neuron might push for "lakeside rural cottage" properties if such a pattern is present. ANNs are flexible and can theoretically handle interactions and non-linearities better than any predefined regression formula. However, there are challenges and considerations with ANNs. First, they generally require a large amount of data to train effectively, especially compared to many simpler models. In a data-sparse rural context, a complex neural network could overfit, learning quirks of the training data that don’t generalize, if not carefully regularized. Simpler network architectures or additional data might be necessary for more enhanced results. Second, ANNs are often criticized as “black boxes” as mentioned in 1.4, because the relationship between inputs and outputs is encoded in many weights in a non-transparent way. It’s not obvious why a particular prediction was made, which can be a disadvantage in valuation, where explainability is important. Later in this chapter, interpretability methods which can mitigate the interpretability problem are discussed. Finally, hyperparameter tuning (choosing the number of layers, neurons, learning rate, etc.) is important to get good performance and can be time-consuming. Despite these issues, ANNs remain a promising and highly feasible method for capturing complex value drivers in properties. 2.4 Loss Functions In training and evaluating regression models, the choice of loss function/error metric is critical. The loss function is the quantitative measure of error that the model tries to minimize during training. Different losses have different properties and can lead to different model behavior, especially important in valuation where one might care about relative error more than absolute error, or want to avoid over-penalizing outliers. Below, common loss functions and specialized ones used in this thesis are outlined. 2.4.1 Huber Loss The Huber loss is a robust loss function that behaves like mean squared error (MSE) for small errors and like mean absolute error (MAE) for large errors [18] [19]. Math- ematically, it is defined piecewise, being quadratic when the absolute residual is below a certain threshold δ and linear beyond that point. This hybrid nature gives Huber loss the advantages of both MSE and MAE. Huber loss is commonly used in robust regression and machine learning settings where the user expects noisy data, providing a balance between sensitivity to small errors and insensitivity to very large 11 2. Theory deviations. An illustration of how Huber loss penalizes wrong predictions is shown in Fig. 2.3. Figure 2.3: Comparison of Huber loss (green) with standard squared error loss (blue) as a function of the prediction residual. Source: Qwertyus https://en. wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg 2.4.2 Combining Loss Functions in Regression Models In practice, a single regression loss may not capture all modeling objectives. Com- bining multiple loss terms allows the model to balance these priorities. In general, one forms a composite loss as a weighted sum of components, so that each term contributes to guiding the training process. This strategy can improve generaliza- tion: prior work has shown that multi-objective loss functions often yield better performance on heterogeneous data and allow practitioners to tune tradeoff hyper- parameters between different goals [20]. 2.4.3 Combining Loss Functions in Regression Models In practice, optimizing a single regression loss may not capture all modeling ob- jectives, particularly when the model must also learn a structured or generalizable internal representation. In multi-task learning settings, it is common to combine sev- eral loss terms, each with a different purpose. For example, alongside the primary regression loss which predicts the sale price, additional losses such as classification or contrastive objectives can help the model toward learning embeddings that re- flect meaningful relationships in the data, for example market segment. This allows each loss term to influence training in proportion to its assigned weight. While the model still predicts a single scalar target, additional losses support generalization by enforcing structure in the learned representation. Prior work shows that such multi- objective training can improve both convergence and out-of-distribution robustness [20]. 12 https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg https://en.wikipedia.org/wiki/Huber_loss#/media/File:Huber_loss.svg 2. Theory 2.5 Gradient Boosting and LightGBM Gradient Boosting is an ensemble method that builds a strong predictor by se- quentially combining weak learners, typically shallow regression trees. Originally developed for classification, it was extended to regression by Friedman (2001) as Gradient Boosted Decision Trees (GBDT) [21]. 2.5.1 The Gradient Boosting Process Instead of training one complex model, gradient boosting builds a sequence of simple models (h1, h2, . . . , hM), where each new model tries to correct the errors made by the ones before it. The process starts with a basic guess F0(x), often just the average sale price, and gradually improves this prediction in steps [22]. 1. Compute residuals for each training example. For Mean Squared Error loss, the residual at stage m is ri,m = y − Fm−1(xi). 2. Train a new decision tree hm(x) on these residuals, learning how the current model errs. 3. Update the model: Fm(x) = Fm−1(x) + ν · hm(x), where ν is a shrinkage parameter (learning rate). 4. Repeat until M trees have been added or validation error ceases to improve. Each tree greedily reduces remaining error, by moving in the negative gradient direction of the loss function, hence the term gradient boosting. The final model is a weighted sum of M decision trees. Although individual trees are generally shallow, the ensemble collectively achieves accuracy and robustness. 2.5.2 Gradient Boosting in Real Estate Valuation In real estate valuation, gradient boosting models provide distinct advantages. Deci- sion trees naturally handle numerical and categorical features, effectively capturing non-linear relationships among features. For example, trees can specifically model scenarios like rural properties with long commutes or waterfront properties, accu- mulating adjustments from multiple trees for nuanced predictions. Despite its strengths, gradient boosting, like all models, has drawbacks. Optimal performance requires careful hyperparameter tuning, like tree number, depth, and learning rate, using techniques like cross-validation [23] to balance overfitting and underfitting. Furthermore, large ensembles can slow predictions with very lagre datasets, though typically manageable in real estate contexts. However, despite these minor drawbacks, gradient boosting remains a powerful, flexible methodology well suited to modeling complex and sparse data. 2.6 Overfitting Overfitting is a phenomenon in which a model becomes too closely aligned with the training data, capturing noise or unusual patterns that do not generalize well to un- seen inputs [24]. This often results in a steadily decreasing training error while the validation error, after an initial improvement, begins to rise. This pattern indicates 13 2. Theory that the model is not learning the underlying data distribution but is instead mem- orizing specific examples from the training set. As a result, the model performs well on the data it has seen but poorly on new or unseen data. Bad fitting typically oc- curs when a model is either too small to capture the data well i.e underfitting, when model capacity is similar in scale to the training data i.e when it is large enough to memorize patterns without generalizing, the risk of overfitting is often at its highest. Interestingly, work has shown that very large or overparameterized networks often generalize better than moderately sized ones [25]. When the training data is small or contains noise, the risk increases further. For instance, a neural network with a number of parameters similar to the number of training samples can easily memorize the data rather than learn generalizable features. In practice, overfitting is often diagnosed by monitoring the difference between training and validation errors. A growing gap between them signals that the model’s generalization ability is deteri- orating. To mitigate overfitting, techniques such as regularization, early stopping during training, and the use of a separate validation set are commonly applied. 2.7 Model Evaluation Metrics In order to make a thorough comparison of the different models, standardized and clear evaluation metrics are needed. The ones used for this thesis are listed in the subsequent subsections, where y is the true price, ŷ the predicted price, and n is the total number of properties. 2.7.1 Mean Absolut Percentage Error (MAPE) MAPE is essentially the average percentage error and is calculated as equation (2.7). 100% n n∑ i=1 ∣∣∣∣∣yi − ŷi yi ∣∣∣∣∣ (2.7) For instance, an MAPE of 10% means predictions are off by 10% on average. MAPE is scale-independent, which is useful in real estate portfolios with a wide range of prices. It is intuitive and a common metric in appraisal literature. 2.7.2 Mean Absolute Error (MAE) In contrast to MAPE, MAE measures the average absolute difference between pre- dicted and true values. It is defined as: MAE = 1 n n∑ i=1 ∣∣∣yi − ŷi ∣∣∣ (2.8) MAE is in the same units as the target, i.e SEK, making it directly interpretable for stakeholders. 14 2. Theory 2.7.3 Root Mean Squared Error (RMSE) RMSE penalizes larger errors more heavily by squaring the residuals before averaging and then taking the square root: RMSE = √√√√ 1 n n∑ i=1 ( yi − ŷi )2 (2.9) Because of the squaring, RMSE is more sensitive to outliers than MAE. A lower RMSE indicates fewer large deviations, which is critical when extreme misvaluations carry high risk. 2.7.4 Coefficient of Determination (R2) The R2 metric quantifies the proportion of variance in the true values explained by the model: R2 = 1 − n∑ i=1 (yi − ŷi)2 n∑ i=1 (yi − ȳ)2 , ȳ = 1 n n∑ i=1 yi (2.10) An R2 of 0.80 means 80% of the variance in sale prices is captured by the model, indicating strong explanatory power. Unlike error metrics, higher R2 is better, with a maximum of 1.0 for perfect fit. 2.7.5 P10 and P20 In real estate mass appraisal, P10 and P20 are accuracy metrics indicating the share of model predictions that fall within a certain margin of the true property value. In other words, P10 is the percentage of predicted prices within ±10% of the actual sale price, and P20 is the percentage within ±20%. Formally, one can define these as: P10 = 100% · (# predictions with |ŷ − y|) ≤ (0.10 · y) n , P20 = 100% · (#predictions with |ŷ − y|) ≤ (0.20 · y) n , 2.7.6 SHAP values – feature importance and interpretabil- ity SHAP (SHapley Additive exPlanations) is a method rooted in cooperative game the- ory for interpreting machine learning predictions by assigning each feature a SHAP value. These values represent how much each feature increases or decreases a pre- diction relative to a baseline (e.g., the average prediction) [26]. SHAP values extend Shapley values from cooperative game theory to machine learning. They attribute the model’s prediction to input features by averaging each feature’s contribution 15 2. Theory across all possible subsets of features. SHAP values show the effect of each feature on a specific prediction. For exam- ple, a rural house’s valuation might decrease due to being farther from a city, but increase with a larger lot size, for example. Summing the baseline and these SHAP contributions explains the final prediction clearly, similar to how an appraiser would justify property valuation. In this thesis, SHAP values clarify the gradient boosting model’s predictions, iden- tifying which features influence house valuations and ensuring the model captures logical patterns (e.g., larger living area positively affecting price). SHAP analyses can also detect any potential spurious correlations and visually demonstrate feature importance and nonlinear effects through SHAP summary and dependence plots. SHAP values are also attributed to neural embeddings if they are used in a gra- dient boosting, allowing for interpretation of what embeddings, or set of combined features contribute to the valuation. 2.7.7 t-SNE visualizing a high-dimensional representation t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for visualizing high-dimensional data by embedding it into a low-dimensional space, typically 2D, while preserving local relationships [27]. Unlike linear methods like Principal Com- ponent Analysis (PCA) [28] that preserve global variance, t-SNE emphasizes local structure: points that are close together in the high-dimensional space are mapped close together in 2D, while dissimilar points are placed farther apart. The algorithm proceeds in two main steps: 1. High-dimensional similarities: Computes probabilities pij that reflect how similar data points are using a Gaussian distribution. 2. Low-dimensional mapping: Computes probabilities qij in 2D using a Stu- dent t-distribution. 16 3 Methodology 3.1 Data Preprocessing Accurate and robust data preprocessing is critical to ensure that the models learn meaningful patterns rather than artifacts of noise or absence of data. The following steps were applied to transform the raw transaction records into a fully numeric dataset with consistent scaling and minimal missing values. 3.1.1 Cleaning and Imputation After loading the Parquet transaction file, columns with more than 30 % missing entries were discarded to avoid distorting model training. For municipality-level attributes such as population, population change rates, migration fractions, gaps were forward-and back-filled within each Municipality code group (see A.1), since within each respective municipality, municipality-level features should be identical. Highly skewed numeric features, identified by a maximum-to-minimum ratio above 10 and strictly positive values were log-transformed and the originals were dropped. 3.1.2 Categorical Encoding and Vocabulary Extraction Object-dtype columns were first changed to UTF-8 text and nulls replaced with the literal category “Unknown.” Each was then converted to a pandas.Categorical type [29], allowing LightGBM to treat them natively as categorical features. Si- multaneously, integer codes for each category level were extracted into new coded columns, which serve as inputs to the neural network’s embedding layers. The code also records each category’s vocabulary size, ensuring that each embedding matrix is sized precisely to its feature’s cardinality. 3.1.3 Proportion Clipping and Cyclical Date Features A set of fraction-type variables was clipped to the [0, 1] interval to enforce valid bounds. To allow models to learn smooth seasonal effects, the month of sale was encoded as two cyclical features: SaleMonthSin = sin ( 2π SaleMonthOfYear/12 ) , SaleMonthCos = cos ( 2π SaleMonthOfYear/12 ) . This representation ensures December and January are adjacent in feature space, instead of interpreting the months as far apart (1 and 12). 17 3. Methodology 3.1.4 Feature Selection and Scaling After dropping raw identifiers and geometry columns, the remaining numeric fea- tures (original, log-transformed, and cyclical) were split into FEATURES for contin- uous inputs and CAT_CODE_COLS for the integer codes of each categorical feature. The continuous features were standardized to zero mean and unit variance using Scikit-Learn’s StandardScaler [30] fitted on the training split, then applied un- changed to development and test splits, in order to completely avoid any test data leakage into training. Categorical columns retained their category dtype for Light- GBM, while the corresponding coded columns were fed into the neural network’s embedding layers. After these definitive preprocessing steps, the dataset consists exclusively of: • Log-transformed and z-score standardized continuous features (FEATURES) • Integer-coded categorical features (CAT_CODE_COLS) with known vocabulary sizes • Validated proportion and cyclical date features. This scaled representation is favorable for the ANN’s embedding layers and the LightGBM’s native categorical handling. 3.2 Model Development In Fig. 3.1 a simple flowchart of the hybrid model is displayed. The following subsections detail the architecture and implementation of each model, as well as the techniques used to improve their efficiency and effectiveness. 18 3. Methodology Figure 3.1: A simple diagram, visualizing the steps of the hybrid model. Source: Primary. 3.2.1 Training, Validation, and Test Split The data is partitioned in three stages to ensure a strict temporal hold-out for final evaluation and a separate development set for model selection and early stopping. The dataset is split chronologically, simulating a real life scenario, the training set consists of transactions from 2015-2020, the validation set consists of transactions from 2021, and accordingly the test set consists of purchases that occurred during 2022. This ensures that no data from 2022 are used in any training or validation step and that the final test set remains completely unseen until the very end. 3.2.2 Artificial Neural Network with Embeddings One of the core parts of the hybrid model is a multi-task ANN that incorporates learned entity embeddings for categorical features. By mapping each category into a trainable dense vector, the model captures intrinsic similarities among categori- cal values, avoiding sparse one-hot encodings. The network processes numeric and embedded categorical inputs jointly, feeding them through a deep feed-forward ar- chitecture with residual connections. This design allows the ANN to learn a rich 19 3. Methodology embedding of each data point, which is used for both a continuous value predic- tion and a class prediction where the classification head teaches the model broader market segments (see 3.2.2.3). The multi-task setup aims to enrich the shared repre- sentation by learning from both regression and classification targets simultaneously, improving generalization. 3.2.2.1 Input and Embedding Layers The input features x consist of standardized continuous variables and categorical variables. Continuous features are input directly, while each categorical feature is handled via a dedicated embedding layer. Specifically, for each categorical field c with Vc unique values an embedding matrix Ec ∈ RVc×d with a small dimension d (e.g. d = 8) tuned for the task is included. These are implemented as a ModuleDict of embedding layers in PyTorch [31]. During a forward pass, each categorical input is transformed into its d-dimensional embedding vector, and all embeddings are concatenated with the numeric features to form the combined input. By learning embeddings, the model can place similar category values close together in the vector space, reflecting their inherent similarities, such as being in the same region or similar area. This approach not only reduces dimensionality compared to one-hot encoding, but also can reveal meaningful relationships between categories. The resulting input vector (continuous features + all embedding outputs) is then passed into the first hidden layer of the network. 3.2.2.2 Residual Stack and Embedding Head After the input layer, the network feeds forward through a stack of fully connected layers with residual connections inspired by ResNet architectures [32]. The first layer expands the concatenated input to a high-dimensional hidden state of 512 neurons with batch normalization and ReLU activation. Then, several residual blocks follow: each block is a two-layer MLP that learns an increment ∆h and adds it to the block’s input via a skip connection. Formally, if a block’s input is hin, it produces hout = ReLU(W2(Dropout(ReLU(W1hin)))) + hin, with a linear projection on hin if dimensions differ. The main reasons of using a residual network rather than a purely feed-forward network was: 1. Stabilizing gradient flow: In a deep network, gradients can "vanish" or "explode" during backpropagation. By introducing a skip-connection that adds each block’s input directly to its output, the network effectively learns only the residual function δ(hin) instead of a full transformation. This identity- mapping shortcut allows gradients to propagate freely from deeper layers all the way to the input. 2. Non-linear interactions: House values depends on highly non-linear in- teractions among features. A deeper architecture can, in principle, capture hierarchical feature interactions. Residual blocks allows the user to train a deeper stack (512 →256→ 128 neurons over two blocks) without suffering from “degradation”, where additional layers actually hurt performance. Each block only needs to learn an additive correction on top of its input, so the network can gradually improve representations instead of forcing a large transformation 20 3. Methodology in one go. 3. Faster Convergence and Reduced Overfitting. ResNet-style blocks gen- erally converge faster than plain MLPs of the same depth. This meant requir- ing fewer epochs and less aggressive regularization. A 20% dropout was used inside each residual block and so was batch normalization after each ReLU to keep embedding magnitudes consistent. This combination reduced over- fitting on the training set, leading to a more robust embedding (128-D) for downstream stacking. After the final residual block, an embedding head was applied which is a linear layer that compresses the last hidden activations into a 128-dimensional embedding vector e. This e is a compact representation of the input property, integrating signals from all numeric and categorical features. It serves as the input to the subsequent output prediction heads, and also as a learned feature for the hybrid model. The use of a lower-dimensional embedding bottleneck (128-D) encourages the network to distill informative features of the data point, which was used later in the hybrid stacking approach. 3.2.2.3 Multi-Task Output Heads From the shared embedding vector e, the ANN branches into three output heads, one for regression, one for classification and one for structuring of the embedding space. The regression head predicts the log-adjusted price as a single scalar output. It consists of a small fully connected sub-network: a dropout layer followed by a dense layer (128→32 with ReLU) and a final linear layer to output ŷ (log-adjusted sale price) as a single continuous value. In parallel, the classification head predicts a discrete price category indicating the relative price level of the property. Five ordinal buckets are defined by partitioning the training-set prices into 20% quintiles. The classification head is similarly a dropout plus dense layers ending in a 5-logit output p̂ = (p̂1, . . . , p̂5). These correspond to the model’s confidence that the object’s price falls into each quantile range. A focal loss for this classification task to mitigate class imbalance was used, focusing the training on under-represented price ranges. The multi-task design provides an auxiliary learning signal, the classification objective, i.e distinguishing different price-ranges guides the network to learn features that segment properties by value, complementing the exact regression objective. Both heads share the same underlying embedding e, so the gradients from the regression and classification tasks jointly update the preceding layers. 3.2.2.4 Composite Loss Function The network is trained to minimize a composite loss Ltotal = Lreg︸︷︷︸ P10-aware regression + (1 − α) wc Lcls︸ ︷︷ ︸ focal classification + wt Ltriplet︸ ︷︷ ︸ embedding triplet , where • α ∈ [0, 1] is linearly annealed from αstart to αend over training epochs. • wc and wt are fixed weights for the classification and triplet losses, respectively. 21 3. Methodology 3.2.2.4.1 P10-Aware Regression Loss Lreg Let ŷ and y be the network’s prediction and ground-truth for the standardized log-price. Define the Huber com- ponent δHuber(ŷ, y) =  1 2(ŷ − y)2, if |ŷ − y| ≤ δ, δ ( |ŷ − y| − 1 2δ ) , otherwise and recover original-scale prices p = exp(ŷ σ + µ), t = exp(y σ + µ). The soft P10 term is P10soft = 1 − 1 B B∑ i=1 σ ( k (0.10 − |pi−ti| ti ) ) , where σ is the sigmoid and k a large constant. Then Lreg = (1 − α) δHuber(ŷ, y) + α P10soft . An illustration of how the composite loss-function penalizes predictions is displayed in Fig 3.3. Figure 3.2: An illustration of how the composite loss-function penalizes wrong predictions as α increases. Source: Primary. 3.2.2.4.2 P10-Aware Regression Loss Lreg Let ŷ and y be the network’s prediction and ground-truth for the standardized log-price: y = log(price) − µt σt , where µt and σt are the mean and standard deviation of log-prices in the training set. To recover original-scale prices, the inverse transformation is applied: p̂ = exp(ŷ σt + µt), t = exp(y σt + µt). where p̂ is the predicted sale price and t is the true sale price. The regression loss consists of two components: • A Huber loss on the standardized log-price: δHuber(ŷ, y) =  1 2(ŷ − y)2, if |ŷ − y| ≤ δ, δ ( |ŷ − y| − 1 2δ ) , otherwise 22 3. Methodology • A soft P10 loss, which softly penalizes predictions that deviate more than 10% from the true price. It is defined using a sigmoid function: P10soft = 1 − 1 B B∑ i=1 σ ( k ( 0.10 − |p̂i − ti| ti )) , where σ is the sigmoid function and k is a steepness constant. The final composite loss is a convex combination of these two objectives: Lreg = (1 − α) δHuber(ŷ, y) + α P10soft, where α ∈ [0, 1] controls the trade-off between squared-log error and P10-aware supervision. An illustration of how the composite loss penalizes errors at different α levels is shown in Fig. 3.3. Figure 3.3: An illustration of how the composite loss penalizes prediction wrong predictions as α increases. Source: Primary. 3.2.2.4.3 Focal Classification Loss Lcls The classification head outputs logits for nb price-quantile buckets. After softmax, let pi,c be the predicted probability for the true bucket c of sample i. The focal loss with focusing parameter γ is defined by: Lcls = − 1 B B∑ i=1 (1 − pi,c)γ log(pi,c). In Fig. 3.4 the focal loss is visualized for different values of the focusing parameter γ. 23 3. Methodology Figure 3.4: Illustration of focal loss, as γ increases, well classified examples (high p) are down-weighted, i.e their loss goes to zero faster, which helps focus the training on more difficultly classified examples (low p). The difference might look small but is quite tangible in practice. Source: Primary. 3.2.2.4.4 Triplet Embedding Loss Ltriplet To encourage the network to learn an embedding space in which similarly priced properties lie close together and dis- similar properties are pushed apart, a triplet-based loss was implemented in addition to the regression and classification heads. • Price Quantile Buckets. Before training, all sale prices in the training set are sorted and partitioned into five equal-sized buckets / quintiles. • Sampling Anchors, Positives, and Negatives. During each mini-batch, anchors are sampled uniformly at random. For an anchor a with sale price in quantile bucket ba, the model chooses: – A positive example p from the same bucket ba, i.e. a property whose sale price falls into the same quintile as a. – A negative example n from a different bucket bn, such that |ba − bn| ≥ 1. In practice, negatives are drawn uniformly from all buckets at least one quantile away, ensuring a clear price separation. Given embeddings ea, ep, and en, a standard margin-based triplet loss is implemented by: Ltriplet = 1 T ∑ (a,p,n) max{0, ∥ea − ep∥2 2 − ∥ea − en∥2 2 + m}. In Fig. 3.5 the Triplet Embedding Loss is visualized, with δ = 0.2. This makes clear how the margin parameter creates a zero-loss region and then penalizes violations linearly. 24 3. Methodology Figure 3.5: Triplet Embedding Loss. For δ ≤ −margin, the negative sample is at least "margin" farther than the positive -> zero loss. For δ > −margin, the loss gros linearly with δ + margin.Source: Primary. 3.2.2.5 Optimization and Regularization The ANN model was trained using the AdamW optimizer (Adam with decou- pled weight decay) for efficient stochastic gradient descent. Key training hyper- parameters such as learning rate, weight decay, dropout probability, and the loss weight coefficients (wc, wt, and the α schedule) were tuned using the Optuna hy- perparameter optimization framework [33]. In particular, Optuna’s TPE sampler explored ranges for the initial learning rate, the L2 weight decay penalty, the em- bedding dimensionality for categories, and the starting/ending values of α (which define how quickly the P10 term ramps up). Adopting Optuna [33] allowed for an efficient search of a well-performing configuration. The final chosen parameters (e.g. learning rate ≈ 2 × 10−4, weight decay ≈ 1 × 10−2, dropout ≈ 0.20) reflect the best trade-offs found. To train effectively, PyTorch’s One-Cycle Learning Rate (OneCy- cleLR) [34] schedule was also applied, which cyclically adjusts the learning rate from a low value up to a peak and back down to a low value within one training run. This method, introduced by Smith [35] for "super-convergence", allows the model to use a relatively high learning rate briefly and often leads to faster convergence and better generalization. Batch normalization was also applied in each layer to stabilize learning and dropout in the hidden layers and output heads to reduce overfitting by randomly deactivat- ing neurons during training. The trade-off parameter α, which controls the balance between Huber loss and soft P10 supervision, was scheduled to increase linearly over training. Specifically, α started at approximately 0.19 and increased to 0.63 by the final epoch. This gradually shifted emphasis from minimizing squared error on log-price to optimizing the soft P10 metric on original-scale prices. This sched- ule gave the model time to learn an accurate overall fit before focusing too much on the stricter P10 criterion. Simultaneously, the classification loss weight wc was effectively scaled by (1 − α), so that as α grew, the classification task was gradually down-weighted to zero towards the end of training. This ensured that in later epochs the model concentrates on P10 and embedding structure, having already benefited from the classification signal early on. Early stopping was introduced on MAPE as 25 3. Methodology well as increasing validation error. The best model, with lowest validation MAPE was saved for final evaluation. PyTorch’s ReduceLROnPlateau scheduler [36] was also used as a fallback, if progress stagnated, the learning rate would be halved after 5 epochs of no MAPE improvement. However, with OneCycleLR in effect, this was rarely needed until the very end of training. 3.2.3 LightGBM Ensemble with Raw Features and ANN Embeddings In the hybrid valuation framework, two LightGBM regressors are employed in a stacking configuration. LightGBM is known for its efficiency and accuracy, training faster than traditional GBMs while maintaining similar accuracy, making it suit- able for the large feature set. The two-stage ensemble is outlined in the following subsections. 3.2.3.1 Stage 1: Raw-Feature GBM In Stage 1, the LightGBM regressor is trained on the same scaled continuous features as the ANN, but the original categorical columns are kept as pandas Categorical dtype [29], so that LightGBM can handle splits on them natively while predicting the log-adjusted sale price. The Stage1 LightGBM was tuned with Optuna as well, optimizing hyperparameters like number of leaves, learning rate, feature fraction, and regularization terms. The objective was standard regression, i.e minimizing MAPE as the evaluation metric. The LightGBM was trained with early stopping on a validation set to determine the optimal number of boosting rounds. The Stage 1 model learns a baseline mapping from raw inputs to price. For instance, it can directly learn effects like “houses in region X are more expensive” or “larger living area increases price,” and so on, by leveraging decision-tree splits. After training, the Stage 1 predictions are produced. Denote ŷ0(i) as the Stage1 predicted log-price for sample i. 3.2.3.2 Stage 2: Embedding-Based Residual GBM For Stage 2, a second LightGBM model is trained to predict the residual r(i) using the ANN’s learned embedding as input. Essentially, Stage 2 is learning to predict what Stage 1 missed, but only using the information encoded in the embeddings e. The Stage2 LightGBM also uses a set of Optuna tuned parameters. It trains on the dataset (e(i), , r(i)) with an objective of regression on MAPE as well. Because the range of residuals is smaller than the original target, this stage can focus on finer details. For example, the ANN embedding might encode subtle interactions which the model can pick up on by splitting on e dimensions. Given all the complex signals the ANN captured, the Stage 2 model tries to find the remaining price adjustment that needs to be added to the Stage 1 prediction. Typically, Stage2 required fewer trees than Stage 1, as the residual signal is weaker than the original. After training, the model outputs a residual correction ŷ1(i) for each input embedding. This model effectively boosts the performance of the ensemble by adding back the nonlinear, 26 3. Methodology interaction-driven effects that a single GBM could not easily find from raw features only. 3.2.3.3 Combined Prediction and Performance The final prediction for a given property is the sum of Stage 1 and Stage 2 outputs: ŷfinal = ŷ0 + ŷ1. Where ŷ0 is the Stage1 GBM’s prediction using raw features, and ŷ1 is the Stage2 GBM’s predicted residual using the ANN embedding. The two terms together give the full predicted log-price, that is then exponentiated to obtain the predicted sale price in SEK. 3.3 Benchmark Models Two simple baseline models were constructed to benchmark the proposed hybrid ANN approach: a classical hedonic regression and a straightforward KNN regression. These models serve as interpretable, traditional baselines for comparison. 3.3.1 Hedonic Regression Baseline The code implements a standard procedure for training and evaluating a hedonic regression model using the same log-adjusted sale price as the target variable. The linear regression model is fitted on the training set, with both the predictors and the log-transformed sale prices. Predictions for the test set are generated in the log-price space and subsequently exponentiated to return to the original price scale. Model performance is then assessed on the natural price scale using the same evaluation metrics as for the other models. 3.3.2 KNN The kNN regression model predicts property prices by averaging the prices of the nearest training examples in feature space. For this implementation, each property identifies the five most similar neighbors using standard Euclidean distance across identical features. Predictions are generated through uniform weighting, meaning each neighbor contributes equally. Same as for the rest of the models, the KNN regression was applied to the log-transformed price target. Fig. 3.6 illustrates a simple example of this method: the price of a new house of 180m2 (indicated by the vertical dashed line) is predicted by averaging the prices of its 5 closest neighbors (marked in orange) within the training dataset. 27 3. Methodology Figure 3.6: Illustration of KNN regression. A new house (vertical dashed line at 180 m2) is valued by averaging prices of its 5 nearest neighbors (orange points) among a sample of training homes. Source: Primary. 3.3.3 Baseline Model Configurations Table 3.1 summarizes the key settings of each baseline model. The hedonic regression has no adjustable hyperparameters, while the kNN model’s main parameter is K (the number of neighbors). Table 3.1: Overview of baseline models and their configurations Model Target Variable Key Settings Hedonic Regression log-adjusted sale price OLS linear regression on structural & loca- tional features; no hy- perparameter tuning. KNN Regression log-adjusted sale price k-nearest neighbors (e.g. k=5), Euclidean distance, uniform weighting. The comparative evaluation was then carried out on an identical held-out test dataset for all models. Using the same test set for each model ensures a fair, direct comparison of predictive accuracy no model has an advantage from different data splits. The same set of error metrics was applied to each model’s predictions. In this way, the hybrid model is benchmarked against both conventional methods. 28 4 Summary of Findings 4.1 Comparative Evaluation Table 4.1 summarizes the test-set performance of each modeling approach. The hybrid model emerges as the top performer across every metric. Although its im- provements over the raw-only GBM might seem modest, they are consistent and meaningful in a valuation context. Table 4.1: Test Set Performance Comparison of All Models Model MAPE MAE RMSE R2 P10 P20 Hybrid Model 15.9 431 926 624 460 0.814 41.4 73.6 Raw features LGBM 17.1 464 525 665 020 0.798 38.6 68.0 Embeddings LGBM 19.3 481 560 739 780 0.751 37.3 63.6 Neural Network 18.6 481 290 737 220 0.752 38.9 65.1 KNN 23.6 619 170 963 270 0.577 28.6 53.0 Hedonic Regression 22.8 552 140 799 780 0.709 30.0 56.0 The hybrid model reduces MAPE by over 1 percentage point (pp) relative to the raw-only GBM (15.9% vs. 17.1%), translating into an average error reduction of roughly 33,000 SEK. This means more appraisals fall closer to their true value. The RMSE also decreases by roughly 41,000 SEK. The hybrid model scores an R2 of 0.814 which means that the model explains 81.4% of the variance in sale price in the test set. Turning to coverage metrics, the hybrid’s P10 of 41.4% signifies that four out of ten valuations lie within ±10% of the sale price, compared to only 38.6% for the raw GBM. Similarly, P20 improves by 5.6pp. These gains reflect a clear tightening of the error distribution, which can translate to stronger confidence intervals in practice. The embeddings-only GBM and the standalone neural network both underperform the raw GBM, confirming that while learned embeddings excel at capturing complex, nonlinear feature interactions, they do not substitute for the breadth of information contained in the original variables. Embeddings distill higher-order patterns, but require the raw features to ground those patterns in measurable property attributes. By contrast, the kNN and hedonic regression baselines underperform substantially on every metric. Hedonic regression, relying on linear relationships and pre-specified interaction terms, struggles to accommodate the irregular, multimodal distributions of property characteristics outside major urban centers, without proper, careful and 29 4. Summary of Findings thorough pre-processing. Likewise, kNN depends on finding truly comparable sales in the training set; in thin markets or highly heterogeneous rural regions, suitable comps may be sparse or distant in feature space, leading to noisy, unstable estimates. This sharp underperformance of classical methods underscores the sheer difficulty of automated property appraisal in diverse, data-sparse contexts. Real estate mar- kets outside metropolitan areas exhibit wide variability in lot sizes, building styles, renovation levels, and locational premiums that differ from the smoothness and homogeneity assumptions of simple regression or nearest-neighbor approaches. In such settings, hybrid models that combine global pattern-learning and local context provide the flexibility and robustness needed to attain practical accuracy. 4.1.1 Model Performance by Price Decile To understand how valuation accuracy varies across the price spectrum, test-set properties were grouped into ten equally sized deciles by true sale price. For each decile, Table 4.2 reports the average true price, the model’s mean prediction, and key error metrics: MAE, MAPE, P10 and P20. Table 4.2: Hybrid Model Performance by True Price Decile Decile True Mean Pred Mean MAE MAPE P10 P20 0 1,326,602 1,237,231 203,372 13.85% 43.5% 74.8% 1 1,461,995 1,428,079 274,347 18.36% 30.5% 58.3% 2 1,681,692 1,628,669 318,380 18.87% 32.4% 62.0% 3 1,903,869 1,839,429 365,234 19.41% 32.1% 60.5% 4 2,155,763 2,093,139 407,223 19.86% 34.0% 61.0% 5 2,480,797 2,399,840 435,851 18.32% 39.3% 65.5% 6 2,853,116 2,771,432 484,960 18.12% 39.8% 69.3% 7 3,321,370 3,249,310 468,718 14.94% 47.8% 76.8% 8 3,994,205 3,867,537 519,312 13.14% 49.5% 79.9% 9 5,597,320 5,315,956 839,685 14.72% 45.5% 75.6% Takeaways: • Low-to-Mid-market challenge (deciles 1–4): MAE and MAPE peak in the second through fifth deciles, and P10 dips to its lowest. These low-to-mid- range properties exhibit the greatest disparities in features, making precise valuation more difficult. • Improved accuracy at extremes: Both the lowest decile (0) and the top three deciles (7–9) show stronger P10 and lower MAPE. In the cheapest seg- ment, homes are more homogeneous, while in the mid-to-high-value tiers, the model excels at identifying objects, similar to mid-value properties, but with slight improvements across the object-specific features. However, for the most expensive of objects (found in decile 9) the object struggles to make correct valuation, for the very most expensive properties (≈ 12M+) mostly due to the small sample size of very expensive properties in the training set. • P20 stability: The P20 metric remains above 58% across all deciles, peaking 30 4. Summary of Findings at nearly 80% for decile 8. This indicates that even when ±10% accuracy is challenging, the model still generally stays within ±20% of sale price. • Systematic underestimation: Across all deciles, the model underestimates the value. The predicted mean in decile 9 (5.32 M SEK) is slightly below the true mean (5.60 M SEK), aligning with the MAPE and MAE increases, suggesting a modest bias that could be addressed by targeted calibration in the highest price bracket. 4.2 Error analysis This section quantifies how well the hybrid valuation model performs and where it fails. With aggregate views, histograms, and scatter plots that reveal the overall spread and systematic biases of its residuals, and then drill down to illustrative case studies that expose the specific transactions driving the largest misestimations. 4.2.1 Error Distribution Fig. 4.1 shows the distributions of absolute and relative errors for the hybrid model on the test set. The absolute-error histogram is tightly concentrated: over 50% of predictions fall within ±400 000 SEK, and only 5% exceed 1 000 000 SEK. The relative-error plot confirms that more than 40% of predictions lie within ±10% of actual price (P10), and roughly 75% within ±20% (P20). This error profile indicates that the hybrid model delivers both small typical errors and a compressed tail of large misvaluations critical for reducing risk in automated appraisal. Figure 4.1: These two plots show the Absolute Error Distribution (left) and Rel- ative Error Distribution (right) for the Hybrid Model. The scatter plot in Figure 4.2 compares true sale prices against model predictions. Most points lie close to the 45° line, demonstrating a relatively accurate fit across price ranges. A heavy under-prediction bias appears as the prices increase, though mostly due to the lack of expensive objects in the entire dataset. Overall, the scatter confirms that the hybrid stack generalizes quite well and maintains linearity between predicted and actual values. 31 4. Summary of Findings Figure 4.2: Scatter plot of all true (x-axis) and predicted prices (y-axis), if all predictions were totally correct, they would align with the red dotted line. 4.2.2 Case Studies: Best and Worst Predictions Table 4.3 lists five, respectively seven, examples of the lowest and highest absolute errors. The best-predicted properties tend to be around mid-market, but as was evident in Fig. 4.2, the model successfully makes good estimations even in higher price segments. Conversely, the worst estimations almost exclusively fall into the most expensive range of properties in the test set. This holds for almost all of the 50 worst predictions as well, with two exceptions: one object, whose adjusted sale price was 4,313,017 SEK, was valued at 9,456,976 SEK by the model, almost double the actual price, and another, priced at just 1,548,869 SEK, was overestimated at 5,222,245 SEK. Table 4.3: The table displays the best and worst predictions made by the model in absolute terms. The sale prices are still adjusted to 2020-06, hence the strange price sequences. Best Estimates by the Hybrid Model SEK Adjusted sale price Predicted Sale Price Absolute Error 1 1,466,269 1,466,336 67 2 3,248,320 3,248,389 69 3 1,881,336 1,881,457 121 4 2,481,618 2,481,743 124 5 2,616,279 2,615,984 295 Worst Estimates by the Hybrid Model SEK Adjusted sale price Predicted Sale Price Absolute Error 1 16,298,297 5,117,908 11,180,389 2 19,281,824 11,055,575 8,226,249 3 16,071,772 8,416,230 7,655,542 4 13,593,189 7,154,120 6,439,069 ... ... ... ... 7 4,313,017 9,956,976 5,643,959 14 1,548,869 5,222,246 3,673,377 32 4. Summary of Findings 4.2.3 Case Studies of Selected Transactions It is expected that many of the worst predictions would fall into the high-end market segment in terms of absolute error. As mentioned multiple times before, the dataset did not contain a sufficient portion of very expensive homes that the model could learn from. However, the two bottom predictions in table 4.3 are not part of the most expensive price segment, something that needs to be investigated. Table 4.4, shows a feature comparison between two instances in the data set, one from the test set (Anomaly) and one from the training set (Similar object). Table 4.4: Comparison of the model’s 7th worst prediction and its nearest neigh- bors (NN1)(see Table 4.3) between two nearly identical property records, one from the training set (true sale price 12,500,000 SEK) and one from the test set (true sale price 4,313,017 SEK), showing adjusted true vs. predicted sale prices and key features(more of key features in section 4.4), highlighting a likely duplicate entry. Example of a Potential Duplicate Transaction Anomaly Similar object Adjusted Sale Price (SEK) 4,313,017 12,500,000 BuildingAge (years) 113 109 UtilityArea m2 187 187 Lot Area m2 548 548 QualityScore 27 27 Closetobeach (1/0) 0 0 DistmediumCity (m) 681.55 681.31 Distcoast (m) 363 363 strand 4 4 Deso Class C C When evaluating the test data, it was observed that the property with a sale price of 4,313,017 SEK was valued by the model at 9,883,873 SEK, resulting in an absolute error of 5,570,856 SEK. Examining the values for both objects in the table reveals that they are highly similar, in fact almost identical, across key features. Further- more, an analysis of their respective longitude and latitude coordinates confirmed that the two transactions certainly correspond to the same object. The reason be- hind the same object being sold for almost a third of what it had been sold for just four years prior is not apparent. It could be due to an external action, the property is quite large and could have been subdivided into a two-family building, and that the data for the building simply has not been updated accordingly. It is unclear how many such "identical" or misleading entries exist within the dataset, and no thorough investigations were made. Another anomaly, i.e a relatively cheap object with a high absolute prediction error, is found in Table 4.5 together with some similar objects. 33 4. Summary of Findings Table 4.5: Comparison of the model’s 14th worst prediction (see Table 4.3), be- tween transaction the Anomaly and its five nearest neighbor (NN1–NN5), showing true sale prices and key features. Summary of Property Transactions Anomaly NN 1 NN 2 NN 3 NN 4 NN 5 Adjusted sale price (SEK) 1 548 868 6 829 478 6 550 359 6 056 122 5 725 983 4 684 105 BuildingAge (years) 26 24 27 6 14 27 UtilityArea (m2) 199 168 145 229 220 147 Lot Area m2 1742 829 816 1795 2671 937 QualityScore 36 26 30 32 44 30 Closetobeach 0 0 0 0 0 0 DistmediumCity (km) 5.04 4.54 2.73 18.70 15.45 5.86 Distcoast (km) 6.66 0.093 126.40 0.39 5.3 39.57 strand 4 3 4 4 4 4 Deso Class C C C C B C The left-most column in Table 4.5 corresponds to the Anomaly, an object priced at 1,548,868 SEK. The model predicts a sale price of ŷ = 5,185,451 SEK, whereas the actual price is only y = 1,548,868 SEK. This yields an absolute error of |ŷ − y| = 3,636,583 SEK, corresponding to a relative error of 235 %. A row- wise scan of Table 4.5 reveals that the Anomaly is not easily separated from its five nearest neighbours (NN 1–NN 5) across all high-importance features. These include BuildingAge, UtilityArea, LotArea, QualityScore, the socio-economic DesoClass, and categorical indicators like strand. There is no standout covariate that would suggest this object should be treated differently by the model. In other words, the Anomaly is fully embedded in the typical feature space. In contrast to its feature similarity, the Anomaly is dramatically dissimilar in price. While its neighbors all transacted between 4.7 and 6.8 million SEK, the Anomaly closed at just 1.55 million SEK, a discount of 65% to 77 % relative to every peer. Thus, price becomes the only truly anomalous dimension for this transaction. Additionally, distance-based features such as DistCoast offer little help. The Anomaly lies 6.6 km from the coast, but the five nearest neighbors span a wide range from just 93 meters to 126 kilometers. The variation within this group weakens the predictive signal in that dimension, making it unlikely that the model can rely on it to adjust for the price outlier. Feed-forward ANNs are well-suited to learning smooth, high-frequency patterns in feature space. Given that the Anomaly’s input vector xAnomaly closely resembles those associated with sale prices in the 5–7 million SEK range, the model naturally maps it into that price manifold. From the network’s perspective, there is no statis- tical precedent suggesting that a home with such observable features can transact at 1.5 million SEK. It therefore extrapolates upward in a way that is rational from a data-driven standpoint. Furthermore, an external valuation benchmark was found on the Anomaly on Booli [37]. Booli is a Swedish real estate platform offering comprehensive housing market data, now owned by SBAB Bank. It provides users with access to current property 34 4. Summary of Findings listings, historical sale prices, area-level trends, and market statistics across Swe- den. Booli’s platform serves home buyers, sellers, and investors seeking data-driven insights into the property market. One of Booli’s core services is its automated property valuation tool. In this context, Booli’s automated valuation of the Anomaly was 3,540,000 SEK, substantially higher than its realized sale price of 1,548,868 SEK but still below the price predicted by the model. This reinforces the notion that, even from the perspective of an independent, market-wide algorithm, the sale price of the Anomaly stands out as an extreme outlier [37]. In summary, the Anomaly is not a failure of the model but rather a reflection of data limitations. While the model is trained on a rich feature set, the analysis presented here focuses on a carefully selected subset of the most important features, those that contribute most significantly to price prediction according to feature importance metrics. Displaying the entire feature space would obscure the interpretability of the analysis and offer limited additional insight. As shown, the Anomaly is virtually indistinguishable from its nearest neighbours across this high-importance subset, yet its price deviates dramatically. This high- lights a fundamental limitation: if two homes appear nearly identical in all observ- able and influential aspects but sell for vastly different prices, a model, even one trained on a comprehensive feature space, cannot be expected to resolve such dis- crepancies without access to additional factors. Without access to richer data or targeted algorithmic adjustments, large errors for cases like the ones investigated are not only unsurprising, they are inevitable. 4.3 Embedding Analysis In the following subsections, the model’s embedding space is explored to uncover its key patterns and insights. 4.3.1 Embeddings Clustering K-means clustering (k = 5) was applied to the 128-dimensional embeddings produced by the neural network’s projection layer for all training samples. There is a boxplot in Fig. 4.3, where sale prices by cluster reveals five distinct tiers: 35 4. Summary of Findings Figure 4.3: Boxplots of sale-price distributions for five clusters obtained by apply- ing K-means to the 128-dimensional neural network embeddings. • Cluster 0 Mid-high-market segment (median ≈ 4.0 M SEK) with moderate interquartile range and a few high-price outliers. • Cluster 1 Low-priced homes (median ≈ 1.2 M SEK) showing a tight distri- bution and minimal skew. • Cluster 2 High-end homes (median ≈ 5.8 M SEK) with a pronounced long upper tail. • Cluster 3 Mid tier (median ≈ 2.8 M SEK) exhibiting the widest overall spread and several extreme values. • Cluster 4 Low-mid-level units (median ≈ 1.7 M SEK) with relatively low interquantile range but some upper-end outliers. The plot shows that embeddings naturally partition the data into value-driven groups beyond any single raw feature. 4.3.2 t-SNE Projection of Embeddings Figure 4.4 presents a two-dimensional t-distributed Stochastic Neighbor Embedding projection of the 128-dimensional neural network embeddings for all test transac- tions. Each point corresponds to a single transaction and is colored according to its log-sale price. A smooth gradient from low to high prices is apparent, with higher-priced properties clustering in a distinct region of the embedding space. This continuous organization demonstrates that the learned embeddings capture price information in a structured manner, mapping gradual increases in sale price onto similarly gradual transitions in embedding coordinates. 36 4. Summary of Findings Figure 4.4: Two-dimensional t-SNE projection of the 128-dimensional ANN em- beddings for each property, colored by log sale price. Points that cluster together share similar learned representations. 4.3.3 Embedding-Feature Correlation Analysis To interpret what the learned embedding dimensions capture, the absolute Pear- son correlation was computed between each embedding coordinate and all original numeric and categorical-code features. Here are some key takeaways: • Educational and demographic signals: Many of the most important di- mensions correlate strongly with educational features, such as the fraction that attended post-elementary education and higher education. These embedding axes clearly encode neighborhood education-level statistics. • Population and area metrics: Many embeddings show high correlation with municipality population, its deviation and its total and its annual (and biannual) change, indicating that latent dimensions capture market size and growth dynamics. • Spatial proximity: Several embeddings also correlate with distance to medium sized cities, and distances to points of interest, such as golf courses, confirming that network-learned axes encode locational gradients. • Heterogeneity across dimensions: While some dimensions focus on socioe- conomic factors, others capture built-environment attributes, demonstrating that the embedding space distributes different types of signals across separate axes. • Object-specific dimensions: Many dimensions also focus almost solely object-specific features, such as living area, lot area, distance to beach etc, meaning that the network encodes a combination of attributes that pushes value. 37 4. Summary of Findings 4.4 Model Interpretability The following subsections explore the model’s prediction process by examining fea- ture contributions and importance using various interpretability methods. 4.4.1 SHAP Analysis on Raw Features The individual feature contribution in SHAP is visualized in Fig. 4.5 for 20 of the most important features. Figure 4.5: SHAP summary plot for the model, showing each feature’s contribution to the predicted price (x-axis) and the distribution of feature values (color) across observations The topmost feature is LogDistMediumCity: its red points (high values) lie on the left (negative impact), and blue points (low values) on the right (positive impact). This indicates that larger distance to a medium-sized city lowers the predicted price, whereas being closer (small distance, blue) raises it. This aligns with empirical find- ings that home values decline with distance from city centers [38]. The second and third features from the top, MunicipalityCodeStr and DesoArea, are categorical 38 4. Summary of Findings location codes. They exhibit nearly symmetric, grey-colored distributions around zero, meaning they adjust baseline prices up or down by municipality or district but show no clear monotonic trend. In other words, different municipalities or DeSO areas simply shift the model output without a single direction of effect since they are merely categorical. 4.4.1.1 Property Attributes and Size/Quality Effects The model’s next most important features are structural attributes of the property. LogBuildingAge has high (red) points on the left and low (blue) on the right: older buildings