Incorporating Interior Property Images for Predicting Housing Values Master’s thesis in Data Science and AI Adrian Gortzak Nedim Can Ulusoy Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 Master’s thesis 2024 Incorporating Interior Property Images for Predicting Housing Values Adrian Gortzak Nedim Can Ulusoy Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden 2024 Incorporating Interior Property Images for Predicting Housing Values Adrian Gortzak Nedim Can Ulusoy © Adrian Gortzak, Nedim Can Ulusoy, 2024. Supervisor: Milad Malekipirbazari, Computer Science and Engineering Advisor: David Magnusson, Valueguard Index Sweden AB Examiner: Aila Särkkä, Mathematical Sciences Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Visual features as part of the comparative market analysis tool. Gothenburg, Sweden 2024 iv Incorporating Interior Property Images for Predicting Housing Values Adrian Gortzak Nedim Can Ulusoy Department of Computer Science and Engineering Chalmers University of Technology Abstract The property valuation process for the real estate market is essential for predicting a fair market value. This process is traditionally carried out by brokers, including inspecting and assessing the subject property to find comparable sales for compara- tive market analysis (CMA). Meanwhile, an automated valuation model (AVM) can help achieve an autonomous version of this process, which speeds up the process but lacks some of the inputs that a manual assessment provides. AVMs have difficulty considering more subjective architectural qualities, such as beauty, stability, and util- ity, due to the difficulty of quantifying these aspects objectively. New advancements in Visual Transformers (ViT), self-supervised learning and Contrastive Language- Image Pre-training (CLIP) technologies have shown favourable improvements in the field of computer vision. Therefore, this study explores the potential improvements of these new techniques within the visual feature extraction task to enhance the AVMs from interior images. By applying ViTs as binary classifiers, clusters, and textual descriptions matching, we aim to enrich the feature extraction process for a property valuation model in the region of Uppsala County, Sweden. Our find- ings show modest enhancements in the AVM’s performance, which align with prior studies, but also highlight that these new technologies can extract more detailed fea- tures compared to previous methods. Furthermore, they demonstrate the potential for these technologies to capture more comprehensible architectural qualities from images, which could significantly assist brokers in the valuation process. Keywords: Computer Vision, Transformers, Feature Extraction, Machine Learning, Deep Learning, Real Estate, Automated Valuation Models, Architectural Qualities. v Acknowledgements Firstly, we want to extend our sincere gratitude to Milad Malekipirbazari, our aca- demic supervisor, for quickly providing suggestions and answers to our inquiries. Additionally, he suggested alternatives and supplied a solid foundation in the field of AI and ML while still being patient with us. Secondly, we would also like to show appreciation and gratitude to Valueguard In- dex Sweden AB is for an exciting research area, hardware access, and an educative process. Moreover, we would like to thank David Magnusson, our company super- visor, for his engagement, fast support and industry expertise in guiding the thesis forward and resolving issues along the way. Finally, we want to express our heartfelt gratitude to all the individuals who have contributed feedback and input throughout the thesis. Adrian Gortzak & Nedim Can Ulusoy , Gothenburg, 2024-06-17 I want to express my heartfelt appreciation to my partner, Sandra, for her invaluable help and emotional support throughout the thesis. I am deeply grateful for her encouragement and support. Adrian Gortzak, Gothenburg, 2024-06-17 I would like to express my heartfelt gratitude to my family for their unwavering support and encouragement throughout the entirety of my academic journey. Nedim Can Ulusoy, Gothenburg, 2024-06-17 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Property Valuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Market Value in Real Estate . . . . . . . . . . . . . . . . . . 5 2.1.2 Comparative Market Analysis . . . . . . . . . . . . . . . . . . 6 2.1.3 Automated Valuation Models . . . . . . . . . . . . . . . . . . 6 2.2 Features Impacting Property Price . . . . . . . . . . . . . . . . . . . 7 2.2.1 Architectural Quality . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Factors Connected to Market Value in the Location . . . . . 7 2.2.3 Factors Connected to Market Value in the Property . . . . . . 8 2.2.4 Property Type Specific Factors . . . . . . . . . . . . . . . . . 8 2.2.5 Other Factors Connected to Price . . . . . . . . . . . . . . . . 9 2.3 Images in the Sales Process . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Interior Visible Features . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Review of Similar Studies . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6.2 Previous Attempts . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 13 2.7.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.8 Gaps in the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.9 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Methods 17 3.1 Research Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Limitations and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 18 ix Contents 3.3 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Valueguard Index Sweden AB . . . . . . . . . . . . . . . . . . . . . . 20 3.5.1 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . 20 3.5.2 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.2.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.2.2 Base Features and Target . . . . . . . . . . . . . . . 22 3.6 Room Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.8 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.9 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.10 Visual Target Features . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.10.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . 29 3.10.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.10.3 Contrastive Language-Image Pre-Training . . . . . . . . . . . 32 3.11 Utilising Visual Features in the Model . . . . . . . . . . . . . . . . . 32 3.12 Automated Valuation Model . . . . . . . . . . . . . . . . . . . . . . . 33 3.13 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Results 37 4.1 Classification of Images . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Self-Supervised Models . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Binary Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Clustering Features Found . . . . . . . . . . . . . . . . . . . . 41 4.3.3 CLIP Features Explored . . . . . . . . . . . . . . . . . . . . . 41 4.4 Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Automated Valuation Model . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.1 Apartments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 Houses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5.3.1 Apartment . . . . . . . . . . . . . . . . . . . . . . . 47 4.5.3.2 House . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5.4 Visual Features Importance . . . . . . . . . . . . . . . . . . . 48 4.5.4.1 Apartment . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.4.2 House . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Conclusion 55 5.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5.1 Recommendations for Future Research . . . . . . . . . . . . . 58 5.5.2 Conclusive Summary . . . . . . . . . . . . . . . . . . . . . . . 58 Bibliography 59 x Contents A CLIP features I B Neural Network model architectures VII xi Contents xii List of Figures 1.1 Blueprint for the system’s process flow . . . . . . . . . . . . . . . . . 3 2.1 Highlighting comp selection problem . . . . . . . . . . . . . . . . . . 10 2.2 Example comp (Desired) . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Example comp (Undesired) . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Uppsala County on OpenStreetMap [60] . . . . . . . . . . . . . . . . 18 3.2 Geographical areas from Statistics Sweden on OpenStreetMap . . . . 21 3.3 H3 Index with different resolutions on OpenStreetMap . . . . . . . . 21 3.4 Room types explored . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 An instance of labeling process . . . . . . . . . . . . . . . . . . . . . . 25 3.8 Pipeline for the AVM model . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Neural Network with two-dimensional input . . . . . . . . . . . . . . 27 3.10 Neural Network with one-dimensional input . . . . . . . . . . . . . . 28 3.11 Histogram of the percentage error . . . . . . . . . . . . . . . . . . . . 30 3.12 Histogram of the standard point . . . . . . . . . . . . . . . . . . . . . 30 3.13 Use of the CLIP model to generate spaciousness score [106] . . . . . . 32 4.1 Confusion matrix of interior and exterior classifications . . . . . . . . 38 4.2 Number of images in each room type used in the study . . . . . . . . 39 4.3 Pre-trained ViT model attention on a kitchen example . . . . . . . . 39 4.4 Our Self-supervised ViT model attention on a kitchen example . . . . 40 4.5 Sales of property type before and after filtering . . . . . . . . . . . . 42 4.6 Apartment sales in Uppsala County . . . . . . . . . . . . . . . . . . . 43 4.7 House sales in Uppsala County . . . . . . . . . . . . . . . . . . . . . 43 4.8 Top 30 features - apartment AVM [Ridge] - With base features . . . . 47 4.9 Top 30 features - apartment AVM [XGBoost] - With base features . . 47 4.10 Top 30 features - house AVM [Ridge] - with base features . . . . . . . 48 4.11 Top 30 features - house AVM [XGBoost] - with base features . . . . . 48 4.12 Top features - apartment AVM [Ridge] - with only cluster features . . 49 4.13 Top 30 features - apartment AVM [Ridge] - with only CLIP features . 50 4.14 Top 30 features - apartment AVM [XGBoost] - with CLIP features . . 50 4.15 Top 30 features - apartment AVM [XGBoost] - only CLIP features . . 51 4.16 Top features - house AVM [Ridge] - only cluster features . . . . . . . 51 xiii List of Figures 4.17 Top 30 features - house AVM [Ridge] - only CLIP features . . . . . . 52 4.18 Top 30 features - house AVM [XGBoost] - with CLIP features . . . . 53 4.19 Top 30 features - house AVM [XGBoost] - only CLIP features . . . . 53 B.1 Neural Network model structure for apartment AVM . . . . . . . . . VII B.2 Neural Network model structure for house AVM . . . . . . . . . . . . VIII B.3 Head of the Neural Network classification model for percentage Error and Standard Points . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII B.4 Head of the Neural Network classification model - interior and exterior IX B.5 Head of the Neural Network Classification Model - room type . . . . IX xiv List of Tables 3.1 Housing metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Apartment metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Best thresholds and F1 scores for different rooms . . . . . . . . . . . 38 4.2 Area under the curve score for the room types on percentage error . . 40 4.3 Area under the curve score for each room type on standard point . . 41 4.4 Visual clustering features selected with beauty (B) and utility (U) . . 41 4.5 Examples of the CLIP features selected . . . . . . . . . . . . . . . . . 42 4.6 Performance metrics on Ridge for apartment AVM . . . . . . . . . . 44 4.7 Paired sample t-Test results for apartment AVM with Ridge . . . . . 44 4.8 Performance metrics on XGBoost for apartment AVM . . . . . . . . . 44 4.9 Paired sample t-Test results for XGBoost . . . . . . . . . . . . . . . . 44 4.10 Performance metrics on Neural Network for apartment AVM . . . . . 45 4.11 Paired sample t-Test results on Neural Network for apartment . . . . 45 4.12 Performance metrics on Ridge for house AVM . . . . . . . . . . . . . 45 4.13 Paired sample t-Test Results for house Ridge . . . . . . . . . . . . . . 45 4.14 Performance metrics on XGBoost for house AVM . . . . . . . . . . . 46 4.15 Paired sample t-Test results for house XGBoost . . . . . . . . . . . . 46 4.16 Performance metrics on Neural Network for house AVM . . . . . . . . 46 4.17 Paired sample t-Test results for house Neural Network MAPE . . . . 46 A.1 CLIP features - bedroom . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 CLIP features - bathroom . . . . . . . . . . . . . . . . . . . . . . . . II A.3 CLIP features - kitchen . . . . . . . . . . . . . . . . . . . . . . . . . . III A.4 CLIP features - living room . . . . . . . . . . . . . . . . . . . . . . . IV A.5 CLIP features - dining room . . . . . . . . . . . . . . . . . . . . . . . V xv List of Tables xvi 1 Introduction This section introduces our research topic of visual feature extraction in real es- tate, outlines the research question, and provides a brief background on automated valuation models and visual feature extraction with deep neural networks. 1.1 Background Property value assessment is an essential part of the real estate field, ensuring that both buyer and seller get a fair price for what is typically one of the most significant investments in their lifetime. The aim of property value assessment is to predict market value, which is the expected value of a property under normal conditions [1]. A broker conventionally does this with the help of a manual Comparative Market Analysis (CMA), which uses comparable sales to derive the market value [1]. While thorough, the traditional method can be time-consuming, especially when comparing architectural qualities that require an assessment of utility, stability, and beauty. Quantifying these aspects automatically poses a challenge, often requiring a manual visual comparison of the images. To address this challenge, the real estate field has seen significant advancements in Automated Valuation Models (AVMs), which efficiently and accurately estimate a property’s market value efficiently and precisely [2]. While AVMs also primarily rely on easily quantifiable data, such as the living area and the number of bedrooms, computer vision and pattern recognition improvements have made it feasible to combine unstructured data, such as images, as part of the AVMs’ input. Previous studies have tried incorporating visual features from images [3], [4]. They have focused on exterior and interior images, using techniques such as Convolutional Neural Network (CNN) [5] for pattern recognition related to market value. While these studies have shown feasibility, they have also shown only modest improvements [3]. Building on this foundation, new computer vision improvements have shown promis- ing results in quantifying aesthetics and outperforming the earlier state-of-the-art CNN models [6] with the recent development of Vision Transformers (ViT) [7]. Furthermore, self-supervised methods, such as Self-Distillation with No Labels (DINO) [8] and A Simple Framework for Contrastive Learning of Visual Representations 1 1. Introduction (SimCLR) [9], have demonstrated the ability to learn robust features from images without relying on labelled data. They achieve this by utilising parts or transformed version of the image during training of the ViT model, predicting whether the im- ages originate from the same source. These methods reduce the previous need for a larger labelled image dataset to create robust models for feature extraction. Additionally, the arrival of Contrastive Language-Image Pre-Training (CLIP) [10], trained to find the similarities between the ViT model and a text encoder, makes scoring the images by textual input feasible without additional training or provided examples. 1.2 Problem This thesis aims to solve the problem of extracting and incorporating visual features related to architectural qualities and utilising advancements in computer vision, par- ticularly via ViTs to reduce the uncertainty in AVMs. Through methods such as classification, clustering, self-supervised learning, and CLIP, this research aims to extract more comprehensive information from interior images, surpassing the limi- tations of previous techniques. These improvements could potentially revolutionise both automated and manual valuation processes. Another problem with similar studies is the impact of cultural differences and pref- erences when focusing on different countries. The preferences for particular architec- tural styles and access to material or functionalities of rooms may cause significant shifts in data distribution. For instance, due to differing market dynamics and Euro- pean trends, the pattern found in the United States study [11] may not be directly applicable or effective in the Swedish housing market. Therefore, this thesis is an exploratory study to quantify visual features from listing images using Deep Neural Networks (DNNs), such as CNN and ViT, to improve AVMs in Sweden. These findings aim to find the answer to the research question: RQ: Which visual features predict market value most significantly in Sweden? 1.3 Outcomes This project aims to create an extraction system similar to the flowchart in Figure 1.1. It first takes images as input and proceeds to classify the type of room in these images. The final step is to extract features from these images that are relevant to the architectural qualities, using models tailored to the specific room type identified and utilising deep learning techniques, such as CNNs and ViTs, in the extraction process. These extracted features aim to be used as inputs for the AVMs, allowing for the comparison with AVMs lacking visual features. In addition to assessing changes in accuracy, the importance of the features will be extracted from the model and compared to show the impact of the extracted features. 2 1. Introduction Figure 1.1: Blueprint for the system’s process flow 1.4 Structure of the Thesis The thesis structure will begin with a theoretical summary of the field of housing valuation and previous attempts to use images to enhance the performance of AVMs. The study also explores the visual features connected to market value in Sweden, their importance, and the visual features used in similar studies in other regions. This thesis aims to highlight the most promising techniques and visual features to focus on, based on their proven impact on housing valuation, to answer our research question. Given this deeper understanding of the Swedish housing market and previous re- search on visual extraction, the methodology section describes the methodology used within this thesis and the necessary pre-processing to focus on different room types individually. Additionally, it seeks to describe the different automated valuation models and the visual feature extraction techniques applied. Following the description of the methodology employed in the thesis, the results section highlights the data used in sales and images. Additionally, it discusses the features identified from the different models in this study and their correlation with market value. The thesis ends with a conclusion section, where the results from the experiments are discussed, and further directions are recommended for further exploration. 3 1. Introduction 4 2 Theory This section introduces the field of housing valuation, current manual and automated methods to estimate market value, and the potential use of visual data to enhance these models. Additionally, previous research utilising images and computer vision within the valuation process will be highlighted. 2.1 Property Valuation The property valuation process holds significant importance across numerous fields. In the private sector, property purchases are among the most expensive decisions an individual makes in their lifetime [12]. Ensuring a fair price is essential for both the seller and buyer during negotiations and transactions [13]. Unlike other major purchases, such as a car, a property serves not only as a utility but also as an investment with a potential resale value that historically tends to increase over time [14], [15]. This dual nature of property as both a necessity and an investment underscores its significance as future capital appreciation becomes a key consideration [16]. Furthermore, property valuation plays a crucial role in risk assessments [17], and is utilised by banks during loan assessment [18]. 2.1.1 Market Value in Real Estate In the real estate field, market value refers to the most probable selling price in a fair and open market without personal relations between the seller and buyer or coercion and with enough marketing time [1]. This implies equal access to information and opportunities for all parties involved in the selling process. It also implies that the broker would have enough preparation time, and it also requires the seller to be patient without urgency in the selling process. Additionally, the lack of a personal relationship between the broker, the buyer and the seller ensures that the price is not dependent on the personal relation, thereby preventing potentially biased pricing. Therefore, the sales can be seen as samples from the distribution where the market value is the mean and can not be observed. The sales price is thereby used as the target, with the aim of estimating the market value, assuming that all the sales follow the market value conditions. Consequently, a slight difference in the market value estimation and the selling price 5 2. Theory is expected. However, a significant difference between the two may indicate either a bad estimation or a sale price that does not follow the conditions [1]. 2.1.2 Comparative Market Analysis In Sweden, the broker usually performs a manual appraisal of houses and apartments using a comparative market analysis (CMA) [1]. This process determines the esti- mated market value of the subject property by finding similar comparable property sales, referred to as comps, in an area that also shares similar characteristics. The sale date of the comps must also be close to the valuation date to be comparable due to market trends and price changes over time [1]. Alternatively, if it is necessary to use an older sale, a housing index, such as the one derived from the Hedonic Regression Model [19], can be used to track changes over larger areas and take the feature of the property under control to isolate changes over time. Thereafter, the index can be used to adjust the expected market value of the valuation date as either the current or the past date, depending on the specific valuation requirements. To derive the estimated market value, one can either use the average sale price per square meter of the comps or the Purchase Price Coefficient (K/T) value, which is the sale price divided by the taxation value for the comps. Depending on the chosen sub-methods, this is then multiplied by the subject property’s corresponding living area or taxation value [1]. Suppose a significant distinction between the comps and the subject property; for example, a worse condition or another addition makes them different. In such a situation, the broker can make additional adjustments to align the price with the predicted market value [1]. Predicting the market value accurately is generally difficult [20], considering the real estate market’s very competitive nature and frequent price fluctuations [16]. 2.1.3 Automated Valuation Models The AVM is an automated way of estimating the market value without doing a deeper and more time-consuming analysis manually, especially when large numbers of properties need to be assessed. In addition, it provides an automated early indi- cator for potential sellers before a broker does a deeper analysis. Multiple types of models can be used to do this process autonomously. Firstly, a straightforward AVM approach imitates the traditional manual evaluation process using a comparable approach. This approach entails a k-nearest neighbour- like search for comps that can then be used to estimate the subject property auto- matically [21]. Secondly, linear regression methods are also widely used, especially as an improve- ment baseline [22]. These methods find the linear relationships between the features and the target, assuming linearity in relations. 6 2. Theory Thirdly, Tree-based methods and gradient-boosted methods, such as the eXtreme Gradient Boosting (XGBoost) [23], work well when there are no linear relationships and have shown promising results on the AVM task in previous studies [2]. In recent years, the use of Neural Networks has also shown promising results, espe- cially in combining different sources of data into one model. Notably, market leader Zillow transitioned from a multi-model approach to a larger Deep Neural Network (DNN) model, resulting in improved performance and reduced maintenance [24]. 2.2 Features Impacting Property Price Understanding the importance of finding similar comps is crucial for a reliable price estimation. These comps should closely mirror the factors of subject properties. In the Swedish market value process [1], Fredrik Brunes categorises these factors into two main groups: those related to the property and the property’s location. Apartments and houses have additional separate features that connect to the market value, such as housing cooperatives for the apartment and land area for the house. Therefore, they are handled separately in the models and the theory [1]. 2.2.1 Architectural Quality The features connected to the property and location are scored on architectural quality, drawing from the foundational principles in the 10 Books on Architecture by Marcus Vitruvius Pollio [25]. These criteria are divided into three main parts when scoring the architecture: stability, utility, and beauty, which are commonly regarded as standard in Sweden. However, due to the unclear meaning and different interpretations of standard, this thesis will use a specific definition of architectural quality [1]. 2.2.2 Factors Connected to Market Value in the Location Starting with location, which is assessed similarly for both houses and apartments, the utility of the location can be evaluated by considering the proximity to positive factors.These can be the distance to marketplaces, commuting possibilities, and workplaces. Additionally, it entails factors such as a sense of safety and relaxation and the availability of places for socialisation with friends and family, such as parks or forest areas within walking distance of the property [1]. Secondly, the area’s beauty can be assessed based on the quality of the surrounding houses, streets, and parking spaces in terms of material and detail perspective. This assessment includes factors such as the amount of daylight the area receives, whether views are long or blocked by high buildings and accessibility of the area including multiple routes to access the property [1]. Finally, the stability of the location is assessed based on the materials used in the nearby houses and common areas, such as parks and streets. This includes evaluating if these areas are well maintained and stating any damages and their severity [1]. 7 2. Theory The location is assessed in multiple layers, which include micro-location, the sur- rounding area, and the neighbourhood’s reputation. The micro-location is the re- gion with a direct connection to the property. Additionally, the surrounding area is the region within walking distance. Lastly, the neighbourhood’s reputation is a wider region where the general opinion is assessed. In particular, the reputation is not assessed based on architectural quality but rather as an independent value [1]. 2.2.3 Factors Connected to Market Value in the Property Unlike the nearby area, which is shared between multiple residences, a property offers a private space that the owner can customise. This gives the owner more control over the home environment and architectural qualities. Although there are distinctions between houses and apartments, there are also sev- eral commonalities in terms of the layout, utility, and aesthetics of the rooms. This section explains the overlapping characteristics of the apartment and the house, while the differences will be addressed in the subsequent section. Stability within the property is connected to the material and building techniques employed during its construction. This relates to the predicted maintenance re- quirements in the form of expected repairs and the associated cost. Typically, the foundation has longer intervals between repairs than the flooring and walls, but it comes with a comparably high repair cost at the time of repair [1], [26]. The property’s utility relates to how it can be used, and this relates to the ability to spend time with friends, cook food, maintain hygiene, and recover through sleep. This could be in the form of a larger room, making it possible to spend time together, sound isolation that keeps noise out, or a bathroom or laundry machine within the residence [1], [26]. Within the property, beauty is related to finer details in pleasing materials and openness in combination with the balance of natural light. It also includes a balance between open and closed areas and the generality of the home, as well as, the ability to use rooms for multiple purposes [1], [26]. While aesthetic preferences may vary, certain features are generally considered appealing, while others, such as damaged walls or broken details, are not. 2.2.4 Property Type Specific Factors Given the explanation of the common characteristics, the focus now shifts to the dif- ferences. Swedish apartments are usually part of a housing cooperative. A monthly fee determined by the financial status and planned maintenance is paid to the coop- erative. High fees add to the buyer’s expenses, particularly if the cooperative has high loans and may need to increase fees during times of high loan rates. Thus, un- derstanding the cooperative’s financial situation is crucial, enabling buyers to assess future potential costs [1]. On the other hand, the house also comes with extensions, such as land, the possibility of extra buildings, and a foundation that is part of the residence. Unlike apartments, 8 2. Theory where utilities are shared responsibilities within the cooperative, houses typically place these responsibilities on the owner. This increases the potential repair cost and underscores the importance of assessing the comp’s current state during the selection process [1]. 2.2.5 Other Factors Connected to Price There are other factors that affect the subject property’s price, such as demand and supply, regulation changes, mortgage rates, and disposable income rates. Since the comps are supposed to be sold close in time to the subject property or adjusted by indices to be, these can therefore be assumed to be shared between the comps and the subject property. 2.3 Images in the Sales Process When it comes to property sales, images play a crucial role as part of the listing material, providing potential buyers with a first impression of the property. They give the buyer a general idea of the property and assess if they are interested in bidding or attending a viewing [13]. Capturing people’s interest is essential in persuading potential buyers to pursue further steps, increasing the number of potential buyers and, thereby, the demand for the property. This part of the selling process has generated a niche in the company area of home staging. It takes advantage of the importance of aesthetics, aiming to make the home feel and look better to attract more buyers. Consequentially, it potentially increases the price and thereby makes the service a potential investment [4]. In a study using eye tracking, it was determined that the subjects spent 60% of the time watching the images in the property advertisement compared to the description and comments from the broker [27]. This highlights the importance of the images in the sale process. In addition, images have the advantage of universally communi- cating the property’s condition without language barriers, conveying its appearance in a way that words alone may struggle to achieve [16]. 2.4 Interior Visible Features A broker will visually inspect the subject property during the valuation process to identify the previously mentioned features related to the property’s beauty, utility, and stability, which are necessary to find suitable comps [1]. The property’s rooms show material choices related to stability. They can also indicate the feasibility of spending time with friends and family or whether this area is too cramped. Simul- taneously, the bathroom and kitchen conditions can indicate one’s ability to cook food and maintain proper hygiene [1], [26]. Damages and aesthetics are also visible components on the surface layer of the exterior and interior. These damages can be moisture and humidity on the ceiling or walls, as well as cracks and stains. 9 2. Theory Within the home, both in the exterior and the interior, there are time-typical features that are normal for the era [28], [29]. These features can be the type of wallpaper, the doors and the windows of the property or some additional details. While these features can show a desirable style, they also hint at potential underlying issues; the construction industry has tried different techniques and materials throughout the years, only exhibiting the usual problems long after the construction [28], [30]. 2.5 Limitations and Challenges A limitation in the manual assessment process is that while the subject property can be visually inspected and assessed thoroughly, the comps are usually not readily available for inspection, making it hard to adjust the assessment based on these features [1]. This is especially true when the interior parts are involved, while the exterior and surroundings can be viewed with satellite or street view images. This can lead to a situation as seen in Figure 2.1, where two similar-looking sales differ in price, making it problematic to assess the features setting them apart and adjusting accordingly. Figure 2.1: Highlighting comp selection problem A broker with area knowledge and previous sales experience within that area might have a good understanding of the differences, including the general condition and how to adjust the price accordingly. Conversely, a new broker might be more re- stricted [1]. The visual aspect, if available, can then aid the broker, leading to a better understanding of the differences, as Figures 2.2 and 2.3 show an excessive example of the potential differences in stability, utility, and beauty 10 2. Theory Figure 2.2: Example comp (Desired) Figure 2.3: Example comp (Unde- sired) One of the main limitations of these data-based valuation systems, like the AVM, is their difficulty in grasping unstructured data that has proven important to the mar- ket value or data hard to access. These, for example, can be natural light or sound levels from the surrounding area and views from the property [16]. While it is easy to add quantitative data, using unstructured data such as ad descriptions, satellite images, and exterior and interior images of a property, requires more advanced fea- ture extraction techniques. However, extracted features from the unstructured data could provide essential information for comparing comps with the subject property, whether for AVMs or the broker. These architectural qualities for the home have been referred to as the property’s unmeasurable values [26], and the literature on what is considered high within these topics is limited in Sweden [31]. This has triggered new research within the field to find objective guidelines [32], thereby highlighting the difficulty and importance of the topic. 2.6 Review of Similar Studies Along with advancements in Machine Learning (ML) and Artificial Intelligence (AI) and the increasing ability to utilise unstructured data, methods based on DNNs have 11 2. Theory begun to be used to obtain more objective and data-driven real estate valuations. These methods have also shown promising development in predicting these hard-to- quantify features related to architectural qualities. The literature contains various studies on real estate valuation and the use of visual features from images, mainly aiming to reduce uncertainty in the assessment. The studies have focused on different properties with overlapping themes, such as attrac- tiveness [11], [33], aesthetics [34], material usage [35], luxury levels [4], the impact of damages [36], and the effects of furniture and unfurnitured images [37]. Most of these studies have been conducted on exterior images, such as satellite images [38], [39], street views [20], [40], possibly due to the ease of accessing this data afterwards, with services such as Google Street View [41] and Google Maps Static [42]. However, the interior images have also been focused on in some studies, where the room types were assessed separately [43]. Closely related, a study was conducted on the number and location of photos taken from Facebook and how that indicated something beautiful or photo-worthy within the area [44]. These studies have been conducted in many countries, such as China [40], Italy [38], England [39], United States [3], and South Korea [20], highlighting region-specific insights. For instance, research in Beijing, China, showed a negative correlation between water bodies and market value due to pollution [40]. 2.6.1 Methods These studies use different methods to measure visual features and their importance. One method of gathering ratings on aesthetics and damages has been used to catch subjective opinions [37], [43]. In these studies, participants grade the visual features in a comparable fashion, and multiple options are presented. A rating is set on the targeted features, such as damages or aesthetics, and an average rating serves as the objective truth during the training of the models. This has also been done to quan- tify beauty with the help of natural language processing (NLP), where comments on images are gathered to extract assertions in the form of undesired and desired comments regarding image aesthetics [6]. Another method is to use the error of the original estimate as indicators of the visual features’ effect on the price [3]. A negative difference in an area can indicate that the visual aspects found differ from the region negatively. Using multiple examples, the model can find these commonly negative and positive patterns as features that can be added to the assessment. This method is heavily based on the assumption that the estimate’s error is based on the visual aspects. 2.6.2 Previous Attempts In case of the studies that utilised ratings, one of the studies evaluates the effect of features in property images on real estate, and a group of experts uses structured methodologies to evaluate the functionality and aesthetics of furniture [37]. The results showed that this approach was effective in aligning furniture design with 12 2. Theory consumer preferences and quality standards. In the study, "Image-Based Appraisal for Real Estate Using Mask R-CNN" [36], they labelled each image with multiple annotations related to the room conditions, including damages and their severity. This study focused on the lack of importance of the property’s current situation based on its image in real estate valuation. In this study, the Mask R-CNN [45] approach published by Facebook AI Research was used, and both defect and damage detection were performed by object segmentation on the interior and exterior images of real estate. The primary purpose was to understand the effect of the defects and damages in the interior and exterior images on the price. The price error was used as an indicator for undesired visual features in the "House Price Estimation from Visual and Textual Features" [3] study. Specifically, a binary classifier for Curb Appeal of houses was developed. It was based on the error of the previous prediction in combination with Principal Component Analysis (PCA) of the Pre-trained ResNet features was developed. The choice of a binary classifier for a good and a bad curb is a simplification over a regression task where the actual difference is the target. While their attempts led to an improvement, it was stated that it was only a modest improvement. In the context of AVM models, these studies have mainly used models such as Ordinary Least Squares (OLS) [46] and XGBoost as baseline models, comparing the result with and without visual features extracted from images. Another study used recurrent neural networks (RNNs) [47] to process data from random walks based on the location of properties to embed locality to improve property pricing [16]. During performance evaluations of these models, they generally used Mean Square Error (MSE), Mean Absolute Percentage Error (MAPE), and R2. In addition, the results of these studies showed moderate improvements, suggesting that visual features reduce uncertainty and emphasise the need for further research [3], [20]. 2.7 Computer Vision The field of computer vision, which retrieves information from images and video, has been active for a long time. CNN’s early breakthrough was its ability to capture patterns, making it possible to classify text and scenery from visual media [5]. 2.7.1 Convolutional Neural Networks Yann LeCun and his collaborators first entered this field in 1998. In the paper "Gradient-Based Learning Applied to Document Recognition" [5], they introduced the use of CNNs for document recognition. After this introduction, they pioneered CNNs’ architecture and training methods and showed their effectiveness for two- dimensional shapes, such as handwritten characters. After LeNet, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton introduced AlexNet in 2012 in the paper "ImageNet Classification with Deep Convolutional Neural Net- works" [48]. AlexNet was deeper than LeNet and used ReLU activation functions to increase the model’s performance and GPUs for computation. It also achieved 13 2. Theory a top-five error rate of 15.3% on the ImageNet challenge, which was a significant performance among the existing models. Following this, Karen Simonyan and Andrew Zisserman introduced the VGG net- works in 2014, in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" [49]. VGG’s architecture was deeper than AlexNet because of the use of very small 3x3 convolutional filters. The VGG model with this architec- ture achieved better performance on the ImageNet challenge than other models like, AlexNet. This result showed that using deeper networks with smaller convolutional filters can increase the performance of the model and provide better accuracy in image-oriented tasks. Subsequently, the ResNet model was introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in 2016 in the paper "Deep Residual Learning for Image Recognition" [50]. This paper presented a new approach to the problem of vanishing gradients during the training of deeper networks. By incorporating skip connections between layers, ResNet effectively mitigated this issue and achieved superior results across various image-oriented tasks, surpassing the performance of previous architectures such as VGG. While these larger models are trained with millions of images [51] to find robust features, they can start from a pre-trained stage, where the model has already learned base patterns that can be fine-tuned to a set objective. These larger models must be trained on data that are connected to the target domain at hand. For instance, a large model that is trained on a dataset linked to animals might not have found the patterns that would be useful in real estate. The continuation of openly available datasets, from the MIT indoor 67 [52], with 15,620 images, to Places 365 [53], with approximately 8 million training images, gives the ability to train these larger DNN models, which would have been impractical due to resource, time and image constraints. The AlexNet [48], VGG [49], and ResNet [50] models have shown promising per- formance in the papers related to classifying scenery and distinguishing between different types of settings, such as cafeterias and classrooms. While not perfect, the models still show room for improvement, especially in the top-1 prediction accuracy, which means the accuracy of the class with the highest probability. That is currently between 50% and 60%. This is still impressive given the 365 options. Demonstrating the ability and slight improvement between the generations of models. Therefore, showing the ability and slight improvement between the generations of models [53]. 2.7.2 Transformers The recent success of transformers in NLP, starting from the paper "Attention is All You Need" [54], started an AI leap with the advancements of new tools such as Bert [55] and GPT [56] models. The significant improvement with transformers is the ability to handle sequential data without having to process them sequentially, a limitation that previous models, such as the RNN, had [54], [55]. It does this with the help of attention or self-attention mechanism, which is used to set the importance 14 2. Theory or weight of part of the inputs during training, thereby keeping attention on the high-weighted parts [54]. This has shown improvements over previous methods [6]. The use of transformers has also entered the computer vision field in the form of Vision Transformer (ViT) [57], which is a transformer model designed to push the limits of transformers outside their primary field, NLP, and perform in the computer vision or Image Analysis field. In ViT, the main idea is to let the model learn image structures independently by representing all image inputs as sequences of patches using the attention mechanism of the transformers [57]. The main advantage of ViT over CNNs is that ViT uses these attention mechanisms that are not constrained by the spatial structures on which CNNs are based. This allows the model to focus on the most relevant parts of the input image [57]. This can lead to more efficient processing, especially for tasks that benefit from understanding the global context of the image [57]. 2.8 Gaps in the Research Given the rapid advancements in AI, the vast amount of unstructured data within the real estate field in the form of text, images, and videos, this research area holds significant potential [58]. Specifically, it can leverage these unstructured data to improve the market value assessment. Also, due to the difficulty of estimating market value, it makes property valuation a good test for new developments in the inclusion of extracted features that are difficult to quantify to reduce uncertainty. Given the current difficulty in quantifying the architectural quality features of the property and its usage in the comp selection, we believe that the future of real estate research around the valuation process will continue to find ways to incorporate more complex components. These components can then be correlated with the market valuation to highlights their importance. Additionally, they can be used to make the selection of comps easier for real estate brokers. Many features related to the architectural qualities could be extracted, such as sound levels and natural light levels. However, they are also hard to quantify accurately and clearly. Furthermore, acquiring relevant data for these features presents a chal- lenge [21]. As AI models continue to advance and more open data becomes available, the field of property assessment will undergo renewed exploration, exploring the un- certainty and understanding of the correlated features. This includes the dimensions of the architectural quality. 2.9 Future Directions While the use of CNN and DNNs in the valuation process has been studied, mainly in other regions and with both exterior and interior images, the use of ViT remains limited. Regional differences might also include regional biases in the studies, po- tentially limiting the scope of the findings to other regions, such as Sweden. 15 2. Theory Another area for improvement with these findings lies in the ability to interpret the result. While these methods have shown slight improvements in automated valuation, they are usually hard to use outside of AVMs. A more target extraction, with a visual understanding, could also aid the manual assessment process, helping the broker in the comp selection process and speeding up their workflow. Therefore, there is a need for future studies of visual features in regions such as Sweden that continue to explore ways of incorporating additional features into the valuation process. This could enhance its accuracy, efficiency, and understanding of the market value and its relation to characteristics as architectural qualities. 16 3 Methods The method chapter begins with a summary of the research plan, followed by the data and pre-processing required for this thesis. Thereafter, the theory and best practices for training DNNs are explained. The chapter concludes with the visual extraction methods, the AVM models and the scoring methodology used in this research. 3.1 Research Plan The primary objective of this thesis is to leverage visual aspects from interior im- ages with the help of state-of-the-art computer vision models to reduce uncertainty in Swedish property valuation. The hypothesis is that interior images reflect the architectural qualities that advanced computer vision models can extract and use in the predictions, thereby improving the AVMs’ accuracy. We test this hypothesis by conducting an empirical research study on private housing within the Uppsala county region in collaboration with Valueguard Index Sweden AB (Valueguard) [59]. The collaboration with Valueguard enables us to access listing images taken at the time of sale, along with the metadata associated with the property, selling date, and selling price. It also supplies us with housing indices that track price changes over time, which enables us to adjust these sales to the same date, resulting in a more comparable dataset. Lastly, their extensive expertise in the real estate field provided invaluable guidance throughout the project. This thesis focuses on the interior images and excludes exterior and surrounding images. These interior images are categorised separately according to room types for a more tailored comparison. One initial limitation in this study is that these labels are not provided, thus requiring extensive manual pre-processing through image classification to obtain the required dataset. In this thesis, the accuracy of the AVM model is used to measure the impact of these extracted features, measuring the reduced uncertainty in the form of MAPE in the AVM prediction. Thereafter, a 10-fold cross-validation is combined with paired t- tests to test for statistical significance of the added visual features. Additionally, when a statistically significant improvement in the reduction of MAPE is found, the feature weight in the model is examined to understand the importance of the 17 3. Methods newly added features. Consequentially, this research plan aims to provide a solid foundation for the research. 3.2 Limitations and Scope The first limitation in scope is the types of properties explored. This thesis focuses exclusively on apartments and smaller houses, essentially year-round private housing. This approach defers the inclusion of images from interior commercial establishments and summer cottages to future studies. This decision ensures a targeted approach, considering the propertys’ distinct customer groups and usages. There is also a limited availability of relevant data for commercial housing for this thesis. Secondly, this thesis only explores the interior images and excludes images of the property’s exterior and those depicting the surrounding area. This, in combination with the focus on the room types separately, creates a high reliance on representative data connected to all the room types for each sale. Thirdly, this thesis does not collect votes or labels from the broker to use as a target for architectural quality. Instead, it attempts to find these structures in the data. This scope is set to explore the advancement of new AI tools for finding patterns, primarily due to the scarcity of available experts in the field to assist in this process. Finally, the region’s size is limited to control the number of sales and images pro- cessed within the study. Choosing a region rather than sampling from the country as a whole is chosen to capture comparable sales in the regions. Therefore, it is decided that only sales within the region of Uppsala County are included, as shown in Figure 3.1. The region is chosen based on the available data, with a preference for regions the authors are familiar with. Furthermore, the region is also considered sufficiently large to generate a dataset big enough to make the use of DNN models meaningful. Figure 3.1: Uppsala County on OpenStreetMap [60] 18 3. Methods 3.3 Technologies Multiple programming languages can be utilised for visual feature extraction. How- ever, due to previous knowledge and experience, the choice is made to work in Python [61] and use PyTorch [62], as the main library for working with images. Ad- ditionally, we use GeoPandas [63] to mark up areas and calculate distances, while we use pandas [64] to load and process the metadata. Subsequently, we run these tools and models mainly in Jupyter notebooks [65], which run inside a Docker container [66] with graphics processing unit (GPU) access. A key tool in the project is Label Studio [67], which makes it fast and efficient to work with labeling and ensures that data policies are upheld by working locally. Additionally, it supports the required multi-label and single-class classification that we use in the thesis. The tool is essential for labelling a large quantity of images in a secure and reasonable time with an easy import function from a JSON format to generate the task and hotkeys to speed up the labelling process. MlFlow [68] is another valuable tool for this thesis. It makes saving experiment results with the connected parameters, scores, models, and graphs more concisely and easily manageable. This reduces the associated difficulties with a more extensive set of models with different training parameters. It also makes it possible to do larger experiments sequentially over multiple days, trying out a wider range of parameters that can be assessed afterwards. Computing power is an essential part of running these models. Given that DNNs run considerably faster on GPUs than central processing units (CPUs) [69]. GPU resources are used to increase the number of feasible experiments within the lim- ited time. Also, due to the sensitive nature of some of this data, an additional requirement is that it has to stay within the company’s hardware. This requirement removes the alternative of utilising cloud computing, which can scale more freely. As a result, for this project, a Linux server with an Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU, 32GB DDR4 RAM and an Nvidia GeForce RTX 3090 with 24GB VRAM is provided and used throughout the project. Regarding the dataset and models explored, the hardware is deemed sufficient to run and train the models within a reasonable time for multiple attempts throughout the project. 3.4 Data Data is our project’s most critical and essential part, especially when using the DNN approach due to its data-hungry nature [70]. This data consisted of information related to the property, such as the number of rooms and size. Valueguard provides most of this data through metadata and images connected to the sales. However, some additional data sources are used to enrich the dataset. This additional data is mainly related to the locality in the form of additional regions marked up with the provided coordinates. 19 3. Methods 3.5 Valueguard Index Sweden AB Throughout the thesis, there is a collaboration with the company Valueguard. This company has a long history of creating hedonic housing indices and aiding brokers in the valuation process by providing CMA tools that suggest comps [71]. Additionally, they offer AVM services in the form of an API [72]. Furthermore, Valueguard gathers its data from various sources, such as realtors, and direct data transfers from multiple real estate agencies as well as large providers such as Svensk Mäklarstatistik [73]. 3.5.1 Ethical Considerations Working with images from people’s residences might be intrusive ethically. Although these images have been used for marketing and public viewing, within the thesis, there is a commitment to maintaining confidentiality throughout the project. This is done by implementing strict safety measures to anonymise all images, ensuring that individual privacy is protected and no personal data is compromised. This includes keeping the images secure on the server, generating universally unique identifiers (UUIDs) for the images that can only be tied to sales with credentials during the run, and removing any images with individuals from the dataset. These security measures also led to the decision that the images in the thesis in the form of visual examples are from license free image providers as Unsplash [74] rather than the images that are provided. 3.5.2 Metadata The metadata here refers to the data connected to the sale and includes numerical and categorical variables tied to the property and the connected location. The provided metadata is used as the baseline features in the AVM model comparisons. 3.5.2.1 Location As previously mentioned, the locality is an essential component of the valuation process. Using publicly available regions extended the given base data, giving the models more opportunities to learn characteristics tied to the region. In this thesis, two external location-oriented sources are used. The first is Statistics Sweden [75], which exposes multiple definitions of Swedish regions to capture and compare localities. The regions used from Statistics Sweden in our study include Demographic Statistical Areas (DeSo) [76], a geographical division of Sweden for a more detailed demographic and statistical data analysis. Additionally, Regional Statistical Areas (RegSO) [77] are used, where Sweden is divided into 3,363 statistical areas based on municipal and county boundaries. Urban areas [78] where at least 200 inhabitants are in a contiguous built-up area are also considered. Finally, the study incorporated Municipalities [79] that divide Sweden into 290 more extensive regions. The different regions within the Uppsala County boundaries can be observed in 20 3. Methods Figure 3.2 showing the different sizes of the regions and their overlap, allowing the model to tie smaller regions to more extensive regions. Additionally, this is added to ensure the models can capture the micro-location and regional features explored in theory. Figure 3.2: Geographical areas from Statistics Sweden on OpenStreetMap Besides the pre-defined regions provided by Statistics Sweden, geographical tiling, similar to a grid system, is used to capture regional characteristics on multiple levels or, as they refer to it, resolutions. This is achieved by marking the sales with multiple layers of Ubers H3 index (H3) [80] with multiple resolutions. This approach is inspired by a blog post from Zillow about the design of their DNN AVM [24]. Figure 3.3 displays a visual example of the resolution of 7, 9 and 11 in Uppsala centre. Figure 3.3: H3 Index with different resolutions on OpenStreetMap One final feature is added to the sale regarding the location, which is the distance to the urban area centre. It aims to capture how far the sales are from desirable locations in the region centre, such as stores and other necessities. The centre is predicted using the 90th quantile on the recounted sale price within the region, thereby only selecting the most expensive sales. The centre point is then generated by taking the median of the x-axis and y-axis of the RT90 coordinates separately. This method operates under the assumption that property market values generally increase the closer they are to the city center. The straight-line distance is then calculated to the center point from the sale within the region and added to the base features. 21 3. Methods This study exclusively used the distance to the generated centre of the urban areas. However, other distances, such as the municipality centre, travel distance to the airport, and public transport, can enrich the models with more information regarding the location. Nonetheless, they are excluded to minimise the scope and workload. 3.5.2.2 Base Features and Target This section describes the base features and the target of the AVM model. Due to the skewed distribution of some of the features, a logarithmic transformation or log scaling is applied to convert these distributions to a more normally distributed one, potentially making it easier for the model to work with these features [81]. The base features in the form of metadata can be seen in Table 3.1 for houses and Table 3.2 for apartments. The type of variable and transformation are indicated with the tag notation of C for a categorical feature, L for a feature with logarithmic transformation, and N for a numerical feature. Table 3.1: Housing metadata Feature Home type [C] Standard points [N] Plot area [N] Living area [N,L] Rooms [C] Ancillary area [N] Construction year [N] RT90x [N] RT90y [N] DeSO [C] RegSO [C] Municipality [C] H3 Index res 3-9 [C] LKF [C] Distance to urban area center [N] Table 3.2: Apartment metadata Feature Elevator [C] Monthly fee [N] Living area [N,L] Rooms [C] Construction year [N] Floor [N] Floors [N] Housing cooperative [C] RT90x [N] RT90y [N] DeSO [C] RegSO [C] Municipality [C] H3 Index res 3-9 [C] LKF [C] Distance to urban area center [N] Regarding the shared features between the property types, living area indicates the livable area size. Meanwhile, the rooms are the number of rooms within the property and are decided to be categorical due to the shown separation and differences in the number of rooms and the price per square meter [82]. Additionally, the construction year refers to when the property is built. Furthermore, the regional features, RT90x denote the horizontal distance to the east from the central meridian of the RT90. Meanwhile, RT90y indicates the vertical distance to the north from the central meridian of the RT90. The DeSO, RegSo, municipality, and H3 Index with different resolutions are the regions in which the 22 3. Methods data is marked up, while the LKF represented a region connected to a parish region [83]. Lastly, distance to urban area center is the shortest distance to the generated urban area center. In the case of house-specific variables the home types indicate the type of house, such as chain house, semi-detached house or terraced house. Meanwhile, standard points reflect the condition scoring of different features in the house, such as kitchen setup and renovations [84]. Additionally, the plot area and ancillary area indicate secondary areas that are not part of the main living areas of the property, such as the garage or basement. In the case of apartment-specific features, the elevator feature indicates whether the apartment building has an elevator. The monthly fee is the monthly rental fee for the apartment to the housing cooperative. The floor represents the floor where the apartment is located, while floors represents the total number apartment building floors. Meanwhile, the housing cooperative is the legal institution that owns the real estate. The target variable for the AVM, is a logarithmic transformed recounted sale price that is used as a proxy for the market value. In the recount process, the price at the time of the sale is adjusted with the help of a regional index to represent the market value on January 15, 2024. The recounted date is selected because it is the latest published index value at the start of this thesis. This approach allows us to exclude the time variable in our thesis and simplify the task. However, it shall be stated that the further away the original estimate is, the more uncertain the recounted value becomes. In our study, the sales and the corresponding images analyzed cover the period from 2019 to 2024. The images play a crucial role in our study. Each sale is provided with an average of approximately 30 images related to the property. The room-type images are grouped between the property types. This grouping is based on the assumption that the characteristics of these rooms are comparable and overlap between apartments and houses. 3.6 Room Types In this study, the focus is primarily on the interior images and handling the room types separately. This is done to make comparisons within the model more mean- ingful, not comparing kitchens with bathrooms but rather bathrooms with another bathroom to find similarities and differences relevant to the room type regarding the architectural qualities. Furthermore, the room types used in this study include the bathroom, bedroom, kitchen, living room, and dining room. The room types are decided partly due to the architectural quality connection in the theory, the functional importance of the bedroom for sleep, the importance of the kitchen for making food and the function of handling the hygiene in the bathroom. However, some of the room types, such as dining room are chosen because they were used in similar studies [4]. The goal 23 3. Methods of the pre-processing is to obtain images that only depicted the chosen room types, as seen in Figure 3.4. Figure 3.4: Room types explored 3.7 Labelling Labelling is a time-consuming but essential part of this study. It generated the core dataset by categorising the images to the corresponding room type. A hasty execution of this step can create issues for the rest of the study through a low- quality dataset or a lack of data. Therefore, it is determined that a longer period for labelling will be designated to reduce the impact of time pressure. Each day, a portion of the dataset will be labelled. The data quality is also reviewed throughout the thesis to maintain its high quality. This is done by excluding low-quality and undesirable images, and marking them when seen. Due to the previous success in distinguishing scenes with high accuracy [4], [53], the labelling task is split into two stages, using a so-called hierarchical approach with two sub-tasks. The first stage aimed to exclude a larger proportion of the irrelevant images from this study, thereby maximising the study related images for a more thorough review in a second stage. The classes of the initial stage included interior, exterior, and others, where others are a class for floor plans and 3D renderings of the property. These three classes are selected due to their apparent visual differences and the assumption with previous successes that they will be easily split with high accuracy. The second stage, which involved classifying the room types from the labelled interior images, presented a more complex challenge. This is partly due to the presence of multiple room types in one image. This issue is handled with the help of a multi- label classifier, where each room type is represented as an individual probability 24 3. Methods vector for each image. Thereby, adding the option to select multiple rooms in the second labeling stage. In the study, 10,000 images are labeled in the first stage and 5,000 in the second stage. The higher number in the first stage is due to the ease of labelling them, and an early accurate model gives more interior images to the second stage and thereby increasing the number of images that can be used and reducing additional filtering out non-interior images. Furthermore, the first stage is considered a single-classification task where the highest-scoring class is chosen. In contrast, the second stage is considered a multi- label task where each class is assessed separately with a threshold. This threshold is determined by testing a range of thresholds and selecting the one with the highest F1 score of each room type to ensure a balance between the models precision and recall. A visual explanation of the output of the two stages can be observed in the visual representation of the label processes seen in Figure 3.7. Figure 3.5: Stage 1 Figure 3.6: Stage 2 Figure 3.7: An instance of labeling process Another technique used to improve the classification model is semi-supervised learn- ing. This approach leverages pseudo-labels or high certainty predictions on unseen data from the initial model as additional training data in a second training stage. Consequently, it is deemed that a high-accuracy prediction above 90% in our study can be used as training data in a second training run. Besides the five main room type labels in the second stage, there are three addi- tional labels for aiding purposes, namely Miscellaneous, Needs work, and Uncertain, respectively. Needs work label is for images needing cropping or additional process- ing in case an image is a composite image of multiple images. Uncertain labels tag images that do not depict a room clearly, or the labeller is unsure. Miscellaneous labels are used to set up rooms not analysed in this study and are excluded; these rooms could, for example, be saunas or gyms. These additional tags makes it pos- sible to indicate uncertainty and the required further work in parallel with labeling 25 3. Methods the dataset. This is especially helpful in the form of non-furniture rooms that can serve multiple functions and are hard to label. 3.8 Data Pre-Processing The metadata and the images are pre-processed, partly to create the required format for the models and partly to aid their convergence. A frequently employed approach in similar studies [13] is to filter out sales for which the price deviates significantly from the other sales in the region in which it is sold or from an initial prediction. There can be multiple reasons for this price difference, such as data quality issues or failure to adhere to the market value conditions. There- fore, we determined that prices deviating from the initial estimate by more than a certain margin of error are unreliable and undesirable for the model to learn. As a result, this thesis excludes sales with prices that deviate from the original assessment by more than 80% in any direction. Normalisation is another useful method that can make it easier for AI models to converge and learn representations, primarily because it is easier to grasp the ranges of the features [81]. Therefore, the numerical values used in the valuation model are normalised with a mean of zero and a standard division of one. This normalisation is based on the training data and then applied to the testing data in the test stage to continue out-of-sample learning. Due to the high dimensional feature space in the valuation model, especially the categorical features connected to the location, feature reduction techniques are ex- plored to limit the model from over-fitting to noisy variables. In this thesis, this is done by comparing the model trained on all provided variables with one that is only trained on the ones with higher feature importance using the SelectFromModel [85] function. The latter uses a pre-trained model on all the variables, keeping only the variables above the mean of the absolute feature importance for the specific model. This technique focuses on the more essential variables with the aim of a model that generalises better with only the more robust features. The model with the highest performance on the validation data is then chosen. To streamline these pre-processing steps, two sci-kit pipelines [86] are created, with the first step handling the numerical and categorical features separately. The numer- ical features are standardised as described earlier with the provided StandardScaler [87]. At the same time, the categorical variables got one-hot encoded with the One- HotEncoder [88]. In the second pipeline, the additional feature selection is added, and only the features above the mean of importance are kept, as previously de- scribed. Figure 3.8 visually represents the pre-processing pipeline where X is the features and y_pred is the predicted market value. 26 3. Methods Figure 3.8: Pipeline for the AVM model In the case of the normalisation for the DNNs used with the images, the pre-trained model usually has normalisation applied to the images during the training [89]–[91]. To work as expected, the same normalisation must be applied to the new data to give comparable and reasonable results. In PyTorch, these pre-processing steps are usually provided by the Transforms library [92] and applied during training. This is also the case for this thesis. 3.9 Deep Neural Networks DNNs refer to neural networks with multiple hidden layers between the input and the output, making them deep. These deep models are primarily used during this thesis, and the following section outlines the methods used to train the DNN and highlights the best practices used. These DNNs come in different forms to handle different kinds of data, whether it comes to data in succession, images, sounds, or inputs suited for standard feed- forward networks. However, the central concept is that these networks take these initial input signals and propagate them through a network, resulting in an output format designed to align with the task. This thesis mainly works with two-dimensional and one-dimensional data. The two- dimensional inputs relate to the images, and the connected RGB colours relate to the red, green, and blue in the images. These are then used as input to find spatial patterns related to the task. A visual example of this can be seen in Figure 3.9 Figure 3.9: Neural Network with two-dimensional input Meanwhile, the one-dimensional input relates to the numerical and categorical vari- ables for the AVMs. Figure 3.10 shows a visual example of a model with a one- dimensional input. 27 3. Methods Figure 3.10: Neural Network with one-dimensional input In the domain of AI models, and especially in the use of DNNs, the complexity of the models can be decided quite freely. This flexibility makes it a powerful tool that can be applied to various tasks. However, the trade-off between variance and bias on the task must be carefully assessed to decide on a suitable model, ensuring a suitable model complexity that has the ability to learn the patterns required in the task without overfitting to the noise within the data. In addition to the model complexity selection, regularisation methods can reduce the model’s tendency to over-fit on the training data. In this thesis, several regu- larisation methods are used. These include early stopping, which aims to stop the training phase when no new improvements are being made to the validation data [93]. Additionally, weight decay is used to penalise high weights during training to reduce weight changes on the noise. Furthermore, batch normalisation, which normalised the data between the layers, is incorporated into some of the models [94]. Dropout layers are also utilised to set a percentage of the neurons to zero during training to lower the reliance on certain neurons. Another method used to leverage earlier models’ learned patterns is transferring learning, where a model trained on a task has learned robust features to differentiate the data [95]. An example of this can be a room classifier on the Places365 dataset. This would then be able to be used in another task, replacing the end of the model with a new specific task. This new task is usually solved with a new sub-model on top referred to as the head of the model. When training these pre-trained models, the best practice is to freeze or lock the base to retain the robust pattern learned in the previous task and only train the head [95]. This is done within a main training stage with a larger learning rate, followed by a fine-tuning training stage with a lower learning rate. Finally, even the base or backbone is added to the training with a minimal learning rate as a final task tuning of the whole model. For each stage, the model is trained until a set 28 3. Methods maximum number of epochs is reached or until the early stopping halts the training. Recent developments with training techniques and models within projects such as SimCLR [96] [97] and DINO [98] have shown the strength of self-supervised models. These models can learn robust features without the need for provided labeled data by augmenting an image and aiming to maximise the similarities between the original and the augmented image. The idea is that if the model has difficulty distinguishing the two, they are presumably similar. In this thesis, these improvements are a perfect match due to the lack of provided labels and the focus on the differences in the images. The study used these im- provements in the form of the provided pre-trained models [99] and in the form of training our own self-supervised base models for each room type with the DINO V1 [98] approach. All DINO V1 self-supervised trained models are trained using the default parameters provided, with the recommendation of 100 epochs for initial convergence. This limitation is set due to the runtime of training these models, requiring approximately two days per model, and uncertainly about whether the provided data size will be enough to generate an adequate model. 3.10 Visual Target Features The following chapter focuses on visual feature extraction. This thesis explored three primary methods for feature extraction. The first is a binary classifier, where the decision between a desired and undesired attribute is assumed to be related to the percentage error or standard point. The second is an unsupervised approach, where clusters in connection with different models are assessed visually in relation to the architectural quality. The last approach uses the CLIP model, which can compare the similarity between images and a positive and negative description of a architectural quality, in a zero-shot fashion. Zero-shot learning refers to a scenario where the model can be applied to a task it was not trained on, and no examples were given with the task to the model [100]. 3.10.1 Binary Classification The binary classifier can be used as a simplified regression task where the magnitude is not directly related to the desired target variable but is assumed to be related. The provided magnitude is therefore ignored, and the target is converted into a simplified undesired or desired category related to the positive and negative sides. The model can then create its magnitude by analysing the common patterns on the positive and negative sides in the form of probability related to the desired or undesired classes. This thesis uses two distributions to retrieve desirable and undesirable features. The first target, also used in earlier studies [3], is the percentage error of the prediction and the actual price without the visual features. Given the usage in the theory, it is assumed that the difference between the initial assessment and the market value depends on the lack of attributes related to the architectural quality that can be 29 3. Methods observed in the image. The more images in the undesired class that share similar patterns, the stronger the predictor of an undesired feature. The second target is the standard point, which adds a Swedish-specific approach. This score only exists for the houses. It is a condition measurement used as part of the housing declaration. This score assesses multiple factors, such as the aspect of the property’s exterior, energy management, kitchen condition, sanitation, and other interior features that align with the condition and function [84]. However, while these scores are not only based on interior features that can be seen, within this thesis, it is assumed that a home with a high standard point score or a high percentage error has visual features in the interior of the property that indicate higher architectural quality. Figures 3.11 and 3.12 show the percentage error and the standard points distribution used during training, which seem to follow a normal distribution. Zero is the divider in the percentage error case, while the empirical mean is the divider between lower and higher in the standard point scenario. To focus on the more distinct differences, the sales outside of the absolute ten percentages are used in the percentage sale case. Figure 3.11: Histogram of the percent- age error Figure 3.12: Histogram of the stan- dard point To evaluate and score the binary classifier, an area under the curve (AUC) score is generated for each model to compare their ability to quantify the desired and unde- sired features. The AUC score is obtained from the receiver operating characteristic (ROC) curve, which shows the true positive rate (TPR) compared against the false positive rate (FPR). The equations for FPR and TPR can be seen below. TPR = True Positives True Positives + False Negatives (3.1) FPR = False Positives False Positives + True Negatives (3.2) Consequentially, the common method to calculate the AUC is using the trapezoidal rule. This rule approximates the area under the ROC curve by dividing that area into multiple trapezoids, with vertical lines for FPR values and horizontal lines for TPR values. After that, the area is calculated by summing the areas of these 30 3. Methods trapezoids [101]. During this project, sci-kit learn roc_auc_score [102] function implementation is used to get the AUC score. Due to a late improvement, these improvements are not assessed in the form of the AVM models but rather by themselves compared to the validation data. AUC is a way to score how well the model distinguishes the classes. Generally, a score of 1 is considered perfect class separation, and a score of 0.8 is regarded as a good separation. However, a score of 0.5 shows no ability to separate the groups [101]. 3.10.2 Clustering The last method used for visual feature extraction is clustering. In this method, a pre-trained model generates a high-dimensional vector that is then used to find clusters of images based on similar visual traits. The aim here is to find visually interpretable clusters that relate to architectural qualities. Multiple models are used to generate these high-dimensional vectors. Firstly, the ViT Base model with 14 patches provided by the DINO v2 project [103] [104], which has been trained on ImageNet [51] using a self-supervised approach, is used. Sec- ondly, the self-supervised model that is trained within this thesis on each room type is tried. Finally, two pre-trained CNN models, VGG and ResNet50, that are pre-trained on the Places365 dataset are utilised. These high-dimensional vectors are then normalised and clustered with the K-Means algorithm based on the Euclidean distance seen below: √√√√ n∑ i=1 (Ai − Bi)2 (3.3) In this equation, A and B are the vectors in n-dimensional space, and Ai and Bi indicate components of the A and B vectors, respectively. This process is repeated with a range of different numbers of clusters (k-values) between 2 and 10. This range is chosen to look for more significant clusters and assess more models within the thesis. Getting a good indication for well-divided clusters might vary depending on the project framework, domain, and data. For this project’s framework, the Davies- Bouldin score is used as the primary indicator to determine the best number of clusters to focus on. This effectively differentiates between the distinct clusters and ensures they are well-separated and compact [105]. This is used as an indicator for what clusters to explore more. To select the cluster feature to be extracted and included in the AVM, a visual inspection is performed with the aim to understand the clusters visually and relate them to our interpretation of the architectural qualities. This involved randomly sampling five images from each cluster, a process repeated three times to ensure a diverse representation of the perceived quality found in the images. This rigorous approach lowers the chances of the characteristics being found simply due to chance. 31 3. Methods 3.10.3 Contrastive Language-Image Pre-Training CLIP is a model developed by the OpenAI team that has been trained to tie together the ViT encoding of images to the text encoded image descriptions with the help of cosine similarities, which have earlier been used within the field of Natural Language Processing (NLP) to match document types [10]. This returns a score between minus one and one, representing the similarity between vectors A and B, as shown below. In addition, the representation of ||X|| in the equation refers to the Euclidean norm for vectors. Cosine Similarity(A, B) = A · B ∥A∥∥B∥ (3.4) In this study, the clip model is used to score the architectural qualities by providing a textual description and then matching it with the images in a zero-shot fashion. A positive and a negative version of the targeted feature is used to provide a range of results. Figure 3.13 shows an example of extracting a score for the room’s utility to move around by matching the spaciousness in text format with the room type image. This involves inputting two sentences in the text encoder and taking the positive score minus the negative score as the saved score for the feature. Figure 3.13: Use of the CLIP model to generate spaciousness score [106] These text versions of the positive and negative architectural qualities are primarily based on examples found in the literature, with some additions from our understand- ing of what is considered desirable and undesirable within the room types. 3.11 Utilising Visual Features in the Model After the corresponding model extracted the visual features, these features are la- belled with a name that tied them to the model and targeted features. They are then 32 3. Methods added to the enriched version of the base features with the model-specific features. The scores are averaged, when multiple images associated with the same room type and numerical visual features are present. This is a typical approach when using multiple images connected to the same feature [13], [16]. However, in the case of a categorical variable, the presence of one is enough for it to be valid for the entire sale. For example, one bathroom image with a bathtub is enough for the "bathroom has bathtub" feature. In the case of missing images of a room type, which results in missing features, the average of the features is used to fill in the missing values. This method is also used in a similar study, and it is a normal way of handling missing values without completely excluding the rows [4]. This is only done when a row has any visual features. However, if there are only non-visual features, the sale is excluded. 3.12 Automated Valuation Model A crucial part of our study is the AVM, which highlights the importance of our visual features related to the price. Within this thesis, three base models are chosen due to their previous use as baseline models and the ease of extracting the importance of the feature input. An advantage of linear regression models and XGBoost [23] is that they are typi- cally used as baselines. This is primarily because they can also provide a feature importance that shows the weight or importance of a feature [34]. Firstly, Ridge regression is a model that uses a regularisation technique to improve the model’s accuracy. During this regularisation, a penalty term proportional to the square of these coefficients is added to the loss function. This makes the model better at generalising the data, thus controlling the model’s complexity and reducing the model’s tendency to overfit the data. This is chosen over normal Linear Regression in our case because it minimises the chance of overfitting problem on data with high variance. However, they share the same fundamental basis, which is the linear assumption, and can also be used to show the weight of the feature. Secondly, the XGBoost is an implementation of gradient-boosted decision trees. It aims to improve prediction accuracy by using an ensemble of trees [107]. To do this, it corrects errors from previous trees iteratively. A key difference to the linear methods is that it can catch more complex patterns in the datasets. Lastly, neural networks, or DNNs, which can have one or multiple neuron layers, can process data through these neuron layers to recognize patterns in the dataset. It can learn non-linear relationships in the dataset, making it capable of finding more complex relations and features available. In the context of AVM, it provides the benefit of being able to processing complex inputs such as images as part of the same model to obtain several patterns that might affect the valuation of the properties. Both of the neural network AVMs used are trained in five stages with a mean square error loss (MSELoss) [108] loss function seen below. 33 3. Methods MSE = 1 N N∑ i=1 (ti − pi)2 (3.5) where N represents the batch size, ti is the actual or true value and pi is the predicted value. It was trained with an initial learning rate of 0.01 that is then divided by ten after each run until the final run of 0.00001 with a base size of 512, and an early stopping with patience of 4 and weight decay regulation of 0.00046. The dropout layers in the model are halved between each iteration, starting with 10%. In the selection of hyper-parameters for the AVMs, a grid search is performed to find the highest-scoring combination of parameters. It is achieved by running cross- validation on the training data with different pre-decided ranges for the different parameters. Scoring the parameters on the lowest MAPE score achieved. The hyper-parameters that are explored in this thesis related to the Ridge model is the alpha (α) value, which refers to the constant that controls the regularisation strength by being multiplied with the L2 term. For the neural network, the training regularisation hyperparameter of weight decay, the learning rate during the training stages, and patience for early stopping are explored. For the XGBoost model, the following hyperparameters are explored to prevent over- fitting. These include colsample_bytree that indicates the fraction of features per tree, learning_rate controls the training step size, max_depth sets maximum tree depth, min_child_weight ensures minimum instance weight in child, n_estimators sets the number of trees, subsample uses a fraction of data for each tree to generalise better, gamma sets the minimum loss reduction required for a split which makes the model conservative, and alpha applies a regularisation to prevent overfitting for the XGBoost model. The resulting hyper-parameters used are described with the model in the result. 3.13 Scores The final analysis of this thesis focuses on the AVM error rates with and without the newly integrated features to capture the importance of the visual features regarding market value prediction. One important rule when validating models is to use an out-of-sample prediction approach, where all models are scored on unseen data during the training stage with the aim of capturing how well they would perform in a real-case scenario. Due to the various models used and the diverse models that contribute to the final visual feature pool, it is decided to split the dataset in the pre-processing stage. This separation prevents these different splits from causing in-sample bias when extracting visual features. One of these metrics is Mean Absolute Error (MAE) [109], which highlights the av- erage difference in the error. In our case, it shows the absolute amount of Swedish 34 3. Methods Krona (SEK) that the predictions differ from the actual selling price. The formula- tion of MAE can be seen below, where the pi is the i:th prediction and ti is the i:th actual value. MAE = 1 N N∑ i=1 |ti − pi| (3.6) Another more easily comprehensive error metric is the Mean Absolute Percentage Error (MAPE) score [110], which shows how much an estimation is wrong on average in the percentage of the actual value. This makes it easier to get comparable results between regions with different prices. For example, a 200,000 SEK error on a 200,000 SEK property differs from a 200,000 SEK error on a 2,000,000 SEK property. The formulation of MAPE can be seen below. MAPE = 1 N N∑ i=1 ∣∣∣∣ti − pi ti ∣∣∣∣ (3.7) Furthermore, another metric is chosen to highlight the worst prediction in proportion. This metric generates a score for the MAPE on the 10% worst predictions. This highlights how far off the model is on the worse predictions. Additionally, R2 is the coefficient of determination of a regression model. Its value shows the proportion of variance in the dependent variable, which is the selling price in our case [111]. Finally, the Median Error Rate [112] is used to highlight the centre of the errors. This method is robust against outliers because it excludes them from the actual score, unlike the averaging scores method. Instead, as the name suggests, it shows the median error. These results in five scores that highlight different metrics to give a broader picture of the differences between before-and-after visual features and where improvements are made. However, the main focus of this thesis is the MAPE score due to its ease of interpretation concerning the different property types. Lastly, the feature’s importance is assessed. In the case of Ridge Regression, the feature’s importance is ordered by the absolute coefficient to focus on magnitude and not solely on positive features. Additionally, two usual alternatives for XGBoost feature importance are gain and weight. Gain indicates the contribution makes by the feature, and weight indicates the frequency in which it is used [113]. This thesis focuses on improvements in the form of gain rather than usage to align with the goal of reducing the uncertainty. 35 3. Methods 36 4 Results The following chapter shows the study’s results. First, the outcomes of the pre- processing and the performance of the room classifier will be highlighted. Next, the results of the self-supervised attention will be compared visually. Then, the results of the visual feature extraction process will be displayed. Finally, the AVM score and the importance of the features of the models will be exhibited. 4.1 Classification of Images The hierarchical classification approach to label the rooms began with separating the interior images. Different models were compared, leading to the selection of a ViTS14 with the pre-trained weights from DINO V2 [103], [104]. The model was trained using a cross entropy (CE) loss [114] function where its equations in the form of binary and multi-class can be seen below. CE (Binary Classification) = − (y log(p) + (1 − y) log(1 − p)) (4.1) In the binary version, y denotes the actual label which is either 0 for false or 1 for true and p represents the predicted probability that the label is true. CE (Multi-class Classification) = − N∑ i=1 yi log(pi) (4.2) In the multi-class version, N represents the number of classes, yi indicates the binary indicator where 1 indicates the correct classification for class label i and 0 otherwise. Also, pi denotes the predicted probability for class i. It was trained with an initial learning rate of 0.001 and fine-tuning at 0.0001. The head of this model can be found in Appendix B, as shown in Figure B.4. After pseudo-labelling and re-training using the same loss function and step sizes, Figure 4.1 shows the final confusion matrix on the test dataset. It highlights an excellent ability to distinguish the classes, with a few instances where the model was confused. For the second labelling stage, the same ViT base model was used with a similar head, but ending with a Sigmoid function and one neuron per room type, as shown 37 4. Results Figure 4.1: Confusion matrix of interior and exterior classifications in Figure B.5 in Appendix B. It was trained with an initial learning rate of 0.001, which changed to 0.00001 during fine-tuning, and an early stop patience of 4 while the maximum number of epochs was set to 100. Thereafter, the model generated the thresholds indicating that an image corresponds to a particular room type. Table 4.1 shows the threshold selected by finding the best threshold according to the highest F1 score on the validation data for each room type. Consequentially, these thresholds indicates a high F1 score and ability to differentiate the room types