Stocks vs. Bonds A Data-Driven Approach to Asset Allocation Using Machine Learning Master’s thesis in computer science and engineering Emil Hölvold Nermin Skenderovic Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Stocks vs. Bonds A Data-Driven Approach to Asset Allocation Using Machine Learning Emil Hölvold Nermin Skenderovic Department of Mathematical Sciences Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Stocks vs. Bonds A Data-Driven Approach to Asset Allocation Using Machine Learning Emil Hölvold Nermin Skenderovic © Emil Hölvold, Nermin Skenderovic, 2023. Supervisor: Sebastianus Cornelis Jacobus Bruinsma, Department of Data Science and AI Advisor: Karl Larsson, Nordea Bank Abp Advisor: Fredrik Lundström, Nordea Bank Abp Examiner: Stefan Lemurell, Department of Mathematical Sciences Master’s Thesis 2023 Department of Mathematical Sciences Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: The cumulative return of the stock index MSCI World and the bond index Bloomberg Global Aggregate between 2001 and 2023. Typeset in LATEX Gothenburg, Sweden 2023 iv Stocks vs. Bonds A Data-Driven Approach to Asset Allocation Using Machine Learning Emil Hölvold Nermin Skenderovic Department of Mathematical Sciences Chalmers University of Technology and University of Gothenburg Abstract This thesis explores the application of supervised machine learning algorithms to asset allocation strategies with the aim of enhancing investment decision-making processes. Collaborating with Nordea, one of the leading financial institutions in the Nordics, the study was conducted at their Asset & Wealth Management department to investigate the potential of developing a machine learning model, with the goal of improving portfolio performance and reducing risk in the context of a dynamic and uncertain financial market environment. The research begins by analysing the current status of the field, examining the theo- retical foundations of asset allocation, and identifying the shortcomings of traditional approaches. Additionally, the thesis raises a nuanced view of quantitative investing, with an in-depth exposition of the most common pitfalls and their consequences. Building on this foundation and previous work, regression and classification algo- rithms are investigated together with premium financial data as potential solutions to overcome these limitations. Specifically, the Random Forest and XGBoost models are used to forecast movements for the upcoming month in a global stock and bond index. The signals generated by the models are then incorporated into a rule-based allocation model. The findings of this research suggest that machine learning techniques can offer valu- able insights and improved performance in asset allocation. The results highlight the potential of these models to identify leading indicators and exploit market inef- ficiencies, resulting in improved risk-adjusted returns. The best-performing model achieved an alpha of 2.05% during the backtest between 2020 and 2023, accompanied by an increase in Sharpe ratio and a decrease in volatility. However, it is important to note that the effectiveness of machine learning algorithms is heavily dependent on the quality and availability of data, as well as the appropriate selection and calibration of model parameters. Financial markets are dynamic and subject to various factors, so ongoing adjustments are necessary to adapt to changing market conditions and mitigate risks. Keywords: asset allocation, machine learning, quantitative investing, regression, classification, algorithms, Random Forest, XGBoost, leading indicators, risk-adjusted returns. v Acknowledgements We would like to express our sincere gratitude and appreciation to the following in- dividuals who have made invaluable contributions to the completion of this master’s thesis: First and foremost, we would like to extend our deepest thanks to our supervisors at Nordea, Karl Larsson and Fredrik Lundström. Their guidance, expertise, and contin- uous support throughout this project have been instrumental in its success. Their insightful feedback, constructive criticism, and relentless pursuit of improvement have profoundly influenced our understanding, guided us through the challenges we encountered, and propelled this project beyond what we could have achieved otherwise. Furthermore, we acknowledge the guidance and assistance provided by Sebastianus Cornelis Jacobus Bruinsma, our supervisor at Chalmers. His tireless dedication to supporting us and willingness to allocate time for discussions when we encountered academic-related questions have been truly invaluable. His insightful feedback and suggestions have significantly enhanced the quality of this report. We also express our gratitude to Antti Saari, the Nordea manager, for providing us with the opportunity to undertake this thesis within the organisation. The resources provided by Nordea have made all the difference between success and failure in this research. We are also thankful for our examiners, thanking Johan Jonasson for believing in this project from the first day and for his work reviewing this report. The constructive feedback and valuable insights have helped refine our research. We are also thankful to Stefan Lemurell who could step in during the final part of the project and for being helpful in taking this project to completion. Finally, we express our heartfelt appreciation to all those who have supported us during the course of this project, including our friends and family, whose unwavering encouragement and belief in our abilities have been a constant source of motivation. Emil Hölvold, Nermin Skenderovic, Gothenburg, 2023-06-19 vii Glossary Alpha: A term used to describe an investment strategy’s ability to beat the market, or the relative performance to a benchmark. Asset: An asset is anything of value that has the potential to generate future eco- nomic benefits, including stocks and bonds. Asset allocation: “Asset allocation is an investment strategy that aims to balance risk and reward by apportioning a portfolio’s assets according to an individ- ual’s goals, risk tolerance, and investment horizon.” [1] Basis point: Equivalent to 0.01%. The smallest measure used in quoting yields and interest rates. Bias (prediction bias): An error that occurs due to erroneous assumptions in the learning algorithm. Bond: A depth instrument issued by a government, municipality, or corporation to raise capital, where the issuer pays interest to the bondholder over a specified period. Boosting: A technique in machine learning that iteratively combines multiple weak classifiers into a single strong classifier by applying weights to misclassified samples. Classification: A supervised learning task where the goal is to assign input data to a specific category or class. It involves learning a mapping function from input features to discrete output labels. Data leakage: when future information is used to predict past events. Feature: A feature is an input variable or attribute that is used to make predictions or model a target variable. It is also sometimes called an independent variable. Fixed-target portfolio: A portfolio that is rebalanced to keep the proportion be- tween assets fixed. Label: The correct answer or result for a given data point. Overfitting: When a model is created that matches the training data too closely, resulting in a model that fails to make correct predictions on new data. Portfolio: Refers to a collection of assets, such as stocks and bonds held by an individual or entity to generate income and/or achieve long-term financial goals. Regression: A supervised learning task focusing on predicting continuous numeri- cal values rather than discrete classes. It involves learning a mapping function from input features to a continuous output variable. Stock: A unit of equity ownership in the capital stock of a corporation Strong learner: A model capable of achieving arbitrarily good accuracy by learn- ing complex patterns and relationships in the data. Supervised learning: A type of training method in which the model is trained on predetermined labels. Target variable: Also known as the dependent variable, is the variable that is being predicted or modelled by the machine learning algorithm. It is the output variable or the response variable. Weak learner: A simple model that gives better results than a random prediction in a classification problem or the mean in a regression problem ix x Contents Contents xi List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Asset Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Nordea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Portfolio Allocation Targets . . . . . . . . . . . . . . . . . . . 5 1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6.2 Machine Learning in Finance . . . . . . . . . . . . . . . . . . 7 2 Data 9 2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Preprocessing & Cleaning . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1.1 Value Difference . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 Exponential Moving Average (EMA) . . . . . . . . . . . . . . 11 2.3.5 Sharpe Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.5.1 Sharpe Ratio Difference . . . . . . . . . . . . . . . . 12 2.3.6 Sortino Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.6.1 Sortino Ratio Difference . . . . . . . . . . . . . . . . 12 2.3.7 Maximum Drawdown (MDD) . . . . . . . . . . . . . . . . . . 13 2.3.8 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.9 First Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.10 Second Derivative . . . . . . . . . . . . . . . . . . . . . . . . . 14 xi Contents 2.3.11 Z-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.11.1 Z-Score Difference . . . . . . . . . . . . . . . . . . . 14 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Machine Learning Fundamentals 17 3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.4 eXtreme Gradient Boosting (XGBoost) . . . . . . . . . . . . . 22 3.2 Model Evaluation Strategies . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Train-Validation-Test Split . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Model-Free Blackbox Tuning Methods . . . . . . . . . . . . . 25 3.4 Backward Feature Elimination . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Coefficient of Determination (R2) . . . . . . . . . . . . . . . . 26 3.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Methodology 29 4.1 Development Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Model Input Processing . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.3 Filling Missing Data . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.4 Shifting independent variables . . . . . . . . . . . . . . . . . . 30 4.2.5 Train-Validation-Test Split . . . . . . . . . . . . . . . . . . . . 31 4.2.6 Numpy Transformation . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Feature Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.1 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5.2 Allocation Models . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Results 37 5.1 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 Prediction Regressors . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.2 Prediction Classifiers . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Allocation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.1 Allocation Regressors . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.2 Allocation Classifiers . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Rebalancing day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xii Contents 6 Conclusion 45 6.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1.1 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1.1.1 Prediction Regressors . . . . . . . . . . . . . . . . . 45 6.1.1.2 Prediction Classifiers . . . . . . . . . . . . . . . . . . 46 6.1.2 Allocation Models . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.1.2.1 Allocation Regressors . . . . . . . . . . . . . . . . . . 46 6.1.2.2 Allocation Classifiers . . . . . . . . . . . . . . . . . . 47 6.1.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.3 Social and Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . 50 Bibliography 51 A Appendix 1 I A.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I xiii Contents xiv List of Figures 1.1 The annual returns of the two indices studied in this project, with the four most exceptional years labelled in the graph. 2002 was a year when the economy was still struggling to recover from the reces- sion that had started in 2001. The uncertainty in the economy led investors to seek the relative safety of bonds, driving up bond prices and depressing stock prices. The following year, 2003, the economy was recovering, contributing to the rise of stocks and bonds. In 2008, the world went into a severe worldwide crisis caused by the United States housing bubble bursting. As stocks plunged, investors shifted their money into lower-risk government bonds, increasing their price. In 2021 global bonds slumped into their first bear market in a gen- eration, under pressure from central bankers determined to quash inflation caused by two years of expansionary fiscal policy during the COVID-19 pandemic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 MSCI World (Year-Over-Year) vs. US ISM PMI index, with a corre- lation of 0.739. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Example of how a regression tree can be used to fit a model to con- tinuous non-linear data. Each leaf of the tree is labelled with a value, which is the output of the model. . . . . . . . . . . . . . . . . . . . . 18 3.2 The Random Forest will build N unique decision trees that will each make a prediction. Random Forest is a strong learner constructed of many smaller decision trees, known as weak learners. . . . . . . . . . 19 3.3 Gradient Boosting learning curve. Figure illustraded by Aratrika Pal [39]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 When developing a model for time series forecasting, it is important to remove future data and gap samples from the training set to avoid data leakage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 When developing a model for time series forecasting, it is important to remove future data and gap samples from the training set to avoid data leakage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xv List of Figures 3.6 Comparison of grid search and random search minimising a function with one important and one unimportant parameter. However, when both parameters have a large impact on the result, grid search usually performs better. This figure is based on the illustration by Bergsta and Bengio [44]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Shifting the independent (X) features. . . . . . . . . . . . . . . . . . 30 4.2 Train-Validation-Test Split. The models are trained on the training set, hyperparameter tuning is performed using the validation set, and the models’ generalisation capabilities are tested on the test set. . . . 31 4.3 Feature importance from the initial Random Forest model trained on over 1200 features. However, due to the presence of noise, the number of features used in the final models were significantly reduced by up to 90%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 The backtested performance of MSCI Europe between 1995 and 2014 is almost twice as good when only considering the investment universe of 2014, a common error in quantitative investing. Graph produced by Yin Luo [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Prediction results for the regressor models on the test set. . . . . . . 38 5.2 Prediction results for the classifier models on the test set. . . . . . . . 39 5.3 Allocation results for the final models. . . . . . . . . . . . . . . . . . 40 5.4 Predicted Risk-Adjusted Returns (PRED_RaR) from the XGBoost regressor and realised Risk-Adjusted Returns (RaR) from the Stock and Bond index. PRED_RaR has been a core part of the regressor version of the rule-based allocation model. . . . . . . . . . . . . . . . 41 5.5 Rebalancing day matters. The total return of the Random Forest and XGBoost allocation models can vary by 8.15 and 11.51 percentage points, respectively, depending on when the portfolio is rebalanced. . 42 5.6 Both the XGBoost and Random Forest Stock classifiers were trained on the same 150 features. The Bond classifiers, however, reached optimal performance on slightly different features, hence Figure 5.6b contains an additional 40 features compared to Figure 5.6a. . . . . . . 43 6.1 Optimal portfolio (with actual instead of predicted returns) for the Risk-Adjusted Returns (RaR) based portfolio strategy. . . . . . . . . 47 xvi List of Tables 3.1 Random Forest Parameters. These hyperparameters were optimised for both the regressor and the classifier. . . . . . . . . . . . . . . . . . 20 3.2 XGBoost Parameters. These hyperparameters were optimised for both the regressor and the classifier. . . . . . . . . . . . . . . . . . . . 22 4.1 Random Forest default vs. best parameters. See Table 3.1 for param- eter definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 XGBoost default vs. best parameters. See Table 3.2 for parameter definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 Prediction performance (R2-Score) of the regressor models for the Stock and Bond index (best values in bold). . . . . . . . . . . . . . . 38 5.2 Prediction performance (Accuracy) of the classifier models for the Stock and Bond index (best values in bold). . . . . . . . . . . . . . . 39 5.3 Allocation performance of the studied models (best values in bold). . 40 5.4 The Stock index’s top ten features, combined from the two classifier models used to forecast the direction of the monthly returns. . . . . . 43 5.5 The Bond index’s top ten features, combined from the two classifier models used to forecast the direction of the monthly returns. . . . . . 44 A.1 This table presents the time series data used, including Stock and Bond indices, risk-premium data, interest rates, sentiment surveys, and indicators on business, geopolitical and financial conditions. The data provides a comprehensive view of the financial spectrum covering various aspects such as market performance, risk assessment, and macroeconomic indicators. . . . . . . . . . . . . . . . . . . . . . . . . I xvii List of Tables xviii 1 Introduction Harry Markowitz published his landmark article Portfolio Selection 70 years ago, which ended up winning him the Nobel Memorial Prize in Economic Sciences in 1990 [2]. This groundbreaking theory empowered generations of academics and practitioners to pursue asset allocation, and it remains one of the most important strategies an investor can employ to increase returns and reduce the portfolio’s overall volatility [3]. Since then, there have been many attempts to extend and enhance the work done by Markowitz, and yet despite all this progress, to this day, few would argue that asset allocation is an easy task. However, with recent success within the field of machine learning and the increased availability of data, some of the key challenges in portfolio theory might be close to a solution. 1.1 Background The biggest challenge for every investor is how to increase returns while mitigating risk continuously. There is consensus within the industry about the importance of asset allocation, but looking at, e.g. the years 2002 and 2021 in Figure 1.1, fixed- target portfolios are not always optimal, as a dynamic portfolio overweight in bonds in 2002 and then overweight in stocks in 2021 would have yielded a higher return. In addition, the portfolio optimisation model (mean-variance analysis) suggested by Markowitz has limited impact in practice due to estimation issues when applied to real data. Today, the financial market is one of the largest big data generators, with multiple terabytes being generated daily. Although this is being exploited to some extent by skilled financial institutions, there are still decisions based on purely classical financial theory, some of which have many drawbacks, argued by financial professor Marcelle Chauvet, among others, in [4], [5]. 1 1. Introduction Figure 1.1: The annual returns of the two indices studied in this project, with the four most exceptional years labelled in the graph. 2002 was a year when the economy was still struggling to recover from the recession that had started in 2001. The uncertainty in the economy led investors to seek the relative safety of bonds, driving up bond prices and depressing stock prices. The following year, 2003, the economy was recovering, contributing to the rise of stocks and bonds. In 2008, the world went into a severe worldwide crisis caused by the United States housing bubble bursting. As stocks plunged, investors shifted their money into lower-risk government bonds, increasing their price. In 2021 global bonds slumped into their first bear market in a generation, under pressure from central bankers determined to quash inflation caused by two years of expansionary fiscal policy during the COVID- 19 pandemic. 2 1. Introduction 1.1.1 Asset Allocation Asset allocation involves dividing your investments among different asset classes and is a critical consideration for constructing successful portfolios and assessing their risk exposure. Risk is defined in financial terms as the probability that an outcome or the actual gains of an investment will differ from an expected outcome, including the possibility of losing some or all of an original investment [6]. Risk is often measured by volatility and calculated according to Equation 1.1. σN = √∑N i=0(xi − µ)2 N · √ T where: N = the size of the population T = number of periods in the time horizon xi = each return from the time horizon µ = average return from the time horizon (1.1) A fundamental idea in finance is the relationship between risk and return. As risk increases, investors seek higher returns to compensate for taking such risks, called risk premium. Financial asset classes are a grouping of investments with similar characteristics, such as equities (e.g. stocks), fixed income (e.g. bonds), real estate, commodities, and currencies. The two most common financial asset classes are stocks and bonds, and they are therefore the focus of this thesis. Historical data shows that stocks are more volatile than bonds and are therefore considered the riskier asset among the two. Bonds are less volatile, known for their hedging characteristics and low correlation with the stock market, and due to these less risky attributes, it is well known that a diversified portfolio should include both [7]. Within these asset classes, there are further subcategories. However, these are beyond the scope of the thesis and are not explained in further detail. When managing risk in a portfolio that includes these two asset classes, the most common approach is to adjust the allocation between stocks and bonds to balance risk exposure. For example, an investor with high risk tolerance might have a fixed- target portfolio with 80/20 stocks and bonds to gain greater exposure to the more volatile stock market, with the chance of higher returns and the risk of greater losses. Meanwhile, an investor with a low risk tolerance might instead have a fixed-target portfolio with 20/80 stocks and bonds to decrease potential risk, with the knowledge of maybe missing out on greater returns. However, as seen in Figure 1.1, bonds have experienced negative annual returns three times since 2001, while stocks have been rising, causing the most vulnerable investors to lose money on the investment that was considered safer. As of early 2023, no better conventional strategy than these fixed-target portfolio allocations has been reported in the literature, and with 2022 being yet another year where bonds have failed to hedge the investors’ savings, this topic is again trending. 3 1. Introduction 1.1.2 Nordea Nordea is the largest bank in the Nordics, with a market capitalisation of more than SEK 400bn and operations mainly in Sweden, Norway, Denmark, and Finland. The bank’s core business areas include personal banking, business banking, large corporates & institutions and asset & wealth management. Asset & wealth manage- ment offers savings and investment products and manages the accumulated wealth of customers with a total asset under management (AuM) of EUR 359bn in Q4 2022 [8]. Within Nordea Asset & Wealth Management, the thesis was carried out in col- laboration with the two teams, House View and Quantitative Solutions & Equity Research, in Investments. Alongside research papers, Investments provides the of- ficial strategic and tactical Nordea market views and model portfolios for retail clients. Currently, the tactical views published by House View are mainly based on economic models and the experienced opinion of investment strategists. However, developing deeper quantitative models could potentially enhance customer returns and the bank’s competitive position. 1.2 Aim The aim is to investigate how the predictive performance of gradient boosting algo- rithms translates into the field of asset allocation, as well as to examine whether an increased amount of premium data increases the predictive performance of machine learning models used in previous research for asset allocation. The hope is also that this report can increase transparency in the industry regarding investment strategies with comprehensive information on methods, models, and data. 1.3 Goals The main goal of this thesis is to investigate how machine learning can be used to dynamically optimise a virtual portfolio of stocks and bonds for the upcoming month. As a proxy for the stock and bond market, two indices are used. The MSCI World Index (hereafter referred to as the Stock index) for stocks and the Bloomberg Global Aggregate Index (hereafter referred to as the Bond index) for bonds. The Stock index covers 85% of adjusted free-float market capitalisation in each included country, while the Bond index tracks investment-grade fixed-rate issuances from high-income corporations and countries spanning the globe. The coverage of these two indices replicates a well-diversified portfolio that spans most major sectors and developed countries in the global economy [9], [10]. The objective is to create a machine learning-based portfolio that will outperform a benchmark portfolio with a fixed-target allocation of 50% stocks and 50% bonds without increasing risk. The choice of a 50/50 fixed-target portfolio is significant as it represents a diversified portfolio, which is a fundamental concept in investment theory, dating back to Harry Markowitz and his Modern Portfolio Theory [2]. Using 4 1. Introduction this benchmark allows for meaningful comparisons and evaluation of the machine learning-based portfolio’s effectiveness [11]. This problem statement will be approached by analysing the state of the economy with the help of financial data, see Section 2, and supervised learning algorithms, see Section 3.1, to find leading indicators for the stock and bond market, respectively. The portfolio allocation is re-evaluated monthly depending on future prospects for the various asset classes, and the proportion of stocks and bonds is set accordingly. The problem is studied from a global perspective since the major financial markets historically have a strong correlation. The goal is divided into two smaller and more manageable sub-problems, according to the list below. 1. Find leading indicators for the Stock and Bond index using machine learning algorithms such as Random Forest and gradient boosting. 2. Predict the allocation of stocks and bonds that will outperform its benchmark with a lower or equal amount of risk for the upcoming month, see all targets in Section 1.3.1. By taking on these challenges, the goal is to improve the process and quality of the tactical level 1 bet, i.e. weightings between stocks and bonds, at Nordea. This approach works as a bridge between classic portfolio theory and state-of-the-art machine learning. Increased precision in stock-and-bond-market forecasting helps investors outperform the market while reducing risk due to the diversification that decreases volatility in the portfolio. 1.3.1 Portfolio Allocation Targets The performance of the models is evaluated monthly and measured in relative return compared to the targets listed in increasing difficulty below. The portfolio allocation is rebalanced monthly to ensure that decisions are based on the most recent available data. This provides a measurement that explicitly indicates how well the models perform relative to the current strategies used by Nordea. 1. The main target is to outperform an equal-weighted fixed-target portfolio on a monthly basis with less or equal risk. 2. The second target is to outperform the current Nordea allocation strategy. Success is already achieved if a model reaches the first target since this proves that machine learning can be used to achieve a higher-return portfolio allocation of stocks and bonds compared to a fixed-target portfolio without increasing the risk. If a model also reaches the second target, it would perform better than Nordea’s current strategy, which would be a great success. The last target contains classified information and is only evaluated internally and therefore is not presented in this thesis. 5 1. Introduction 1.4 Challenges Since the thesis topic touches on financial and computer science questions, there are several challenges in both areas. The modern portfolio theory by Harry Markowitz is still the most adopted strategy for asset allocation. Machine learning models have not yet become the industry standard. Therefore, there are no guarantees of successful results when evaluating the models. The second challenge is from a data perspective. Good data is almost always more important than a perfect model when working with any analysis. Data must be chosen carefully to avoid situations where the choice of training data limits a good model. The approach to this challenge is explained in more detail in Section 2. The third challenge is to avoid backtest overfitting, a common problem in math- ematical finance. This occurs when historical market data is used to develop an investment strategy and numerous variations of the strategy are tested on the same data set. Backtest overfitting is now recognised as a primary reason why quanti- tative investment models and strategies that appear promising on paper (based on backtests) often fail to perform as expected in practice [12]. The fourth challenge is to divide the data used to train the models, as they need to generalise well across all market conditions. There have been seven major financial crises that the world has witnessed in the last 100 years, and they have all come in various forms, such as event-driven, cyclical, or structural. 1.5 Limitations The machine learning models are only evaluated on the two asset classes, stocks and bonds. Therefore, predictions of other asset classes, specific regions, or individual financial instruments are beyond the scope of this thesis. The models are developed to fit within the limitations of the Nordea allocation mandate. Hence, no asset class can allocate more than 60% and no less than 40% of the portfolio. The models can not allocate money in cash, meaning that the proportion of stocks and bonds must sum up to 100%. The models can not use short selling or leverage. Only a frictionless market is considered. This means that trading occurs without transaction costs, taxes, restrictions, or impediments. Simulating a more realistic market with, e.g. transaction fees, can be interesting but is not relevant for this thesis, especially due to the low-frequency trading. Additionally, it is important to note that certain time series, such as BNP numbers, are prone to revisions after their initial release. This introduces the potential for data leakage when using such data to train the models. However, considering the constraints of time and resources, a systematic and sustainable approach to accessing 6 1. Introduction primary data is not pursued, as the anticipated impact on the results is deemed relatively minor due to the few time series that are affected. The final limitation is the lack of access to good, high-quality hardware. Only laptops with 11th Gen Intel i7 processors are available for this thesis. Not even a GPU, which is highly recommended when using more computationally heavy models, such as deep neural networks, due to the longer training times. Therefore, the models selected are partly chosen because they are lightweight and computationally efficient. 1.6 Literature The financial field is highly researched due to its impact on society and the possible returns of successful methods and models. The subfield of asset allocation is no different, with Harry Markowitz and James Tobin winning the Nobel Memorial Prize in Economic Sciences for their work on portfolio theory. However, finance is also a secretive area where most research remains proprietary at the institutes that develop them, as any publication would result in a lost edge to the market. 1.6.1 Previous Work Quantitative approaches built on machine learning models have previously been used to outperform the financial market. The predictive performance of Random Forest models has been consistent in recent years, achieving great results in financial areas such as algorithmic trading and stock analysis [13], [14]. A study conducted at Cornell University found that a Random Forest model could reduce risk while surpassing its benchmark by 3.4% annually [15]. In another study from Lund Univer- sity, the author found a model that outperformed its benchmark by 3.7% annually with a lower risk [16]. These findings highlight the potential of the Random Forest algorithm in improving investment strategies, even if they used unconstrained allo- cation, which the rational long-term investor tends to avoid, and prediction horizons very different from what is studied in this thesis. With gradient boosting outperforming Random Forest models in individual stock predictions in recent studies, the question remains as to whether performance trans- lates into other financial domains, such as asset allocation [17]. 1.6.2 Machine Learning in Finance Over the years, numerous publications have explored machine learning in finance. However, given the failure to translate backtest profits into real-world success, the most intriguing publications instead emphasise the methodology of developing a financial machine learning model. Marcos López de Prado, a renowned expert in quantitative finance and algorithmic trading, is the author of multiple well-cited books and papers on the subject. In the book Advances in Financial Machine Learning, he delves deep into the practical aspects of implementing machine learning models in finance, and he puts a significant 7 1. Introduction focus on the evaluation and validation of machine learning models in finance [18]. In the following book Machine Learning for Asset Managers, he provides a broad overview of machine learning techniques applied to the field of asset management [19]. While both books cover similar ground, Advances in Financial Machine Learning can be seen as a more specialised and advanced continuation of the concepts introduced in Machine Learning for Asset Managers. Another influential publication on the subject is the Deutsche Bank Quant Hand- book, whose second part has garnered widespread attention for its nuanced perspec- tives on quantitative investing [20]. Titled Seven Sins of Quantitative Investing, the paper, authored by Yin Luo and his team, offers a comprehensive and insightful anal- ysis of why machine learning models in finance frequently fail to achieve real-world success. The seven sins are listed below: 1. Survivorship bias: Backtesting or evaluating strategies, excluding compa- nies that have gone bankrupt, been delisted, or acquired. 2. Look-ahead bias: Unintentional incorporation of future information into the training and testing of predictive models 3. Storytelling: Making up a story ex-post to justify some random pattern 4. Data mining and data snooping: Repeatedly testing hypotheses or strate- gies on the same dataset, leading to potential overfitting and misleading results 5. Transaction costs: Ignoring transaction costs can lead to overestimated returns 6. Outliers: Ignoring or basing a strategy on a few extreme events can have an effect on how well the model generalises to future events. 7. Shorting: Taking a short position on cash products requires finding a lender, making it a non-implementable strategy for all stocks. 8 2 Data Studying leading indicators is a long-lived tradition in economic research, dating back to at least 1946 with the book Measuring Business Cycles by Burns and Mitchell [21]. This chapter covers a description of the data, how it has been preprocessed, the features extracted from it, and the initial analyses done. 2.1 Description The data used throughout the thesis were available through Refinitiv, one of the largest providers of premium financial market data and infrastructure. Refinitiv provides instant access to over 16 million economic indicators from 215 countries and regions dating back as far as 120 years [22]. In total, a set of 48 different time series were used and evaluated, see Appendix A.1. The time series covered various categories such as indices, risk-premium data, interest rates, sentiment surveys, and indicators on business, geopolitical, and finan- cial conditions. Some of the time series used were composite indicators built with multiple underlying variables. The time series were chosen at the discretion of the advisors at Nordea, Karl Larsson and Fredrik Lundström. Since most of the time series are provided by paid services, such as Refinitiv, they are rarely used in earlier work. Historical data and the work of Travis Berge have shown how some of these premium time series work as leading indicators of the economy and thus can provide an additional edge to earlier literature that only used stock market data such as price, volume, and volatility [17], [23]. Standard premium leading indicators are surveys measuring business climate, credit in the economy, and market sentiment. Local regions often have their own interpretation of these indicators. However, the American versions are commonly used due to their extended amount of historical data, the size of the surveys, and the proportion of American equity in global market cap. 2.2 Preprocessing & Cleaning Data preprocessing-and-cleaning is crucial in enabling efficient analysis and obtain- ing satisfactory model results. The time series used were measured in different units, 9 2. Data used different granularities, or were issued in different regions as they measured dif- ferent things. For example, some series used percentages (%), while others used basis points. Most series used the US dollar (USD) as currency, but some used Euro (EUR). Once the preprocessed data was downloaded, missing and unwanted values such as NaN values and leading zeros (default value indicating missing value) were removed. In addition, if possible and needed, some data was also cut off to guarantee that the time series started on January 1st to make comparisons between time series easier in a later stage. Another crucial filtering of the data was the cut-off point for the Stock and Bond index. Historical data for the Stock and Bond indices were available dating back to 1970 and 1991, respectively. However, the granularity of this data was limited to monthly intervals. It was not until 2001 for the Stock index and 1999 for the Bond index that the granularity increased to daily intervals. Therefore, to simplify the development process, only data starting from 2001 was used. 2.3 Feature Engineering Feature engineering involves transforming raw data into a set of features that are utilised to train and enhance the performance of machine learning models. This enables extracting crucial information, identifying patterns, and computing domain- specific features. The feature engineering process is crucial to the success of machine learning models, as the quality of the features used to train the models can signifi- cantly impact performance. In this case, the raw data was the time series value at a particular date. For each time series and feature category (described below), rolling windows of one month, one quarter, and one year were calculated. The window lengths were chosen partly on the recommendation of the Nordea advisors, but also because of the periodic nature of the financial market. Financial data is namely most often presented in a daily, monthly, quarterly, and yearly matter. Because the different time series varied in granularity, the window frames were only computed if they were compatible. For example, it does not make sense to compute a monthly feature on a time series with only quarterly data. Another issue with time-series feature engineering was the initial missing data. For example, a feature measured over one year generally needs one year of data before the first data point can be computed. The options were to either not use this initial period at all, backfill with the next occurring future value, or fill with some default value. To avoid losing more data than necessary and data leakage by backfilling, the elected method was to use a growing rolling window that starts at length one and grows until its intended length. Default values were not used as no appropriate value could be found. In total, circa 1200 features were computed. The various feature categories are presented in further detail below. 10 2. Data 2.3.1 Value The value represents the observation recorded at a specific time for each individual time series. The granularity of these values can vary, ranging from daily, monthly, to quarterly, depending on the specific time series. It is important to note that the values are always organised in a temporal order. 2.3.1.1 Value Difference The value difference is computed by subtracting the previous value from the current value over a rolling window, see Equation 2.1. Value Difference = Current Value− Previous Value. (2.1) 2.3.2 Returns Return is the relative change between two values and is calculated according to Equation 2.2. The return feature was central in this thesis since it was the target variable (more on this in Section 4), and it was also used to calculate other features. Return = Current Value− Previous Value |Previous Value| . (2.2) 2.3.3 Volatility Volatility refers to the degree of variation or fluctuation in the price or value of a financial instrument (or market) over time. Volatility is often used as a measure of risk, since a highly volatile asset is considered riskier than a less volatile one. High volatility implies greater potential for both gains and losses, while low volatility implies relative stability with smaller potential gains or losses. Volatility is typically measured using standard deviation, see Equation 1.1, which indicates the extent to which the values in a data set deviate from the mean. 2.3.4 Exponential Moving Average (EMA) Exponential Moving Average (EMA) is a moving average that is often used to help identify trends and potential changes in the direction of an asset [24]. It is similar to the simple moving average (SMA), which is calculated by taking the mean of a specified number of data points for a given period of time. EMA differs by applying a smoothing factor, α, which will weigh recent data more heavily than older data and is calculated according to Equation 2.3. EMA = value · α 1 + window_size + EMA−1 · ( 1− α 1 + window_size ) where: α = 2. (2.3) 11 2. Data 2.3.5 Sharpe Ratio Sharpe ratio is a measure of risk-adjusted return used to evaluate the performance of an investment or portfolio. It was first proposed by Nobel laureate William F. Sharpe in 1966 under the name reward-to-variability ratio and is widely used by investors to compare the risk-adjusted returns of different investments [25]. The Sharpe ratio is calculated by subtracting the risk-free rate of return from the investment’s return and dividing the result by the investment’s standard deviation, see Equation 2.4. Values above 1 are generally considered good [26]. Sharpe Ratio = Rp −Rf σp where: Rp = Expected return of the investment Rf = Risk-free rate of return σp = Standard deviation of the investment’s returns. (2.4) For this thesis, J.P. Morgan Global Aggregate Bond Index was used as a risk-free asset. The reason is that it’s backwards-looking instead of forward-looking, such as the US 10-year treasury bond yield. 2.3.5.1 Sharpe Ratio Difference The Sharpe Ratio Difference is simply computed by subtracting the previous value from the current value. Sharpe Difference = Current Sharpe− Previous Sharpe. (2.5) 2.3.6 Sortino Ratio Sortino ratio is very similar to the Sharpe ratio, but instead focuses only on the downside risk [27]. The only difference in the calculation is to divide by the standard deviation of the downside returns instead of the returns. See Equation 2.6. Sharpe Ratio = Rp −Rf σd where: Rp = Expected return of the investment Rf = Risk-free rate of return σd = Standard deviation of the downside. (2.6) 2.3.6.1 Sortino Ratio Difference The Sortino ratio difference is simply calculated by subtracting the previous value from the current value. Sortino Difference = Current Sortino− Previous Sortino. (2.7) 12 2. Data 2.3.7 Maximum Drawdown (MDD) Maximum drawdown (MDD) is a measure of the largest percentage drop in the value of an investment or portfolio from a previous peak to a subsequent trough. It is used to assess the risk of an investment and to evaluate the historical performance of an investment or portfolio [28]. The maximum drawdown is calculated by taking the difference between the highest value of the investment or portfolio and the subsequent lowest value, divided by the highest value. This provides a measure of the percentage decline in value from a peak to the subsequent lowest point, known as a trough, see Equation 2.8. MDD = Trough Value− Peak Value Peak Value . (2.8) 2.3.8 Correlation Correlation is a statistical technique that is used to measure the strength and direc- tion of the linear relationship between two quantitative variables. In other words, correlation measures the extent to which the values of one variable are associated with the values of another variable. The correlation coefficient takes a value between -1 and +1. • A correlation coefficient of -1 indicates a perfect negative correlation, where the two variables move in opposite directions (i.e. as one variable increases, the other decreases). • A correlation coefficient of +1 indicates a perfect positive correlation, where the two variables move in the same direction (i.e. as one variable increases, the other also increases). • A correlation coefficient of 0 indicates that there is no correlation between the variables. The correlation is computed between the returns of the series and the returns of the target series (Stock or Bond), see Equation 2.9. Corr = ∑(xi − x̄)(yi − ȳ)√∑(xi − x̄)2 ∑(yi − ȳ)2 where: xi = Values of the x-variable in a sample x̄ = Mean of the values of the x-variable yi = Values of the y-variable in a sample x̄ = Mean of the values of the x-variable. (2.9) 2.3.9 First Derivative In order to quantify the trend direction and magnitude, a regression line is fitted to the series returns. From the fitted regression line, the slope coefficient is extracted, 13 2. Data see Equation 2.10. y = kx + m =⇒ y′ = k = First Derivative. (2.10) 2.3.10 Second Derivative In order to get a sense of how fast the rate of change itself is changing, a second derivative is calculated. The second derivative is computed by taking the relative change of the returns, i.e. “the returns of the returns”, see Equation 2.11. Second Derivative = Current Returns− Previous Returns Previous Returns . (2.11) 2.3.11 Z-Score The Z-score, also referred to as the standard score, is a statistical measure that quantifies the number of standard deviations the observed value deviates from the mean [29]. The Z-score is the number of standard deviations by which the value of a raw score is above or below the mean value and is calculated using Equation 2.12. For example, if the Z-score equals 0, it indicates that the data point’s score is identical to the mean, while a Z-score of 1.0 would indicate a value that is one standard deviation above the mean. Z-score = x− µ σ where: x = Observed value µ = Mean of the sample σ = Standard deviation of the sample. (2.12) 2.3.11.1 Z-Score Difference The Z-score difference is simply computed by subtracting the previous value from the current value, see Equation 2.13. Z-score Difference = Current Z-score− Previous Z-score. (2.13) 2.4 Analysis A major part of developing machine learning models is understanding the data. Gaining insight into the fundamental information and relationships within the data can lead to identifying more effective features that can enhance the model’s perfor- mance. Therefore, the initial part of the thesis was spent exploring and visualising the data, using different statistical functions to gain useful insights. 14 2. Data Correlation, mentioned in Section 2.3.8, was an initial metric used to rule out any obvious relationships between the Stock and Bond index to the rest of the time series. Cross-correlation, similar to correlation, is a measure of similarity between two sets of time series as a function of the displacement of one relative to the other. Cross- correlation shows when the best match occurs as correlation is calculated over differ- ent time shifts. Cross-correlation is mainly used in portfolio management to measure the degree of diversification among the assets contained in the portfolio. This thesis used it in an attempt to find time shifts that would make a time series a potential leading indicator for the Stock or Bond index. Windowed cross-correlation was also used, which is a variant of cross-correlation but with fixed time windows of correlation sliding across the time series. Autocorrelation is equivalent to correlation when correlating a time series with itself. It was used in an attempt to find periodical patterns in the data. The Granger causality test was also used, which is a statistical hypothesis test created by Nobel laureate Sir Clive Granger to determine whether a time series is useful for forecasting another. When studying market data, one must consider absolute and relative changes in value, as momentum greatly impacts an asset’s returns. There are many ways to find the current momentum of an asset, for example, Year-Over-Year (YoY) growth. YoY growth allows for gauging an asset’s financial performance over time, whether it is improving, static, or worsening. For the analysis, YoY growth was used to visualise the time series from a new perspective as well as potentially finding new, lesser obvious correlations, see Figure 2.1. Figure 2.1: MSCI World (Year-Over-Year) vs. US ISM PMI index, with a correlation of 0.739. However, despite much time being invested into data exploration and analysis, little 15 2. Data was gained in terms of useful knowledge for the features or models moving forward, with the exception of the correlation feature. 16 3 Machine Learning Fundamentals This chapter covers the fundamental theory and terminology as well as an introduc- tion to the machine learning models and techniques used in the thesis. Furthermore, the performance metrics that are used to evaluate the quality of the models are introduced. 3.1 Models The thesis used supervised learning algorithms, which is a type of machine learning in which an algorithm is trained to learn the relationship between input variables (also known as features) and output variables (also known as labels or targets) by using a set of labelled training data. The models were constructed using regression (reg) and classification (clf) algorithms. Regression involves estimating the relationships between a dependent variable (tar- get variable) and one or more independent variables (feature variables). On the other hand, classification focuses on identifying the category or sub-population to which an observation belongs. The output of the models was predictions on how the indices would move, then used by a rule-based allocation model to create a portfolio allocation of Stocks and Bonds, with the aim of meeting all the targets stated in Section 1.3.1. The machine learning models that were developed and evaluated are as follows (in order of implementation). 1. Random Forest Regressor (RandomForestRegressor from sklearn.ensemble [30]) 2. Extreme Gradient Boosting Regressor (XGBRegressor from dmlc/xgboost [31]) 3. Random Forest Classifier (RandomForestClassifier from sklearn.ensemble [32]) 4. Extreme Gradient Boosting Classifier (XGBClassifier from dmlc/xgboost [31]) More complex and sophisticated models suited for time series data, such as Recurrent Neural Networks, were considered but not pursued due to their lack of interpretabil- ity and the limitations of available hardware (neural networks often require a GPU to complete training in a reasonable time). 17 3. Machine Learning Fundamentals 3.1.1 Decision Trees A decision tree is a decision support tool that uses a tree-like model of decisions with their possible consequences. Decision trees originated from the discipline of decision analysis but have since become a popular tool in machine learning and are integrated into several prominent machine learning algorithms, such as Random Forest (Section 3.1.2) and gradient boosting (Section 3.1.3). One of the main advantages of decision trees is their interpretability, as they are easy to understand and can also be displayed graphically so that non-experts can interpret [33]. A disadvantage of decision trees when working with time series data is that they cannot capture temporal patterns such as trends, periodicities, and sequences. Decision trees are a non-parametric supervised learning algorithm, which is used for both regression (continuous values) and classification (discrete values) tasks [34]. Decision trees where the target variable can take continuous values are also called regression trees. This can be useful when the relationship in the data is found to be non-linear, as seen in Figure 3.1. Decision trees where the target variable can take discrete values are also known as classification trees, or multi-class classification trees if it predicts outcomes for more than two classes. Decision trees are weak learners. Weak learners are models that perform slightly better than random guessing or taking the mean of a sample and are often used together in ensemble models to form strong learners with higher accuracy [35]. Figure 3.1: Example of how a regression tree can be used to fit a model to continuous non-linear data. Each leaf of the tree is labelled with a value, which is the output of the model. 3.1.2 Random Forest Random Forest is an ensemble learning method first proposed in 1995 by Tin Kam Ho [35] and later expanded by Leo Breiman and Adele Cutler in 2001 [36] to the version that was used in this thesis; see Algorithm 1. Random Forest is a meta- estimator that constructs a multitude of decision trees at training time that either returns the average (regressor) or the majority class (classifier) of all individual trees; see Figure 3.2. 18 3. Machine Learning Fundamentals Algorithm 1: Random Forest Algorithm [Tb]B ← The ensamble of trees; for b = 1 to B do 1. Draw a bootstrap sample Z∗ of size n from the training data. 2. Grow a random-forest tree Tb to the bootstrapped data by recursively repeating the following steps for each terminal node of the tree, until the minimum node size fraction smin or the maximum number of terminal nodes kmax are reached. (a) Select m variables at random from the p variables. (b) Pick the best varaiable/split-point among the m. (c) Split the node into two child nodes. end Return [Tb]B Figure 3.2: The Random Forest will build N unique decision trees that will each make a prediction. Random Forest is a strong learner constructed of many smaller decision trees, known as weak learners. 19 3. Machine Learning Fundamentals Random Forest models inherit their great interpretability characteristics from deci- sion trees and require fewer computations than neural networks, which is a great benefit. The method has previously shown promising results within the financial field, unlike other decision tree algorithms that tend to overfit and struggle with noisy data [14]. Random Forest models overcome this problem by training multiple decision trees on subsets of available data to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone [37]. The current state of the art, presented by Pinelis and Ruppert 2022 [15], used this method and outperformed a buy and hold strategy by 3.4% while also gaining significant improvements in the Sharpe ratio. The Random Forest model used in this thesis contains a total of 17 tunable hyper- parameters. However, only a subset of these were used, which are defined in Table 3.1. Parameter Definition max_depth The maximum depth of the tree. max_features The number of features to consider when looking for the best split. min_sample_leaf The minimum number of samples required to be at a leaf node. min_sample_split The min number of samples required to split an internal node. n_estimators The number of trees in the forest. Table 3.1: Random Forest Parameters. These hyperparameters were optimised for both the regressor and the classifier. 3.1.3 Gradient Boosting The gradient boosting algorithm was the primary focus of the thesis. Gradient boosting is closely related to the simpler Random Forest algorithm. Both algo- rithms are ensemble learning methods that use weak learners in the form of decision trees to perform regression or classification. The key difference is that Random For- est is a bagging ensemble method, while gradient boosting is a boosting ensemble method. Bagging creates multiple diverse models by training them independently, while boosting creates models sequentially, with each new model learning from the errors made by the previous model, see Figure 3.3. Gradient boosting offers sev- eral advantages, such as great interpretability and fast training times, even on less powerful machines. Additionally, it tends to outperform Random Forest models in terms of accuracy and predictive power [38]. The algorithm, whose pseudocode can be seen in Algorithm 2, was originally devel- oped by Jerome H. Friedman in 2001. The model creates a strong prediction by combining weak learners (denoted fm) over a fixed number of iterations (denoted M). The number of boosting iterations M is chosen to be the one that minimises the 20 3. Machine Learning Fundamentals Figure 3.3: Gradient Boosting learning curve. Figure illustraded by Aratrika Pal [39]. Bayesian Information Criterion (BIC) of the final boosted model. The weak learners are simple binary decision trees that iteratively improve by learning from previous mistakes. At each iteration, a new residual value is predicted. The residual value is the difference between the estimated and true values. The weak learners then try to minimise the residual value until it reaches the fixed number of M iterations. At that point, it exits the outer loop and uses its current model to predict the results. The benefit of Gradient Boosting, which Friedman talks about in his paper, is how the algorithm has a low variance, resulting in good predictions on unseen data [40]. Algorithm 2: Gradient Boosting Algorithm f0(x)← argminγ ∑N i=1 L(yi, γ); M ← Number of iterations; N ← Numbers of datapoints; for m = 1 to M do for i = 1, ..., N do rim ← −[ δL(yi,f(xi)) δf(xi) ]f=fm−1; end Fit regression tree to the targets rim giving terminal regions Rjm, j = 1, 2, ..., Jm. for j = 1, 2, ..., Jm do γjm ← argminγ ∑ xi∈Rjm L(yi, fm−1(xi) + γ; end fm(x)← fm−1(x) + ∑Jm j=1 γjmI(x ∈ Rjm) end Return f̂(x) = fM(x) The gradient boosting algorithm is also exciting from a financial perspective, as em- 21 3. Machine Learning Fundamentals pirical evidence shows that it can classify noisy data, a characteristic very common in finance [40]. Previous work with gradient boosting has also been done in other areas, such as recession and individual stock prediction, with great success [17], [23]. 3.1.4 eXtreme Gradient Boosting (XGBoost) XGBoost is a praised implementation of gradient boosting that specifically focuses on optimising the gradient boosting algorithm [31]. It offers several enhancements over traditional gradient boosting methods. These enhancements include a more regularised model to control overfitting, a customised loss function, and a highly efficient implementation that supports parallel processing and can handle large-scale data sets. XGBoost has a total of 44 tunable hyperparameters. However, only a subset of these were used, which are defined in Table 3.2. Parameter Definition colsample_bytree Subsample ratio of columns when constructing each tree. gamma Minimum loss reduction required to make a further partition on a leaf node of the tree. learning_rate Boosting learning rate. max_depth Maximum tree depth for base learners. n_estimators Number of trees in Random Forest to fit. subsample Subsample ratio of the training instance. Table 3.2: XGBoost Parameters. These hyperparameters were optimised for both the regressor and the classifier. 3.2 Model Evaluation Strategies Model evaluation strategies play a vital role in developing and assessing machine learning models. They offer a systematic framework to determine the likely perfor- mance of a model on unseen data. In this context, two competing concerns arise: parameter estimates exhibit higher variance when there is less training data, and the performance statistic becomes more variable with limited testing data. Conse- quently, there are no universal solutions applicable to all data sets. Instead, the choice of strategy and data split should be tailored to the specific situation at hand. In this section, the two methods train-validation-test split and time-series cross- validation are explained, with their benefits and drawbacks. 3.2.1 Train-Validation-Test Split Train-Validation-Test split is a model evaluation strategy where the available data is split into three unique subsets: one for training, one for validation, and one for testing. Models that do not perform hyperparameter tuning only use two subsets, train and test, as a validation set is unnecessary. 22 3. Machine Learning Fundamentals Normally, the original data set is shuffled before being split into subsets. Still, due to the temporal dependency of the data in financial time series and to prevent data leakage, the data is kept in its original order, as illustrated in Figure 3.4. Figure 3.4: When developing a model for time series forecasting, it is important to remove future data and gap samples from the training set to avoid data leakage. The training set is crucial for building a machine learning model. By analysing train- ing data, the algorithm identifies relationships and patterns to determine optimal variable combinations to generate an effective predictive model [41]. The training set is usually the largest data set, often containing more than 70% of the total data [42]. The validation set is used to validate the model’s generalisation performance during training and to tune the hyperparameters thereafter. It is necessary to have a validation data set in addition to the training set to avoid overfitting. However, the model can become indirectly biased towards the validation set because of the hyperparameter tuning. The test set is a final out-of-sample (OOS) test of the model’s generalisation ability. It is the best indication of the likely future performance of the model on unseen data. This sample is only used once the model development and hyperparameter tuning process is complete in order to detect and avoid backtest overfitting. The size of the validation and test set can vary greatly and is highly dependent on the specific use case, but a good rule of thumb is to make them equal-sized. Evaluating the models on the validation set prevents backtest overfitting by miti- gating selection bias in multiple backtests. Refraining from conducting a backtest on the out-of-sample data until satisfactory results are obtained on the validation set reduces the risk of building a model based on a statistical fluke. Additionally, in the event of a negative outcome on the out-of-sample test, it is crucial to restart the process, as repeating tests on the same data is likely to lead to false discoveries. Typically, around 20 iterations are required to discover a false investment strategy within the standard significance level of 5%, something shown by López de Prado [18]. 3.2.2 Cross-Validation Cross-validation (CV) is a commonly used technique in machine learning to avoid overfitting and to evaluate a model’s generalisation performance on an independent data set as an alternative to the basic train-validation-test split. It is often employed 23 3. Machine Learning Fundamentals in combination with hyperparameter tuning to determine the optimal hyperparam- eter values for a model. K-fold CV is a variant that divides the original data into K subsamples, see Figure 3.5, where each training sample is used to fit the model and tune the hyperparameters. CV is very useful when the dataset is small and when splitting the data into a typical train-validation-test set would significantly affect the model’s accuracy. Figure 3.5: When developing a model for time series forecasting, it is important to remove future data and gap samples from the training set to avoid data leakage. However, using a conventional CV approach may not be effective or valid when forecasting a time series due to the temporal dependency of the data, and access to future data during model training can result in data leakage. Ensuring that the test set has more recent data than the training set is crucial to address this challenge. This method is known as time-series cross-validation (TSCV). Another difference between CV and TSCV, also related to data leakage, is when each train/test set is adjacent to each other in sequence. There is a risk that variables at the end of the training set, known as gap samples, can incorporate some of the information from the test set, leading to look-ahead bias. Therefore, removing the gap samples when splitting the data is important. 3.3 Hyperparameter Tuning Hyperparameter tuning, is an important part of the machine learning model devel- opment process, as the hyperparameter settings can significantly impact the model’s performance. A hyperparameter is a control parameter that influences the learning process of a model, in contrast to other parameters, such as node weights, which are learnt by the model itself [43]. Hyperparameter tuning poses numerous challenges that make it a complex problem in practice. It can be computationally expensive when dealing with a large number of hyperparameters, and the configuration space is often intricate, involving a mix of continuous, categorical, and conditional hyperparameters. Additionally, optimising for generalisation is difficult, as training data sets are typically limited in size. To address the challenge of generalisation performance, hyperparameter tuning is frequently employed in conjunction with model evaluation strategies, as outlined in Section 3.2. Generalisation performance is typically estimated through techniques 24 3. Machine Learning Fundamentals such as cross-validation on the training set or evaluation on a hold-out validation set. 3.3.1 Model-Free Blackbox Tuning Methods There are many methods available to perform hyperparameter tuning, and due to the non-convex nature of the problem, global optimisation algorithms are usually preferred [43]. Model-free blackbox tuning methods are optimisation algorithms used to find the optimal configuration or parameters of a system without relying on an explicit mathematical model of the system. These methods are typically employed when the underlying system is complex, highly non-linear, or lacks a clear mathematical representation. Grid search is the traditional and most basic method used to perform hyperparame- ter tuning. This method specifies a predefined set of hyperparameters that are then exhaustively searched. This ensures that no hyperparameter configuration is missed during the tuning process, in contrast to other methods, which may miss important configurations or spend more time exploring less promising ones. However, a major drawback of this method is that the number of evaluations grows exponentially as the set of hyperparameters grows, and thus the method is not always feasible. A simple alternative to grid search is random search. As the name suggests, random search replaces the exhaustive enumeration of all combinations by randomly select- ing the hyperparameters, where each setting is sampled from a distribution over possible parameter values. This method can outperform grid search, mainly when only a small number of hyperparameters affect the final performance of the machine learning algorithm, as illustrated in Figure 3.6 [44]. Random search can be a valu- able method to initiate the search process, since it covers the entire configuration space, leading to the discovery of settings that often yield satisfactory performance. These settings can serve as a reference point when conducting more comprehensive guided search methods, such as grid search. Figure 3.6: Comparison of grid search and random search minimising a function with one important and one unimportant parameter. However, when both parameters have a large impact on the result, grid search usually performs better. This figure is based on the illustration by Bergsta and Bengio [44]. 25 3. Machine Learning Fundamentals 3.4 Backward Feature Elimination Backward feature elimination, also known as feature selection, is a technique which involves iteratively removing features from a model until an optimally performing subset of features is achieved. The process starts with a model that contains all the available features, and then one feature is removed at a time. The model is then evaluated with the reduced set of features, and the feature with the least impact on the model’s performance is discarded. This process continues until the desired number of features or a predetermined threshold is reached. One drawback with backward feature elimination is that features sometimes per- form better when combined with other previously determined bad features, a result of non-linear relationships between the features. Such relationships can be hard to find when performing backward feature elimination, compromising the model’s performance. 3.5 Performance Metrics Performance metrics are crucial for model development and provide quantitative measurements of the quality of the model during all stages of its lifetime. They are used to evaluate trained models, for model selection and hyperparameter tuning during development and to monitor the performance of a deployed machine learning model in a production environment. Numerous performance metrics are available for evaluating models, each with unique characteristics. Thus, it is essential to understand when and how to utilise them appropriately and thoroughly. The metrics used in this thesis are explained below. 3.5.1 Coefficient of Determination (R2) The coefficient of determination, more commonly known as r-squared (or R2), is a statistical measure that represents the proportion of the variance in the dependent (target) variable that can be explained by the independent variables (features) in a regression model. In other words, it indicates how well the regression model fits the observed data. The R2-score is calculated according to Equation 3.1. R2 = 1− ∑n i=1 (yi − f(xi))2∑n i=1(yi − ȳ)2 where: yi = Observed value ȳ = Mean value of a sample n = Number of observations f(xi) = Predicted value of yi xi = Set of input features (3.1) The r2_score function from sklearn.metrics is used for practical implementation. The best possible score is 1.0, indicating a perfect fit. On the contrary, a score of 0.0 26 3. Machine Learning Fundamentals corresponds to a constant model that always predicts the average of the dependent variable, disregarding the input features. The score can also be negative because the model can be arbitrarily worse [45]. In finance, this metric is often used to determine the percentage of a price movement in a stock that is attributed to the price movement of the corresponding index [46]. 3.5.2 Accuracy Accuracy is a commonly used metric to evaluate the performance of a classification model. Accuracy measures how well the model correctly predicts the class labels of the input data, calculated according to Equation 3.2. The accuracy is the proportion of correct predictions (both true positives and negatives) over the total number of predictions made. Accuracy = TP + TN TP + TN + FP + FN where: TP = True Positive TN = True Negative FP = False Positive FN = False Negative (3.2) 27 3. Machine Learning Fundamentals 28 4 Methodology Ensuring the prevention of the Seven Sins of Quantitative Investing, mentioned in Section 1.6.2, has been a central focus of the thesis, influencing every stage of the development process to uphold the integrity of the results. This chapter provides further elaboration on how the errors have been addressed. In addition, this chapter outlines the methods used during the course of the thesis, as well as the underlying choices behind them. The tools used to develop the models and the system as a whole are presented. The main parts of a machine learning project are also presented and discussed in the context of this specific thesis. 4.1 Development Approach The approach used during this project can be described as a cycle with three steps: data analysis, model development, and evaluation of results. This cycle was used in a systematic way to thoroughly test the concepts and gain an understanding of how the different models interpret the data. The implementation relied on Python and its renowned libraries for machine learn- ing and numerical computation, including NumPy, SciPy, scikit-learn, pandas, and Matplotlib, among others. This decision was made due to the abundant availabil- ity of comprehensive online resources, the prevalence of open-source code, and the authors’ extensive prior experience with these tools. 4.2 Model Input Processing Once the feature engineering was complete, the data was almost ready to be fed into the models, but first, it needed to be processed and transformed in a suitable way for the task at hand. 4.2.1 Merging The first part of the input processing consisted of merging all feature-engineered time series into a single pandas DataFrame table. This was done using the target variable as the basis and left-joining the remaining series onto the target variable, using forward-fill if needed. In this way, the series with more infrequent granularity 29 4. Methodology could still be used as input, and a lagged version of the target variable could also be used as input. Forward-fill was used instead of backfill to avoid data leakage. 4.2.2 Filtering The next step was to filter the data in rows (date) and columns (features). The data was filtered to span the range from 2003 to 2023. As mentioned previously, the lower limit of the year 2001 was initially chosen because that was the year the Stock index started with daily granularity. However, as some time series did not begin until later than 2001, 2003 was chosen as a balanced lower limit between missing out on real data vs. including too much artificial data, as is explained in the next Section. The columns that were Bond-specific were dropped when predicting the Stock index and vice versa. For example, dropping the Bond-correlation feature when predicting the Stock index. 4.2.3 Filling Missing Data A problem with left-merging all time series onto the target index was that most, but crucially not all, time series went as far back as the target index. Therefore the resulting merge would have ended up containing some initial missing values, which, e.g. the Random Forest model does not support. Instead, the initial missing data for each feature was filled with the feature’s average value. 4.2.4 Shifting independent variables Because a subgoal of the project was to predict the returns of the Stock and Bond index one month ahead, the models also had to be trained with this time shift in mind. This was achieved by shifting the input features forward one month, including a shifted variant of the target variable, see Figure 4.1. Figure 4.1: Shifting the independent (X) features. 30 4. Methodology 4.2.5 Train-Validation-Test Split The data remaining after the initial pre-processing was split into three sets in con- sultation with the Nordea advisors; Train (2003-2016), Validation (2016-2020), and Test (2020-2023), see Figure 4.2. The entire data span included a total of 5197 days, with 3371 belonging to the train set, 1043 to the validation set, and 783 to the test set. The train set contained 13 years of data corresponding to 65% of the total dataset. The period includes remarkable events such as the aftermath of the dot-com crash, including the following bull market that lasted until the global financial crisis in 2008. During this period, the correlation between the Stock and Bond index was low, as the Bond index increased when Stocks declined. The validation set contained four years of data corresponding to 20% of the total dataset. The period was overshadowed by geopolitical tensions, concerns about global economic slowdown, and policy uncertainties that contributed to market swings. The test set contained three years of data corresponding to 15% of the total dataset. The period includes remarkable events such as the COVID-19 crash in 2020, the bull market in 2021 as a result of the extreme expansionary fiscal policy and 2022 with historically low returns for a 60/40 portfolio. As the correlation between the stock and bond market was high in 2022, failing to hedge the investors, this test set also provided a measure of how good the models were at predicting relative future performance between the two assets. Figure 4.2: Train-Validation-Test Split. The models are trained on the training set, hyperparameter tuning is performed using the validation set, and the models’ generalisation capabilities are tested on the test set. Cross-validation (see Section 3.2.2) was also explored, but due to the limited amount of data, Time Series Cross-Validation was not able to perform well because of the 31 4. Methodology first few folds, which got little data to train on in correlation to the test data, which in turn brought the whole average down. 4.2.6 Numpy Transformation In order to make the data compatible with the models from the scikit-learn and dmlc libraries, the pandas DataFrames were transformed into NumPy ndarrays as the final step before training the models. 4.3 Hyperparameter Tuning Hyperparameter tuning was performed using a custom implementation of sklearn’s GridSearchCV, as CV (Cross-Validation) was replaced by evaluation on a hold-out validation set instead. GridSearch can be computationally expensive, as mentioned in Section 3.3, but a parallel version was implemented, reducing the search time. Random search (sklearn: RandomizedSearchCV ) was tested as well but was outper- formed by grid search due to the complex relationship between the parameters. 4.3.1 Random Forest For the Random Forest models, there are a total of 12 tunable hyperparameters. However, only the five parameters in Table 4.1 were tuned. These were found to have the greatest impact on performance, and tuning all 12 parameters would be too time-consuming. Parameters Default Best REGR Best CLF max_depth None 20 10 max_features 1.0 10 sqrt min_sample_leaf 1 2 1 min_sample_split 2 2 3 n_estimators 100 1000 200 Table 4.1: Random Forest default vs. best parameters. See Table 3.1 for parameter definitions. 4.3.2 XGBoost For the XGBoost models, there are a total of 44 tunable hyperparameters. However, only the six parameters in Table 4.2 were tuned. These were found to have the greatest impact on the performance, and tuning all 44 parameters would be too time-consuming. 32 4. Methodology Parameters Default Best REGR Best CLF colsample_bytree 1.0 0.1 0.1 gamma 0 0.02 0.02 learning_rate 0.3 0.4 0.2 max_depth 6 3 15 n_estimators 100 25 25 subsample 1.0 0.7 0.6 Table 4.2: XGBoost default vs. best parameters. See Table 3.2 for parameter definitions. 4.4 Feature Elimination All 1200+ features were initially fed into the Random Forest and XGBoost models. However, early findings suggested that this method was not feasible, since the models kept overfitting to the training data, even though parameters like max_features and colsample_bytree were thoroughly used. Therefore, feature elimination was required. The development stage for the models consisted of a two-step process. First, an initial hyperparameter-optimised model using all features was developed. Then, a second and final hyperparameter-optimised model was created, using only the top 150 performing features of the previous model, which consistently gave better results. The number 150 was chosen by trial and error; using less than 150 gave worse results due to underfitting, and using more than 150 gave worse results due to overfitting. See Figure 4.3 for the feature importance ranking of the initial Random Forest model. For the Random Forest model, the top features of only an initial Random Forest model were used. But for the XGBoost model, the top features of both an initial XGBoost model and an initial Random Forest model were tested. It turned out that the XGBoost model, which used the top features of a Random Forest model, was actually the most performant. 4.5 Model Evaluation A backtest from 2020 to 2023 on the Stock and Bond index was performed when evaluating the prediction and allocation models. Using Stock and Bond market indices can be helpful in mitigating survivorship bias when backtesting an allocation strategy. Indices maintain a history of their constituent stocks, and by accessing the historical composition of an index, the per- formance is based on companies that were part of the index at specific points in time, even if they are no longer actively traded. Figure 4.4 shows how only considering the current investment universe can lead to overly optimistic performance on historical backtests. 33 4. Methodology Figure 4.3: Feature importance from the initial Random Forest model trained on over 1200 features. However, due to the presence of noise, the number of features used in the final models were significantly reduced by up to 90%. Figure 4.4: The backtested performance of MSCI Europe between 1995 and 2014 is almost twice as good when only considering the investment universe of 2014, a common error in quantitative investing. Graph produced by Yin Luo [20]. 34 4. Methodology 4.5.1 Prediction Models The Random Forest and XGBoost machine learning models were used to predict the movements of the Stock and Bond index for the upcoming month. The regressors predicted the returns of the respective index for the upcoming month, while the classifiers only predicted the direction of the returns. The models were evaluated on the validation set for model selection and ultimately evaluated on the test set for general predictive performance. The regressor models were evaluated using the R2 score, while the classifier models were evaluated using accuracy. 4.5.2 Allocation Models The portfolio’s performance was assessed by rule-based allocation models through backtesting, using signals from the prediction models. There were two types of allocation models, one for the regressors and one for the classifiers. The regressors used annual risk-adjusted returns as the allocation strategy, which is calculated according to equation 4.1. The asset that exhibited the highest predicted risk-adjusted return was selected for overweighting in the portfolio. Risk-Adjusted Returns = Yearly Returns Yearly Volatility(of daily returns) (4.1) The classifiers used the predicted direction of the Stock index as the allocation strategy. If the Stock index returns were predicted to be positive, the Stock index was overweight and the Bond index was underweight, otherwise, the Bond index was overweight and the stock index was underweight. As mentioned in the limitations in Section 1.5, the allocation models were restricted to a 60/40 overweight/underweight principle. The evaluation of the allocation models encompassed six widely used performance metrics in finance, listed below: • Total Return often aligns with investors’ objectives, allowing for meaningful comparisons between investment options or asset allocation strategies. • Alpha helps assess the investment’s performance relative to a benchmark and can explain whether a strategy has added value beyond what can be explained by market movements. • Volatility provides insights into an investment or portfolio’s risk and potential fluctuations. Volatility is a crucial component of risk management. • Sharpe Ratio provides a measure of risk-adjusted performance, how a strat- egy has historically managed risk, and whether risk levels are acceptable within their risk tolerance. 35 4. Methodology • Sortino Ratio is similar to the Sharpe ratio, but focuses specifically on the downside risk. • Maximum Drawdown (MDD) provides insights into the worst peak-to- trough decline experienced by an investment or portfolio. MDD is particularly relevant for investors who prioritise capital preservation. Because of the uncertain nature of the stock market, the models were backtested for all rebalancing days in the month, and an average was computed as the final result. 4.6 Model Comparison Once all the models were fully developed, they were compared against the target benchmarks defined in Section 1.3.1, as well as against each other, the results of which are presented in the next chapter. 36 5 Results This chapter highlights the results and findings of the project, demonstrating how they align with the project’s goals. The first and second section gives an in-depth analysis of the prediction and allocation model results, focused towards the first goal of the thesis, to build an outperforming portfolio. The last section is focused on the data and how feature importance has been utilised to achieve the thesis’ second goal, to find leading indicators for the Stock and Bond index. 5.1 Prediction Models The prediction models were assessed through a backtest, a historical simulation that gauges the performance of a strategy if executed in the past. While the results of the backtest do not guarantee future performance and only serve as a hypothetical sanity check, it is widely employed in the industry, despite being criticised for its susceptibility to errors [18]. One of the reasons for the popularity of backtesting is because it allows for result comparisons with other studies, providing insights into the performance of developed models in relation to existing literature. 5.1.1 Prediction Regressors The regressor prediction models were the main focus of this thesis, as the hypothesis was that the increased detail that a regressor could provide over a classifier would be beneficial for how the rule-based allocation models would perform. The results of the hyperparameter-optimised regressor models on the test set can be seen in Figure 5.1, together with the performance metrics in the various data sets in Table 5.1. The table includes a separate test set that excludes the year 2020, which marked the beginning of the COVID-19 pandemic. This separation allows for testing the model’s resilience to large outliers, as numerous time series experienced significant deviations during and after the market crash in 2020. When predicting the monthly returns for the Stock index, XGBoost achieved an R2-score of 0.233 on the test set that excluded 2020, with the magnitude of the predictions posing as a primary drawback. However, predicting the Bond index proved to be more challenging, with difficulties in identifying patterns and relationships in the data evident from the low R2-score 37 5. Results REGRESSOR Model Train Validation Test Test (2021-23) RF Stock 0.997 0.159 -0.345 0.188 XGBoost Stock 0.773 0.254 -0.273 0.233 RF Bond 0.540 -0.024 -0.115 -0.302 XGBoost Bond 0.615 0.091 -0.148 -0.186 Table 5.1: Prediction performance (R2-Score) of the regressor models for the Stock and Bond index (best values in bold). on all three data sets. Regardless of the inclusion or exclusion of 2020, the test set did not yield a positive R2-score. (a) RF Stock (b) XGBoost Stock (c) RF Bond (d) XGBoost Bond Figure 5.1: Prediction results for the regressor models on the test set. 5.1.2 Prediction Classifiers Studying classifier models was not considered until the latter parts of the project, when the initial regressor prediction models did not perform as well as initially hoped, but the classifiers quickly showed promising results. The classifier prediction models were used to predict asset trend for the upcoming month, and the result from backtesting on the test set can be seen in Figure 5.2, together with performance metrics on the various data sets in Table 5.2. 38 5. Results CLASSIFIER Model Train Validation Test Test (2021-23) RF Stock 98.84% 73.06% 56.96% 58.93% XGBoost Stock 99.05% 74.69% 57.09% 62.96% RF Bond 96.11% 51.58% 45.59% 47.79% XGBoost Bond 98.04% 62.13% 42.15% 40.50% Table 5.2: Prediction performance (Accuracy) of the classifier models for the Stock and Bond index (best values in bold). The same patterns that occurred for the regressors were seen here as well. The XGBoost model was still the most accurate when predicting stocks, achieving an accuracy of 62.96% on the test set, discarding 2020. Bonds proved difficult to forecast again, with both XGBoost and Random Forest scoring an accuracy below 50%. (a) RF Stock (b) XGBoost Stock (c) RF Bond (d) XGBoost Bond Figure 5.2: Prediction results for the classifier models on the test set. 39 5. Results 5.2 Allocation Models Two types of rule-based allocation models were created. The resulting allocation models can be seen in Figure 5.3, along with the various portfolio metrics in Table 5.3. (a) RF Regressor (b) XGBoost Regressor (c) RF Classifier (d) XGBoost Classifier Figure 5.3: Allocation results for the final models. Model Total Return Alpha Volatility Sharpe Sortino MDD 50/50 0.49% - 0.77% -1.65 -2.51 -32.16% RF (reg) -3.24% -3.72% 0.84% -5.97 -8.91 -30.62% XGBoost (reg) -3.10% -3.59% 0.84% -5.80 -8.66 -30.66% RF (clf) 2.53% 2.05% 0.73% 1.06 1.62 -32.21% XGBoost (clf) 1.84% 1.36% 0.76% 0.11 0.17 -32.08% Table 5.3: Allocation performance of the studied models (best values in bold). 40 5. Results 5.2.1 Allocation Regressors The total return for the 50/50 benchmark model was 0.49%, with a Sharpe and Sortino ratio of -1.65 and -2.51, respectively. But the regressor allocation models only achieved a total return of -3.24% and -3.10% for the Random Forest and XGBoost models, respectively. In addition, the Sharpe and Sortino ratios of the models were significantly lower compared to the benchmark as well. Although the continuous nature of risk-adjusted returns would have made it possible to work with thresholds to decide the portfolio weights between neutral (±0%), single (±5%) or double overweight (±10%), the difference between the Stock and Bond index, as illustrated in Figure 5.4 made it difficult to find a threshold that would generalise well, especially in less volatile periods outside of the test set. Instead, the rule-based allocation regressor model only used double-overweight depending on which asset had the highest predicted risk-adjusted return. Figure 5.4: Predicted Risk-Adjusted Returns (PRED_RaR) from the XGBoost re- gressor and realised Risk-Adjusted Returns (RaR) from the Stock and Bond index. PRED_RaR has been a core part of the regressor version of the rule-based alloca- tion model. 5.2.2 Allocation Classifiers Due to the poor accuracy of the Bond classifier prediction models, the rule-based allocation strategy only utilised the signals from the Stock prediction model. As a consequence of using only one signal, the allocation model was limited to only using double overweights (±10%), as opposed to single (±5%) or neutral overweights (±0%), which is otherwise a possibility within the Nordea allocation mandate. Despite these limitations, the Random Forest classifier allocation model outper- formed the benchmark by more than 2% while simultaneously decreasing risk, re- sulting in a higher risk-adjusted return overall. Although the XGBoost Stock prediction classifier had a better accuracy on the test set compared to the Random Forest Stock prediction classifier (Figure 5.2), the total 41 5. Results return of the XGBoost allocation model was only 1.84% on the backtest, compared to 2.53% for the Random Forest model. Still, the result was an improvement over the 50/50 benchmark portfolio across all performance metrics. 5.3 Rebalancing day The choice of which day in the month to rebalance the portfolio is a crucial factor that significantly impacts the total return and is relevant to asset allocation practices in general, extending beyond the scope of financial machine learning. Figure 5.5 illustrates how the performance of the allocation classifier models is affected by the choice of rebalancing day. The figure reveals that the Random Forest model exhibits more consistent returns, while the XGBoost model shows greater variability with a higher peak (7.67% compared to 6.08%) and a lower bottom (-3.84% compared to -2.07%). (a) Random Forest Classifier (b) XGBoost Classifier Figure 5.5: Rebalancing day matters. The total return of the Random Forest and XGBoost allocation models can vary by 8.15 and 11.51 percentage points, respec- tively, depending on when the portfolio is rebalanced. 5.4 Feature Importance The use of backtests as an industry-standard performance metric in financial ma- chine learning has faced criticism, and the results have been labelled as pseudo- discoveries by prominent figures in quantitative forums, such as Marcos López de Prado [18]. Feature importance provides an alternative evaluation approach to a machine learning model in conjunction with historical backtests. By analysing fea- ture importance, insights can be gained into when and which features influenced the model’s performance, thereby uncovering the inner workings of the often-discussed “black box” in artificial intelligence. This analysis enables the elimination of noisy features and facilitates the assessment of performance for new combinations of time series. Consequently, feature importance was conducted prior to the backtests to inform the subsequent analysis. 42 5. Results The extracted features and their importance can be seen in Figure 5.6. The figure highlights the models’ dependence on multiple features rather than relying heavily on a single one. When performing feature elimination, only 78 features significantly impacted the XGBoost model when forecasting the Bond index, compared to 150 features for the Random Forest model. But when predicting the Stock index, both models used the same 150 features. Furthermore, a comparison of the most impor- tant Stock features, seen in Table 5.4, and the most important Bond features, seen in Table 5.5, reveals that there was no overlap in the top features between the two models. (a) Stock Features (b) Bond Features F