AI-Driven Predictions of Industrial Metal Prices on the London Metal Exchange Optimizing Echo State Networks with Genetic Algorithms Master’s thesis in Engineering Mathematics and Computational Science MARKUS EDLUND DEPARTMENT OF MICROTECHNOLOGY AND NANOSCIENCE CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 AI-Driven Predictions of Industrial Metal Prices on the London Metal Exchange Optimizing Echo State Networks with Genetic Algorithms MARKUS EDLUND Department of Microtechnology and Nanoscience Chalmers University of Technology Gothenburg, Sweden 2025 AI-Driven Predictions of Industrial Metal Prices on the London Metal Exchange Optimizing Echo State Networks with Genetic Algorithms MARKUS EDLUND © MARKUS EDLUND, 2025. Supervisor: Zoran Konkoli, Department of Microtechnology and Nanoscience Examiner: Zoran Konkoli, Department of Microtechnology and Nanoscience Master’s Thesis 2025 Department of Microtechnology and Nanoscience Division of Division name Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv AI-Driven Predictions of Industrial Metal Prices on the London Metal Exchange Optimizing Echo State Networks with Genetic Algorithms MARKUS EDLUND Department of Microtechnology and Nanoscience Chalmers University of Technology Abstract The battery manufacturer Northvolt aims to reduce inventory risks. In an increas- ingly competitive industrial metals market, protecting inventory value by employing hedging strategies is a key component of lowering costs. Using artificial intelligence for reliable predictions of short term prices is an interesting prospect for improving such strategies. This study uses data from the London Metal Exchange, including cash and three- month futures contracts, along with stock volume, to predict next-day cash prices for six industrial metals. These predictions are compared to a baseline random walk model using the metrics mean squared error (MSE), mean absolute error (MAE), R2, and directional accuracy. Predictions are made using an Echo State Network (ESN), with optimized parame- ters chosen by a Genetic Algorithm (GA). The network is trained offline with ridge regression and online with stochastic gradient descent. The performance of the ESN- GA setup is validated using chaotic systems (Mackey-Glass and Lorenz equations) before being applied to metals futures data. The data is split into four different sets: a warmup set necessary for ESNs, a training set used for offline training, a validation set to measure GA performance, and a test set of unseen data to ensure the network generalizes well. The GA optimization significantly reduced prediction errors in the Mackey-Glass and Lorenz systems, with MSE values improving from 1.2E-8 to 5.3E-12 and from 3.3E-2 to 8.8E-7, respectively, on the validation set. For metals futures data, the GA enhanced ESN performance, outperforming the random walk across all metrics on the validation set for all six metals. However, slight modifications were necessary to achieve superior performance on the unseen test set. Superior results were achieved for four metals across all metrics, while the results for the remaining two metals were mixed. The GA effectively optimizes ESN parameters for systems with clear, determin- istic dynamics but encounters challenges when applied to stochastic futures data, particularly due to the risk of overfitting the validation set. While the ESN-GA method does not consistently outperform the random walk on unseen test data, it demonstrates significant potential. Further exploration with alternative configura- tions, additional data types, and more robust validation techniques is warranted to enhance its practical applicability. Keywords: Artificial Intelligence, Echo State Networks, Genetic Algorithms, North- volt, London Metal Exchange, Metal Futures, Financial Time Series, Random Walk v Acknowledgements I would like to express my heartfelt gratitude to several individuals and organizations who have supported me throughout this journey. First and foremost, I would like to thank my supervisor and examiner, Zoran Konkoli, for his patience, guidance, and encouragement in allowing me the free- dom to explore reservoir computing according to my interests and aspirations. His support has been invaluable throughout this process. I am deeply grateful to Denver Stretesky for serving as my opponent and for provid- ing great feedback and insightful suggestions that have greatly improved the quality of this work. A special thanks goes to Frederic Sese and Axel Myrberger at Northvolt for giving me the opportunity to work on such an engaging and meaningful problem. Despite challenging times, their enthusiasm has never wavered. Finally, I extend my thanks to everyone who contributed directly or indirectly to the completion of this project. Markus Edlund, Gothenburg, January 2025 vii List of Acronyms DA Directional Accuracy ESP Echo State Property ESN Echo State Network GA Genetic Algorithm GD Gradient Descent LME London Metal Exchange MAE Mean Absolute Error MSE Mean Squared Error RC Reservoir Computing ix Contents List of Acronyms ix 1 Introduction 1 1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Northvolt and the London Metal Exchange . . . . . . . . . . . . . . . 3 2.2 The Random Walk and Chaos in Finance . . . . . . . . . . . . . . . . 4 2.2.1 Is the LME an Efficient Market? . . . . . . . . . . . . . . . . 5 2.2.2 The Random Walk Model . . . . . . . . . . . . . . . . . . . . 5 2.2.3 Beyond the Random Walk: Chaotic Dynamics . . . . . . . . . 5 2.3 The ESN-GA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Theory 9 3.1 Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 Scale-Free Network . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.2 Uniformly Random Networks . . . . . . . . . . . . . . . . . . 11 3.2 Theory of Reservoir Computing Models . . . . . . . . . . . . . . . . . 11 3.2.1 The Echo State Property . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Calculating Output . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.3 Offline Learning: Ridge-Regression . . . . . . . . . . . . . . . 15 3.2.4 Online Learning: Stochastic Gradient Descent . . . . . . . . . 16 3.3 Theory of Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Premature Convergence . . . . . . . . . . . . . . . . . . . . . 21 3.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Methods 25 4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 The Echo State Network . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.1 Initializing Networks . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.2 Warmup and Training . . . . . . . . . . . . . . . . . . . . . . 29 4.2.3 Validation and Testing . . . . . . . . . . . . . . . . . . . . . . 29 4.2.4 Calculating Test Metrics . . . . . . . . . . . . . . . . . . . . . 30 4.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 xi Contents 4.3.1 The Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.4 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.5 Summary of Genetic Algorithm Parameters . . . . . . . . . . 33 4.4 Refinements to the Methods . . . . . . . . . . . . . . . . . . . . . . . 34 4.5 Benchmarking Using Mackey-Glass and Lorenz Attractor . . . . . . . 35 4.6 Software and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7 Limitations of the Methods . . . . . . . . . . . . . . . . . . . . . . . 36 4.8 Summary of Methodological Approach . . . . . . . . . . . . . . . . . 37 5 Results 39 5.1 Mackey-Glass Equations and Lorenz Attractor . . . . . . . . . . . . . 39 5.2 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.1 Results From Modified Fitness Function . . . . . . . . . . . . 68 5.4.2 Grid Search For Regularization . . . . . . . . . . . . . . . . . 70 5.4.3 Removing the Stock Data . . . . . . . . . . . . . . . . . . . . 72 5.5 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6 Discussion 75 6.1 Effectiveness of the GA in Improving ESN Predictive Performance . . 75 6.2 Sufficiency of the Genome . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Effectiveness of the Enhancements . . . . . . . . . . . . . . . . . . . . 75 6.4 Computational Costs and Time Requirements . . . . . . . . . . . . . 76 6.5 Robustness of GA-Optimized ESNs Across Datasets . . . . . . . . . . 76 6.6 Insights for Trading Strategies . . . . . . . . . . . . . . . . . . . . . . 76 6.7 Comparison with Random Walk Model . . . . . . . . . . . . . . . . . 77 6.8 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7 Conclusions 79 Bibliography 81 A Appendix A I B Appendix B V C Appendix C IX xii 1 Introduction In an increasingly competitive industrial metals market, protecting inventory value is critical for industrial actors like Northvolt, one of Europe’s fastest-growing battery manufacturers [70]. Materials account for approximately 65% of the total production costs of electric vehicle batteries [45], therefore delivery contracts of batteries usually include a moving prices of said materials [80]. This exposes the whole supply chain to significant financial risk if prices are volatile, highlighting the need for effective mitigation strategies. To address this challenge, hedging strategies involving futures contracts traded on the London Metal Exchange (LME), are employed to stabilize costs and protect inventory value [55][46][75]. The LME is the world’s primary commodities exchange for industrial metals, with $15 trillion traded in 2023 [53]. Futures contracts allow traders to lock in prices for metals at a future date, enabling them to offset potential losses caused by price fluctuations [55]. For instance, selling futures while holding an equivalent physical inventory can balance the financial impact of price changes. In this context, accurate short-term price predictions, particularly for LME cash contracts, could improve hedging strategies and inventory management. Predicting financial markets, however, is notoriously challenging. The efficient mar- ket hypothesis suggests that market prices reflect all available information, causing price movements to behave like a random walk, that is inherently unpredictable [21]. Yet, skepticism remains. Some studies propose that financial time series exhibit chaotic dynamics rather than pure randomness, meaning there may be underlying patterns that can be identified and modeled [62]. If chaotic behavior exists, mod- els adept at predicting chaotic systems might offer a competitive edge over other models. To address this, this thesis explores the use of artificial intelligence for predicting short-term cash prices of industrial metals. Specifically, the study applies an Echo State Network (ESN) to forecast next-day cash prices of six metals traded on the LME. The ESN’s parameters are optimized using a Genetic Algorithm (GA), a nature-inspired optimization technique [66]. The performance of the ESN-GA model is evaluated against a random walk baseline, using the metrics mean squared error (MSE), mean absolute error (MAE), R2, and directional accuracy. 1.1 Research Questions This thesis aims to answer the following primary research question: • Can Echo State Networks (ESNs), optimized using a Genetic Algo- 1 1. Introduction rithm (GA), outperform a random walk model in predicting next- day metal futures prices? To address this, the following secondary questions are considered: • Is the GA effective at improving the ESN’s predictive performance? • What are the computational costs and time requirements for using a GA to optimize ESNs? • How robust is the GA-optimized ESN model for different metals and across various datasets? • What patterns or insights emerge from the ESN model’s predictions that could inform trading strategies? 1.2 Overview of Methods The study uses historical data from the LME, including cash contracts, three-month futures contracts, and stock volume for six industrial metals: aluminium, copper, lead, nickel, tin, and zinc accessed through Westmetall [86]. The data is split into warmup, training, validation, and test sets. The ESN, a type of recurrent neural network, is constructed to predict next-day cash prices. Its parameters are tuned using a GA, which iteratively evolves the network design to minimize prediction error. The ESN-GA model’s performance is validated for chaotic systems using the Mackey-Glass and Lorenz Equations. Finally the performance of the ESN-GA model is tested on the industrial metals and benchmarked against a random walk model. 1.3 Significance of the Study This study contributes to the fields of machine learning and financial time series forecasting by exploring the effectiveness of ESN-GA models for industrial metal price predictions. It also bridges the gap between theoretical AI models and prac- tical financial applications, offering insights that could enhance hedging strategies for industrial actors like Northvolt. By evaluating the ESN-GA model against the random walk, this research provides a rigorous test of the model’s predictive capa- bilities often lacking in many similar studies on using AI for financial time series predictions [73][83][13][58]. 2 2 Background This section established the practical context and briefly introduces the theoretical frameworks and machine learning techniques that will be applied in the thesis. 2.1 Northvolt and the London Metal Exchange Northvolt is a Swedish battery manufacturer and was the fastest-growing producer in Europe 2023, as highlighted in their sustainability and annual report [70]. The company’s value proposition is centered around producing the world’s greenest bat- teries with industry-leading performance, enabling customers to transition to electric solutions. To manufacture these batteries a wide range of metals and materials are required. This exposes Northvolt to fluctuations in metal prices. As part of its strategy to manage this risk Northvolt incorporates recycling into its processes [71]. In 2023, 6% of the materials in their battery cells were recycled, with an ambitious target of 50% by 2030 [70]. Materials account for a significant portion of the costs in battery manufacturing, approximately 65% according to industry estimates for electric vehicle batteries [45]. While this estimate is not specific to Northvolt, material costs are critical for any battery manufacturer, emphasizing the importance of intelligent procurement strategies. While well structured supply chains are probably the most important part of driving costs down, an important complement is the use of financial tools such as futures [75] [46]. The London Metal Exchange (LME) is the world’s primary commodities exchange for industrial metals. In 2023, it handled trading volumes of 3.5 billion tonnes, and a notional trading value of $15 trillion [20]. The LME facilitates the majority of global financial business related to non-ferrous metals, i.e. metals that do not contain iron, and prevalent in batteries. One of the LME’s key functions is enabling the trading of futures contracts [55]. Fu- tures are agreements that obligate buyers and sellers to transact a specified quantity of a commodity at a predetermined price on a future date [31]. Notably, traders do not need to physically own the underlying metal to trade futures. They can offset contracts by entering equivalent contracts with matching terms on the opposite side. Upon contract expiration, the LME clearing house records a net-zero position, and the trader realizes a profit or loss based on the difference between the purchase and sale prices of the contracts and no metals have to be delivered or received [52]. Futures contracts are an effective tool for reducing inventory risk without the need to buy or sell additional metal. For example, buying a certain amount of a metal and 3 2. Background simultaneously selling an equivalent quantity in futures contracts offsets potential inventory losses due to declining metal prices, as these losses are balanced by gains in the futures contracts. Conversely, increases in inventory value are offset by losses in the futures position. Something similar to this approach is widely employed by industrial players, including Northvolt, to mitigate inventory risks [80]. This study uses two types of LME contracts as data: cash contracts and three-month contracts. • Cash contracts are the shortest-duration contracts on the LME [51]. Once a cash contract is entered, it cannot be offset with another contract because no equivalent contract exists to take an opposing position. These contracts effectively determine the final cash value for traders who do not wish to settle their positions physically. The next-day cash price is the primary focus of the predictive model in this thesis. • Three-month contracts are the longest contracts that are settled daily on the LME [51]. As the name implies, they mature three months from the purchase date. These contracts offer insight into market expectations of future price movements. This relationship may hold predictive power for next-day cash prices. The third type of data used in this study is stock data. The LME authorizes a global network of warehouses for storing metals traded on the exchange [54]. When a futures contract is settled, sellers issue warrants for the agreed-upon quantity of metal, which buyers can redeem at these warehouses. Since prices are fundamentally driven by supply and demand, information about available metal stock may provide valuable insights for price predictions. 2.2 The Random Walk and Chaos in Finance The dynamics of financial markets are complex. Unlike physical systems governed by constant fundamental laws, a major contributor to financial market outcomes are the anticipations of its participants [65]. Adding to this complexity is bad in- put data, extraordinary events and malevolent actors. This makes price prediction a daunting challenge. If financial market prices were predictable using readily available formulas, these formulas would create a self-correcting effect: market participants would exploit the predictions, and subsequent prices would quickly reflect the ex- pected outcomes. This leads to the notion that prices fully incorporate available information, with price movements occurring as independent random events when new information emerges. This concept is encapsulated in the Efficient Market Hy- pothesis (EMH), which states that: "A market in which prices always fully reflect available information is called efficient" [21]. The EMH is not meant to be taken literally and recognizes varying degrees of market efficiency. Empirical tests of market efficiency are typically categorized as [21]: 1. Weak: Only historical prices are considered available information. 2. Semi-strong: All publicly available information is considered. 3. Strong: All information, including non-public data, is incorporated. While the validity of the Efficient Market Hypothesis remains a subject of debate 4 2. Background [91], AI models often struggle to outperform a random walk in short-term predic- tions, particularly when relying solely on historical price data [58]. 2.2.1 Is the LME an Efficient Market? There is a case to argue that the LME may not fully align with the definition of an efficient market. The sufficient conditions for market efficiency include [21]: 1. No transaction costs in trading securities. 2. All available information is costlessly accessible to all market participants. 3. All participants agree on the implications of current information for prices. The LME has transaction costs [48], and historical data is not freely available [49], posing barriers for some participants, such as the author of this thesis. However, in practice the only actors directly trading on the LME are institutional, making the economical barriers to information and transactional costs negligible [50]. Re- garding the last condition, it would be incredibly difficult to show that every single participant agrees on anything and is entirely out of scope for this thesis. While not technically fulfilling the sufficient conditions for a market to be considered efficient, it is certainly possible that the LME is. A brief older study on the LME copper market concluded that it probably is, despite statistical evidence being inconclusive [30]. 2.2.2 The Random Walk Model The predictive model derived from the weak form of the EMH is the random walk. A time-discrete random walk has two key properties [21]: 1. Changes in price are independent of each other. 2. Changes in price are identically distributed. The random walk (RW) model is represented as: St = St−1 + ϵt where §t is the price at time t and ϵt is a noise term with E[ϵt] = 0. This implies that the expected value of the next day’s rate of change (rt+1) given the available information (It) is: E[rt | It−1] = 0 where It−1 represents the information set containing historical data. Extensions and variations of the RW assumption is widely used, with applications such as the Black-Scholes model for option pricing where log returns are assumed to be independent and normally distributed [6]. The RW model in this thesis only assumes that expected next day returns are zero, thus the best prediction of the next day price is the same price as today. 2.2.3 Beyond the Random Walk: Chaotic Dynamics Despite its prevalence, the random walk model has its skeptics. Some research has found that financial time series are better described as chaotic rather than random 5 2. Background [62]. Unlike randomness, chaos implies deterministic yet complex patterns that can be modeled. Accurate modeling of chaotic systems requires precise measurements of all relevant variables, which may be infeasible in the case of financial time se- ries. Regardless, using some of the relevant variables might produce more accurate predictions than a random walk. Chaotic systems, such as the Mackey-Glass and Lorenz equations, are examples of differential equations describing the dynamics of complex systems [63] [56]. These models will be used in this thesis to demonstrate the ability to predict chaotic systems. The reasoning is that if the LME futures market exhibits chaotic behavior, a model designed for chaotic systems should outperform the random walk. Instead of debat- ing whether markets are inherently chaotic or efficient, this thesis adopts a prag- matic approach: constructing a model that excels at chaotic prediction and then comparing its performance with the random walk model. With some of the inherent challenges of financial time series forecasting established, the next section introduces the model employed in this thesis. 2.3 The ESN-GA Model Echo State Networks (ESNs) are a category of Reservoir Computing (RC) models that have shown excellent performance in predicting chaotic time series [19][41]. One of ESN’s key advantages is its limited training requirements. This is achieved through the construction of the reservoir, a high-dimensional, non-linear space of sparsely connected nodes whose weights and connections remain fixed [38]. Instead of training the entire network, only a linear readout layer is trained to interpret the various states of the reservoir nodes. Since this readout layer is linear and non-recurrent, efficient regression techniques can be employed. The primary challenge in constructing ESNs lies in selecting the parameters defining the reservoir [57]. Questions such as “How many nodes should be used?”, “How should they be connected?”, and “How should the input be scaled?” are critical to the network’s performance. A common approach involves relying on expertise or intuition, essentially educated guessing. When such expertise is unavailable, a systematic and automated method becomes necessary. In this thesis, parameters are selected using a Genetic Algorithm (GA). A GA simulates the process of evolution [66]. Central to evolution is the concept of survival of the fittest, which measures how well an individual adapts to its envi- ronment. In the context of GAs, fitness is quantified using a mathematical function that evaluates how well an individual performs a given task. Here, an individual represents a set of parameters defining an ESN model. The GA process begins by randomly initializing a population of individuals, the first generation, within a defined search space. Each individual is evaluated based on its fitness, which, in this context measures the ESN’s predictive performance on LME cash contracts. To improve the next generation, evolutionary operators such as selective breeding and random mutation are applied: • Selective breeding: Individuals with higher fitness are more likely to be selected for reproduction, creating new offspring by combining the traits of 6 2. Background two selected parents. • Random mutation: Introduces novel variations by randomly altering traits. While often disruptive, mutation is crucial for maintaining genetic diversity and discovering new, potentially superior solutions. This process of selection, crossover, and mutation is repeated over multiple genera- tions, allowing the population to converge toward optimal or near-optimal solutions. Compared to random or grid search methods, GAs often identify high-performing solutions more efficiently [85]. The nested optimization loop, where the GA optimizes the parameters of the ESN whose readout layer is then trained, form the ESN-GA model proposed in this the- sis. This combination leverages the predictive power of ESNs and the optimization capabilities of GAs to address the challenges of parameter selection systematically. 2.4 Related Work Having introduced the relevant topics of the thesis it is worth looking into whether anything similar has been done before. The concept of reservoir computing is not new. It was formally introduced by Herbert Jaeger in 2001 [38]. Similar models were known before but Jaeger’s work is the foundation of what RC is today [40]. Since then a multitude of effective applications have been found [12]. There are also multiple RC libraries available for ease of implementation [68] [15] [16]. However, when pairing the RC model with a GA, using existing libraries can be cumbersome and clunky. Instead a low level custom implementation is favored allowing for control and larger understanding of network design decisions. Creating a low level implementation of RC can be challenging but the work by Lukoševičius, specifically A Practical Guide to Applying Echo State Networks, is a vast resource [57]. The guide is in fact cited in most of the existing libraries. Others have used a genetic algorithm to optimize the parameters of an RC model with positive results [29] [11]. There are also implementations of RC networks specifically for financial time series [84] [89] [88], however none known to the author for predicting metal futures. Using ESNs, or reservoir computing for that matter, for financial time series predic- tions is quite niche and there are seemingly no comprehensive literary reviews on the topic. Exemplifying this is the omission of RC models from reviews of deep learn- ing for financial time series predictions [13] [83] [73]. It should also be noted that data from non-publicly traded metal futures, such as the LME data, is also never mentioned in any of the reviews. Further, using the random walk as a benchmark is not common. In the respective 72, 85 and 140 studies in these reviews, using the random walk as a comparative baseline only appears twice, even though the efficient market hypothesis is commonly cited. The explanation for this might be found in a study that replicated 15 best in class deep learning models found in other studies [58]. None of the models replicated in the study managed to outperform the naive approach, called random walk model in this thesis, based on mean squared error over multiple time periods. 7 2. Background 8 3 Theory This section handles the theory required to construct an ESN, optimize its param- eters using a GA and produce predictions. 3.1 Network Topologies A network topology refers to the arrangement or organization of nodes and connec- tions within a network. It is particularly relevant to this subject due to its significant impact on network performance [36] [59] [60]. This section outlines the two different network topologies utilized in this study. The algorithms used create simple adjacency matrices. Such matrices describe in binary terms how the nodes in the network are connected. A one represents a connection, and subsequently a zero represents no connection. The matrices are directional and self-connecting, that is, node a can be connected to node b while node b does not necessarily need to be connected to node a, also node a can be connected to itself. Let A be an n× n adjacency matrix where the elements Aij are defined as: Aij = 1 if i is adjacent to j, 0 otherwise, for all i, j ∈ N. After the simple adjacency matrix is generated, each of the non-zero elements are then replaced to fulfill the echo state property, this is handled later. The new matrix defines all the connections and corresponding weights in the reservoir, hence called the weight matrix. Below are descriptions of the two algorithms used for generating the simple adjacency matrices. 3.1.1 Scale-Free Network The reason for generating scale free networks in this context is to mimic the complex structure of networks occurring in nature, such as the neural network in brains [4]. Scale-free networks can be generated using the Albert-Barabasi model [5] and is achieved through so called preferential attachment. The process follows two general steps: first generate a small number of connected nodes, second add subsequent nodes with preferential attachment. That is, add connections with a proportional probability based on the degree of each node over the total degree of the network [5]. This approach leads to an undirected network without self connections so a slightly 9 3. Theory edited approach inspired by [69] and [3] is taken to make the network directed and self connecting. The process follows: The algorithm begins by initializing a clique C of size 2, consisting of two fully connected nodes. This is represented by an adjacency matrix A ∈ {0, 1}n×n, where Aij = 1 for i, j ∈ {0, 1}, and Aij = 0 otherwise. Define the set of all nodes as N = {0, 1, . . . , n − 1}, where n is the total number of nodes in the network. The initial clique C ⊆ N contains two nodes, and the set S = N \ C represents the nodes that are not initially part of the clique. For each node sk ∈ S, the algorithm sequentially adds sk to the clique C, so that C ← C ∪ {sk}. Once sk is added to C, edges between sk and the nodes in C are established based on preferential attachment. Specifically, for each node si ∈ C, the following steps are performed: 1. Sample two random values r1, r2 ∼ U(0, 1), where U(0, 1) represents the uniform distribution on the interval [0, 1]. 2. Compute the degree deg(si) of node si, which is the sum of the entries in the i-th row of the adjacency matrix A: deg(si) = n−1∑ j=0 Aij 3. Let the total degree of the nodes in C be: deg(C) = ∑ i∈C deg(si) 4. Define the probability of connecting sk to si as: P (sk → si) = deg(si) deg(C) If r1 < P (sk → si), then set Aik = 1 (i.e., create a directed edge from sk to si). 5. Similarly, define the probability of connecting si to sk as: P (si → sk) = deg(si) deg(C) If r2 < P (si → sk), then set Aki = 1 (i.e., create a directed edge from si to sk). If, after iterating through all nodes in C, node sk has not been connected to any other nodes, i.e. no entries in row Ak· or column A·k are set to one, the process is repeated until sk is connected to at least one node in C. This process is repeated for each node sk ∈ S until all nodes are part of the clique C and A defines the scale-free simple adjacency matrix. 10 3. Theory 3.1.2 Uniformly Random Networks Generating uniformly random networks requires less steps and is commonly used in ESNs [36]. For every entry in the adjacency matrix a uniformly random number between 0 and 1 is sampled. If the random number is below a specified probability P the entry in the adjacency matrix is put to one, else zero. That is: Let A be an n× n adjacency matrix where the elements Aij are defined as follows. For every i, j ∈ N = 0, 1, ..., n− 1, let p ∼ U(0, 1), where U(0, 1) represents the uniform distribution on the interval (0, 1). Then, the elements Aij are defined as: Aij = 1 if p < P, 0 otherwise, for all i, j ∈ N. Thus a uniformly random topology requires a connectivity parameter P. A defines the uniformly random simple adjacency matrix. 3.2 Theory of Reservoir Computing Models Reservoir computing is a recurrent neural network framework designed for , among other things, processing temporal data and solving complex dynamical problems [26]. It consists of a fixed, randomly initialized set of recurrent nodes the so-called reservoir, which captures and transforms input data into high-dimensional space. All nodes in the reservoir record states, which are then interpreted by a readout layer. A key feature is that only the readout layer is trained. This makes the approach computationally efficient, considering the training becomes a least squares problem, while still capable of learning complex patterns [79]. This architecture is particularly well-suited for tasks like forecasting, pattern recognition, and modelling chaotic systems [26][41], which financial markets could be claimed to be a part of [33]. Regardless, there are promising results for utilizing reservoir computing models within financial markets to make predictions [33][84]. Following is an abstract outline of how ESNs operate as described in [61]. In short, the process involves handling input to produce output and then optimize the output towards a goal: Form two separate sequences: D = {xt, yt}T −1 t=0 where: • The set D is the training data. • xt is the input signal at time t • yt is the output signal at time t • There exists some relationship between the two signals. The goal is to learn the relationship between x and y from D, and then, after the training period (for t ≥ T ), be able to give estimates ŷt for an unknown yt given some new input xt. When performing a time series prediction, the goal is often to predict xt+τ = yt given {xi}i≤t, which is called a τ -step ahead prediction. Now that the purpose of 11 3. Theory an ESN has been established, following is an abstract illustration of the network structure: Figure 3.1: Illustration of a Reservoir Computing Model A general reservoir computing model, as illustrated above, consists of the input and output layers (sometimes called the input and readout layer) connected to the n- dimensional dynamical space called the reservoir. The reservoir can be described as follows: St = F (St−1, xt) where • St ∈ Rn is the state of the reservoir at time t, which consists of {si,t}n−1 i=0 , the individual states of the reservoir nodes. • F : Rn × R→ Rn is the map describing the dynamics of the reservoir. This means that the state of the reservoir St evolves over time according to the dynamics F and the input xt. The approximation ŷt of yt is calculated from the reservoir as follows: ŷt := n−1∑ i=0 wisi,t That is, reading the states of the nodes in the reservoir and scaling them by the output weights {wi}n−1 i=0 . Choosing the output weights is done by making the approximation as close to the output signal as possible in the training data D. That is, ŷt ≃ yt, for t ∈ [0, T − 1]. Typically, this is done by a least squares method. 3.2.1 The Echo State Property For the functionality of the ESN it is essential that the state of the network is uniquely determined by any left-infinite sequence of inputs [32]. This means that, regardless of the initial state of the network, the current state is fully determined by the infinite history of past inputs, ensuring that the influence of any arbitrary initial conditions vanishes over time. Guaranteeing this can be done through the echo state property (ESP). A few formal definitions for the ESP have been proposed depending on restrictiveness, slight differences in the ESN and practicality [32] [37][38][39]. Following is the definition offered by [32]: Being uniquely determined by any left- 12 3. Theory infinite sequence depends on the input sequence. Therefore, the definition of the ESP is stated through constraining the input sequence’s range to a compact set, let’s denote that X. We also need to relate X to the set of all possible states of the nodes in the reservoir, denote that S. This leads to the compactness condition which states: Let F be the map defining the dynamics of the reservoir. • F defined on S×X • S ⊂ Rn • X ⊂ Rt • X and S are compact sets • F (St−1, xt) ∈ S, xt ∈ X for any t ∈ Z Now, S being compact is guaranteed when the activation function of the reservoir nodes f is bounded. A common example of a bounded activation function is the hyperbolic tangent tanh. Practically speaking the inputs xt will also be bounded making X a compact set. Now, let X−∞ and S−∞ denote the left infinite input and state vectors respectively defined as: • X−∞ := {x−∞ = {xi}0 i=−∞ | xt ∈ X for any t ≤ 0} • S−∞ := {S−∞ = {Si}0 i=−∞ | St ∈ S for any t ≤ 0} S−∞ is compatible with x−∞ when: • St = F (St−1, xt), ∀t ≤ 0 Equipped with this, the definition of the ESP follows: A network with dynamics: • F : S×X− > S satisfying the compactness condition has the echo state property for any input se- quence: x−∞ ∈ X−∞ and any two state vector sequences: S−∞, K−∞ ∈ S−∞ Compatible with x−∞. It holds that S0 = K0. This definition implies that states that are nearby must be the result of similar input histories [8]. Subsequently the current state should depend more on recent inputs and states rather than historic ones. In practice one way of achieving the echo state property is by assuming a compact set of input vectors, which almost always holds, and adjusting the reservoir weights 13 3. Theory matrix accordingly. So, for the standard ESN, construct the reservoir weights matrix W res in the fol- lowing manner to guarantee the ESP [32]: 1. Randomly initialize W res ∈ Rn×n such that all entries wres ij ≥ 0, i.e., they are non-negative. 2. Scale W res such that the spectral radius is less than 1, i.e., ρ(W res) < 1. 3. Change the sign of a desired number of weights wres ij so negative connections are introduced. This guarantees the ESP. A few ways to prove this can be found in [32][39][38]. It should be noted that as explicitly stated in [40] and as shown in [32] that randomly initializing W and then scaling the spectral radius below unit is not necessary for the ESP and spectral radii above unity might be optimal for the specific application. 3.2.2 Calculating Output Previously the calculation of the output, the approximation ŷ, was described as multiplying the states of the reservoir by the weights of the output layer, that is a simplification. This section in turn outlines the full calculation from input to output. To start with, we define all the different vectors, matrices and activation functions used to define the dynamics of the reservoir F subsequent calculation of the output. Figure 3.2: Illustration of Reservoir Computing Model, Complemented with Vec- tors, Matrices and Activation Functions used to Define the Dynamics of the Reser- voir First we have the matrices: • W in ∈ Rk×n, where dim(xt) = k, the number of dimensions to the input signal. • W res ∈ Rn×n, where n is the number of nodes in the reservoir. • W out ∈ Rn×m, where dim(yt) = dim(ŷt) = m, the number of dimensions of the output signal and approximation. Second the bias vectors: • bin ∈ Rn • bout ∈ Rm Note that not necessarily all the nodes in the reservoir are connected to the input or output layers. For the connected nodes the entries in the corresponding weight matrix and bias vector are non-zero, correspondingly for non-connected nodes the entries are zero. 14 3. Theory Third the state vector of the reservoir • St ∈ Rn, the reservoir state at time t • St,i denotes the state of node i at time t And fourth the three activation functions: • f in, the element-wise activation function of the input layer. Usually just the unity. • f res, the element-wise activation function of the reservoir. Usually the hyper- bolic tangent. • f out, the element-wise activation function of the output layer. Usually unity or a rectifier. Last, as suggested in [68], [27] and [10] we introduce a leakage rate, α ∈ (0, 1]. The term helps control how fast the state of the reservoir changes with new input, allowing learning of slower dynamical systems and recognizing strongly time warped patterns [27]. With all terms defined, the first step is to calculate St using the dynamics of the reservoir F driven by xt: St = F (St−1, xt) Where the map describes: St = (1− α)St−1 + αf res ( W resSt−1 + f in ( W inxt + bin )) With the updated states St, we can calculate the output ŷ: ŷt = f out(W outSt + bout) Having calculated the output, it can be used to train the output weights and or make predictions, discussed in the following sections. 3.2.3 Offline Learning: Ridge-Regression Offline training entails training the model on a fixed historical dataset. The only coefficients in the model that are being optimized are W out and bout. Because f out is typically linear the calculation of the output can be written as follows: ŷt = ¯W outS̄t where ¯W out = [W out; bout] and S̄t = [St; 1], allowing for simpler notation. The purpose of the ESN is to match the input-output pair [X, Y ]. This is done by feeding xt into the reservoir, updating the states St and calculating the output ŷt as above, where ŷt should be as close to yt as possible. To achieve this one can gather all the states produced by X and minimize the squared error with respect to the weights and bias. Assuming we have T time steps of training data we want to solve the following equation for W̄ out: Y = W̄ outS̄ 15 3. Theory where Y ∈ Rm × T , S̄ ∈ R(n+1) × T , and W̄ out ∈ R(n+1) × m. Since the system is usually overdetermined, T >> (n + 1), there are many known solutions. Using a ridge regression is the most universal and stable, which is the following equation [57] : W̄ out = Y S̄ ( S̄S̄T + λI )−1 where T denotes the transpose, I is the identity matrix, (·)−1 represents the matrix inverse, and λ is the regularization coefficient. The inclusion of λ is made to penalize large absolute values of W̄ out. If W̄ out has very large absolute values this might amplify tiny differences in S̄. This means possibly worse generalization for out of sample inputs that produce never before seen states and potentially massive errors. Picking λ such that this does not happen is important and therefore included as a parameter that needs optimization. 3.2.4 Online Learning: Stochastic Gradient Descent Online learning enables the model to adapt to new data. This is desirable in cases where the time series X is non-stationary [72] [90], which financial time series can be. Online learning through stochastic gradient descent (SGD) enables this by, at each time step, shifting the weights and bias such that the squared error becomes smaller. SGD is formulated as follows [67]: The objective of SGD is to minimize the average value of a function L: L(θ) = Eq(p)[L(θ, p)] Where θ is the parameters we want to optimize and p is a random input to L. For example, an approximation from a training set. At each step of the optimization, we assume that we observe: Lt(θ) = L(θ, pt) And pt ∼ q. We also assume that an unbiased estimate for the gradient of L can be computed. That holds if the distribution of q(p) is independent of θ and we can use: gt = ∇θLt(θt) With that we can follow the negative gradient and iteratively update the parameters: θt+1 = θt − ηtgt Where η is the learning rate determining how far the step in the negative gradient direction the update should be. Because the method approximates the error space by a hyper plane and traverses it in the steepest instantaneous direction, η needs to be optimized. If picked too large the new weights will overshoot the optimal solution and too small never adapt to a non-stationary time series. The choice of η is handled in the methods section. 16 3. Theory Now, θ represents the readout layer, that is W out and bout. The calculation of the gradient of the loss function with respect to the two is also handled in the methods section. 3.3 Theory of Genetic Algorithms Thus far a few parameters needed to construct and train ESNs have been introduced. This section outlines some theory of genetic algorithms (GAs), its usefulness in parameter selection and how they can be implemented. GAs mimic Darwinistic evolution [66]. This is why most terminology is borrowed directly from the world of biology. A general genetic algorithm proceeds as follows [66] [85]: Figure 3.3: Schematic of a General Genetic Algorithm Where a more detailed explanation follows: 1. Initialize the population by randomly generating N different chromosomes, ci, where i = 0, . . . , N − 1, from the genome (the search space). 2. Evaluate the individuals (a) Decode the chromosomes into their corresponding parameters, pi, where i = 0, . . . , N − 1. 17 3. Theory (b) Evaluate the objective function g using the parameters. Assign a fitness value to each individual Gi = g(pi), where i = 0, . . . , N − 1. 3. Form the next generation (a) Select two parents, i1 and i2, such that individuals with a better fitness value have a higher probability of being selected. (b) Generate two children, j1 and j2, by crossing the two parents’ chromo- somes ci1 and ci2 . (c) Mutate the two children j1 and j2. (d) Repeat until you have a new generation of N children. 4. Return to step 2 unless the termination criteria have been achieved. Where the different terms are briefly explained as: • Population: The set of individuals with size N . • Individuals: The different sets of parameters being evaluated, encoded as chromosomes. • Decode: Translating the numerical values stored in the chromosome into the parameters. • Objective function: The function being optimized. In the case of this paper, it is the mean squared error of the Echo State Network (ESN) on the validation set. • Fitness value: The value of the objective function for an individual. In the case of mean squared error, lower is better. • Generation: The population during a specific iteration of the algorithm. • Parents: The individuals selected in the current generation to create the next. • Children: The product of the parents after crossing and mutation, constitut- ing the next generation. • Termination Criteria: The conditions that halt the algorithm. This could be a specified number of iterations, a certain fitness value achieved, or a max- imum number of iterations without an improvement in the fitness value. The three operators central to genetic algorithms are selection, crossover and mu- tation [66]. Different versions of the three exist and the ones utilized in this thesis are as follows: Tournament Selection 1. Randomly select ntournament individuals from the population. 2. Order the selected individuals from best to worst. 3. Select the best individual with probability ptournament. Continue down the ordered list until an individual has been selected, or there is only one individual left. In that case, select the last individual. 18 3. Theory Single Point Crossover Figure 3.4: Schematic of a Single Point Crossover 1. Given two selected parents, cross them with probability pcross. Otherwise, set the children to be exact copies of the parents. 2. Uniformly at random select a crossover point in the chromosome. 3. Set the first child to be the first parent’s genes before the crossover point, and the second parent’s genes after the crossover point. Vice versa for the second child. Modified Bit-Flipping Mutation Typically the chromosomes are encoded as strings of bits, ones and zeros. Then, during mutation every bit in each chromosome is flipped with a probability pmut. This means that for a random 32-bit string representing a number in a specified range, you have the following distribution of absolute change for a mutation of pmut = 5/32: Figure 3.5: Probability Distribution of Absolute Change in a 32-bit String with pmut = 5/32 19 3. Theory The distribution reveals that smaller changes, relative to the allowed range, are significantly more frequent when flipping bits compared to mutations that alter 90-100% of the allowed range. In essence, bit flipping facilitates both small and large changes, but the likelihood of larger changes diminishes progressively as their magnitude increases. Now, in the implementation in this thesis, instead of using bits, floating point num- bers are used. One way of implementing mutation for floating point encoding is to uniformly pick a number in the allowed range. This loses the property described earlier of bit-flipping mutation. So, the method used in [7] is employed instead. The process follows: Given an allowed range R = [cMin, cMax] and a value c that should be mutated, randomly pick a direction up or down for c. Without loss of generality we assume up was picked. This means the range the mutated value can fall into is R = [c, cMax]. Partition R into a equal parts of width w. Now, we want to pick a sub-range Rsub: Rsub = [c, c + wi] i ∈ Z, i ∼ U[1, a] Then pick a number cmut uniformly at random in Rsub. Let cmut replace c in the chromosome. This method of mutation leads to the following distribution of absolute change given a mutation occurs: Figure 3.6: Probability Distribution of Absolute Change in Modified Bit-Flipping The method enables mutations to result in both large and small changes but smaller changes are more likely, which is the property desired. GA-Parameters With the evolutionary operators defined, to run the genetic algorithm requires the following parameters: 1. N , Population size 2. M , Number of iterations 20 3. Theory 3. ntournament, Tournament size 4. pcross, Crossover probability 5. pmut, Mutation probability 6. [cmin i , cmax i ], range of chromosome i 7. ai accuracy of chromosome i 3.3.1 Premature Convergence A common issue when implementing genetic algorithms is fitness values converging to a local optima and being unable to find children outperforming parents. This is called premature convergence [22]. There are several factors that play a role in premature convergence, but the main culprit is a lack of genetic diversity [64]. Various techniques have been developed to address this issue [28]. While choosing an appropriate approach one should strike a balance between exploration of the search space and exploitation of solutions already found. Three measures are implemented in this thesis where the first two are inspired by [1] and the third by [25] [64]. They follow below. 1. Linearly Decreasing Crossover Rate The idea is that, initially the population is diverse, and exploiting the best solutions to a higher degree will not cause premature convergence. Later on, when the popu- lation converges on a few solutions, a lower crossover rate should allow the algorithm to still explore new solutions. The linearly decreasing crossover rate is implemented as follows: pcross i = pcross max − pcross max − pcross min N − 1 · i where: • N is the number of iterations. • pcross i is the crossover rate for iteration i = 0, . . . , N − 1. • pcross min is the initial crossover rate. • pcross max is the final crossover rate. 2. Linearly Increasing Mutation Rate The idea is similar to the previous technique. Initially, the population is diverse, and a high mutation rate pushes the algorithm toward a random search. Later on, a higher mutation rate helps escape local optima. The linearly increasing mutation rate is implemented as follows: pmut i = pmut min + pmut max − pmut min N − 1 · i where: • N is the number of iterations. • pmut i is the mutation rate for iteration i = 0, . . . , N − 1. • pmut min is the initial mutation rate. 21 3. Theory • pmut max is the final mutation rate. 3. Stagnation Reset and Elitism Since it is difficult to determine a satisfactory fitness level, the model is terminated after a certain number of iterations. In some cases, the model converges prematurely, despite the efforts mentioned above. One could decide to halt the algorithm after a certain number of stagnated iterations, that is iterations with very marginal or no progress. The approach taken instead is to reintroduce genetic diversity by resetting the popu- lation after fitness progression has stagnated. This approach almost certainly wors- ens fitness at first as it’s distribution is equivalent the fitness of the first generation. To prevent fitness from regressing, the individual with the best fitness previously is copied directly to the next generation. This is known as elitism. Elitism is applied every iteration, regardless of stagnation resets. Elitism and stagnation resets are implemented as follows: 1. After crossing and mutating in each iteration: • If fitness improved, set the stagnation counter cstagnation = 0, otherwise set cstagnation ← cstagnation + 1. 2. If cstagnation = Nstagnation, set all children randomly. 3. After all operators (and possibly resetting), apply elitism: • Set one of the children to be an exact copy of the best parent. Note that these techniques introduce additional parameters for the genetic algo- rithm, such as cstagnation, and ranges for pcross and pmut. It should also be noted that the selection method used affects premature conver- gence. Using a ranked system, such as tournament selection, instead of proportional fitness levels among individuals, can potentially mitigate premature convergence [85]. 3.4 Overfitting In most machine learning applications, over-fitting is a concern [34][85][77]. In gen- eral, the purpose of machine learning is to use some known input-output pairing to train a model such that it can generalize to unseen data, that is, produce predictions based on novel input. Under certain conditions, the model might fit very well to the seen data, but fail to make accurate predictions for unseen data; this is known as over-fitting. Usually, this occurs when the training methods over correct for the noise in the data. Considering the methods used in this paper, where three machine learning algo- rithms are employed, over-fitting might be a concern. Two remedies for over-fitting are using large datasets and validating the results before testing. Fortunately, the data available is quite extensive. However, validating both GA and the training of the ESN requires splitting the dataset into validation and testing sets for both mod- els. Another common approach is to use cross validation techniques, how this should be done for temporal data and reservoir states is unclear. Instead, the splitting is done as follows: 22 3. Theory • Warm-up Set – washes out the initial random states. • Training Set – used to optimize the readout layer. • Validation Set – validates the ESN training and is used for evaluation in the GA. • Test Set – tests the GA-optimized ESN. With this setup, the issue of over-fitting could arise in the genetic algorithm on the validation set. However, the genetic algorithm is somewhat blunt in the sense that it only indirectly interacts with the noise in the data. To check if the setup is over-fitting the validation set, which is used to train the genetic algorithm, the training error is measured for the best performing individual (set of parameters). If the training error starts to consistently increase, that would be an indication of overfitting the validation set. It is important to emphasize that the test set remains completely unseen by all op- timization algorithms, ensuring it serves as an unbiased evaluation of the model’s performance. The fitness on the validation set is exclusively used to guide the se- lection of the best network. If the genetic algorithm begins to overfit the validation set, this would likely manifest as both an increase in training error and poor perfor- mance on the test set. In such cases, alternative validation techniques or more robust data-splitting strategies should be explored to mitigate overfitting and improve the generalizability of the model. 23 3. Theory 24 4 Methods When building an ESN and running a GA, there are plenty of design choices that needs to be made. There are also plenty of ways to process data and evaluate results. This chapter describes the approaches taken. 4.1 Data The full historical data from the London Metal Exchange (LME) comes at a signif- icant cost [48]. However, Westmetall provides access to a subset of historical data on their website [86]. The "cash settlement" data refers to the closing offer from the daily second ring round of cash futures, while the "three month" data corresponds to the closing offer from the second daily ring round of the three-month futures contract. It is important to note that these are not the official prices determined by the LME. Official prices typically fall between the final bid and offer of the second daily ring round, with final offers generally being slightly higher, though rarely ex- ceeding a single trading tick ($0.5 or $0.1 depending on the metal). Additionally, the "stock" data provided by Westmetall represents the official stock levels recorded by LME-approved warehouses. The metals Westmetall provide data for are aluminium, copper, lead, nickel, tin, zinc. All metals but tin are mentioned in the Northvolt 2023 annual report[70]. What ratios or volumes used in their manufacturing process is not public informa- tion. Unfortunately, the data available on the Westmetall website is not provided in a readily downloadable format. As a result, it is necessary to scrape the data table directly from the HTML code. The algorithm employed for this purpose utilizes the Jsoup library [35] for parsing the HTML and the OpenCSV library [78] for handling CSV operations. The process operates as follows: This produces a CSV file where each line consists of a date, cash settlement, three month and stock. The CSV files are then read into Java. The data is then transformed to reflect the daily percentage change. This transfor- mation results in the loss of the very first data point, as it is used as a reference. Various transformations were tested, including using the raw data or multiplicative changes, but the percentage change consistently yielded the best results. It is also an intuitive and easily interpretable format compared to other options, such as log changes. The transformation is applied to all data types (cash settlement, three month, and stock) as follows: 25 4. Methods Algorithm 1 Extract Table Data from HTML and Write to CSV 1: Connect to the target URL 2: Select CSS query "table" 3: for each table in document do 4: for each row in table (table.select("tr")) do 5: for each cell in row (table.select("td")) do 6: Write cell content to CSV 7: Append comma "," after each cell 8: end for 9: Write new line in CSV after all cells in the row are processed 10: end for 11: end for Let i = 0, ..., T − 1, percentageChange[i] = ( data[i + 1]− data[i] data[i] ) × 100 The percentage change of the different data types are then assembled into input and output matrices x and y to form paired sets for one-step predictions. This is done such that: yi−1 = xi, for i = 1, . . . , T Where each pair [xi, yi] is used for a one-step predictions, where xi, along with it’s history, aims to predict yi. The [x, y] data is then split into a warmup, training, validation and test set. The warmup set should be large enough to wash out the initial states of the reservoir, 1000 data points of warmup and 2000 of training is suggested in [42]. Considering the finite amount of data in this application, such a ratio between warmup and training (1:2) might be too high. One also has to take into account that training the network with a ridge regression requires an overdetermined system of equations. How to split the data is a vast topic of discussion. Ultimately the split decided upon is: Phase Percentage Warmup 10% Training 85% Validation 2.5% Testing 2.5% Because the number of data points available are slightly different among the metals, a percentage of the available is used instead of the number of data points. With the data collected, transformed and split into different sets it is time to describe the specifics of the ESN, which is the topic of the next section. 26 4. Methods 4.2 The Echo State Network This section describes how the ESN is initialized and how warmup, training, valida- tion and testing is performed. The activation functions used are: f in - unity, f res - tanh and f out - unity. This results in the following formulations for updating states St and calculating output ŷt: St = (1− α)St−1 + αtanh ( W resSt−1 + W̄ inx̄t ) ŷt = W̄ outS̄t Where the "bar" indicates the matrix or vector has been extended to accommodate a bias term and tanh(·) indicates the element wise hyperbolic tangent. 4.2.1 Initializing Networks To initialize the network the initial state vector S0 and the matrices W in, W out and W res have to be defined and populated. There are a few legitimate choices that can be made. The approach taken follows: Si = 0 , ∀i ∈ [0, n− 1] where n is the reservoir size. Setting S to reasonable values is later handled by the warmup. Having all of the nodes in the reservoir connected to both the input and output layers is not always optimal. To control this we need two sets of connections to the reservoir nodes for W in and W out: NodesIn = {si ∈ Ud[0, n− 1] | si drawn without replacement, for i ∈ [0, inLength− 1]} NodesOut = {si ∈ Ud[0, n− 1] | si drawn without replacement, for i ∈ [0, outLength− 1]} Where Ud denotes the discrete uniform distribution and in- and outLength is the number of non zero connections to the reservoir. Since W out is determined during training with a ridge regression, which does not require calculation of output since the desired output is known, there is no need to randomly initialize the matrix. NodesOut is used to determine which nodes’ state should be collected to form the design matrix. The entries of W̄out are then calculated as seen in the theory section. W in on the other hand needs to be randomly initialized. Controlling the distribution of W in in relation to the input types and reservoir is a desirable feature. Therefore 27 4. Methods an input scale for each of the data types and a bias scale is used. W in is initialized as: W in j,i ∈ U [−inScalej, inScalej] ∀i ∈ NodesIn, j ∈ [0, k − 1] W in j,n ∈ U [−inBiasScale, inBiasScale] ∀j ∈ [0, k − 1] where k is the number of data types. When it comes to W res the process becomes more involved, considering the echo state property (ESP). The process follows: Let A denote the n×n adjacency matrix initialized either uniformly random or scale free as described in the theory section. Initialize the weights as: W res i,j ← Ai,jU(0, 1] ∀i, j ∈ [0, n− 1] Such that the non-zero elements are set uniformly random on (0, 1]. The next step is to scale the weights such that W res has a specified spectral radius. This is done by calculating the current spectral radius of W res and then multiplying all weights by a ratio of the wanted spectral radius and the current spectral radius. The spectral radius is approximated using power iteration. The process follows [87][9]: Algorithm 2 Power Iteration for Largest Eigenvalue 1: randomly initialize bk ∈ Rn, where ∥bk∥ ̸= 0, bk is the initial guess for the dominant eigenvector. 2: for iter = 1 to MaxIterations do 3: bk+1 ← W resbk 4: ρ← bT k+1bk+1 ∥bk+1∥ 5: bk+1 ← bk+1 ∥bk+1∥ {normalize bk+1} 6: if ∥bk+1 − bk∥ < tolerance then 7: break 8: end if 9: bk ← bk+1 {update for the next iteration} 10: end for Here, bk is a normalized vector that approximates the dominant eigenvector of W res, which corresponds to the largest eigenvalue. The spectral radius ρ is approximated as the largest eigenvalue of W res, estimated iteratively using the updated bk+1 vector. The power iteration is run with MaxIterations = 1000 and tolerance = 1e − 9. Thus we have a close approximation of the spectral radius of W res and we can rescale to the spectral radius wanted: W res ← ρwanted ρ W res To make W res include both positive and negative values approximately half of the non-zero values are randomly set to be negative. This concludes the network initialization. 28 4. Methods 4.2.2 Warmup and Training Warmup is simply passing input to the reservoir updating the states. This assures that the initial state, where all states are zero, does not affect the subsequent training process. When the warmup set has been passed through, training begins. This is done by passing the training set through the reservoir and collecting the states of the output nodes creating the design matrix. Then the output weights and bias is calculated using ridge regression as described in the theory section. There are a few alternatives to optimize λ since the computational expense is quite low considering the same design matrix is used in addition the network does not have to be reinitialized. However, employing for instance a logarithmic grid search for λ, in conjunction with the GA for the other parameters, adds up and ultimately slows down the optimization significantly. λ is there for also included in the genome fo the GA. 4.2.3 Validation and Testing With the output weights and bias set from the offline training, the model can now produce predictions. To validate the model the validation set is passed through the reservoir and approximations ŷt are calculated. Since the model is adaptive though online training, at each time step the error gradient is calculated and the weights slightly adjusted. As discussed in the theory section we need the gradient of the error with respect to the weights and bias. Defining the loss function Lt as follows, we have: Lt = 1 2 ||ŷt − yt||2 ŷt = f out(zt) zt = W̄ outS̄t That is, Lt is a function of ŷt which is a function of the activation zt which is in turn a function of W̄ out. Using the chain rule, the gradient of Lt can be expressed as: ∂L ∂W̄ out = ∂L ∂ŷ · ∂ŷ ∂z · ∂z ∂W̄ out Where: ∂Lt ∂ŷt = ŷt − yt ∂ŷt ∂zt = 1, f out is unity ∂zt ∂W̄ out = S̄t 29 4. Methods and W̄ out is updated as: W̄ out ←W̄ out − η∇Lt W̄ out ←W̄ out − η ( (ŷt − yt)S̄T t ) This is done at every step of both validation and testing. Now, η has to be set such that the model does not over correct for small errors nor fail to correct for changing dynamics. The optimization of η is handled by the GA. When the model has produced predictions for the validation and test sets respec- tively the results are measured using a few test metrics. These are the topic of the following section. 4.2.4 Calculating Test Metrics To measure how well the model can make predictions, the mean squared error (MSE) is used. Considering the output has multiple dimensions the mean squared error is the mean of the MSE of the different dimensions. Given T predictions and m dimensions this is calculated as follows: MSE = 1 Tm T∑ i=1 m∑ j=1 (yj,i − ŷj,i)2 This is the metric used as fitness in the GA. The primary concern of the model is to predict the next day cash. Therefore, the MSE of the cash dimension of y and ŷ is metric of ultimate interest. These values are denoted as ycash and ŷcash, respectively. Given T predictions, the MSEcash is calculated as follows: MSEcash = 1 T T∑ i=1 (ycash,i − ŷcash,i)2 In addition to MSE, three other evaluation metrics are calculated to assess the efficacy of the model: mean absolute error (MAE), coefficient of determination (R2), and directional accuracy. Omitting the cash sub text for simplicity, these metrics are calculated as follows: MAE = 1 T T∑ i=1 |yi − ŷi| R2 = 1− ∑T i=1(yi − ŷi)2∑T i=1(yi − ȳ)2 where ȳ denotes the mean of y. Directional Accuracy = 1 T T −1∑ i=0 1yi,ŷi , where 1yi,ŷi = 1 if sign(yi) = sign(ŷi) 0 otherwise Because the model takes the transformed version of the underlying data it might also of interest to take a look at what the predictions produced in the original, reverted 30 4. Methods form. Since the original values of y are already available the only thing one has to calculate is the reverted ŷ which is done as follows: ŷreverted i = xoriginal i ( 1 + ŷpercent i 100 ) Note that the reverted and original super text is omitted in other cases for simplicity. As outlined in the theory section the RW model produces the prediction that the expected value of the next price is the same as the current price. In terms of percentage change, the model predicts an expected 0% change every day. This makes predictions very convenient to calculate. The result metrics for the RW model are calculated in the exact same way as above, just replacing ŷcash with 0 and ŷreverted cash with xoriginal cash . The directional accuracy is not very useful, considering the RW model predicts no directional change, therefore instead the ratio of upward daily movements are calculated to give context to the results of the ESN’s predictions. The directional accuracy is calculated as: Directional Accuracy = 1 T T −1∑ i=0 1 if yi ≥ 0 0 otherwise 4.3 Genetic Algorithm This section outlines the approach taken for the genetic algorithm. Having estab- lished all the parameters needed for the initialization, offline and online training of the ESN as well as having a way of measuring how well the model makes predic- tions the setup for the GA can be established. First off is the search space for the parameters the Genome, the topic of the following section. 4.3.1 The Genome The genome is what defines the search space for the parameters. These are the values used: Parameter Min Value Max Value Precision Topology 0 1 1 Connectivity 0.01 0.15 5 LearningRate 0 0.01 5 Regularization -5 5 5 SpectralRadius 0.5 1 7 LeakageRate 0 1 7 ReservoirSize 500 1200 7 inLength 50 1200∗ 7 outLength 50 1200∗ 7 inBiasScale 0 100 5 inFactorScales 0 3 5 31 4. Methods The precision is the number of intervals used in the modified bit-flipping mutation, therefore a higher precision value makes smaller changes more likely. The topology parameter is the only non numerical parameter. Since the parameter has precision 1, initialization and mutation is uniformly random on the interval. The random value is then rounded to the closest integer, that is either 0 which indicates scale free or 1 which indicates a uniformly random topology. The regularization is on a 10 base logarithmic scale. That is if the gene is encoded as -5, the λ in the model is decoded as 10−5. Since the in- and outLength cannot exceed the reservoirSize this has to be accounted for when decoding. Since the lengths and sizes represent a number of nodes, the gene values are rounded to the nearest integer. Choosing the values for the genome is challenging and mostly done through trial and error. There are some general guidelines. The spectral radius, for instance, should be scaled below unity. The connectivity should make the reservoir sparse and the leakage rate is defined between 0 and 1. Reservoir size is limited by computing capacity, which in turn limits the in and out lengths since they cannot be larger than the number of nodes in the reservoir. The out length is also limited by the available training data, considering that ridge regression requires an overdetermined system. The approach to defining the genome was to iteratively test using the metal data and inspecting if any parameter approached a boundary, then limits were adjusted accordingly. Also note that the precision determines how likely mutation is to take small steps in the allowed range. Therefore a higher precision places more emphasis on finding a precise optimal value than exploring the whole range. A precision of 1 means, given the parameter is chosen the be mutated, that all the values in the range are equally likely during mutation. 4.3.2 Initialization Initializing the population is done as follows: ci,j = minj+(maxj−minj)r for every i ∈ [0, P−1], j ∈ [0, C−1], where r ∈ U [0, 1] P indicates the population size, C which is the number of genes for each individual and minj, maxj is the min and max values for each gene determined by the genome. 4.3.3 Evaluation Given a set of individuals Pg, the population of generation g, each individual ci defines an ESN when decoded. For each ci an ESN is initialized, warmed up, trained and then validated with online training. The MSE, averaged for all dimensions of the output, on the validation set is then used as a fitness score for evaluation. In the case of the metals, this means the parameters defining networks producing smaller MSE for predicting the next day cash, three month and stock is awarded a better fitness score. While trying out parameters for the GA the error for the training set would some- times explode while still improving on the validation set. With the setup ultimately 32 4. Methods decided upon this does not seem to happen. Regardless, the MSE on the training set is tracked however not directly involved in the evaluation. The GA is run for 60 generations after which the parameters corresponding to the best individual are stored in a CSV file. Also included in the file is the random seed used to construct the ESN. This is necessary for reproducing the ESN’s performance seen during the GA which is sensitive to the random parts of initialization. 4.3.4 Evolution Once all ci have been evaluated they are ready to be evolved. To do this the cross and mutation probabilities for generation g are calculated as described in the theory section. Using tournament selection, with a tournament size of 5, the parents of the next generation are chosen. With the current cross probability, they are crossed using one point crossover and the children of the next generation are formed. Next, each of the genes in the children are mutated with the current mutation probability, using modified bit-flip mutation for the gene specific specified precision in the allowed range. To make sure the performance does not regress elitism is applied. That is, a random child in the next generation is replaced with a direct copy of the best parent. To make sure that the search space is adequately explored, should the performance of the best individual be the same for 10 generations, the population is re-initialized. This is done before elitism is applied meaning the performance does not regress despite resetting the population. 4.3.5 Summary of Genetic Algorithm Parameters To summarize, the parameters used for each of the runs, that is both the bench- marking systems and all the metals, for the genetic algorithm are: Parameter Value Iterations 60 Population Size 30 Tournament Probability 0.8 Tournament Size 5 Initial Cross P. 0.9 Final Cross P. 0.5 Initial Mutation P. 0.5/numGenes Final Mutation P. 2.0/numGenes Reset Duration 10 Table 4.1: Genetic Algorithm Parameters Where numGenes is the number of genes in the respective genomes. This means that on average 0.5 genes in a chromosome are mutated initially, increasing to 2 genes mutated per chromosome towards the end of the run. The formulation is often more 33 4. Methods convenient than just using a raw number, especially when the number of genes shift which is the case of the Mackey-Glass Equations as seen in section 4.5. 4.4 Refinements to the Methods During testing of the methods on the metal dataset, it became evident that a uniform approach across all metals was suboptimal. The primary challenge is ensuring the model’s ability to generalize to unseen data in the test set. To address this, three refinements were introduced. First, the fitness function was modified to focus solely on MSEcash. Guiding the GA to prioritize models that optimize performance in the cash dimension allows it not to be constrained by trade-offs with other dimensions. The metric aligns with the primary objective of the report used to benchmark the model against the RW. The calculation of MSEcash follows the procedure detailed in 4.2.4. Second, the optimization of the regularization parameter (λ) was separated from the GA and handled through a grid search. This approach proceeded as follows: Algorithm 3 Grid Search for Regularization Parameter λ in Echo State Network (ESN) 1: for each individual in the population do 2: Initialize the Echo State Network (ESN) 3: Wash out the initial states using the warmup set 4: Gather the reservoir states for the first 90% of the training set, denoting the state matrix as S̄90 5: for each λi = 10i, i = −3,−2, . . . , 2, 3 do 6: Calculate the output weights and bias: W̄ out = Y S̄90 ( S̄90S̄90T + λiI )−1 7: Use these weights to make predictions for the remaining 10% of the training set 8: Compute MSEall 9: Reset the reservoir states to S̄90 end, the final state of the first 90% of the training set 10: end for 11: Choose the λi with the lowest MSEall 12: Reset the reservoir states to zero 13: Proceed with the usual process: warmup, training on the entire training set, and compute MSEall for predictions on the validation set 14: end for Third, the stock dimension was removed from the dataset as it was identified as a possible source of issues. These refinements all maintain the same data splits for the validation and test sets, ensuring that results remain directly comparable. While alternative methods, such 34 4. Methods as cross-validation, could potentially improve generalization, they are challenging to implement due to the temporal structure of the data. 4.5 Benchmarking Using Mackey-Glass and Lorenz Attractor To assess whether the ESN-GA model is effective at predicting chaotic dynamic systems, the two examples Mackey-Glass and Lorenz are used. The two systems are described by the following equations: Mackey-Glass Equations: dx(t) dt = β x(t− τ) 1 + x(t− τ)n − γx(t) where x describes the state of the system, and the parameters used are: n = 10 β = 0.2 τ = 17 γ = 0.1 Lorenz Attractor Equations: dx dt = σ(y − x) dy dt = x(ρ− z)− y dz dt = xy − βz where x, y, z determine the state of the system over time, and the parameters used are: σ = 10 ρ = 28 β = 8 3 Solving these ordinary differential equations, given starting points to the variables, is in short done by discretizing t and numerical integration. There are available methods for this in the apache commons library [2]. However, because training the ESN requires several thousand data points, issues arise. The solvers FirstOrderIn- tegrator, ClassicalRungeKuttaIntegrator and DormandPrince853Integrator all stop functioning before a satisfactory number of datapoints have been produced. This might be an issue of implementation. Luckily there are readily available implementations that can produce several thou- sand datapoints. The code used for the Mackey-Glass is provided by [14] and the Lorenz Attractor provided by [82]. Both implementations are run with the desired parameters, with the results exported to CSV files and then imported into Java for further processing. Note that the Mackey-Glass system has one dimension. This means a smaller genome is required considering there is only one input scale to optimize. Both the Lorenz Attractor and the metal data have three dimensions. 35 4. Methods The same data split for the two benchmarking systems is used. It follows: Phase Number of Data Points Warmup 1000 Training 2000 Validation 500 Testing 500 4.6 Software and Tools All code, except for generating chaotic systems data, was written in the IntelliJ IDE [43], utilizing Java 13. The code is then run through a java virtual machine on a MacBook Air 2017 with MacOS Monterey 12.7.6. The following libraries were used, all accessed through Maven [24]: • Apache Commons Math: version 3.6.1 [2] • Efficient Java Matrix Library (EJML): version 0.43.1 [76] • JFreeChart: version 1.0.17 [18] • Jsoup: version 1.17.2 [35] • OpenCSV: version 5.9 [78] During the coding process, both OpenAI’s ChatGPT 3.5 and 4.0 [74] and JetBrains AI Assistant 2024.2 and 2024.3 [44] were used. The LaTeX template used for this document was provided by David Frisk [17], and ChatGPT is used to assist with formatting plots, tables, and equations. The Mackey-Glass data was generated using MATLAB [81], with code provided by [14]. Meanwhile, the Lorenz Attractor data was generated in Python [23], with code sourced from [82]. 4.7 Limitations of the Methods The methods employed in this study have several limitations: 1. Parameter Search: The genetic algorithm used for parameter search is not guaranteed to find a global optimum but rather a satisfactory solution. Addi- tionally, the complexity of patterns in the data might exceed the representa- tional capacity of the network, constrained by both the number of nodes and available hardware resources. 2. Data Complexity and Availability: • The patterns in the data might lack sufficient predictive power to consis- tently outperform a random walk. • The methods are not restricted to the three data types used but could incorporate additional datasets, which are unavailable to the author. 3. Price Discrepancy: The data does not use the official LME closing price but instead relies on the last offer made on the trading ring. While this difference is often minimal, just a single trading tick, it may still influence results and practical application. 36 4. Methods 4. Computational Constraints: Limited computational capacity restricts the size of the search space and the ability to thoroughly evaluate all possible configurations. 5. Model Construction Choices: The numerous design decisions made during model construction are not rigorously verified for their individual contributions to performance. Comparing alternate paths or evaluating every choice is infea- sible within the given time frame, particularly due to the long runtime required for GA evaluations and subsequent testing. 6. Fitness Function: The fitness function, which evaluates parameter combi- nations, is not guaranteed to be convex. Optimal solutions may lie outside the explored search space, even if parameters are far from a boundary. 7. Inherent Data Randomness: This setup cannot directly determine whether the data is inherently random. Additional tests would be required to evaluate this aspect. 8. Statistical Significance: The results do not include statistical significance measures, which limits the ability to provide confidence intervals for predic- tions. 9. Data Integrity: The data originates from a third-party source, so its accu- racy cannot be independently verified. 10. Prediction Horizon: The model is limited to one-day-ahead predictions. Extending the prediction horizon would require modifications to the setup. 11. Multi-Day Predictions: The current setup does not support autonomous multi-day predictions. While implementing a feedback model that uses its own predictions as inputs is straightforward, it is incompatible with stochastic gradient descent, which requires true data for iterative weight updates. 12. Simultaneous Time and Space Dependence: The Mackey-Glass equa- tions exhibit time dependence, while the Lorenz Attractor demonstrates spa- tial dependence. Although the setup is capable of handling these phenomena individually, there is no assurance that the model can accurately predict sys- tems where both time and space dependencies occur simultaneously. This list is by no means exhaustive. 4.8 Summary of Methodological Approach This section summarizes the methodological approach employed for optimizing and testing the model. Data Collection • For Benchmarking: – Generate the data. – Split the data into warmup, training, validation, and test sets. • For Metals: – Collect the data from Westmetall. – Transform the data into percentage changes. – Split the data into warmup, training, validation, and test sets. 37 4. Methods Genetic Algorithm • Initialize the population. • Loop: For each generation: – Evaluate Population: ∗ Initialize the ESN. ∗ Perform warmup. ∗ Train: · Train weights using Ridge Regression. ∗ Validate: · Nudge weights using gradient descent. · Calculate Mean Squared Error (MSE). – Evolve Population: ∗ Perform selection and crossover. ∗ Mutation. ∗ Stagnation reset if necessary. ∗ Elitism. • Store optimized parameters and SEED. Testing Optimized Parameters • Initialize the ESN with optimized parameters. • Perform warmup. • Train using Ridge Regression. • Validate using gradient descent and store validation metrics. • Test using gradient descent and store test metrics. The four different setups MSEall, MSEcash, grid search, and no-stock are applied to six different metals. The benchmarking of the model for Mackey-Glass and Lorenz data is only done with the MSEall setup. 38 5 Results This section presents the results. First the ESN-GA model is tested for the known application of predicting Mackey-Glass and Lorenz Attractor data. Then an overview of the metal data is presented leading up to the results pertaining to parameter op- timization and finally a performance comparison to a random walk model. 5.1 Mackey-Glass Equations and Lorenz Attrac- tor This section is included to validate the ESN-GA model on well-established appli- cations, ensuring its functionality. Additionally, it examines whether the GA effec- tively selects parameters that enhance performance and evaluates the ESN’s ability to handle chaotic data. We begin by analyzing the performance progression of the GA on the Mackey-Glass dataset. Given that the Mackey-Glass system features a single spatial dimension with temporal dependencies, strong performance would indicate that the ESN-GA model successfully relates past values to predict future ones. Figure 5.1: Genetic Algorithm Progress for Mackey-Glass Equations The training and validation MSE goes from 1.628E-5 and 1.167E-8 to 2.795E-8 and 5.253E-12 respectively. Note that the plot is on a logarithmic scale and that the algorithm is still improving upon termination, though progress slows down. There might be a case for running the GA longer, however this clearly indicates that the 39 5. Results GA is capable of selecting parameters superior to random ones. The parameters settled upon are: Parameter Value Topology Scale-Free Connectivity 0.1387 Spectral Radius 0.8399 Regularization 2.730e-4 Learning Rate 9.999e-3 Leakage Rate 0.998 Reservoir Size 764 Input Length 358 Output Length 358 Input Bias Scale 15.405 Input Factor Scale 2.994 Table 5.1: GA Optimized Parameters for Mackey-Glass Re-initializing the network, using the same SEED as during the GA, results in the following predictions: (a) Training (b) Validation (c) Testing Figure 5.2: Mackey-Glass Model Predictions vs Actual for Training, Validation, and Testing 40 5. Results The predicted values closely align with the actual values to the point where the difference is not distinguishable at the resolution and line thickness used. This points to the model’s ability to predict the behavior of the Mackey-Glass system. Following is the test metrics of the predictions: Metric MSE MAE R-Squared Directional Accuracy Training 2.795e-08 1.157e-04 1.000 0.993 Validation 5.250e-12 1.786e-06 1.000 0.998 Testing 1.966e-11 3.032e-06 1.000 0.996 Table 5.2: Mackey-Glass Model: Performance Metrics The results demonstrate that the model clearly can understand the behavior of the Mackey-Glass system. There is still an error, although small, and possibly room for improvement. Considering the values are stored as doubles, which have a precision of around 10−16 this should not cause issues. The GA is still improving after 60 iterations indicating the model in its current form has more potential. Although, more iterations are not necessary considering the results clearly illustrate that the model is capable of predicting the Mackey-Glass system. With that established we move on to the Lorenz system. The system has three spa- tial dimensions that interact making the complexity greater. If the ESN-GA model can perform well it establishes the model is capable of detecting interactions between input variables, not only temporal structures in the data. The Lorenz attractor also has a higher degree of chaos, measured in Lyapunov exponent, compared to Mackey- Glass for the respective parameters, thus making predictions slightly more difficult. Note that the predictions are one-step, so errors do not accumulate exponentially as they would do for multi-step predictions. The GA progress follows: Figure 5.3: Genetic Algorithm Progress for Lorenz Attractor The training and validation MSE go from 0.073 and 0.033 to 9.756E-8 and 8.782E-7 respectively. Yet again a marked improvement from the initial random ESN param- eters. Here it is more clear that the GA could run for longer considering the steady 41 5. Results improvements until termination. Because there are no obvious signs of premature convergence the population size and enhancements can be deemed satisfactory, at elast in this case. The following are the parameters the GA settled upon: Parameter Value Topology Scale-Free Connectivity 0.1352 Spectral Radius 0.6985 Regularization 1.690e-5 Learning Rate 2.361e-3 Leakage Rate 0.1324 Reservoir Size 967 Input Length 967 Output Length 855 Input Bias Scale 13.029 Input Factor Scale [0.2937, 0.0741, 0.0804] Table 5.3: GA Optimized Parameters for Lorenz Attractor Re-initializing the network with the optimized parameters and seed produces the following predictions: (a) Training (b) Validation (c) Testing Figure 5.4: X Values of Lorenz Attractor Model Predictions vs Actual for Training, Validation, and Testing 42 5. Results (a) Training (b) Validation (c) Testing Figure 5.5: Y Values of Lorenz Attractor Model Predictions vs Actual for Training, Validation, and Testing 43 5. Results (a) Training (b) Validation (c) Testing Figure 5.6: Z Values of Lorenz Attractor Model Predictions vs Actual for Training, Validation, and Testing Again the difference between the actual and predicted values are hard to distinguish in the plots. Clearly the model is capable of making one-step predictions in multiple dimensions. The test metrics for the three dimensions of the Lorenz Attractor Model follows: Variable Phase MSE MAE R-Squared Directional Accuracy X Training 3.107e-08 1.310e-04 1.000 0.999 Validation 1.560e-07 1.307e-04 1.000 0.998 Testing 2.081e-09 3.206e-05 1.000 0.998 Y Training 1.651e-07 2.962e-04 1.000 0.999 Validation 1.766e-06 3.521e-04 1.000 0.996 Testing 1.173e-08 7.243e-05 1.000 0.998 Z Training 9.651e-08 2.302e-04 1.000 1.000 Validation 7.124e-07 2.644e-04 1.000 0.998 Testing 8.112e-09 6.431e-05 1.000 0.998 Table 5.4: Lorenz Attractor Model: Performance Metrics for X, Y, and Z Variables The situation is similar for the Lorenz attractor. The model is clearly capable of predicting the behavior of the system with near perfect metrics for both R2 and directional accuracy. The comparatively larger values for MSE and MAE could to 44 5. Results some degree be attributed to larger absolute values. This might call for the use of a ratio based error metric, but the purpose is not to extensively analyze the results from these two benchmark data sets but to be able to conclude that the model works for known applications. It is clear that the model is capable of improving on a random selection of parameters and the ESN model initialized with said parameters is capable of predicting chaotic systems with temporal or spatial dependencies. The next step is to apply the model to the metal data. First we take a look at the LME data. 5.2 Data Overview The following is the data from the LME provided by Westmetall from the 2nd of January 2008 to 2nd of September 2024 for the cash settlement, three month and stock of six metals. Due to trading being halted there are some gaps in the data. These missing data are summarized in the table below for the different metals. Metal Missing Data Points Usable Data Points Aluminium 87 4131 Copper 7 4211 Lead 85 4132 Nickel 99 4118 Tin 87 4130 Zinc 85 4132 Table 5.5: Number of Missing and Usable Data Points for the Metals It should be noted that the missing data are distributed across the various data types. When one or more dimensions are missing for a given trading day, the entire day is discarded. Given the relatively small proportion of missing data, the remaining dataset is assumed to still provide an accurate representation of reality. A potential issue arises when missing data points occur consecutively over extended periods. For example, the tin three-month data is missing from February 20 to June 16, 2008, and the nickel three-month data is missing from March 8 to March 25, 2022. In such cases, these gaps are simply skipped, resulting in discontinuities in the time series. For instance, in the case of nickel, the time series entry following March 7, 2022, directly after is March 26, 2022. This requires the assumption that such gaps will not significantly impact predictive performance. The effect of missing data is visibly evident as straight-line segments in the following plots of the datasets, which highlight these discontinuities. 45 5. Results Figure 5.7: Cash Settlements of the Metals Metal Mean Median Variance Aluminium 2.045E+3 1.997E+3 1.511E+5 Copper 7.107E+3 7.049E+3 2.486E+6 Lead 2.064E+3 2.079E+3 8.698E+4 Nickel 1.650E+4 1.614E+4 2.563E+7 Tin 2.183E+4 2.059E+4 4.008E+7 Zinc 2.369E+3 2.306E+3 3.264E+5 Table 5.6: Statistics for the Cash As the plots illustrate, the cash settlements for the different metals are quite different in absolute terms. Lead, zinc, and aluminium appear to have similar behavior being quite stable during the period. Contrasted with nickel and tin which are more volatile, exemplified by the price level in early 2009 then to be almost five times higher in early 2022 and then almost completely recovered by 2024. Next, we inspect the 3 three month for the metals. 46 5. Results Figure 5.8: Three Month of the Metals Metal Mean Median Variance Aluminium 2.069E+3 2.018E+3 1.501E+5 Copper 7.115E+3 7.062E+3 2.479E+6 Lead 2.072E+3 2.090E+3 8.470E+4 Nickel 1.657E+4 1.625E+4 2.565E+7 Tin 2.169E+4 2.050E+4 3.829E+7 Zinc 2.374E+3 2.315E+3 3.095E+5 Table 5.7: Statistics for the Three Month As expected, the series look similar. To get a sense of the possible predictive power of utilizing the three month one can plot the difference between the three month and the cash settlement where a large deviation indicates a strong conviction in the market of upcoming changes. 47 5. Results Figure 5.9: Cash Settlement Deviation from Three Month The plot illuminates the notion that the different metals behave slightly different. Zinc barely has any deviation at all while tin a times have massive differences be- tween cash and three month. There might also be some indication that the de- velopments of tin during 2021-2022 are exceptional. This makes predictions using machine learning in general, and reservoir computing models in particular, difficult considering there might not enough data the model can learn from. It should be noted that the plots so far depict absolute terms. Later relative price changes will be considered hopefully revealing a more stationary behavior, which is more suited for machine learning despite the online learning scheme. Below is the final type of data used in the study, the stock of the different metals. 48 5. Results Figure 5.10: Stock of the Metals Metal Mean Median Variance Aluminium 2.653E+6 2.099E+6 3.099E+12 Copper 2.619E+5 2.429E+5 1.802E+10 Lead 1.574E+5 1.488E+5 8.350E+9 Nickel 2.000E+5 1.615E+5 1.577E+10 Tin 8.290E+3 6.283E+3 4.056E+7 Zinc 4.247E+5 3.358E+5 1.018E+11 Table 5.8: Statistics for the Stock The plot perfectly illustrates the need for using relative terms. The absolute differ- ence in aluminium stock and tin stock is massive. The graph should be viewed in the light of aluminium volumes sold at the LME for 2023 being around 57 million tons and tin volumes being around 1.2 million tons [53], which naturally will be reflected in the absolute stock kept. There is also an apparent issue with the way Westmetall collects data from the LME. The nickel stock in the beginning of April 2012 is completely wrong. On March 30 the nickel stock presented is 99882 ton which drops to 542 ton on April 2 and stays at this level until April 11 when the stock goes back to more reasonable levels of 99330 ton. Since there is no information on there being something exceptional happening during this period the data is suspicious. Luckily there is a snapshot of the LME website, which posts daily data, indicating that the nickel stock on April 3 2012 should be 100542 ton [47]. It is probably safe to assume that the software used to collect the data from the LME by Westmetall has somehow removed 100 000 from the actual stock kept. Therefore 100 000 is added back in the data used for the suspicious stock values between March 30 and April 11 2012. No similar cases are 49 5. Results found throughout the dataset but other errors cannot be ruled out possibly affecting the effectiveness of the training and validity of the results. Extrapolating from basic economic theory on supply and demand there should be a relationship between the stock kept and the prices. This might have predictive power in the model. Therefore, it is of interest to see the cash settlement and stock in the same graph. Some examples are presented below. Figure 5.11: Lead Cash Settlement and Stock Figure 5.12: Copper Cash Settlement and Stock No strikingly obvious relationship is observed for either of the examples. Further speculation is pointless, and the usefulness of the stock will hopefully be evident in the subsequent optimization of the ESN. Next the percentage changes of the data is presented. 50 5. Results Figure 5.13: Percent Change of Cash Settlement of the Metals Metal Mean Median Variance Aluminium 0.0107 0.0000 2.0825 Copper 0.0193 0.0025 2.4216 Lead 0.0146 0.0000 3.9414 Nickel 0.0136 0.0000 5.1395 Tin 0.0353 0.0577 3.8550 Zinc 0.0203 0.0210 3.28