AI-Driven Predictions of Industrial Metal
Prices on the London Metal Exchange
Optimizing Echo State Networks with Genetic Algorithms

Master’s thesis in Engineering Mathematics and Computational Science

MARKUS EDLUND

DEPARTMENT OF MICROTECHNOLOGY AND NANOSCIENCE

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Master’s thesis 2025

AI-Driven Predictions of Industrial Metal Prices
on the London Metal Exchange

Optimizing Echo State Networks with Genetic Algorithms

MARKUS EDLUND

Department of Microtechnology and Nanoscience
Chalmers University of Technology

Gothenburg, Sweden 2025


AI-Driven Predictions of Industrial Metal Prices on the London Metal Exchange
Optimizing Echo State Networks with Genetic Algorithms
MARKUS EDLUND

© MARKUS EDLUND, 2025.

Supervisor: Zoran Konkoli, Department of Microtechnology and Nanoscience
Examiner: Zoran Konkoli, Department of Microtechnology and Nanoscience

Master’s Thesis 2025
Department of Microtechnology and Nanoscience
Division of Division name
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


AI-Driven Predictions of Industrial Metal Prices on the London Metal Exchange
Optimizing Echo State Networks with Genetic Algorithms

MARKUS EDLUND
Department of Microtechnology and Nanoscience
Chalmers University of Technology

Abstract
The battery manufacturer Northvolt aims to reduce inventory risks. In an increas-
ingly competitive industrial metals market, protecting inventory value by employing
hedging strategies is a key component of lowering costs. Using artificial intelligence
for reliable predictions of short term prices is an interesting prospect for improving
such strategies.
This study uses data from the London Metal Exchange, including cash and three-
month futures contracts, along with stock volume, to predict next-day cash prices
for six industrial metals. These predictions are compared to a baseline random walk
model using the metrics mean squared error (MSE), mean absolute error (MAE),
R2, and directional accuracy.
Predictions are made using an Echo State Network (ESN), with optimized parame-
ters chosen by a Genetic Algorithm (GA). The network is trained offline with ridge
regression and online with stochastic gradient descent. The performance of the ESN-
GA setup is validated using chaotic systems (Mackey-Glass and Lorenz equations)
before being applied to metals futures data. The data is split into four different
sets: a warmup set necessary for ESNs, a training set used for offline training, a
validation set to measure GA performance, and a test set of unseen data to ensure
the network generalizes well.
The GA optimization significantly reduced prediction errors in the Mackey-Glass
and Lorenz systems, with MSE values improving from 1.2E-8 to 5.3E-12 and from
3.3E-2 to 8.8E-7, respectively, on the validation set. For metals futures data, the GA
enhanced ESN performance, outperforming the random walk across all metrics on
the validation set for all six metals. However, slight modifications were necessary to
achieve superior performance on the unseen test set. Superior results were achieved
for four metals across all metrics, while the results for the remaining two metals
were mixed.
The GA effectively optimizes ESN parameters for systems with clear, determin-
istic dynamics but encounters challenges when applied to stochastic futures data,
particularly due to the risk of overfitting the validation set. While the ESN-GA
method does not consistently outperform the random walk on unseen test data, it
demonstrates significant potential. Further exploration with alternative configura-
tions, additional data types, and more robust validation techniques is warranted to
enhance its practical applicability.

Keywords: Artificial Intelligence, Echo State Networks, Genetic Algorithms, North-
volt, London Metal Exchange, Metal Futures, Financial Time Series, Random Walk

v


Acknowledgements
I would like to express my heartfelt gratitude to several individuals and organizations
who have supported me throughout this journey.
First and foremost, I would like to thank my supervisor and examiner, Zoran
Konkoli, for his patience, guidance, and encouragement in allowing me the free-
dom to explore reservoir computing according to my interests and aspirations. His
support has been invaluable throughout this process.
I am deeply grateful to Denver Stretesky for serving as my opponent and for provid-
ing great feedback and insightful suggestions that have greatly improved the quality
of this work.
A special thanks goes to Frederic Sese and Axel Myrberger at Northvolt for giving
me the opportunity to work on such an engaging and meaningful problem. Despite
challenging times, their enthusiasm has never wavered.
Finally, I extend my thanks to everyone who contributed directly or indirectly to
the completion of this project.

Markus Edlund, Gothenburg, January 2025

vii


List of Acronyms

DA Directional Accuracy
ESP Echo State Property
ESN Echo State Network
GA Genetic Algorithm
GD Gradient Descent
LME London Metal Exchange
MAE Mean Absolute Error
MSE Mean Squared Error
RC Reservoir Computing

ix


Contents

List of Acronyms ix

1 Introduction 1
1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Northvolt and the London Metal Exchange . . . . . . . . . . . . . . . 3
2.2 The Random Walk and Chaos in Finance . . . . . . . . . . . . . . . . 4

2.2.1 Is the LME an Efficient Market? . . . . . . . . . . . . . . . . 5
2.2.2 The Random Walk Model . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Beyond the Random Walk: Chaotic Dynamics . . . . . . . . . 5

2.3 The ESN-GA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Theory 9
3.1 Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Scale-Free Network . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Uniformly Random Networks . . . . . . . . . . . . . . . . . . 11

3.2 Theory of Reservoir Computing Models . . . . . . . . . . . . . . . . . 11
3.2.1 The Echo State Property . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Calculating Output . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Offline Learning: Ridge-Regression . . . . . . . . . . . . . . . 15
3.2.4 Online Learning: Stochastic Gradient Descent . . . . . . . . . 16

3.3 Theory of Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Premature Convergence . . . . . . . . . . . . . . . . . . . . . 21

3.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Methods 25
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 The Echo State Network . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Initializing Networks . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Warmup and Training . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Validation and Testing . . . . . . . . . . . . . . . . . . . . . . 29
4.2.4 Calculating Test Metrics . . . . . . . . . . . . . . . . . . . . . 30

4.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

xi


Contents

4.3.1 The Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.4 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.5 Summary of Genetic Algorithm Parameters . . . . . . . . . . 33

4.4 Refinements to the Methods . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Benchmarking Using Mackey-Glass and Lorenz Attractor . . . . . . . 35
4.6 Software and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Limitations of the Methods . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Summary of Methodological Approach . . . . . . . . . . . . . . . . . 37

5 Results 39
5.1 Mackey-Glass Equations and Lorenz Attractor . . . . . . . . . . . . . 39
5.2 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4.1 Results From Modified Fitness Function . . . . . . . . . . . . 68
5.4.2 Grid Search For Regularization . . . . . . . . . . . . . . . . . 70
5.4.3 Removing the Stock Data . . . . . . . . . . . . . . . . . . . . 72

5.5 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Discussion 75
6.1 Effectiveness of the GA in Improving ESN Predictive Performance . . 75
6.2 Sufficiency of the Genome . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Effectiveness of the Enhancements . . . . . . . . . . . . . . . . . . . . 75
6.4 Computational Costs and Time Requirements . . . . . . . . . . . . . 76
6.5 Robustness of GA-Optimized ESNs Across Datasets . . . . . . . . . . 76
6.6 Insights for Trading Strategies . . . . . . . . . . . . . . . . . . . . . . 76
6.7 Comparison with Random Walk Model . . . . . . . . . . . . . . . . . 77
6.8 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Conclusions 79

Bibliography 81

A Appendix A I

B Appendix B V

C Appendix C IX

xii


1
Introduction

In an increasingly competitive industrial metals market, protecting inventory value
is critical for industrial actors like Northvolt, one of Europe’s fastest-growing battery
manufacturers [70]. Materials account for approximately 65% of the total production
costs of electric vehicle batteries [45], therefore delivery contracts of batteries usually
include a moving prices of said materials [80]. This exposes the whole supply chain
to significant financial risk if prices are volatile, highlighting the need for effective
mitigation strategies. To address this challenge, hedging strategies involving futures
contracts traded on the London Metal Exchange (LME), are employed to stabilize
costs and protect inventory value [55][46][75].
The LME is the world’s primary commodities exchange for industrial metals, with
$15 trillion traded in 2023 [53]. Futures contracts allow traders to lock in prices
for metals at a future date, enabling them to offset potential losses caused by price
fluctuations [55]. For instance, selling futures while holding an equivalent physical
inventory can balance the financial impact of price changes. In this context, accurate
short-term price predictions, particularly for LME cash contracts, could improve
hedging strategies and inventory management.
Predicting financial markets, however, is notoriously challenging. The efficient mar-
ket hypothesis suggests that market prices reflect all available information, causing
price movements to behave like a random walk, that is inherently unpredictable [21].
Yet, skepticism remains. Some studies propose that financial time series exhibit
chaotic dynamics rather than pure randomness, meaning there may be underlying
patterns that can be identified and modeled [62]. If chaotic behavior exists, mod-
els adept at predicting chaotic systems might offer a competitive edge over other
models.
To address this, this thesis explores the use of artificial intelligence for predicting
short-term cash prices of industrial metals. Specifically, the study applies an Echo
State Network (ESN) to forecast next-day cash prices of six metals traded on the
LME. The ESN’s parameters are optimized using a Genetic Algorithm (GA), a
nature-inspired optimization technique [66]. The performance of the ESN-GA model
is evaluated against a random walk baseline, using the metrics mean squared error
(MSE), mean absolute error (MAE), R2, and directional accuracy.

1.1 Research Questions
This thesis aims to answer the following primary research question:

• Can Echo State Networks (ESNs), optimized using a Genetic Algo-

1


1. Introduction

rithm (GA), outperform a random walk model in predicting next-
day metal futures prices?

To address this, the following secondary questions are considered:
• Is the GA effective at improving the ESN’s predictive performance?
• What are the computational costs and time requirements for using a GA to

optimize ESNs?
• How robust is the GA-optimized ESN model for different metals and across

various datasets?
• What patterns or insights emerge from the ESN model’s predictions that could

inform trading strategies?

1.2 Overview of Methods
The study uses historical data from the LME, including cash contracts, three-month
futures contracts, and stock volume for six industrial metals: aluminium, copper,
lead, nickel, tin, and zinc accessed through Westmetall [86]. The data is split into
warmup, training, validation, and test sets. The ESN, a type of recurrent neural
network, is constructed to predict next-day cash prices. Its parameters are tuned
using a GA, which iteratively evolves the network design to minimize prediction
error. The ESN-GA model’s performance is validated for chaotic systems using
the Mackey-Glass and Lorenz Equations. Finally the performance of the ESN-GA
model is tested on the industrial metals and benchmarked against a random walk
model.

1.3 Significance of the Study
This study contributes to the fields of machine learning and financial time series
forecasting by exploring the effectiveness of ESN-GA models for industrial metal
price predictions. It also bridges the gap between theoretical AI models and prac-
tical financial applications, offering insights that could enhance hedging strategies
for industrial actors like Northvolt. By evaluating the ESN-GA model against the
random walk, this research provides a rigorous test of the model’s predictive capa-
bilities often lacking in many similar studies on using AI for financial time series
predictions [73][83][13][58].

2


2
Background

This section established the practical context and briefly introduces the theoretical
frameworks and machine learning techniques that will be applied in the thesis.

2.1 Northvolt and the London Metal Exchange
Northvolt is a Swedish battery manufacturer and was the fastest-growing producer
in Europe 2023, as highlighted in their sustainability and annual report [70]. The
company’s value proposition is centered around producing the world’s greenest bat-
teries with industry-leading performance, enabling customers to transition to electric
solutions. To manufacture these batteries a wide range of metals and materials are
required. This exposes Northvolt to fluctuations in metal prices. As part of its
strategy to manage this risk Northvolt incorporates recycling into its processes [71].
In 2023, 6% of the materials in their battery cells were recycled, with an ambitious
target of 50% by 2030 [70].
Materials account for a significant portion of the costs in battery manufacturing,
approximately 65% according to industry estimates for electric vehicle batteries
[45]. While this estimate is not specific to Northvolt, material costs are critical for
any battery manufacturer, emphasizing the importance of intelligent procurement
strategies. While well structured supply chains are probably the most important
part of driving costs down, an important complement is the use of financial tools
such as futures [75] [46].
The London Metal Exchange (LME) is the world’s primary commodities exchange
for industrial metals. In 2023, it handled trading volumes of 3.5 billion tonnes, and a
notional trading value of $15 trillion [20]. The LME facilitates the majority of global
financial business related to non-ferrous metals, i.e. metals that do not contain iron,
and prevalent in batteries.
One of the LME’s key functions is enabling the trading of futures contracts [55]. Fu-
tures are agreements that obligate buyers and sellers to transact a specified quantity
of a commodity at a predetermined price on a future date [31]. Notably, traders do
not need to physically own the underlying metal to trade futures. They can offset
contracts by entering equivalent contracts with matching terms on the opposite side.
Upon contract expiration, the LME clearing house records a net-zero position, and
the trader realizes a profit or loss based on the difference between the purchase and
sale prices of the contracts and no metals have to be delivered or received [52].
Futures contracts are an effective tool for reducing inventory risk without the need
to buy or sell additional metal. For example, buying a certain amount of a metal and

3


2. Background

simultaneously selling an equivalent quantity in futures contracts offsets potential
inventory losses due to declining metal prices, as these losses are balanced by gains
in the futures contracts. Conversely, increases in inventory value are offset by losses
in the futures position. Something similar to this approach is widely employed by
industrial players, including Northvolt, to mitigate inventory risks [80].
This study uses two types of LME contracts as data: cash contracts and three-month
contracts.

• Cash contracts are the shortest-duration contracts on the LME [51]. Once
a cash contract is entered, it cannot be offset with another contract because
no equivalent contract exists to take an opposing position. These contracts
effectively determine the final cash value for traders who do not wish to settle
their positions physically. The next-day cash price is the primary focus of the
predictive model in this thesis.

• Three-month contracts are the longest contracts that are settled daily on the
LME [51]. As the name implies, they mature three months from the purchase
date. These contracts offer insight into market expectations of future price
movements. This relationship may hold predictive power for next-day cash
prices.

The third type of data used in this study is stock data. The LME authorizes a
global network of warehouses for storing metals traded on the exchange [54]. When
a futures contract is settled, sellers issue warrants for the agreed-upon quantity of
metal, which buyers can redeem at these warehouses. Since prices are fundamentally
driven by supply and demand, information about available metal stock may provide
valuable insights for price predictions.

2.2 The Random Walk and Chaos in Finance
The dynamics of financial markets are complex. Unlike physical systems governed
by constant fundamental laws, a major contributor to financial market outcomes
are the anticipations of its participants [65]. Adding to this complexity is bad in-
put data, extraordinary events and malevolent actors. This makes price prediction a
daunting challenge. If financial market prices were predictable using readily available
formulas, these formulas would create a self-correcting effect: market participants
would exploit the predictions, and subsequent prices would quickly reflect the ex-
pected outcomes. This leads to the notion that prices fully incorporate available
information, with price movements occurring as independent random events when
new information emerges. This concept is encapsulated in the Efficient Market Hy-
pothesis (EMH), which states that: "A market in which prices always fully reflect
available information is called efficient" [21].
The EMH is not meant to be taken literally and recognizes varying degrees of market
efficiency. Empirical tests of market efficiency are typically categorized as [21]:

1. Weak: Only historical prices are considered available information.
2. Semi-strong: All publicly available information is considered.
3. Strong: All information, including non-public data, is incorporated.

While the validity of the Efficient Market Hypothesis remains a subject of debate

4


2. Background

[91], AI models often struggle to outperform a random walk in short-term predic-
tions, particularly when relying solely on historical price data [58].

2.2.1 Is the LME an Efficient Market?
There is a case to argue that the LME may not fully align with the definition of an
efficient market. The sufficient conditions for market efficiency include [21]:

1. No transaction costs in trading securities.
2. All available information is costlessly accessible to all market participants.
3. All participants agree on the implications of current information for prices.

The LME has transaction costs [48], and historical data is not freely available [49],
posing barriers for some participants, such as the author of this thesis. However,
in practice the only actors directly trading on the LME are institutional, making
the economical barriers to information and transactional costs negligible [50]. Re-
garding the last condition, it would be incredibly difficult to show that every single
participant agrees on anything and is entirely out of scope for this thesis. While not
technically fulfilling the sufficient conditions for a market to be considered efficient,
it is certainly possible that the LME is. A brief older study on the LME copper
market concluded that it probably is, despite statistical evidence being inconclusive
[30].

2.2.2 The Random Walk Model
The predictive model derived from the weak form of the EMH is the random walk.
A time-discrete random walk has two key properties [21]:

1. Changes in price are independent of each other.
2. Changes in price are identically distributed.

The random walk (RW) model is represented as:

St = St−1 + ϵt

where §t is the price at time t and ϵt is a noise term with E[ϵt] = 0.
This implies that the expected value of the next day’s rate of change (rt+1) given
the available information (It) is:

E[rt | It−1] = 0

where It−1 represents the information set containing historical data.
Extensions and variations of the RW assumption is widely used, with applications
such as the Black-Scholes model for option pricing where log returns are assumed
to be independent and normally distributed [6]. The RW model in this thesis only
assumes that expected next day returns are zero, thus the best prediction of the
next day price is the same price as today.

2.2.3 Beyond the Random Walk: Chaotic Dynamics
Despite its prevalence, the random walk model has its skeptics. Some research has
found that financial time series are better described as chaotic rather than random

5


2. Background

[62]. Unlike randomness, chaos implies deterministic yet complex patterns that can
be modeled. Accurate modeling of chaotic systems requires precise measurements
of all relevant variables, which may be infeasible in the case of financial time se-
ries. Regardless, using some of the relevant variables might produce more accurate
predictions than a random walk.
Chaotic systems, such as the Mackey-Glass and Lorenz equations, are examples of
differential equations describing the dynamics of complex systems [63] [56]. These
models will be used in this thesis to demonstrate the ability to predict chaotic
systems.
The reasoning is that if the LME futures market exhibits chaotic behavior, a model
designed for chaotic systems should outperform the random walk. Instead of debat-
ing whether markets are inherently chaotic or efficient, this thesis adopts a prag-
matic approach: constructing a model that excels at chaotic prediction and then
comparing its performance with the random walk model.
With some of the inherent challenges of financial time series forecasting established,
the next section introduces the model employed in this thesis.

2.3 The ESN-GA Model
Echo State Networks (ESNs) are a category of Reservoir Computing (RC) models
that have shown excellent performance in predicting chaotic time series [19][41].
One of ESN’s key advantages is its limited training requirements. This is achieved
through the construction of the reservoir, a high-dimensional, non-linear space of
sparsely connected nodes whose weights and connections remain fixed [38]. Instead
of training the entire network, only a linear readout layer is trained to interpret
the various states of the reservoir nodes. Since this readout layer is linear and
non-recurrent, efficient regression techniques can be employed.
The primary challenge in constructing ESNs lies in selecting the parameters defining
the reservoir [57]. Questions such as “How many nodes should be used?”, “How
should they be connected?”, and “How should the input be scaled?” are critical
to the network’s performance. A common approach involves relying on expertise
or intuition, essentially educated guessing. When such expertise is unavailable, a
systematic and automated method becomes necessary. In this thesis, parameters
are selected using a Genetic Algorithm (GA).
A GA simulates the process of evolution [66]. Central to evolution is the concept
of survival of the fittest, which measures how well an individual adapts to its envi-
ronment. In the context of GAs, fitness is quantified using a mathematical function
that evaluates how well an individual performs a given task. Here, an individual
represents a set of parameters defining an ESN model.
The GA process begins by randomly initializing a population of individuals, the
first generation, within a defined search space. Each individual is evaluated based
on its fitness, which, in this context measures the ESN’s predictive performance on
LME cash contracts. To improve the next generation, evolutionary operators such
as selective breeding and random mutation are applied:

• Selective breeding: Individuals with higher fitness are more likely to be
selected for reproduction, creating new offspring by combining the traits of

6


2. Background

two selected parents.
• Random mutation: Introduces novel variations by randomly altering traits.

While often disruptive, mutation is crucial for maintaining genetic diversity
and discovering new, potentially superior solutions.

This process of selection, crossover, and mutation is repeated over multiple genera-
tions, allowing the population to converge toward optimal or near-optimal solutions.
Compared to random or grid search methods, GAs often identify high-performing
solutions more efficiently [85].
The nested optimization loop, where the GA optimizes the parameters of the ESN
whose readout layer is then trained, form the ESN-GA model proposed in this the-
sis. This combination leverages the predictive power of ESNs and the optimization
capabilities of GAs to address the challenges of parameter selection systematically.

2.4 Related Work
Having introduced the relevant topics of the thesis it is worth looking into whether
anything similar has been done before. The concept of reservoir computing is not
new. It was formally introduced by Herbert Jaeger in 2001 [38]. Similar models
were known before but Jaeger’s work is the foundation of what RC is today [40].
Since then a multitude of effective applications have been found [12]. There are also
multiple RC libraries available for ease of implementation [68] [15] [16].
However, when pairing the RC model with a GA, using existing libraries can be
cumbersome and clunky. Instead a low level custom implementation is favored
allowing for control and larger understanding of network design decisions. Creating
a low level implementation of RC can be challenging but the work by Lukoševičius,
specifically A Practical Guide to Applying Echo State Networks, is a vast resource
[57]. The guide is in fact cited in most of the existing libraries.
Others have used a genetic algorithm to optimize the parameters of an RC model
with positive results [29] [11]. There are also implementations of RC networks
specifically for financial time series [84] [89] [88], however none known to the author
for predicting metal futures.
Using ESNs, or reservoir computing for that matter, for financial time series predic-
tions is quite niche and there are seemingly no comprehensive literary reviews on the
topic. Exemplifying this is the omission of RC models from reviews of deep learn-
ing for financial time series predictions [13] [83] [73]. It should also be noted that
data from non-publicly traded metal futures, such as the LME data, is also never
mentioned in any of the reviews. Further, using the random walk as a benchmark
is not common. In the respective 72, 85 and 140 studies in these reviews, using the
random walk as a comparative baseline only appears twice, even though the efficient
market hypothesis is commonly cited. The explanation for this might be found in
a study that replicated 15 best in class deep learning models found in other studies
[58]. None of the models replicated in the study managed to outperform the naive
approach, called random walk model in this thesis, based on mean squared error
over multiple time periods.

7


2. Background

8


3
Theory

This section handles the theory required to construct an ESN, optimize its param-
eters using a GA and produce predictions.

3.1 Network Topologies
A network topology refers to the arrangement or organization of nodes and connec-
tions within a network. It is particularly relevant to this subject due to its significant
impact on network performance [36] [59] [60].
This section outlines the two different network topologies utilized in this study. The
algorithms used create simple adjacency matrices. Such matrices describe in binary
terms how the nodes in the network are connected. A one represents a connection,
and subsequently a zero represents no connection. The matrices are directional and
self-connecting, that is, node a can be connected to node b while node b does not
necessarily need to be connected to node a, also node a can be connected to itself.
Let A be an n× n adjacency matrix where the elements Aij are defined as:

Aij =

1 if i is adjacent to j,

0 otherwise,
for all i, j ∈ N.

After the simple adjacency matrix is generated, each of the non-zero elements are
then replaced to fulfill the echo state property, this is handled later. The new matrix
defines all the connections and corresponding weights in the reservoir, hence called
the weight matrix.
Below are descriptions of the two algorithms used for generating the simple adjacency
matrices.

3.1.1 Scale-Free Network
The reason for generating scale free networks in this context is to mimic the complex
structure of networks occurring in nature, such as the neural network in brains [4].
Scale-free networks can be generated using the Albert-Barabasi model [5] and is
achieved through so called preferential attachment. The process follows two general
steps: first generate a small number of connected nodes, second add subsequent
nodes with preferential attachment. That is, add connections with a proportional
probability based on the degree of each node over the total degree of the network [5].
This approach leads to an undirected network without self connections so a slightly

9


3. Theory

edited approach inspired by [69] and [3] is taken to make the network directed and
self connecting. The process follows:
The algorithm begins by initializing a clique C of size 2, consisting of two fully
connected nodes. This is represented by an adjacency matrix A ∈ {0, 1}n×n, where
Aij = 1 for i, j ∈ {0, 1}, and Aij = 0 otherwise.
Define the set of all nodes as N = {0, 1, . . . , n − 1}, where n is the total number
of nodes in the network. The initial clique C ⊆ N contains two nodes, and the set
S = N \ C represents the nodes that are not initially part of the clique.
For each node sk ∈ S, the algorithm sequentially adds sk to the clique C, so that
C ← C ∪ {sk}.
Once sk is added to C, edges between sk and the nodes in C are established based
on preferential attachment. Specifically, for each node si ∈ C, the following steps
are performed:

1. Sample two random values r1, r2 ∼ U(0, 1), where U(0, 1) represents the uniform
distribution on the interval [0, 1].

2. Compute the degree deg(si) of node si, which is the sum of the entries in the i-th
row of the adjacency matrix A:

deg(si) =
n−1∑
j=0

Aij

3. Let the total degree of the nodes in C be:

deg(C) =
∑
i∈C

deg(si)

4. Define the probability of connecting sk to si as:

P (sk → si) = deg(si)
deg(C)

If r1 < P (sk → si), then set Aik = 1 (i.e., create a directed edge from sk to si).

5. Similarly, define the probability of connecting si to sk as:

P (si → sk) = deg(si)
deg(C)

If r2 < P (si → sk), then set Aki = 1 (i.e., create a directed edge from si to sk).

If, after iterating through all nodes in C, node sk has not been connected to any
other nodes, i.e. no entries in row Ak· or column A·k are set to one, the process is
repeated until sk is connected to at least one node in C.
This process is repeated for each node sk ∈ S until all nodes are part of the clique
C and A defines the scale-free simple adjacency matrix.

10


3. Theory

3.1.2 Uniformly Random Networks
Generating uniformly random networks requires less steps and is commonly used in
ESNs [36]. For every entry in the adjacency matrix a uniformly random number
between 0 and 1 is sampled. If the random number is below a specified probability
P the entry in the adjacency matrix is put to one, else zero. That is:
Let A be an n× n adjacency matrix where the elements Aij are defined as follows.
For every i, j ∈ N = 0, 1, ..., n− 1, let p ∼ U(0, 1), where U(0, 1) represents the
uniform distribution on the interval (0, 1). Then, the elements Aij are defined as:

Aij =

1 if p < P,

0 otherwise,
for all i, j ∈ N.

Thus a uniformly random topology requires a connectivity parameter P. A defines
the uniformly random simple adjacency matrix.

3.2 Theory of Reservoir Computing Models
Reservoir computing is a recurrent neural network framework designed for , among
other things, processing temporal data and solving complex dynamical problems
[26]. It consists of a fixed, randomly initialized set of recurrent nodes the so-called
reservoir, which captures and transforms input data into high-dimensional space.
All nodes in the reservoir record states, which are then interpreted by a readout
layer. A key feature is that only the readout layer is trained. This makes the
approach computationally efficient, considering the training becomes a least squares
problem, while still capable of learning complex patterns [79]. This architecture is
particularly well-suited for tasks like forecasting, pattern recognition, and modelling
chaotic systems [26][41], which financial markets could be claimed to be a part of
[33]. Regardless, there are promising results for utilizing reservoir computing models
within financial markets to make predictions [33][84].
Following is an abstract outline of how ESNs operate as described in [61]. In short,
the process involves handling input to produce output and then optimize the output
towards a goal:
Form two separate sequences:

D = {xt, yt}T −1
t=0

where:
• The set D is the training data.
• xt is the input signal at time t
• yt is the output signal at time t
• There exists some relationship between the two signals.

The goal is to learn the relationship between x and y from D, and then, after the
training period (for t ≥ T ), be able to give estimates ŷt for an unknown yt given
some new input xt.
When performing a time series prediction, the goal is often to predict xt+τ = yt

given {xi}i≤t, which is called a τ -step ahead prediction. Now that the purpose of

11


3. Theory

an ESN has been established, following is an abstract illustration of the network
structure:

Figure 3.1: Illustration of a Reservoir Computing Model

A general reservoir computing model, as illustrated above, consists of the input and
output layers (sometimes called the input and readout layer) connected to the n-
dimensional dynamical space called the reservoir. The reservoir can be described as
follows:

St = F (St−1, xt)

where
• St ∈ Rn is the state of the reservoir at time t, which consists of {si,t}n−1

i=0 , the
individual states of the reservoir nodes.

• F : Rn × R→ Rn is the map describing the dynamics of the reservoir.
This means that the state of the reservoir St evolves over time according to the
dynamics F and the input xt.
The approximation ŷt of yt is calculated from the reservoir as follows:

ŷt :=
n−1∑
i=0

wisi,t

That is, reading the states of the nodes in the reservoir and scaling them by the
output weights {wi}n−1

i=0 .

Choosing the output weights is done by making the approximation as close to the
output signal as possible in the training data D. That is, ŷt ≃ yt, for t ∈ [0, T − 1].
Typically, this is done by a least squares method.

3.2.1 The Echo State Property
For the functionality of the ESN it is essential that the state of the network is
uniquely determined by any left-infinite sequence of inputs [32]. This means that,
regardless of the initial state of the network, the current state is fully determined by
the infinite history of past inputs, ensuring that the influence of any arbitrary initial
conditions vanishes over time. Guaranteeing this can be done through the echo state
property (ESP). A few formal definitions for the ESP have been proposed depending
on restrictiveness, slight differences in the ESN and practicality [32] [37][38][39].
Following is the definition offered by [32]: Being uniquely determined by any left-

12


3. Theory

infinite sequence depends on the input sequence. Therefore, the definition of the
ESP is stated through constraining the input sequence’s range to a compact set,
let’s denote that X. We also need to relate X to the set of all possible states of
the nodes in the reservoir, denote that S. This leads to the compactness condition
which states:

Let F be the map defining the dynamics of the reservoir.
• F defined on S×X
• S ⊂ Rn

• X ⊂ Rt

• X and S are compact sets
• F (St−1, xt) ∈ S, xt ∈ X for any t ∈ Z

Now, S being compact is guaranteed when the activation function of the reservoir
nodes f is bounded. A common example of a bounded activation function is the
hyperbolic tangent tanh.
Practically speaking the inputs xt will also be bounded making X a compact set.
Now, let X−∞ and S−∞ denote the left infinite input and state vectors respectively
defined as:

• X−∞ := {x−∞ = {xi}0
i=−∞ | xt ∈ X for any t ≤ 0}

• S−∞ := {S−∞ = {Si}0
i=−∞ | St ∈ S for any t ≤ 0}

S−∞ is compatible with x−∞ when:

• St = F (St−1, xt), ∀t ≤ 0

Equipped with this, the definition of the ESP follows:

A network with dynamics:

• F : S×X− > S

satisfying the compactness condition has the echo state property for any input se-
quence:

x−∞ ∈ X−∞

and any two state vector sequences:

S−∞, K−∞ ∈ S−∞

Compatible with x−∞.

It holds that S0 = K0.

This definition implies that states that are nearby must be the result of similar input
histories [8]. Subsequently the current state should depend more on recent inputs
and states rather than historic ones.
In practice one way of achieving the echo state property is by assuming a compact
set of input vectors, which almost always holds, and adjusting the reservoir weights

13


3. Theory

matrix accordingly.
So, for the standard ESN, construct the reservoir weights matrix W res in the fol-
lowing manner to guarantee the ESP [32]:

1. Randomly initialize W res ∈ Rn×n such that all entries wres
ij ≥ 0, i.e., they are

non-negative.
2. Scale W res such that the spectral radius is less than 1, i.e., ρ(W res) < 1.
3. Change the sign of a desired number of weights wres

ij so negative connections
are introduced.

This guarantees the ESP. A few ways to prove this can be found in [32][39][38].
It should be noted that as explicitly stated in [40] and as shown in [32] that randomly
initializing W and then scaling the spectral radius below unit is not necessary for
the ESP and spectral radii above unity might be optimal for the specific application.

3.2.2 Calculating Output
Previously the calculation of the output, the approximation ŷ, was described as
multiplying the states of the reservoir by the weights of the output layer, that is a
simplification. This section in turn outlines the full calculation from input to output.
To start with, we define all the different vectors, matrices and activation functions
used to define the dynamics of the reservoir F subsequent calculation of the output.

Figure 3.2: Illustration of Reservoir Computing Model, Complemented with Vec-
tors, Matrices and Activation Functions used to Define the Dynamics of the Reser-
voir

First we have the matrices:
• W in ∈ Rk×n, where dim(xt) = k, the number of dimensions to the input signal.
• W res ∈ Rn×n, where n is the number of nodes in the reservoir.
• W out ∈ Rn×m, where dim(yt) = dim(ŷt) = m, the number of dimensions of the

output signal and approximation.
Second the bias vectors:

• bin ∈ Rn

• bout ∈ Rm

Note that not necessarily all the nodes in the reservoir are connected to the input
or output layers. For the connected nodes the entries in the corresponding weight
matrix and bias vector are non-zero, correspondingly for non-connected nodes the
entries are zero.

14


3. Theory

Third the state vector of the reservoir
• St ∈ Rn, the reservoir state at time t
• St,i denotes the state of node i at time t

And fourth the three activation functions:
• f in, the element-wise activation function of the input layer. Usually just the

unity.
• f res, the element-wise activation function of the reservoir. Usually the hyper-

bolic tangent.
• f out, the element-wise activation function of the output layer. Usually unity

or a rectifier.
Last, as suggested in [68], [27] and [10] we introduce a leakage rate, α ∈ (0, 1].
The term helps control how fast the state of the reservoir changes with new input,
allowing learning of slower dynamical systems and recognizing strongly time warped
patterns [27].
With all terms defined, the first step is to calculate St using the dynamics of the
reservoir F driven by xt:

St = F (St−1, xt)

Where the map describes:

St = (1− α)St−1 + αf res
(
W resSt−1 + f in

(
W inxt + bin

))
With the updated states St, we can calculate the output ŷ:

ŷt = f out(W outSt + bout)

Having calculated the output, it can be used to train the output weights and or
make predictions, discussed in the following sections.

3.2.3 Offline Learning: Ridge-Regression
Offline training entails training the model on a fixed historical dataset.
The only coefficients in the model that are being optimized are W out and bout.
Because f out is typically linear the calculation of the output can be written as
follows:

ŷt = ¯W outS̄t

where ¯W out = [W out; bout] and S̄t = [St; 1], allowing for simpler notation.
The purpose of the ESN is to match the input-output pair [X, Y ]. This is done by
feeding xt into the reservoir, updating the states St and calculating the output ŷt as
above, where ŷt should be as close to yt as possible. To achieve this one can gather
all the states produced by X and minimize the squared error with respect to the
weights and bias. Assuming we have T time steps of training data we want to solve
the following equation for W̄ out:

Y = W̄ outS̄

15


3. Theory

where Y ∈ Rm × T , S̄ ∈ R(n+1) × T , and W̄ out ∈ R(n+1) × m.
Since the system is usually overdetermined, T >> (n + 1), there are many known
solutions. Using a ridge regression is the most universal and stable, which is the
following equation [57] :

W̄ out = Y S̄
(
S̄S̄T + λI

)−1

where T denotes the transpose, I is the identity matrix, (·)−1 represents the matrix
inverse, and λ is the regularization coefficient.
The inclusion of λ is made to penalize large absolute values of W̄ out. If W̄ out has
very large absolute values this might amplify tiny differences in S̄. This means
possibly worse generalization for out of sample inputs that produce never before
seen states and potentially massive errors. Picking λ such that this does not happen
is important and therefore included as a parameter that needs optimization.

3.2.4 Online Learning: Stochastic Gradient Descent
Online learning enables the model to adapt to new data. This is desirable in cases
where the time series X is non-stationary [72] [90], which financial time series can
be. Online learning through stochastic gradient descent (SGD) enables this by, at
each time step, shifting the weights and bias such that the squared error becomes
smaller.
SGD is formulated as follows [67]:
The objective of SGD is to minimize the average value of a function L:

L(θ) = Eq(p)[L(θ, p)]
Where θ is the parameters we want to optimize and p is a random input to L. For
example, an approximation from a training set.
At each step of the optimization, we assume that we observe:

Lt(θ) = L(θ, pt)

And pt ∼ q.
We also assume that an unbiased estimate for the gradient of L can be computed.
That holds if the distribution of q(p) is independent of θ and we can use:

gt = ∇θLt(θt)
With that we can follow the negative gradient and iteratively update the parameters:

θt+1 = θt − ηtgt

Where η is the learning rate determining how far the step in the negative gradient
direction the update should be. Because the method approximates the error space
by a hyper plane and traverses it in the steepest instantaneous direction, η needs
to be optimized. If picked too large the new weights will overshoot the optimal
solution and too small never adapt to a non-stationary time series. The choice of η
is handled in the methods section.

16


3. Theory

Now, θ represents the readout layer, that is W out and bout. The calculation of the
gradient of the loss function with respect to the two is also handled in the methods
section.

3.3 Theory of Genetic Algorithms
Thus far a few parameters needed to construct and train ESNs have been introduced.
This section outlines some theory of genetic algorithms (GAs), its usefulness in
parameter selection and how they can be implemented.
GAs mimic Darwinistic evolution [66]. This is why most terminology is borrowed
directly from the world of biology. A general genetic algorithm proceeds as follows
[66] [85]:

Figure 3.3: Schematic of a General Genetic Algorithm

Where a more detailed explanation follows:
1. Initialize the population by randomly generating N different chromosomes, ci,

where i = 0, . . . , N − 1, from the genome (the search space).
2. Evaluate the individuals

(a) Decode the chromosomes into their corresponding parameters, pi, where
i = 0, . . . , N − 1.

17


3. Theory

(b) Evaluate the objective function g using the parameters. Assign a fitness
value to each individual Gi = g(pi), where i = 0, . . . , N − 1.

3. Form the next generation
(a) Select two parents, i1 and i2, such that individuals with a better fitness

value have a higher probability of being selected.
(b) Generate two children, j1 and j2, by crossing the two parents’ chromo-

somes ci1 and ci2 .
(c) Mutate the two children j1 and j2.
(d) Repeat until you have a new generation of N children.

4. Return to step 2 unless the termination criteria have been achieved.
Where the different terms are briefly explained as:

• Population: The set of individuals with size N .
• Individuals: The different sets of parameters being evaluated, encoded as

chromosomes.
• Decode: Translating the numerical values stored in the chromosome into the

parameters.
• Objective function: The function being optimized. In the case of this paper,

it is the mean squared error of the Echo State Network (ESN) on the validation
set.

• Fitness value: The value of the objective function for an individual. In the
case of mean squared error, lower is better.

• Generation: The population during a specific iteration of the algorithm.
• Parents: The individuals selected in the current generation to create the next.
• Children: The product of the parents after crossing and mutation, constitut-

ing the next generation.
• Termination Criteria: The conditions that halt the algorithm. This could

be a specified number of iterations, a certain fitness value achieved, or a max-
imum number of iterations without an improvement in the fitness value.

The three operators central to genetic algorithms are selection, crossover and mu-
tation [66]. Different versions of the three exist and the ones utilized in this thesis
are as follows:

Tournament Selection

1. Randomly select ntournament individuals from the population.
2. Order the selected individuals from best to worst.
3. Select the best individual with probability ptournament. Continue down the

ordered list until an individual has been selected, or there is only one individual
left. In that case, select the last individual.

18


3. Theory

Single Point Crossover

Figure 3.4: Schematic of a Single Point Crossover

1. Given two selected parents, cross them with probability pcross. Otherwise, set
the children to be exact copies of the parents.

2. Uniformly at random select a crossover point in the chromosome.
3. Set the first child to be the first parent’s genes before the crossover point, and

the second parent’s genes after the crossover point. Vice versa for the second
child.

Modified Bit-Flipping Mutation

Typically the chromosomes are encoded as strings of bits, ones and zeros. Then,
during mutation every bit in each chromosome is flipped with a probability pmut.
This means that for a random 32-bit string representing a number in a specified
range, you have the following distribution of absolute change for a mutation of
pmut = 5/32:

Figure 3.5: Probability Distribution of Absolute Change in a 32-bit String with
pmut = 5/32

19


3. Theory

The distribution reveals that smaller changes, relative to the allowed range, are
significantly more frequent when flipping bits compared to mutations that alter
90-100% of the allowed range. In essence, bit flipping facilitates both small and
large changes, but the likelihood of larger changes diminishes progressively as their
magnitude increases.
Now, in the implementation in this thesis, instead of using bits, floating point num-
bers are used. One way of implementing mutation for floating point encoding is to
uniformly pick a number in the allowed range. This loses the property described
earlier of bit-flipping mutation. So, the method used in [7] is employed instead. The
process follows:
Given an allowed range R = [cMin, cMax] and a value c that should be mutated,
randomly pick a direction up or down for c.
Without loss of generality we assume up was picked. This means the range the
mutated value can fall into is R = [c, cMax].
Partition R into a equal parts of width w. Now, we want to pick a sub-range Rsub:

Rsub = [c, c + wi]
i ∈ Z, i ∼ U[1, a]

Then pick a number cmut uniformly at random in Rsub. Let cmut replace c in the
chromosome.
This method of mutation leads to the following distribution of absolute change given
a mutation occurs:

Figure 3.6: Probability Distribution of Absolute Change in Modified Bit-Flipping

The method enables mutations to result in both large and small changes but smaller
changes are more likely, which is the property desired.

GA-Parameters

With the evolutionary operators defined, to run the genetic algorithm requires the
following parameters:

1. N , Population size
2. M , Number of iterations

20


3. Theory

3. ntournament, Tournament size
4. pcross, Crossover probability
5. pmut, Mutation probability
6. [cmin

i , cmax
i ], range of chromosome i

7. ai accuracy of chromosome i

3.3.1 Premature Convergence
A common issue when implementing genetic algorithms is fitness values converging
to a local optima and being unable to find children outperforming parents. This
is called premature convergence [22]. There are several factors that play a role
in premature convergence, but the main culprit is a lack of genetic diversity [64].
Various techniques have been developed to address this issue [28]. While choosing an
appropriate approach one should strike a balance between exploration of the search
space and exploitation of solutions already found.
Three measures are implemented in this thesis where the first two are inspired by
[1] and the third by [25] [64]. They follow below.

1. Linearly Decreasing Crossover Rate

The idea is that, initially the population is diverse, and exploiting the best solutions
to a higher degree will not cause premature convergence. Later on, when the popu-
lation converges on a few solutions, a lower crossover rate should allow the algorithm
to still explore new solutions. The linearly decreasing crossover rate is implemented
as follows:

pcross
i = pcross

max −
pcross

max − pcross
min

N − 1 · i

where:
• N is the number of iterations.
• pcross

i is the crossover rate for iteration i = 0, . . . , N − 1.
• pcross

min is the initial crossover rate.
• pcross

max is the final crossover rate.

2. Linearly Increasing Mutation Rate

The idea is similar to the previous technique. Initially, the population is diverse,
and a high mutation rate pushes the algorithm toward a random search. Later on,
a higher mutation rate helps escape local optima. The linearly increasing mutation
rate is implemented as follows:

pmut
i = pmut

min + pmut
max − pmut

min
N − 1 · i

where:
• N is the number of iterations.
• pmut

i is the mutation rate for iteration i = 0, . . . , N − 1.
• pmut

min is the initial mutation rate.

21


3. Theory

• pmut
max is the final mutation rate.

3. Stagnation Reset and Elitism

Since it is difficult to determine a satisfactory fitness level, the model is terminated
after a certain number of iterations. In some cases, the model converges prematurely,
despite the efforts mentioned above. One could decide to halt the algorithm after a
certain number of stagnated iterations, that is iterations with very marginal or no
progress.
The approach taken instead is to reintroduce genetic diversity by resetting the popu-
lation after fitness progression has stagnated. This approach almost certainly wors-
ens fitness at first as it’s distribution is equivalent the fitness of the first generation.
To prevent fitness from regressing, the individual with the best fitness previously is
copied directly to the next generation. This is known as elitism. Elitism is applied
every iteration, regardless of stagnation resets.
Elitism and stagnation resets are implemented as follows:

1. After crossing and mutating in each iteration:
• If fitness improved, set the stagnation counter cstagnation = 0,

otherwise set cstagnation ← cstagnation + 1.
2. If cstagnation = Nstagnation, set all children randomly.
3. After all operators (and possibly resetting), apply elitism:

• Set one of the children to be an exact copy of the best parent.
Note that these techniques introduce additional parameters for the genetic algo-
rithm, such as cstagnation, and ranges for pcross and pmut.
It should also be noted that the selection method used affects premature conver-
gence. Using a ranked system, such as tournament selection, instead of proportional
fitness levels among individuals, can potentially mitigate premature convergence
[85].

3.4 Overfitting
In most machine learning applications, over-fitting is a concern [34][85][77]. In gen-
eral, the purpose of machine learning is to use some known input-output pairing to
train a model such that it can generalize to unseen data, that is, produce predictions
based on novel input. Under certain conditions, the model might fit very well to
the seen data, but fail to make accurate predictions for unseen data; this is known
as over-fitting. Usually, this occurs when the training methods over correct for the
noise in the data.
Considering the methods used in this paper, where three machine learning algo-
rithms are employed, over-fitting might be a concern. Two remedies for over-fitting
are using large datasets and validating the results before testing. Fortunately, the
data available is quite extensive. However, validating both GA and the training of
the ESN requires splitting the dataset into validation and testing sets for both mod-
els. Another common approach is to use cross validation techniques, how this should
be done for temporal data and reservoir states is unclear. Instead, the splitting is
done as follows:

22


3. Theory

• Warm-up Set – washes out the initial random states.
• Training Set – used to optimize the readout layer.
• Validation Set – validates the ESN training and is used for evaluation in the

GA.
• Test Set – tests the GA-optimized ESN.

With this setup, the issue of over-fitting could arise in the genetic algorithm on
the validation set. However, the genetic algorithm is somewhat blunt in the sense
that it only indirectly interacts with the noise in the data. To check if the setup
is over-fitting the validation set, which is used to train the genetic algorithm, the
training error is measured for the best performing individual (set of parameters).
If the training error starts to consistently increase, that would be an indication of
overfitting the validation set.
It is important to emphasize that the test set remains completely unseen by all op-
timization algorithms, ensuring it serves as an unbiased evaluation of the model’s
performance. The fitness on the validation set is exclusively used to guide the se-
lection of the best network. If the genetic algorithm begins to overfit the validation
set, this would likely manifest as both an increase in training error and poor perfor-
mance on the test set. In such cases, alternative validation techniques or more robust
data-splitting strategies should be explored to mitigate overfitting and improve the
generalizability of the model.

23


3. Theory

24


4
Methods

When building an ESN and running a GA, there are plenty of design choices that
needs to be made. There are also plenty of ways to process data and evaluate results.
This chapter describes the approaches taken.

4.1 Data

The full historical data from the London Metal Exchange (LME) comes at a signif-
icant cost [48]. However, Westmetall provides access to a subset of historical data
on their website [86]. The "cash settlement" data refers to the closing offer from the
daily second ring round of cash futures, while the "three month" data corresponds
to the closing offer from the second daily ring round of the three-month futures
contract. It is important to note that these are not the official prices determined by
the LME. Official prices typically fall between the final bid and offer of the second
daily ring round, with final offers generally being slightly higher, though rarely ex-
ceeding a single trading tick ($0.5 or $0.1 depending on the metal). Additionally,
the "stock" data provided by Westmetall represents the official stock levels recorded
by LME-approved warehouses.
The metals Westmetall provide data for are aluminium, copper, lead, nickel, tin,
zinc. All metals but tin are mentioned in the Northvolt 2023 annual report[70].
What ratios or volumes used in their manufacturing process is not public informa-
tion.
Unfortunately, the data available on the Westmetall website is not provided in a
readily downloadable format. As a result, it is necessary to scrape the data table
directly from the HTML code. The algorithm employed for this purpose utilizes the
Jsoup library [35] for parsing the HTML and the OpenCSV library [78] for handling
CSV operations. The process operates as follows:
This produces a CSV file where each line consists of a date, cash settlement, three
month and stock. The CSV files are then read into Java.
The data is then transformed to reflect the daily percentage change. This transfor-
mation results in the loss of the very first data point, as it is used as a reference.
Various transformations were tested, including using the raw data or multiplicative
changes, but the percentage change consistently yielded the best results. It is also
an intuitive and easily interpretable format compared to other options, such as log
changes. The transformation is applied to all data types (cash settlement, three
month, and stock) as follows:

25


4. Methods

Algorithm 1 Extract Table Data from HTML and Write to CSV
1: Connect to the target URL
2: Select CSS query "table"
3: for each table in document do
4: for each row in table (table.select("tr")) do
5: for each cell in row (table.select("td")) do
6: Write cell content to CSV
7: Append comma "," after each cell
8: end for
9: Write new line in CSV after all cells in the row are processed

10: end for
11: end for

Let i = 0, ..., T − 1, percentageChange[i] =
(

data[i + 1]− data[i]
data[i]

)
× 100

The percentage change of the different data types are then assembled into input and
output matrices x and y to form paired sets for one-step predictions. This is done
such that:

yi−1 = xi, for i = 1, . . . , T

Where each pair [xi, yi] is used for a one-step predictions, where xi, along with it’s
history, aims to predict yi.
The [x, y] data is then split into a warmup, training, validation and test set. The
warmup set should be large enough to wash out the initial states of the reservoir,
1000 data points of warmup and 2000 of training is suggested in [42]. Considering
the finite amount of data in this application, such a ratio between warmup and
training (1:2) might be too high. One also has to take into account that training
the network with a ridge regression requires an overdetermined system of equations.
How to split the data is a vast topic of discussion. Ultimately the split decided upon
is:

Phase Percentage
Warmup 10%
Training 85%

Validation 2.5%
Testing 2.5%

Because the number of data points available are slightly different among the metals,
a percentage of the available is used instead of the number of data points.
With the data collected, transformed and split into different sets it is time to describe
the specifics of the ESN, which is the topic of the next section.

26


4. Methods

4.2 The Echo State Network
This section describes how the ESN is initialized and how warmup, training, valida-
tion and testing is performed.
The activation functions used are: f in - unity, f res - tanh and f out - unity. This
results in the following formulations for updating states St and calculating output
ŷt:

St = (1− α)St−1 + αtanh
(
W resSt−1 + W̄ inx̄t

)

ŷt = W̄ outS̄t

Where the "bar" indicates the matrix or vector has been extended to accommodate
a bias term and tanh(·) indicates the element wise hyperbolic tangent.

4.2.1 Initializing Networks
To initialize the network the initial state vector S0 and the matrices W in, W out and
W res have to be defined and populated. There are a few legitimate choices that can
be made. The approach taken follows:

Si = 0 , ∀i ∈ [0, n− 1]

where n is the reservoir size. Setting S to reasonable values is later handled by the
warmup.

Having all of the nodes in the reservoir connected to both the input and output
layers is not always optimal. To control this we need two sets of connections to the
reservoir nodes for W in and W out:

NodesIn = {si ∈ Ud[0, n− 1] | si drawn without replacement, for i ∈ [0, inLength− 1]}

NodesOut = {si ∈ Ud[0, n− 1] | si drawn without replacement, for i ∈ [0, outLength− 1]}

Where Ud denotes the discrete uniform distribution and in- and outLength is the
number of non zero connections to the reservoir.

Since W out is determined during training with a ridge regression, which does not
require calculation of output since the desired output is known, there is no need
to randomly initialize the matrix. NodesOut is used to determine which nodes’
state should be collected to form the design matrix. The entries of W̄out are then
calculated as seen in the theory section.

W in on the other hand needs to be randomly initialized. Controlling the distribution
of W in in relation to the input types and reservoir is a desirable feature. Therefore

27


4. Methods

an input scale for each of the data types and a bias scale is used. W in is initialized
as:

W in
j,i ∈ U [−inScalej, inScalej] ∀i ∈ NodesIn, j ∈ [0, k − 1]

W in
j,n ∈ U [−inBiasScale, inBiasScale] ∀j ∈ [0, k − 1]

where k is the number of data types.

When it comes to W res the process becomes more involved, considering the echo
state property (ESP). The process follows:
Let A denote the n×n adjacency matrix initialized either uniformly random or scale
free as described in the theory section. Initialize the weights as:

W res
i,j ← Ai,jU(0, 1] ∀i, j ∈ [0, n− 1]

Such that the non-zero elements are set uniformly random on (0, 1]. The next step
is to scale the weights such that W res has a specified spectral radius. This is done by
calculating the current spectral radius of W res and then multiplying all weights by
a ratio of the wanted spectral radius and the current spectral radius. The spectral
radius is approximated using power iteration. The process follows [87][9]:

Algorithm 2 Power Iteration for Largest Eigenvalue
1: randomly initialize bk ∈ Rn, where ∥bk∥ ̸= 0, bk is the initial guess for the

dominant eigenvector.
2: for iter = 1 to MaxIterations do
3: bk+1 ← W resbk

4: ρ← bT
k+1bk+1
∥bk+1∥

5: bk+1 ← bk+1
∥bk+1∥ {normalize bk+1}

6: if ∥bk+1 − bk∥ < tolerance then
7: break
8: end if
9: bk ← bk+1 {update for the next iteration}

10: end for

Here, bk is a normalized vector that approximates the dominant eigenvector of W res,
which corresponds to the largest eigenvalue. The spectral radius ρ is approximated
as the largest eigenvalue of W res, estimated iteratively using the updated bk+1 vector.
The power iteration is run with MaxIterations = 1000 and tolerance = 1e − 9.
Thus we have a close approximation of the spectral radius of W res and we can rescale
to the spectral radius wanted:

W res ← ρwanted

ρ
W res

To make W res include both positive and negative values approximately half of the
non-zero values are randomly set to be negative.

This concludes the network initialization.

28


4. Methods

4.2.2 Warmup and Training
Warmup is simply passing input to the reservoir updating the states. This assures
that the initial state, where all states are zero, does not affect the subsequent training
process.
When the warmup set has been passed through, training begins. This is done by
passing the training set through the reservoir and collecting the states of the output
nodes creating the design matrix. Then the output weights and bias is calculated
using ridge regression as described in the theory section.
There are a few alternatives to optimize λ since the computational expense is quite
low considering the same design matrix is used in addition the network does not
have to be reinitialized. However, employing for instance a logarithmic grid search
for λ, in conjunction with the GA for the other parameters, adds up and ultimately
slows down the optimization significantly. λ is there for also included in the genome
fo the GA.

4.2.3 Validation and Testing
With the output weights and bias set from the offline training, the model can now
produce predictions. To validate the model the validation set is passed through the
reservoir and approximations ŷt are calculated. Since the model is adaptive though
online training, at each time step the error gradient is calculated and the weights
slightly adjusted. As discussed in the theory section we need the gradient of the
error with respect to the weights and bias. Defining the loss function Lt as follows,
we have:

Lt = 1
2 ||ŷt − yt||2

ŷt = f out(zt)

zt = W̄ outS̄t

That is, Lt is a function of ŷt which is a function of the activation zt which is in turn
a function of W̄ out. Using the chain rule, the gradient of Lt can be expressed as:

∂L

∂W̄ out
= ∂L

∂ŷ
· ∂ŷ

∂z
· ∂z

∂W̄ out

Where:
∂Lt

∂ŷt

= ŷt − yt

∂ŷt

∂zt

= 1, f out is unity

∂zt

∂W̄ out
= S̄t

29


4. Methods

and W̄ out is updated as:

W̄ out ←W̄ out − η∇Lt

W̄ out ←W̄ out − η
(
(ŷt − yt)S̄T

t

)
This is done at every step of both validation and testing. Now, η has to be set such
that the model does not over correct for small errors nor fail to correct for changing
dynamics. The optimization of η is handled by the GA.
When the model has produced predictions for the validation and test sets respec-
tively the results are measured using a few test metrics. These are the topic of the
following section.

4.2.4 Calculating Test Metrics
To measure how well the model can make predictions, the mean squared error (MSE)
is used. Considering the output has multiple dimensions the mean squared error is
the mean of the MSE of the different dimensions. Given T predictions and m
dimensions this is calculated as follows:

MSE = 1
Tm

T∑
i=1

m∑
j=1

(yj,i − ŷj,i)2

This is the metric used as fitness in the GA.
The primary concern of the model is to predict the next day cash. Therefore, the
MSE of the cash dimension of y and ŷ is metric of ultimate interest. These values
are denoted as ycash and ŷcash, respectively. Given T predictions, the MSEcash is
calculated as follows:

MSEcash = 1
T

T∑
i=1

(ycash,i − ŷcash,i)2

In addition to MSE, three other evaluation metrics are calculated to assess the
efficacy of the model: mean absolute error (MAE), coefficient of determination (R2),
and directional accuracy. Omitting the cash sub text for simplicity, these metrics
are calculated as follows:

MAE = 1
T

T∑
i=1
|yi − ŷi|

R2 = 1−
∑T

i=1(yi − ŷi)2∑T
i=1(yi − ȳ)2 where ȳ denotes the mean of y.

Directional Accuracy = 1
T

T −1∑
i=0

1yi,ŷi
, where 1yi,ŷi

=

1 if sign(yi) = sign(ŷi)
0 otherwise

Because the model takes the transformed version of the underlying data it might also
of interest to take a look at what the predictions produced in the original, reverted

30


4. Methods

form. Since the original values of y are already available the only thing one has to
calculate is the reverted ŷ which is done as follows:

ŷreverted
i = xoriginal

i

(
1 + ŷpercent

i

100

)
Note that the reverted and original super text is omitted in other cases for simplicity.

As outlined in the theory section the RW model produces the prediction that the
expected value of the next price is the same as the current price. In terms of
percentage change, the model predicts an expected 0% change every day. This
makes predictions very convenient to calculate. The result metrics for the RW
model are calculated in the exact same way as above, just replacing ŷcash with 0 and
ŷreverted

cash with xoriginal
cash . The directional accuracy is not very useful, considering the

RW model predicts no directional change, therefore instead the ratio of upward daily
movements are calculated to give context to the results of the ESN’s predictions.
The directional accuracy is calculated as:

Directional Accuracy = 1
T

T −1∑
i=0

1 if yi ≥ 0
0 otherwise

4.3 Genetic Algorithm
This section outlines the approach taken for the genetic algorithm. Having estab-
lished all the parameters needed for the initialization, offline and online training of
the ESN as well as having a way of measuring how well the model makes predic-
tions the setup for the GA can be established. First off is the search space for the
parameters the Genome, the topic of the following section.

4.3.1 The Genome
The genome is what defines the search space for the parameters. These are the
values used:

Parameter Min Value Max Value Precision
Topology 0 1 1

Connectivity 0.01 0.15 5
LearningRate 0 0.01 5
Regularization -5 5 5
SpectralRadius 0.5 1 7
LeakageRate 0 1 7
ReservoirSize 500 1200 7

inLength 50 1200∗ 7
outLength 50 1200∗ 7
inBiasScale 0 100 5

inFactorScales 0 3 5

31


4. Methods

The precision is the number of intervals used in the modified bit-flipping mutation,
therefore a higher precision value makes smaller changes more likely. The topology
parameter is the only non numerical parameter. Since the parameter has precision
1, initialization and mutation is uniformly random on the interval. The random
value is then rounded to the closest integer, that is either 0 which indicates scale
free or 1 which indicates a uniformly random topology. The regularization is on a
10 base logarithmic scale. That is if the gene is encoded as -5, the λ in the model
is decoded as 10−5. Since the in- and outLength cannot exceed the reservoirSize
this has to be accounted for when decoding. Since the lengths and sizes represent a
number of nodes, the gene values are rounded to the nearest integer.

Choosing the values for the genome is challenging and mostly done through trial and
error. There are some general guidelines. The spectral radius, for instance, should
be scaled below unity. The connectivity should make the reservoir sparse and the
leakage rate is defined between 0 and 1. Reservoir size is limited by computing
capacity, which in turn limits the in and out lengths since they cannot be larger
than the number of nodes in the reservoir. The out length is also limited by the
available training data, considering that ridge regression requires an overdetermined
system.
The approach to defining the genome was to iteratively test using the metal data
and inspecting if any parameter approached a boundary, then limits were adjusted
accordingly.
Also note that the precision determines how likely mutation is to take small steps in
the allowed range. Therefore a higher precision places more emphasis on finding a
precise optimal value than exploring the whole range. A precision of 1 means, given
the parameter is chosen the be mutated, that all the values in the range are equally
likely during mutation.

4.3.2 Initialization
Initializing the population is done as follows:

ci,j = minj+(maxj−minj)r for every i ∈ [0, P−1], j ∈ [0, C−1], where r ∈ U [0, 1]

P indicates the population size, C which is the number of genes for each individual
and minj, maxj is the min and max values for each gene determined by the genome.

4.3.3 Evaluation
Given a set of individuals Pg, the population of generation g, each individual ci

defines an ESN when decoded. For each ci an ESN is initialized, warmed up, trained
and then validated with online training. The MSE, averaged for all dimensions of
the output, on the validation set is then used as a fitness score for evaluation.
In the case of the metals, this means the parameters defining networks producing
smaller MSE for predicting the next day cash, three month and stock is awarded a
better fitness score.
While trying out parameters for the GA the error for the training set would some-
times explode while still improving on the validation set. With the setup ultimately

32


4. Methods

decided upon this does not seem to happen. Regardless, the MSE on the training
set is tracked however not directly involved in the evaluation.
The GA is run for 60 generations after which the parameters corresponding to the
best individual are stored in a CSV file. Also included in the file is the random seed
used to construct the ESN. This is necessary for reproducing the ESN’s performance
seen during the GA which is sensitive to the random parts of initialization.

4.3.4 Evolution
Once all ci have been evaluated they are ready to be evolved. To do this the cross
and mutation probabilities for generation g are calculated as described in the theory
section.
Using tournament selection, with a tournament size of 5, the parents of the next
generation are chosen. With the current cross probability, they are crossed using
one point crossover and the children of the next generation are formed.
Next, each of the genes in the children are mutated with the current mutation
probability, using modified bit-flip mutation for the gene specific specified precision
in the allowed range.
To make sure the performance does not regress elitism is applied. That is, a random
child in the next generation is replaced with a direct copy of the best parent.
To make sure that the search space is adequately explored, should the performance of
the best individual be the same for 10 generations, the population is re-initialized.
This is done before elitism is applied meaning the performance does not regress
despite resetting the population.

4.3.5 Summary of Genetic Algorithm Parameters
To summarize, the parameters used for each of the runs, that is both the bench-
marking systems and all the metals, for the genetic algorithm are:

Parameter Value
Iterations 60

Population Size 30
Tournament Probability 0.8

Tournament Size 5
Initial Cross P. 0.9
Final Cross P. 0.5

Initial Mutation P. 0.5/numGenes
Final Mutation P. 2.0/numGenes

Reset Duration 10

Table 4.1: Genetic Algorithm Parameters

Where numGenes is the number of genes in the respective genomes. This means that
on average 0.5 genes in a chromosome are mutated initially, increasing to 2 genes
mutated per chromosome towards the end of the run. The formulation is often more

33


4. Methods

convenient than just using a raw number, especially when the number of genes shift
which is the case of the Mackey-Glass Equations as seen in section 4.5.

4.4 Refinements to the Methods
During testing of the methods on the metal dataset, it became evident that a uniform
approach across all metals was suboptimal. The primary challenge is ensuring the
model’s ability to generalize to unseen data in the test set. To address this, three
refinements were introduced.
First, the fitness function was modified to focus solely on MSEcash. Guiding the
GA to prioritize models that optimize performance in the cash dimension allows it
not to be constrained by trade-offs with other dimensions. The metric aligns with
the primary objective of the report used to benchmark the model against the RW.
The calculation of MSEcash follows the procedure detailed in 4.2.4.
Second, the optimization of the regularization parameter (λ) was separated from
the GA and handled through a grid search. This approach proceeded as follows:

Algorithm 3 Grid Search for Regularization Parameter λ in Echo State Network
(ESN)

1: for each individual in the population do
2: Initialize the Echo State Network (ESN)
3: Wash out the initial states using the warmup set
4: Gather the reservoir states for the first 90% of the training set, denoting the

state matrix as S̄90

5: for each λi = 10i, i = −3,−2, . . . , 2, 3 do
6: Calculate the output weights and bias:

W̄ out = Y S̄90
(
S̄90S̄90T + λiI

)−1

7: Use these weights to make predictions for the remaining 10% of the training
set

8: Compute MSEall
9: Reset the reservoir states to S̄90

end, the final state of the first 90% of the
training set

10: end for
11: Choose the λi with the lowest MSEall
12: Reset the reservoir states to zero
13: Proceed with the usual process: warmup, training on the entire training set,

and compute MSEall for predictions on the validation set
14: end for

Third, the stock dimension was removed from the dataset as it was identified as a
possible source of issues.
These refinements all maintain the same data splits for the validation and test sets,
ensuring that results remain directly comparable. While alternative methods, such

34


4. Methods

as cross-validation, could potentially improve generalization, they are challenging to
implement due to the temporal structure of the data.

4.5 Benchmarking Using Mackey-Glass and Lorenz
Attractor

To assess whether the ESN-GA model is effective at predicting chaotic dynamic
systems, the two examples Mackey-Glass and Lorenz are used. The two systems are
described by the following equations:
Mackey-Glass Equations:

dx(t)
dt

= β
x(t− τ)

1 + x(t− τ)n
− γx(t)

where x describes the state of the system, and the parameters used are:

n = 10 β = 0.2 τ = 17 γ = 0.1
Lorenz Attractor Equations:

dx

dt
= σ(y − x)

dy

dt
= x(ρ− z)− y

dz

dt
= xy − βz

where x, y, z determine the state of the system over time, and the parameters used
are:

σ = 10 ρ = 28 β = 8
3

Solving these ordinary differential equations, given starting points to the variables,
is in short done by discretizing t and numerical integration. There are available
methods for this in the apache commons library [2]. However, because training the
ESN requires several thousand data points, issues arise. The solvers FirstOrderIn-
tegrator, ClassicalRungeKuttaIntegrator and DormandPrince853Integrator all stop
functioning before a satisfactory number of datapoints have been produced. This
might be an issue of implementation.

Luckily there are readily available implementations that can produce several thou-
sand datapoints. The code used for the Mackey-Glass is provided by [14] and the
Lorenz Attractor provided by [82]. Both implementations are run with the desired
parameters, with the results exported to CSV files and then imported into Java for
further processing.
Note that the Mackey-Glass system has one dimension. This means a smaller
genome is required considering there is only one input scale to optimize. Both
the Lorenz Attractor and the metal data have three dimensions.

35


4. Methods

The same data split for the two benchmarking systems is used. It follows:

Phase Number of
Data Points

Warmup 1000
Training 2000

Validation 500
Testing 500

4.6 Software and Tools
All code, except for generating chaotic systems data, was written in the IntelliJ IDE
[43], utilizing Java 13. The code is then run through a java virtual machine on a
MacBook Air 2017 with MacOS Monterey 12.7.6.
The following libraries were used, all accessed through Maven [24]:

• Apache Commons Math: version 3.6.1 [2]
• Efficient Java Matrix Library (EJML): version 0.43.1 [76]
• JFreeChart: version 1.0.17 [18]
• Jsoup: version 1.17.2 [35]
• OpenCSV: version 5.9 [78]

During the coding process, both OpenAI’s ChatGPT 3.5 and 4.0 [74] and JetBrains
AI Assistant 2024.2 and 2024.3 [44] were used.
The LaTeX template used for this document was provided by David Frisk [17], and
ChatGPT is used to assist with formatting plots, tables, and equations.
The Mackey-Glass data was generated using MATLAB [81], with code provided by
[14]. Meanwhile, the Lorenz Attractor data was generated in Python [23], with code
sourced from [82].

4.7 Limitations of the Methods
The methods employed in this study have several limitations:

1. Parameter Search: The genetic algorithm used for parameter search is not
guaranteed to find a global optimum but rather a satisfactory solution. Addi-
tionally, the complexity of patterns in the data might exceed the representa-
tional capacity of the network, constrained by both the number of nodes and
available hardware resources.

2. Data Complexity and Availability:
• The patterns in the data might lack sufficient predictive power to consis-

tently outperform a random walk.
• The methods are not restricted to the three data types used but could

incorporate additional datasets, which are unavailable to the author.
3. Price Discrepancy: The data does not use the official LME closing price but

instead relies on the last offer made on the trading ring. While this difference
is often minimal, just a single trading tick, it may still influence results and
practical application.

36


4. Methods

4. Computational Constraints: Limited computational capacity restricts the
size of the search space and the ability to thoroughly evaluate all possible
configurations.

5. Model Construction Choices: The numerous design decisions made during
model construction are not rigorously verified for their individual contributions
to performance. Comparing alternate paths or evaluating every choice is infea-
sible within the given time frame, particularly due to the long runtime required
for GA evaluations and subsequent testing.

6. Fitness Function: The fitness function, which evaluates parameter combi-
nations, is not guaranteed to be convex. Optimal solutions may lie outside the
explored search space, even if parameters are far from a boundary.

7. Inherent Data Randomness: This setup cannot directly determine whether
the data is inherently random. Additional tests would be required to evaluate
this aspect.

8. Statistical Significance: The results do not include statistical significance
measures, which limits the ability to provide confidence intervals for predic-
tions.

9. Data Integrity: The data originates from a third-party source, so its accu-
racy cannot be independently verified.

10. Prediction Horizon: The model is limited to one-day-ahead predictions.
Extending the prediction horizon would require modifications to the setup.

11. Multi-Day Predictions: The current setup does not support autonomous
multi-day predictions. While implementing a feedback model that uses its
own predictions as inputs is straightforward, it is incompatible with stochastic
gradient descent, which requires true data for iterative weight updates.

12. Simultaneous Time and Space Dependence: The Mackey-Glass equa-
tions exhibit time dependence, while the Lorenz Attractor demonstrates spa-
tial dependence. Although the setup is capable of handling these phenomena
individually, there is no assurance that the model can accurately predict sys-
tems where both time and space dependencies occur simultaneously.

This list is by no means exhaustive.

4.8 Summary of Methodological Approach
This section summarizes the methodological approach employed for optimizing and
testing the model.

Data Collection
• For Benchmarking:

– Generate the data.
– Split the data into warmup, training, validation, and test sets.

• For Metals:
– Collect the data from Westmetall.
– Transform the data into percentage changes.
– Split the data into warmup, training, validation, and test sets.

37


4. Methods

Genetic Algorithm
• Initialize the population.
• Loop: For each generation:

– Evaluate Population:
∗ Initialize the ESN.
∗ Perform warmup.
∗ Train:

· Train weights using Ridge Regression.
∗ Validate:

· Nudge weights using gradient descent.
· Calculate Mean Squared Error (MSE).

– Evolve Population:
∗ Perform selection and crossover.
∗ Mutation.
∗ Stagnation reset if necessary.
∗ Elitism.

• Store optimized parameters and SEED.

Testing Optimized Parameters
• Initialize the ESN with optimized parameters.
• Perform warmup.
• Train using Ridge Regression.
• Validate using gradient descent and store validation metrics.
• Test using gradient descent and store test metrics.

The four different setups MSEall, MSEcash, grid search, and no-stock are applied to
six different metals. The benchmarking of the model for Mackey-Glass and Lorenz
data is only done with the MSEall setup.

38


5
Results

This section presents the results. First the ESN-GA model is tested for the known
application of predicting Mackey-Glass and Lorenz Attractor data. Then an overview
of the metal data is presented leading up to the results pertaining to parameter op-
timization and finally a performance comparison to a random walk model.

5.1 Mackey-Glass Equations and Lorenz Attrac-
tor

This section is included to validate the ESN-GA model on well-established appli-
cations, ensuring its functionality. Additionally, it examines whether the GA effec-
tively selects parameters that enhance performance and evaluates the ESN’s ability
to handle chaotic data.
We begin by analyzing the performance progression of the GA on the Mackey-Glass
dataset. Given that the Mackey-Glass system features a single spatial dimension
with temporal dependencies, strong performance would indicate that the ESN-GA
model successfully relates past values to predict future ones.

Figure 5.1: Genetic Algorithm Progress for Mackey-Glass Equations

The training and validation MSE goes from 1.628E-5 and 1.167E-8 to 2.795E-8 and
5.253E-12 respectively. Note that the plot is on a logarithmic scale and that the
algorithm is still improving upon termination, though progress slows down. There
might be a case for running the GA longer, however this clearly indicates that the

39


5. Results

GA is capable of selecting parameters superior to random ones. The parameters
settled upon are:

Parameter Value
Topology Scale-Free
Connectivity 0.1387
Spectral Radius 0.8399
Regularization 2.730e-4
Learning Rate 9.999e-3
Leakage Rate 0.998
Reservoir Size 764
Input Length 358
Output Length 358
Input Bias Scale 15.405
Input Factor Scale 2.994

Table 5.1: GA Optimized Parameters for Mackey-Glass

Re-initializing the network, using the same SEED as during the GA, results in the
following predictions:

(a) Training

(b) Validation (c) Testing

Figure 5.2: Mackey-Glass Model Predictions vs Actual for Training, Validation,
and Testing

40


5. Results

The predicted values closely align with the actual values to the point where the
difference is not distinguishable at the resolution and line thickness used. This
points to the model’s ability to predict the behavior of the Mackey-Glass system.
Following is the test metrics of the predictions:

Metric MSE MAE R-Squared Directional Accuracy
Training 2.795e-08 1.157e-04 1.000 0.993

Validation 5.250e-12 1.786e-06 1.000 0.998
Testing 1.966e-11 3.032e-06 1.000 0.996

Table 5.2: Mackey-Glass Model: Performance Metrics

The results demonstrate that the model clearly can understand the behavior of the
Mackey-Glass system. There is still an error, although small, and possibly room for
improvement. Considering the values are stored as doubles, which have a precision
of around 10−16 this should not cause issues. The GA is still improving after 60
iterations indicating the model in its current form has more potential. Although,
more iterations are not necessary considering the results clearly illustrate that the
model is capable of predicting the Mackey-Glass system.
With that established we move on to the Lorenz system. The system has three spa-
tial dimensions that interact making the complexity greater. If the ESN-GA model
can perform well it establishes the model is capable of detecting interactions between
input variables, not only temporal structures in the data. The Lorenz attractor also
has a higher degree of chaos, measured in Lyapunov exponent, compared to Mackey-
Glass for the respective parameters, thus making predictions slightly more difficult.
Note that the predictions are one-step, so errors do not accumulate exponentially
as they would do for multi-step predictions.
The GA progress follows:

Figure 5.3: Genetic Algorithm Progress for Lorenz Attractor

The training and validation MSE go from 0.073 and 0.033 to 9.756E-8 and 8.782E-7
respectively. Yet again a marked improvement from the initial random ESN param-
eters. Here it is more clear that the GA could run for longer considering the steady

41


5. Results

improvements until termination. Because there are no obvious signs of premature
convergence the population size and enhancements can be deemed satisfactory, at
elast in this case. The following are the parameters the GA settled upon:

Parameter Value
Topology Scale-Free
Connectivity 0.1352
Spectral Radius 0.6985
Regularization 1.690e-5
Learning Rate 2.361e-3
Leakage Rate 0.1324
Reservoir Size 967
Input Length 967
Output Length 855
Input Bias Scale 13.029
Input Factor Scale [0.2937, 0.0741, 0.0804]

Table 5.3: GA Optimized Parameters for Lorenz Attractor

Re-initializing the network with the optimized parameters and seed produces the
following predictions:

(a) Training

(b) Validation (c) Testing

Figure 5.4: X Values of Lorenz Attractor Model Predictions vs Actual for Training,
Validation, and Testing

42


5. Results

(a) Training

(b) Validation (c) Testing

Figure 5.5: Y Values of Lorenz Attractor Model Predictions vs Actual for Training,
Validation, and Testing

43


5. Results

(a) Training

(b) Validation (c) Testing

Figure 5.6: Z Values of Lorenz Attractor Model Predictions vs Actual for Training,
Validation, and Testing

Again the difference between the actual and predicted values are hard to distinguish
in the plots. Clearly the model is capable of making one-step predictions in multiple
dimensions. The test metrics for the three dimensions of the Lorenz Attractor Model
follows:

Variable Phase MSE MAE R-Squared Directional Accuracy

X
Training 3.107e-08 1.310e-04 1.000 0.999

Validation 1.560e-07 1.307e-04 1.000 0.998
Testing 2.081e-09 3.206e-05 1.000 0.998

Y
Training 1.651e-07 2.962e-04 1.000 0.999

Validation 1.766e-06 3.521e-04 1.000 0.996
Testing 1.173e-08 7.243e-05 1.000 0.998

Z
Training 9.651e-08 2.302e-04 1.000 1.000

Validation 7.124e-07 2.644e-04 1.000 0.998
Testing 8.112e-09 6.431e-05 1.000 0.998

Table 5.4: Lorenz Attractor Model: Performance Metrics for X, Y, and Z Variables

The situation is similar for the Lorenz attractor. The model is clearly capable of
predicting the behavior of the system with near perfect metrics for both R2 and
directional accuracy. The comparatively larger values for MSE and MAE could to

44


5. Results

some degree be attributed to larger absolute values. This might call for the use of
a ratio based error metric, but the purpose is not to extensively analyze the results
from these two benchmark data sets but to be able to conclude that the model
works for known applications. It is clear that the model is capable of improving on
a random selection of parameters and the ESN model initialized with said parameters
is capable of predicting chaotic systems with temporal or spatial dependencies. The
next step is to apply the model to the metal data. First we take a look at the LME
data.

5.2 Data Overview

The following is the data from the LME provided by Westmetall from the 2nd of
January 2008 to 2nd of September 2024 for the cash settlement, three month and
stock of six metals. Due to trading being halted there are some gaps in the data.
These missing data are summarized in the table below for the different metals.

Metal Missing Data Points Usable Data Points

Aluminium 87 4131
Copper 7 4211
Lead 85 4132
Nickel 99 4118
Tin 87 4130
Zinc 85 4132

Table 5.5: Number of Missing and Usable Data Points for the Metals

It should be noted that the missing data are distributed across the various data types.
When one or more dimensions are missing for a given trading day, the entire day
is discarded. Given the relatively small proportion of missing data, the remaining
dataset is assumed to still provide an accurate representation of reality.
A potential issue arises when missing data points occur consecutively over extended
periods. For example, the tin three-month data is missing from February 20 to June
16, 2008, and the nickel three-month data is missing from March 8 to March 25,
2022. In such cases, these gaps are simply skipped, resulting in discontinuities in
the time series. For instance, in the case of nickel, the time series entry following
March 7, 2022, directly after is March 26, 2022. This requires the assumption that
such gaps will not significantly impact predictive performance.
The effect of missing data is visibly evident as straight-line segments in the following
plots of the datasets, which highlight these discontinuities.

45


5. Results

Figure 5.7: Cash Settlements of the Metals

Metal Mean Median Variance
Aluminium 2.045E+3 1.997E+3 1.511E+5
Copper 7.107E+3 7.049E+3 2.486E+6
Lead 2.064E+3 2.079E+3 8.698E+4
Nickel 1.650E+4 1.614E+4 2.563E+7
Tin 2.183E+4 2.059E+4 4.008E+7
Zinc 2.369E+3 2.306E+3 3.264E+5

Table 5.6: Statistics for the Cash

As the plots illustrate, the cash settlements for the different metals are quite different
in absolute terms. Lead, zinc, and aluminium appear to have similar behavior being
quite stable during the period. Contrasted with nickel and tin which are more
volatile, exemplified by the price level in early 2009 then to be almost five times
higher in early 2022 and then almost completely recovered by 2024.

Next, we inspect the 3 three month for the metals.

46


5. Results

Figure 5.8: Three Month of the Metals

Metal Mean Median Variance
Aluminium 2.069E+3 2.018E+3 1.501E+5
Copper 7.115E+3 7.062E+3 2.479E+6
Lead 2.072E+3 2.090E+3 8.470E+4
Nickel 1.657E+4 1.625E+4 2.565E+7
Tin 2.169E+4 2.050E+4 3.829E+7
Zinc 2.374E+3 2.315E+3 3.095E+5

Table 5.7: Statistics for the Three Month

As expected, the series look similar. To get a sense of the possible predictive power
of utilizing the three month one can plot the difference between the three month
and the cash settlement where a large deviation indicates a strong conviction in the
market of upcoming changes.

47


5. Results

Figure 5.9: Cash Settlement Deviation from Three Month

The plot illuminates the notion that the different metals behave slightly different.
Zinc barely has any deviation at all while tin a times have massive differences be-
tween cash and three month. There might also be some indication that the de-
velopments of tin during 2021-2022 are exceptional. This makes predictions using
machine learning in general, and reservoir computing models in particular, difficult
considering there might not enough data the model can learn from. It should be
noted that the plots so far depict absolute terms. Later relative price changes will
be considered hopefully revealing a more stationary behavior, which is more suited
for machine learning despite the online learning scheme.

Below is the final type of data used in the study, the stock of the different metals.

48


5. Results

Figure 5.10: Stock of the Metals

Metal Mean Median Variance
Aluminium 2.653E+6 2.099E+6 3.099E+12
Copper 2.619E+5 2.429E+5 1.802E+10
Lead 1.574E+5 1.488E+5 8.350E+9
Nickel 2.000E+5 1.615E+5 1.577E+10
Tin 8.290E+3 6.283E+3 4.056E+7
Zinc 4.247E+5 3.358E+5 1.018E+11

Table 5.8: Statistics for the Stock

The plot perfectly illustrates the need for using relative terms. The absolute differ-
ence in aluminium stock and tin stock is massive. The graph should be viewed in
the light of aluminium volumes sold at the LME for 2023 being around 57 million
tons and tin volumes being around 1.2 million tons [53], which naturally will be
reflected in the absolute stock kept.
There is also an apparent issue with the way Westmetall collects data from the LME.
The nickel stock in the beginning of April 2012 is completely wrong. On March 30
the nickel stock presented is 99882 ton which drops to 542 ton on April 2 and stays at
this level until April 11 when the stock goes back to more reasonable levels of 99330
ton. Since there is no information on there being something exceptional happening
during this period the data is suspicious. Luckily there is a snapshot of the LME
website, which posts daily data, indicating that the nickel stock on April 3 2012
should be 100542 ton [47]. It is probably safe to assume that the software used to
collect the data from the LME by Westmetall has somehow removed 100 000 from
the actual stock kept. Therefore 100 000 is added back in the data used for the
suspicious stock values between March 30 and April 11 2012. No similar cases are

49


5. Results

found throughout the dataset but other errors cannot be ruled out possibly affecting
the effectiveness of the training and validity of the results.
Extrapolating from basic economic theory on supply and demand there should be
a relationship between the stock kept and the prices. This might have predictive
power in the model. Therefore, it is of interest to see the cash settlement and stock
in the same graph. Some examples are presented below.

Figure 5.11: Lead Cash Settlement and Stock

Figure 5.12: Copper Cash Settlement and Stock

No strikingly obvious relationship is observed for either of the examples. Further
speculation is pointless, and the usefulness of the stock will hopefully be evident in
the subsequent optimization of the ESN.
Next the percentage changes of the data is presented.

50


5. Results

Figure 5.13: Percent Change of Cash Settlement of the Metals

Metal Mean Median Variance
Aluminium 0.0107 0.0000 2.0825
Copper 0.0193 0.0025 2.4216
Lead 0.0146 0.0000 3.9414
Nickel 0.0136 0.0000 5.1395
Tin 0.0353 0.0577 3.8550
Zinc 0.0203 0.0210 3.28