Text Classification with Cellular Automata
Networks

Exploiting The Reservoir Computing Approach

Master’s thesis in Data Science and AI

Oscar Johansson

Department of Microtechnology and Nanoscience
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2024


Master’s thesis 2024

Text Classification with Cellular Automata
Networks

Exploiting The Reservoir Computing Approach

Oscar Johansson

Department of Microtechnology and Nanoscience
Chalmers University of Technology

Gothenburg, Sweden 2024


Text Classification with Cellular Automata Networks
Exploiting The Reservoir Computing Approach
Oscar Johansson

© Oscar Johansson, 2024.

Supervisor: Zoran Konkoli, Department of Microtechnology and Nanoscience
Examiner: Zoran Konkoli, Department of Microtechnology and Nanoscience

Master’s Thesis 2024
Department of Microtechnology and Nanoscience
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Categorisation of different transition rules based on experiment performance.

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Text Classification with Cellular Automata Networks
Exploiting The Reservoir Computing Approach
Oscar Johansson
Department of Microtechnology and Nanoscience
Chalmers University of Technology

Abstract
The thesis focuses on several options of exploiting reservoir computing ideas in the
context of text classification, using cellular automata networks as the reservoir. The
key innovative aspect of the study is related to finding options for computing with
low-density networks, which should be easier to manufacture. The overarching idea is
to extend Wolfram’s elementary cellular automata rules and implement them within
the framework of random cellular automata networks. The primary objectives of this
study are as follows: (i) To determine the optimal network topologies, (ii) to identify
the most effective transition rules within these networks, and (iii) to develop a
robust methodological framework for finding the best networks. By exploring various
network configurations, the aim is to uncover the structural characteristics that
facilitate efficient information propagation and decision-making within the cellular
automata framework. Observations from the study indicate that there exist cellular
automata network configurations which can function as efficient reservoirs. Moreover,
all considered cellular automata networks are categorised into different groups based
on their performance, providing a solid foundation for further research.

Keywords: Reservoir computing, cellular automata networks, cellular automata, text
classification, machine learning, natural language processing, supervised learning,
information theory, elementary cellular automata.

v


Acknowledgements
I would like to thank my supervisor, Zoran Konkoli, for all support and guidance I
have received throughout my thesis work. I greatly appreciate your ability to create
an encouraging and productive environment, allowing me to feel comfortable and
do my best work. Working with you has inspired me to continue on my academic
journey and pursue a PhD degree. Thank you.

I would also like to extend my thanks to everyone I have met at the Department
of Microtechnology and Nanoscience at Chalmers. I greatly value the help I have
received and all the conversations we have had.

Finally, I am deeply grateful to my friends and family for their interest and support,
which continue to motivate me and allowing me to aim higher.

Oscar Johansson, Gothenburg, 2024-06-17

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Reservoir Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Physical reservoirs . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 The readout layer . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Elementary Cellular Automata . . . . . . . . . . . . . . . . . 8
2.2.2 Cellular Automata Networks . . . . . . . . . . . . . . . . . . . 9
2.2.3 Cellular Automata Networks as a Reservoir . . . . . . . . . . 9

2.3 Network Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Reservoir Computing Framework 13
3.1 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Input structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Transition Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Readout Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Methods 21
4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Rule Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Results 25
5.1 Network Topology Performance . . . . . . . . . . . . . . . . . . . . . 25

ix


Contents

5.2 Transition Rule Performance . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.2 Categorisation based on Accuracy Distribution . . . . . . . . . 31
5.2.3 Categorisation of Rules for NP D-2 . . . . . . . . . . . . . . . . 32
5.2.4 Categorisation of Rules for NP D-3 . . . . . . . . . . . . . . . . 34
5.2.5 Categorisation of Rules for NP D-4 . . . . . . . . . . . . . . . . 36

5.3 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Discussion 41
6.1 Finding the Correct Network Topology . . . . . . . . . . . . . . . . . 41
6.2 Selecting an Appropriate Transition Rule . . . . . . . . . . . . . . . . 42

6.2.1 Categorisations of Elementary Cellular Automata . . . . . . . 42
6.2.2 The Ideal Transition Rule . . . . . . . . . . . . . . . . . . . . 44
6.2.3 The Inverse Rules Problem . . . . . . . . . . . . . . . . . . . . 45

6.3 Cellular Automata Networks as a Reservoir . . . . . . . . . . . . . . . 45
6.4 Reservoir Computing as a Methodological Framework . . . . . . . . . 46
6.5 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Conclusion 49

Bibliography 51

A Appendix 1 I

x


List of Figures

2.1 Visual representation of elementary cellular automata rule 91. . . . . 8

3.1 An overview of the model’s architecture. . . . . . . . . . . . . . . . . 13
3.2 Markov models for generating text phrases with different text structures. 15

5.1 Average performance µ(Nγ, I, Ri) over all experiments for each rule
Ri, i ∈ {0, 1, . . . , 255}, grouped by network model γ. The rules are
sorted from worst to best performance within each group, with the
x-axis representing the rank of each rule within its respective group. . 25

5.2 Average performance µ(NP D-δ, Ia,d, R91) with error bars representing
σ2(NP D-δ, Ia,d, R91) over 10 runs for each network model PD-δ, where
δ ∈ {2, 3, . . . , 19}. The shaded gray area highlights the region where
a significant drop in performance is observed. . . . . . . . . . . . . . 26

5.3 Average normalized prediction accuracy µ(NP D-α, I ′, Ri) over all three
input sequences I ′ for each rule, based on the network models: (a)
PD-2, (b) PD-3, (c) PD-4, and (d) the combined average for all three
network models. Rules are presented in numerical order . . . . . . . . 28

5.4 Four groups of distribution types. . . . . . . . . . . . . . . . . . . . . 31
5.5 Categorisation of rules on NP D-2. . . . . . . . . . . . . . . . . . . . . 33
5.6 Four of the best rules on NP D-2. . . . . . . . . . . . . . . . . . . . . . 34
5.7 Categorisation of rules on NP D-3. . . . . . . . . . . . . . . . . . . . . 35
5.8 Four of the best rules on NP D-3. . . . . . . . . . . . . . . . . . . . . . 36
5.9 Categorisation of rules on NP D-4. . . . . . . . . . . . . . . . . . . . . 37
5.10 Four of the best rules on NP D-4. . . . . . . . . . . . . . . . . . . . . . 37

6.1 Visualisation of the three elementary cellular automata rules with
the best performance on network model NP D-2, NP D-3, and NP D-4,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xi


List of Figures

xii


List of Tables

5.1 Mean prediction accuracy µ(−, Ia,b,R), µ(−, Ia,c,R), and µ(−, Ia,d,R)
without reservoir, i.e., the readout layer is applied directly to the input
sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Numerical results from the experiments, including the rank, mean
accuracy, and standard deviation for each rule on each network model,
as well as the overall results. Only a subset of the rules is shown;
complete results can be found in Table A.1 in Appendix A. . . . . . . 30

5.3 Average information gain for 2, 3, and 4 predecessors, respectively. . . 38

A.1 Full results for NP D-2, NP D-3, and NP D-4 from experiments, including
the rank, mean accuracy, and standard deviation for each rule and
network pair. Additionally, average results over all three networks is
also included. The rules are ordered based on the overall results. . . . I

A.2 Full results for NBA, NER, and NW S from experiments, including
the rank, mean accuracy, and standard deviation for each rule and
network pair. Additionally, average results over all three networks is
also included. The rules are ordered based on the overall results. . . . IV

A.3 Full information gain table, including the entropy H(Y ), the informa-
tion gain for the current (“middle”) cell IGM , and the information
gain for predecessors IGA, IGB, IGC and IGD, corresponding to the
order the predecessors appear in the list of predecessors. The informa-
tion gain is calculated for cells with two, three and four predecessors,
separately. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII

xiii


List of Tables

xiv


1
Introduction

Computation fundamentally involves evaluating a function where an input is trans-
formed into an output in a reproducible way [1]. To achieve this, mathematical
abstractions representing the input and output must be embodied within physical
systems [2]. Traditionally, this transformation is executed through a series of opera-
tions on the data using binary logic and electrical signals, structured according to
the well-established von Neumann architecture [3]–[5]. In this standard approach,
there is a clear distinction between the central processing unit, the memory, and the
data transfer pathways between them. This architecture underpins most personal
computers and smartphones today. However, it suffers from a significant limitation
known as the von Neumann bottleneck [5], [6], which arises from the physical and
practical constraints on the number and efficiency of connections between components.
As systems scale up, the complexity and overhead of managing these connections
increase, leading to inefficiencies and limiting overall performance.

To address these challenges, researchers have explored alternative computation
methodologies [7]. Unconventional computing encompasses computational approaches
that deviate from the traditional von Neumann architecture. This category includes,
but is far from limited to, quantum computing, which uses quantum bits (qubits)
instead of traditional bits; optical computing, which processes information using
light rather than electrical signals; and neuromorphic computing, which mimics the
neural structure and functioning of the human brain to perform computations. These
unconventional methods offer distinct advantages and application areas where they
can outperform or complement conventional techniques.

This thesis explores the application of reservoir computing for text classification.
Reservoir computing is a machine learning technique that leverages principles of
neuromorphic computing. A typical reservoir computer consists of a dynamical
system, known as the reservoir, driven by an external input that alters its state.
Instead of analysing the original input directly, the state of the system is analysed and
assigned a label using a linear readout layer. In this way, the information contained
in the input sequence is transformed into the information represented by the label,
constituting the computational process where one type of information is converted
into another. The key advantage of this approach is that the readout layer can be
simple, and thus easy to train, since the reservoir performs the computation.

A critical question which will be discussed in this thesis, is how to construct an
effective reservoir. This research focuses on a novel approach to building such

1


1. Introduction

reservoirs using cellular automata networks. By investigating the properties and
dynamics of such networks, this study aims to enhance our understanding of their
potential as reservoirs in the context of text classification.

1.1 Problem Statement
Traditional machine learning models frequently encounter limitations, such as large
model sizes, time-consuming training processes, and substantial computational
resource consumption. These challenges can restrict their deployment and limit their
scalability and environmental sustainability. A significant contributing factor to these
limitations is the inherent inefficiencies of the von Neumann architecture. Reservoir
computing has emerged as a promising solution to these issues. Unconventional
computing facilitates the development of machine learning models that are smaller,
require minimal training, and consume fewer resources. Such models have the
potential to outperform traditional methods in various tasks [8], while offering
greater efficiency and reduced costs. Furthermore, these systems could be directly
integrated into hardware.

The success of reservoir computing is critically dependent on the underlying dynamical
system used as the reservoir. Although certain dynamical systems have demonstrated
potential, this area remains underexplored, revealing a significant gap in current
research. This study addresses this gap by investigating a versatile dynamical system
in Cellular Automata Networks (CAN). Cellular automata networks feature complex
dynamics governed by local interactions, rendering them intriguing candidates for
effective reservoirs.

Through rigorous numerical experiments, the aim is to evaluate the capabilities of
CAN as a reservoir for the classification of text structures, a fundamental problem
in natural language processing. The objective is to differentiate between various
forms of textual organization, which could ultimately contribute to applications
such as language identification, authorship attribution, and the distinction between
AI-generated and human-generated text. Given the novel application of CAN in this
context, initial investigations will target simpler text structures to establish baseline
capabilities and insights.

1.2 Research Objectives
The primary objective is to identify key properties of cellular automata networks
that enhance their functionality as reservoir computing systems, particularly for
text classification tasks. This goal is part of a broader attempt to understand the
characteristics that define an effective reservoir. Specifically, this study aims to
achieve the following objectives:

1. To determine the optimal network topology of cellular automata networks in
the context of reservoir computing for text classification.

2


1. Introduction

2. To identify the most effective transition rule for cellular automata within these
networks.

3. To develop a robust methodological framework for addressing such complex
and open-ended challenges.

1.3 Limitations
To determine the optimal network topology of cellular automata networks, it is
impractical to conduct an exhaustive search over all network sizes and connection
organisations given the exponentially large number of possibilities. Therefore, the
scope of this research is strategically narrowed. The study will utilize a fixed network
size and focus on a subset of commonly observed random network topologies. By
limiting the variety of topologies considered, the aim is to identify the network
structures that most effectively support reservoir computing for text classification
tasks. Furthermore, for simplicity matters, only binary classification tasks are
considered.

To identify the most effective transition rule, the investigation will also address the
challenge of identifying optimal transition rules for cellular automata. The scope is
confined to an extension of elementary cellular automata rules applied on graphs.
By restricting the ruleset, the study aims to systematically evaluate the impact of
different transition rules on the performance of the cellular automata networks in a
controlled manner. This approach allows for a focused analysis of how specific rules
influence the dynamical properties and computational capabilities of the network.

To develop a robust methodological framework for addressing such complex and
open-ended challenges, the primary goal of this research is not to find the highest-
performing classifier but to establish a framework that identifies the optimal reservoir
for the task. Consequently, the hyperparameters of the readout layer will remain
constant throughout the experiments. This decision ensures that the influence of
different reservoirs on system performance is assessed independently of the readout
layer optimization. By maintaining consistent parameters, the study isolates the
effects of network topologies and transition rules, thereby contributing to a robust
methodology for future research in this domain.

Given these constraints, it is acknowledged that this study may not discover the
optimal cellular automata network and reservoir computing system. However, identi-
fying the best-performing configurations under the specified conditions will provide
valuable insights into the key properties of effective cellular automata networks as a
reservoir. These findings will lay a solid foundation for future research and broader
applications.

1.4 Contributions
This study contributes several key innovations. First, an extension of elementary
cellular automata rules is proposed to accommodate more than two neighbours,

3


1. Introduction

generalizing their application from a grid structure to a graph structure. This
broadens the scope and versatility of cellular automata in computational tasks.
Additionally, a novel type of random network has been suggested to construct the
cellular automata network. These networks exhibit low connectivity density and when
implemented with the extended elementary cellular automata rules, they perform
better than their more frequent counterparts such as Erdős-Rényi networks, small-
world networks, and scale-free networks. Furthermore, implementing a reservoir
computing system with cellular automata networks is by itself a novel innovation.

1.5 Thesis Structure
This thesis is organized into seven main chapters. Chapter 1 introduces the research
problem, objectives, limitations, and contributions of the thesis. Chapter 2 presents
the background on reservoir computing, cellular automata, network theory, and
information theory, including discussions on physical reservoirs, the readout layer,
and the use of cellular automata networks as reservoirs. Chapter 3 outlines the
reservoir computing framework, detailing the input, reservoir, and readout layer
components, while introducing the various network topologies and transition rules to
be examined. Chapter 4 describes the methods employed for classification and rule
entropy analysis within the reservoir computing framework. Chapter 5 presents the
experimental results, discussing the performance of different network topologies and
transition rules. Chapter 6 interprets the results, discussing the selection of network
topologies and transition rules, the evaluation framework, and other related topics,
while also proposing future research directions. Finally, Chapter 7 summarizes the
key findings and highlights the contributions of the research. The thesis is supported
by an appendix with complete experiment results.

4


2
Background

This chapter covers key concepts and theories related to reservoir computing, cellular
automata, network theory, and information theory. By establishing this groundwork,
the aim is to contextualize the study and highlight the significance of the chosen
methodologies and approaches.

2.1 Reservoir Computing
Reservoir computing is a machine learning method, a type of neuromorphic computing
centred around the concept of universal spontaneous computation implemented via a
reservoir [9]. The reservoir is realised as a high dimensional dynamical system with a
rich internal state space that evolves over time and interacts with its environment [10].
The core idea is that such dynamical systems are inherently capable of computation
and information processing [11], [12]. Specifically, the computational power of the
dynamical system is assessed by observing the evolution of its state space in response
to external inputs. Ultimately, the state of the system is transferred to a simple,
linear readout layer that generates the output.

The primary applications of reservoir computing lie in classification and prediction
tasks. Since the reservoir is realised by dynamical systems that evolve over time,
this approach excels at processing time-dependent and sequential data. Key areas
of application include time series prediction, speech recognition, signal processing,
bioinformatics, and pattern recognition. These diverse applications highlight the
versatility and robustness of reservoir computing in handling complex real-world
data.

The concept of reservoir computing emerged as a response to the challenges asso-
ciated with training neural networks, and it was developed independently by two
different research groups. The training process of recurrent neural networks faces
several problems, including vanishing and exploding gradients [13], overfitting, high
computational costs, and extensive data requirements [14]. To address these issues,
the idea to entirely forego training of the network, except for the readout layer, was
proposed. This lead to the introduction of two novel systems: Echo State Networks
[15], [16], which are randomly initiated recurrent neural networks that possess the
echo state property, and Liquid State Machines [17], [18], which utilize randomly
connected spiking neural networks.

5


2. Background

Since the introduction of reservoir computing, the concept of the reservoir component
has been generalized to incorporate any arbitrary dynamical system with specific key
properties. Firstly, the reservoir should be high-dimensional to project the input into
a complex state space, allowing for detailed representations. Secondly, its dynamics
should be rich, enabling the separation of similar inputs within the state space.
Additionally, the elements within the reservoir must exhibit nonlinear behaviour to
effectively process complex patterns in the input data. Lastly, the reservoir should
also possess the echo state property, which means the system has a fading memory
where the influence of past inputs gradually diminishes.

Within the context of reservoir computing, it is widely believed that optimal per-
formance is achieved when the reservoir operates at the edge of chaos — a critical
transition point between order and chaos. Research has shown that dynamical sys-
tems can perform complex computations most effectively at this boundary, where the
system balances stability and adaptability, enabling efficient information processing
[19]. Consequently, reservoirs functioning at the edge of chaos are considered highly
desirable for maximizing computational capabilities.

2.1.1 Physical reservoirs

A primary appeal of reservoir computing lies in the ability to leverage physical
reservoirs. A classic study demonstrated how waves on the surface of a bucket
of water could function as a Liquid State Machine for speech recognition tasks,
showing that a simple readout layer suffices for classification [20]. Although the
practicality of that particular reservoir was not ideal, significant advances have been
made in various other fields of physical reservoir computing [21]. Concrete examples
of physical reservoirs that have been explored include memristors [22], photonic
systems [23], spintronic devices [24], biological systems [25], and mechanical systems
[26], [27].

These physical reservoirs offer unique advantages due to their inherent properties.
For instance, memristors can emulate synaptic plasticity, photonic systems benefit
from the speed of light for high-speed data processing, spintronic devices exploit
electron spin for energy-efficient computations, biological systems harness the complex
dynamics of neural tissue, and mechanical systems leverage the non-linear dynamics
of physical materials.

Simulating physical reservoirs is also a common practice, especially in the early
stages of developing reservoir computing systems. Examples include using neural
networks, cellular automata, or Boolean networks. This approach is often more
practical during the initial development phases of reservoir computing, as it is more
cost-efficient and simpler to implement. Simulations allow researchers to test and
refine their models before deploying physical systems, which can be more expensive
and complicated to build.

6


2. Background

2.1.2 The readout layer
The role of the readout layer in reservoir computing is not to perform the main
computation; rather, this computation is carried out by the reservoir itself. The
readout layer assesses the state of the reservoir and maps these states to output
labels or values [9]. Ideally, the readout layer should be kept as simple as possible,
often involving a linear transformation of the reservoir’s state space.

The weights of the readout layer are the only parts of the system that require training.
Since the readout layer is simple, it can be trained with a relatively small amount of
data, making the training process fast and resource efficient. This simplicity and
efficiency in training are among the primary benefits of using reservoir computing.
It allows for reduced computational costs and time compared to traditional methods,
which is particularly advantageous when working with large datasets or when rapid
training is required.

Furthermore, the simplicity of the readout layer contributes to the overall robustness
and flexibility of the reservoir computing framework. It can easily adapt to different
types of tasks, such as classification, regression, and signal processing, without
the need for complex training algorithms. This versatility, combined with the
inherent efficiency of the training process, underscores the practical utility of reservoir
computing in various real-world applications.

2.2 Cellular Automata

Cellular Automata are simple, yet powerful mathematical models used to represent
physical, biological, and computational systems. Initially proposed by John von Neu-
mann [28] and extensively studied by Stephen Wolfram [29], [30], cellular automata
consist of a grid of cells that interact with their neighbours according to predefined
local transition rules. The state of the cells updates synchronously in discrete time
steps. Since the cells only interact with their local neighbourhood, the system is
decentralised, with no central controller. Despite the simplicity, cellular automata
can exhibit complex dynamics and emergent behaviours from the interaction of
simple rules. This makes them valuable for modelling a wide range of real-world
systems.

A notable example of cellular automata that produce complex behaviour from simple
rules is Conway’s Game of Life (for an introduction see [31] and references therein),
which has been demonstrated to simulate a Turing machine [32], a fundamental model
of computation capable of performing any calculation that a digital computer can.
This capability to simulate a Turing machine highlights the immense computational
power inherent in cellular automata. It underscores their potential as computational
devices capable of complex data processing and algorithm execution. The Game of
Life, thus, serves as a powerful example of how simple, local interactions can lead
to complex, global behaviours, making cellular automata a valuable tool in both
theoretical and applied computational research.

7


2. Background

0 1 0 1 1 0 1 1

Rule 91

Figure 2.1: Visual representation of elementary cellular automata rule 91.

2.2.1 Elementary Cellular Automata
Elementary cellular automata are the simplest class of cellular automata, defined on
a one-dimensional grid [33], [34]. Each cell interacts with its immediate left and right
neighbours. The rule can be seen as a Boolean function that takes three bits x1, x2,
and x3 as input and produces a single bit as output, denoted as y = r(x1, x2, x3).
A simple combinatorial argument leads to the conclusion that one can construct
223 = 256 such rules in total. Such a rule explains how to update a cell that is in
the state x2 that is influenced by two neighbouring cells in states x1 and x3. The
variable y is the new state of the cell after the update. Assume that the automaton of
interest is made of N cells c1, c2, · · · , ci−1, ci, ci+1, · · · , cN . Then a cell ci is updated
according to the rule

ci(t + 1) = f(ci−1(t), ci(t), ci+1(t)) (2.1)

These rules are named based on the binary representation of their outputs. For
example, if the resulting sequence of f(1, 1, 1), f(1, 1, 0), . . . , f(0, 0, 0) is the binary
sequence 010110112, as illustrated in Figure 2.1, where the black and white cells
represents 1 and 0, respectively. Then, this function f corresponds to rule 91, since
010110112 = 9110. A specific mapping from a particular input to an output for f
can also be referred to as ci−1(t), ci(t), ci+1(t)→ ci(t + 1). Then, the first mapping
in rule 91 is written as 111→ 0.

Elementary cellular automata can be categorized based on the patterns they produce
over time when starting from a random initial state. Stephen Wolfram introduced
four classes of behaviour [33]:

1. Rules that converge to a uniform state.

2. Rules that converge to repetitive or stable states.

3. Rules that produce seemingly random states.

4. Rules that produce complex structures, including both repetitive and stable
areas, as well as regions of complicated behaviour.

Class 4 is particularly interesting because it demonstrates a balance between order
and chaos, where some structures are stable while others exhibit complex, dynamic
behaviours. Notably, rule 110, a member of Class 4, has been proven to be Turing
complete [35], meaning it can be used to perform any computation that a universal
Turing machine can. This highlights the impressive computational capabilities of
even the simplest cellular automata.

8


2. Background

2.2.2 Cellular Automata Networks

Cellular Automata Networks (CAN) extend traditional cellular automata by replacing
the standard grid structure with a graph structure [36]. In the literature, this concept
is also known as graph cellular automata [36], cellular automata on graphs [37], or
simply CAN [38]. In this setup, each cell corresponds to a node in a graph, where
the neighbourhood of a cell is defined by edges connected to the node in the graph,
rather than proximity on a grid.

CAN can be defined for both undirected and directed graphs. In the former case,
the transition rule of the cellular automata functions similarly to traditional cellular
automata, with connected nodes as neighbours and the degree as the neighbourhood
size. In the latter case, it is important to distinguish between the in-degree and out-
degree of a node to define the neighbourhood for a cell accurately. The neighbourhood
corresponds to cells connected through incoming edges, and the neighbourhood size
equals the in-degree of the node. To avoid confusion, this type of neighbour will
be referred to as a predecessor. Similarly, cells connected by outgoing edges are
referred to as successors. Importantly, the transition rule only considers predecessors
to determine the next state of a cell.

This adaptation allows for the use of cellular automata to model systems that deviate
from the grid structure. By incorporating the flexibility of network topologies, CANs
are particularly suited for simulating dynamics on networks found in the real world,
such as social networks, transportation networks, and interconnected systems in
biology and finance. This network-based approach offers a richer framework for
exploring the dynamics of complex systems, where the structure of connections can
significantly influence emergent behaviour.

2.2.3 Cellular Automata Networks as a Reservoir

Before discussing cellular automata networks, it is important to note that traditional
cellular automata have been successfully employed or discussed as a reservoir in
various studies [10], [39]–[42]. In these studies, elementary cellular automata were
primarily utilized. Additionally, a key paradigm of reservoir computing is that the
readout layer should only access the current state of the reservoir. However, when
cellular automata are used, it is common practice for the readout layer to access
the entire evolution of the cellular automata, or at least the recent history of the
evolution.

Despite the success of traditional cellular automata, the utilization of CAN as a
reservoir has not been extensively studied. In the context of reservoir computing,
CAN was first introduced to model a memristor network, which was then used as a
reservoir to classify protein toxicity [43]. Consequently, the use of CAN remains a
novel approach in the field of reservoir computing.

9


2. Background

2.3 Network Theory
Network theory has been extensively studied, and for the purpose of this thesis, only
a few key concepts are pertinent. Networks can model real-world systems, and the
structure of the connections between entities in these systems varies depending on
the specific system. The connection structure is commonly known as the network’s
topology and there are three network types that are particularly relevant for this
study.

The random graph, first introduced by Erdős and Rényi [44] is generated by creating
an edge between each pair of nodes with a certain probability p, resulting in what is
known as an Erdős-Rényi graph. When discussing random networks, this type of
graph is what one usually has in mind.

Building on the work of Erdős and Rényi, Watts and Strogatz introduced the
small-world network [45]. Recognizing that few real-world systems are completely
random, this topology aims to balance regularity and randomness. It is constructed
by starting with a regular network where each node is connected to its nearest
neighbours. Subsequently, each edge is rewired to a randomly chosen node with a
probability q, thereby creating long-distance connections.

Similarly, the scale-free network was introduced by Barabási and Albert [46] to address
the limitations of random networks in modelling certain real-world systems. In a scale-
free network, the likelihood of a node acquiring new connections is proportional to its
existing degree. Thus, nodes with higher degrees are more likely to attract additional
edges. This preferential attachment process results in networks characterized by a
few highly connected hubs and many nodes with lower degrees.

These network types — random, small-world, and scale-free — provide a foundational
framework for understanding the structural properties of complex networks. The
topology of these networks offers unique insights into the dynamics and connectivity
patterns that can emerge in real-world systems.

2.4 Information Theory
Information theory, introduced by Claude Shannon [47] provides a mathematical
framework for quantifying information. This theory is fundamental in understanding
and optimising communication systems, computational devices, and many other
applications in modern technology. In this thesis, information theory is used to
analyse the efficiency and capabilities of different transition rules on specific networks.

One of the core concepts in information theory is Shannon entropy. Entropy measures
the uncertainty or unpredictability associated with a random variable. For a discrete
random variable Y with possible outcomes {y1, y2, . . . , yn}, the Shannon entropy
H(Y ) is defined as:

H(Y ) = −
n∑

i=1
p(Y = yi) log2 p(Y = yi)

10


2. Background

In this equation, p(Y = yi) represents the probability of the outcome yi.

Shannon entropy indicates how much information is needed, on average, to describe
the state of a random variable. Higher entropy indicates greater unpredictability,
while lower entropy signifies more predictability. For example, a fair coin toss, with
binary outcomes {Heads, Tails} each having a probability of 0.5, has a Shannon
entropy of:

H(X) = −(0.5 log2 0.5 + 0.5 log2 0.5) = 1 bit. (2.2)

In contrast, a biased coin that always lands on heads has zero entropy, as there is no
uncertainty in the outcome.

11


2. Background

12


3
Reservoir Computing Framework

This chapter introduces the reservoir computing framework and details the imple-
mentation and parameter choices of classification models in this study. The chapter
is organized into four sections. The first section describes the general architecture
of the system. The second section provides a detailed description of the input text
structures. The third section introduces the reservoir, along with the variations in
network topologies and transition rules evaluated in this study. Finally, the fourth
section presents the concrete implementation of the readout layer.

3.1 General Architecture
The architecture of the models adheres to standard reservoir computing practices
when cellular automata is utilised as the reservoir. The key difference is the substi-
tution of cellular automata with CAN. However, this exchange does not significantly
alter the general architecture due to the modular nature of the reservoir comput-
ing framework. Figure 3.1 provides an illustration of the model, highlighting its
components and their interactions. Note that a CAN with directed edges are used.

In this architecture, the input sequence is encoded as a vector and interacts with the
reservoir through input nodes. Specifically, each input node is randomly connected
to a single cell in the reservoir as a predecessor, and its order as a predecessor

Figure 3.1: An overview of the model’s architecture.

13


3. Reservoir Computing Framework

is also randomized. Alternative approaches include driving certain cells in the
reservoir directly with the input sequence or allowing an input node to have multiple
connections to the reservoir. The effects of such alternative implementations have
not been studied in this research.

The readout layer has access to more than just the current state of the reservoir.
Specifically, the readout layer receives the weighted sum of the states from the last
20 discrete time steps of the reservoir’s evolution, w1c1 + · · · + wi−1ci−1 + wici +
wi+1ci+1 + · · ·+ w20Nc20N , where ci is the state of the i-th cell in the evolution, wi

is the weight and N is the network size. In this study N = 20 is used. The entire
evolution spans 100 time steps; however, to allow the system to transition away from
its initial random state, only the latter part of the evolution is used. The 20 last
time steps are used, given that N = 20, ensuring the ability to track the impact of a
single input on each cell within the networks regardless of topology. Additionally,
given that the CANs considered in this study are relatively small, the evolution of
each cell within the CAN is transferred to the readout layer. An alternative approach
would be to transfer the evolution of only a subset of cells, but this option was not
explored in this research.

3.2 Input structure
The input consists of phrases with varying text structures. To construct these phrases,
the simplest possible non-trivial alphabet, A = {a, b}, comprising only two letters, is
used. Thus, each input phrase is represented as a sequence of a and b characters.

Prior to being processed by the reservoir, each letter in the input sequence undergoes
one-hot encoding. This encoding scheme assigns a unique binary vector to each
letter. Since, there are two letters, the vector’s length is also two, with

a =
[
1
0

]
and b =

[
0
1

]

One input node is then assigned to each position in the vector, corresponding to each
letter in A. Specifically, an input node is activated, i.e., state set to 1, if the current
letter in the input sequence matches the corresponding letter in A; otherwise, the
node remains inactive, i.e., state set to 0. Consequently, the system is equipped with
two input nodes. Although a single input node could suffice due to the binary nature
of the input, one-hot encoding is employed for its simplicity and scalability to larger
alphabets. This approach ensures that each unique letter is distinctly represented
by a separate input node, facilitating a more intuitive input representation for the
system.

To introduce structure to the input sequence, each phrase is generated by a Markov
model. The four Markov models considered in this study are displayed in Figure 3.2.
Each Markov process initiates in State 0. As the system transitions to new states,
letters are generated and appended to the phrase. Figure 3.2a represents purely
random text, where the letters a and b have equal probabilities of occurrence.
Figure 3.2b also represents random text; however, the letter a is more likely to

14


3. Reservoir Computing Framework

0 1

0.5, a

0.5, b

(a) Random text.

0 1

0.7, a

0.3, b

(b) Biased random text.

1

0

2

0.5, b0.5, a
1, b

1, a

(c) Alternating text

1

0

2

0.5, b0.5, a

0.1, a
0.9, b

0.1, b
0.9, a

(d) Alternating text with noise.

Figure 3.2: Markov models for generating text phrases with different text structures.

appear than b, with probabilities p(a) = 0.7 and p(b) = 0.3. Figure 3.2c generates
text that alternates between a and b, although the initial letter in the phrase is
random. Similarly, Figure 3.2d also generates alternating text, but this time with
noise. Noise is introduced with a small probability p(noise) = 0.1 and consists of
the opposite letter in A. These Markov chains produce phrases exhibiting some
of the most fundamental properties of text, providing a basis for evaluating the
computational capabilities of cellular automata networks in text classification tasks.
Specifically, in these tasks, the random text is compared to input with varying letter
frequency, high correlation, or correlation with noise present.

3.3 Reservoir
A range of CAN with different combinations of network topologies and transition
rules will be considered for the reservoir. In this study, a unique pair of network
topology and transition rule will be referred to as a configuration. Consistent with
the first two objectives of this thesis — to identify the optimal network topology
and transition rule — all other hyperparameters are kept constant. Specifically, the
network size is limited to 20 cells, excluding the input nodes. This relatively small
network size is deliberately chosen to enhance the training efficiency of the classifier
and to showcase the performance capabilities of even the simplest models.

The reservoir operates through a straightforward mechanism. At each discrete time
step, the states of the cells are synchronously updated according to the defined
transition rule. Input nodes, however, are updated based on the incoming input
sequence rather than the transition rule. This iterative update process allows the
system to evolve over time, and it is this dynamic evolution of cell states that
performs the core computational tasks within the reservoir.

15


3. Reservoir Computing Framework

3.3.1 Network Topology
This study will examine four distinct types of network models and evaluate the
impact of different transition rules on each model. Three standard network models are
employed, specified as Algorithm 1, Algorithm 2, and Algorithm 3. These algorithms
are adaptations from previous work on reservoir computing with memristor networks
[43] and ensure the formation of weakly connected directed graphs. Additionally, a
specialized model developed specifically for this study is defined as Algorithm 4.

Algorithm 1 generates networks based on the Erdős-Rényi (ER) model [44]. These
networks consist of a predetermined number of nodes n with each pair of nodes
connected by an edge with a fixed probability p. Let the number of edges be defined
as d. To ensure network connectivity, the algorithm deviates from the classical
ER framework by initially constructing a tree structure, to which additional edges
are randomly added. This method guarantees connectivity and allows for precise
control over the total number of edges. Thus, guaranteeing a network density p. This
approach facilitates more equitable comparisons between different networks, while
retaining the fundamental characteristics of the ER model: edges can be created
between any two nodes, and the creation of edges remains independent. Furthermore,
the network includes driven input nodes, whose states are regulated by an input
sequence. These driven nodes are added at random after the network has been
created and are unique in that they can only serve as predecessors to other nodes
and are not themselves successors to any node.

Algorithm 1 Weakly connected directed Erdős-Rényi model
1: Create a node in a random state.
2: for i = 1, 2, . . . , n− 1 do
3: Create a node in a random state.
4: Create a directed edge from the last node created to a random existing node

with random orientation.
5: end for
6: while d < p · n(n− 1) do
7: Select two random nodes a and b.
8: if There exists no edge from a to b then
9: Create an edge from a to b.

10: end if
11: end while

Algorithm 2 generates networks based on the Watts-Strogatz (WS) model [45],
commonly known as small-world networks. These networks comprise a predetermined
number of nodes n, with each node initially connected to its k nearest neighbors,
where k is an even number. Thereafter, each edge is rewired with a fixed small
probability q, reassigning the second node of the edge to a randomly chosen node
while avoiding self-loops and duplicate edges, thus creating long-distance connections.
In a traditional small-world network, the number of edges d is a multiple of n. To
allow d to be any number, the algorithm has been modified by adding more edges
to the initial regular structure until the desired edge density p is achieved. This

16


3. Reservoir Computing Framework

Algorithm 2 Weakly connected directed Watts-Strogatz model
1: k = 2⌊p(n− 1)⌋
2: Compute the missing edges m = p · n(n− 1)− n · k/2
3: Create n nodes in random states.
4: k = k + 2
5: for i = 1, 2, . . . , n do
6: if i = m + 1 then
7: k = k − 2
8: end if
9: for j = 1, 2, . . . , k/2 do

10: Create an edge from node i to node i + j mod n with a random direction.
11: With probability q rewire the edge into a long-distance connection to a

random node.
12: end for
13: end for

modification ensures that the algorithm produces a network that corresponds exactly
to the classical WS network when p(n − 1) is an integer. Otherwise, the network
includes some additional edges. Similarly to the previous algorithm, driven input
nodes are added to the network after it has been generated.

Algorithm 3 generates networks based on the Barabási-Albert (BA) model [46]. The
algorithm for generating BA networks proceeds as follows: initially, a small number
of nodes are established. At each subsequent time step, a new node is introduced
along with a fixed number of edges connecting it to the existing nodes. However,
this attachment process is not uniform; the probability that a new node will connect

Algorithm 3 Weakly connected directed Barabási-Albert model
1: if p < 0.25 then

2: n0 =

n + 1−
√

(n + 1)2 − 4(p · n(n− 1) + 1)
2


3: h = n0
4: else

5: n0 =

2n + 1−
√

(2n + 1)2 − 8(p · n(n− 1) + 1)
4


6: h = 2n0
7: end if
8: t = n− n0
9: Create n0 nodes in a random state.

10: for i = 1, 2, . . . , t do
11: Create a node in a random state.
12: Create h edges from the last node created to random existing nodes with

random orientation.
13: end for

17


3. Reservoir Computing Framework

to an existing node is proportional to the number of edges that the existing node
already possesses, following a preferential attachment mechanism. The model is
characterized by three parameters: the initial number of nodes n0, the number of
edges added with each new node h, and the total number of time steps t. For the
purposes of this study, these parameters are represented as the number of nodes n
and edge density p, as detailed in [43]. Like the previous algorithms, driven input
nodes are incorporated into the network after it has been generated.

Algorithm 4 is not derived from a standard network model. Instead, it is character-
ized by a fixed number of predecessors for each node. The algorithm generates the
network by iterating over all n nodes and assigning α unique predecessors to each
node. Consequently, the edge density p is constant for a given α. This approach
produces a network with properties like those of an ER network, but with a prede-
termined number of predecessors for each node. Additionally, driven input nodes
are incorporated into the network. Unlike the previous algorithms, which add new
edges for the driven nodes, this algorithm randomly replaces existing predecessors
with driven nodes, thereby maintaining a constant number of predecessors for each
node. Additionally, while this algorithm does not guarantee weak connectivity, the
probability that the network is not weakly connected is insignificant. Moreover, the
reservoir’s ability to compute does not depend on weak connectivity. In this study,
this network model is referred to as the Fixed Predecessor model (PD-α), where α
denotes the number of predecessors.

Algorithm 4 Directed Fixed Predecessor (PD-α) model
1: Create n nodes in random states.
2: for i = 1, 2, . . . , n do
3: Randomly assign α predecessors to node i.
4: Create directed edges from all predecessors to node i.
5: end for

3.3.2 Transition Rule
The transition rule is applied uniformly across each cell in the network, enabling
the isolated observation of a single rule’s behaviour. This rule is an adaptation
of Wolfram’s elementary cellular automata rules, extended to accommodate cells
with more than two predecessors while retaining the simplicity of the two-neighbour
approach. The chosen method integrates arbitrary network topologies with Wolfram’s
elementary cellular automata rules through a straightforward procedure: when a cell
has more than two inputs, a voting mechanism determines its new state. Specifically,
the rule is applied to all possible pairs of predecessors, and the cell’s new state
is decided by a majority vote of these outcomes. Exceptions are made for cells
with fewer than two predecessors. If a cell is not connected to any other cell by an
incoming edge, its state remains unchanged. If a cell has only one predecessor, it
adopts the state of that cell.

For example, consider a cell c1 that receives input from a list of predecessors
P = {c2, c3, · · · , cn}. It is important to note that for each list of predecessors,

18


3. Reservoir Computing Framework

a predetermined, randomized ordering unique to each target cell c1 is assumed.
The Wolfram rule f is applied to each pair within P , generating a list of outcomes
f(ci, c1, cj), where ci, cj ∈ H and i ̸= j. Then, the majority outcome becomes the
new state of c1.

3.4 Readout Layer
The readout layer applies a linear transformation to the evolution of the reservoir.
Then, the model employs the sigmoid activation function to convert the linear
combination of input features into a value in (0, 1). The sigmoid function is defined
as follows:

σ(z) = 1
1 + e−z

This function maps any real-valued number into the range (0, 1), with σ(0) = 0.5,
making it suitable for binary classification tasks. For an input vector x with weights
w and a bias term wbias, the prediction ŷ(x) is calculated using the following equation:

ŷ(x) = round
(

σ

(
n−1∑
i=0

wixi + wbias

))

Here, n represents the number of features in x. Each weight wi corresponds to
a feature, and wbias is the bias term added to the linear combination to adjust
the decision boundary. If the weighted sum is positive the prediction becomes 1,
otherwise it is 0. During the training phase, the weights are adjusted using the
gradient descent algorithm to minimize the difference between the predictions and
the actual labels. The update rule for the weights is given by:

wj ← wj + α

m

m∑
i=1

(
y(i) − ŷ(x(i))

)
x

(i)
j

for j = 0, 1, . . . , n− 1, and

wbias ← wbias + α

m

m∑
i=1

(
y(i) − ŷ(x(i))

)

Here, α is the learning rate, m is the number of training examples, x(i) is the input
vector of the i-th example, and y(i) is the actual label of the i-th example. Ultimately,
the final prediction is determined by computing ŷ(x(i)

test), where x(i)
test is a sample from

the test dataset.

19


3. Reservoir Computing Framework

20


4
Methods

This chapter introduces the methods used to investigate the performance and func-
tionality of different reservoirs. The chapter is organized into two sections. The
first section details the text classification task, and the second section describes how
information theory is applied to assess the system.

4.1 Classification
This study consists of multiple experiments, each comprising numerous classification
tasks that follow the same procedure but with different combinations of network mod-
els, transition rules, and input text structures. LetNγ = {N1, N2, . . . , N100} represent
a set of networks generated by a network model γ, where γ ∈ {ER, BA, WS, PD-2,
PD-3, PD-4} corresponds to the desired network topology. The target density of
the ER, BA, and WS networks is 0.16, while the density of the PD networks are
0.11, 0.16, and 0.21 for increasing number of predecessors. Additionally, for the
WS network a fixed probability q = 0.1 is chosen for the long-distance connections.
Moreover, let R = {R0, R1, . . . , R255} denote the set of transition rules, where Ri is
the extension of elementary cellular automata rule i that is introduced in this study.

The aim is for the classifier to differentiate between input generated by different
text structures α and β. Suppose that Iα,β = {Iα,β

1 , Iα,β
2 , . . . , Iα,β

n } represents a set
of input text sequences for the classifier, where Iα,β

i is an input sequence of either
type α or β. Here, α and β define the type of text structure an input instance can
be generated by, with α, β ∈ {a, b, c, d}, corresponding to the Markov chains shown
in Figure 3.2. Specifically, a is random text, b is biased text, c is alternating text,
and d is alternating text with noise.

To analyse the performance of the various configurations, two stochastic variables
are introduced which are associated with the choices of Nγ, Iα,β, and R. Given the
objective to differentiate between rules, each rule Ri in R is considered individually.
The first variable, µ(Nγ, Iα,β, Ri), denotes the mean classification accuracy for a
given combination of network model, input text structure, and transition rule. The
second variable, σ2(Nγ, Iα,β, Ri), represents the standard deviation of the classifi-
cation accuracy. This variable quantifies the variability or spread of the individual
classification accuracies across the experiments, providing insight into the consistency
of the performance for each configuration.

21


4. Methods

To obtain µ(Nγ, Iα,β, Ri) and σ2(Nγ, Iα,β, Ri), the following procedure is repeated
100 times:

1. Generate a network based on the network model γ.

2. For each transition rule Ri, use the network and transition rule as the reservoir
in a reservoir computing context.

3. Train the system on a training dataset Iα,β.

4. Test the system on a test dataset I ′
α,β

5. Record the prediction accuracy for Ri on the network, as the number of correct
prediction divided by the number of total predictions.

After creating 100 unique graphs, µ(Nγ, Iα,β, Ri) is computed as the mean prediction
accuracy for the rule Ri, and σ2(Nγ, Iα,β, Ri) is calculated as the standard deviation
of the accuracy across different graphs.

In this framework, the efficiency of rule Ri is characterized by the metrics µ(Nγ, Iα,β, Ri)
and σ2(Nγ, Iα,β, Ri), which indicate average performance and performance consis-
tency, respectively, across the network ensemble within the specified topology and
input type. For simplicity, the notation I will be used instead of Iα,β when repre-
senting the mean and standard deviation for a network model and rule across all
input types.

The scope of the experiments can be summarized as follows. Numerous small models
are created. In a single experiment, 256 ·100 = 25600 reservoir computing models are
trained and evaluated. In this study, a total of 6 · 3 = 18 experiments are conducted.
This means that over all experiments, a total of 25600 · 6 = 460800 models are create
and 256 ·6 = 1536 different model configurations are evaluated. The hyperparameters
used for the training of the readout layer include a training dataset size of 400, a test
dataset size of 100, a learning rate of 0.01, and 1000 passes over the training dataset.

4.2 Rule Entropy
In addition to the experiments, the information gain of predecessors in a single
cell will also be computed for all configurations of network topology and transition
rules. This type of assessment of CAN has been conducted in previous studies [37].
Information gain is defined as the difference between the entropy of a variable Y and
the conditional entropy of Y given a variable X. Here, Y is defined as the next state
of a single cell, and X is defined as the current state of either the cell itself or one of
its predecessors. This is also known as mutual information. The computation aims
to better understand the effects of different configurations when certain parts of the
system are known and to highlight important properties of CAN. It is also utilized
to categorise transition rules with similar behaviour.

In information theory, the entropy [47] of the binary variable Y is calculated as:

H(Y ) = −
∑

y∈{0,1}
p(Y = y) log2 p(Y = y)

22


4. Methods

where p(Y = y) is the probability of Y being y. Similarly, the conditional entropy of
Y given the binary variable X is calculated as:

H(Y |X) =
∑

x∈{0,1}
p(X = x)

− ∑
y∈{0,1}

p(Y = y|X = x) log2 p(Y = y|X = x)


where p(Y = y|X = x) is the conditional probability of Y being y given X = x,
and p(X = x) is the probability of X being x. Finally, the information gain from
knowing X is computed as:

IG(X) = H(Y )−H(Y |X)

This measures the reduction in uncertainty, i.e., the number of bits needed to describe
the outcome, for Y when X is known. For the purposes of this study, it is assumed
that all possible current states are equally likely. This simplification facilitates a
more straightforward computation of the conditional entropy and, subsequently, the
information gain. Therefore, the calculations can be further simplified as:

H(Y |X) = 1
2H(Y |X = 0) + 1

2H(Y |X = 1)

In this study, the outcome consists of a single bit. This means that the maximum
possible entropy is H(Y ) = 1, indicating that the outcome is equally likely to be 0 or
1. Conversely, the minimum entropy is H(Y ) = 0, which means that the outcome is
always the same. Consequently, IG(X) = 1 implies that both outcomes were equally
likely before, but with the additional knowledge of X, the outcome becomes certain.
In contrast, IG(X) = 0 indicates that the knowledge of X does not provide any
additional information about the outcome.

23


4. Methods

24


5
Results

This chapter presents the results from the text classification tasks, organized into
three distinct sections, each offering a unique perspective on the findings. The first
section examines the impact of different network topologies on prediction accuracy.
The second section explores how various transition rules affect prediction accuracy.
Finally, the third section details the information gain associated with different
combinations of network topologies and transition rules.

5.1 Network Topology Performance
The overall performance of different network models on the text classification tasks
is summarized in Figure 5.1. The average prediction accuracy, µ(Nγ, I, Ri), is
calculated for each combination of network model and transition rule across all input
types. The results are then grouped by γ, and within each group, the transition rules
are ordered from worst to best, based on their prediction accuracy. This implies that

Figure 5.1: Average performance µ(Nγ, I, Ri) over all experiments for each rule Ri,
i ∈ {0, 1, . . . , 255}, grouped by network model γ. The rules are sorted from worst to
best performance within each group, with the x-axis representing the rank of each

rule within its respective group.

25


5. Results

the i-th best-performing rule for each γ does not have to be the same rule across
all groups. Instead, the i-th position within each group represents the rule with the
i-th best performance among rules applied to that specific network model. Moving
forward, Nγ is used instead of γ to emphasize that a set of network instances of the
network model γ are utilised in all experiments, rather than a single instance of γ.

The analysis of the graph reveals a distinct hierarchy in the performance of different
Nγ . The highest overall performance is observed for NP D-2, followed by NP D-3, NW S,
NBA, NER, respectively. An exception is noted for NP D-4, where the prediction
accuracy does not follow the same trend as the other network models. While
NP D-4 exhibits poor overall performance, it demonstrates excellent peak performance
compared to other network models. Additionally, it is evident that for each Nγ , the
number of transition rules n that fail to provide meaningful computation, indicated
by µ(Nγ, I, Ri) ≈ 0.50, is approximately the same across each Nγ, with n ≈ 75.

The observed results indicate that the most promising performance with the proposed
transition rules is found in network models with a fixed number of predecessors,
specifically NP D-2, NP D-3, and NP D-4. Therefore, for the sake of simplicity, the
primary focus will be on these network models moving forward. The full results for
NBA, NER, and NW S can instead be found in Appendix A, Table A.2.

Moreover, network models with more than four predecessors do not appear to offer
any additional computational capabilities, as demonstrated in Figure 5.2. The
experiment focuses exclusively on the best performing rule, R91, for NP D-4, as this
rule seems to exhibit properties that are well-suited for network models with more
than two predecessors. Additionally, the investigation only considers one type of
input, Ia,d. Given these limitations compared to the original experiments, the

Figure 5.2: Average performance µ(NP D-δ, Ia,d, R91) with error bars representing
σ2(NP D-δ, Ia,d, R91) over 10 runs for each network model PD-δ, where

δ ∈ {2, 3, . . . , 19}. The shaded gray area highlights the region where a significant
drop in performance is observed.

26


5. Results

average performance is evaluated over 10 runs for each network model NP D-α, α ∈
{2, 3, . . . , 19}. The results strongly suggests that for α > 4 there is no meaningful
computation performed by the network. Consequently, these types of network models
will not be investigated further.

For comparative purposes, a simple network consisting solely of the two driven input
nodes was also evaluated. This configuration is equivalent to the readout layer
directly processing the input sequence, which violate the fundamental principle of
reservoir computing that the readout layer should not perform any computation.
Nonetheless, this comparison is crucial as it provides a baseline to determine whether
the other network models genuinely enhance the computational capabilities of the
system.

Table 5.1 presents the prediction accuracy when this simple network is used as the
reservoir. Similar to previous experiments, the readout layer processes the last 20
states of the network evolution. The results indicate that while the readout layer
can distinguish between random input and input of type B, it fails to differentiate
between inputs of type C or D.

Table 5.1: Mean prediction accuracy µ(−, Ia,b,R), µ(−, Ia,c,R), and µ(−, Ia,d,R)
without reservoir, i.e., the readout layer is applied directly to the input sequence.

Input type B C D
Mean Accuracy 0.74 0.50 0.50

5.2 Transition Rule Performance
In the section on transition rules, the analysis will begin with an overview of the
general performance of the rules. Then, a detailed examination will be conducted
for each network, aiming to categorize the rules based on their performance and
highlight the best performing rules for the network topology. This will specifically
involve analysing the distribution of prediction accuracy across different types of
input.

5.2.1 General Performance
The performance of each transition rule given a network model is summarised in
Figure 5.3. The average prediction accuracy is normalised to highlight the relative
performance of the rules. Specifically, the prediction accuracy for each input type is
normalised over all rules, which means that the new value for the best performing
rule for a specific input type is 1, and for the worst performing rule the same value is
0. The normalised average performance is defined as µ(Nγ, I ′, Ri), where I ′ indicates
that the prediction accuracy for each input type is normalized before combining
the averages of the three input types. The results are presented in a grid format,
where each square corresponds to the normalised prediction accuracy of a single
rule, µ(Nγ, I ′, Ri). The rules are displayed in numerical order. Recall that a rule

27


5. Results

(a) µ(NP D-2, I ′, Ri) (b) µ(NP D-3, I ′, Ri)

(c) µ(NP D-4, I ′, Ri) (d) µ(NP D-2,3,4, I ′, Ri)

Figure 5.3: Average normalized prediction accuracy µ(NP D-α, I ′, Ri) over all three
input sequences I ′ for each rule, based on the network models: (a) PD-2, (b) PD-3,

(c) PD-4, and (d) the combined average for all three network models. Rules are
presented in numerical order

comprises eight mappings from current state to next state and is named after the
binary representation of the output cell in these mappings. Consequently, all rules
in the same row are identical with respect to the first four mappings, while all rules
in the same column share the same last four mappings.

There are four grids presented in total. The first grid corresponds to NP D-2, the
second to NP D-3, and the third to NP D-4. The fourth grid is a composite of the
previous three, representing the average value of those grids for each rule, denoted
as µ(NP D-2,3,4, I ′, Ri). While specific rules will be discussed in detail later in this
section, some general trends can already be observed across these four grids.

Most notably, the performance of each rule is not random, and certain rules are
more effective at processing input than others. Some rules exhibit consistently poor

28


5. Results

performance, effectively erasing any information introduced into the network. Con-
versely, there are rules that consistently demonstrate superior information processing
capabilities compared to others. In NP D-2, a wide range of rules exhibit decent
performance. However, as the number of predecessors increases, only a few rules
stand out in terms of performance. Furthermore, there is no universally optimal rule
that performs best across all types of network models. Nonetheless, several rules
consistently rank among the top performers across all network models considered.

Several general trends can be observed in the grids. Odd-numbered rules tend to
perform better than even-numbered ones. This advantage arises from the fact that
odd-numbered rules include the mapping 000→ 1. Consequently, this configuration
prevents networks with many cells in state 0 from becoming stuck in that state.
Similarly, rules numbered 127 and below outperform those numbered 128 and above.
This distinction is due to the mapping 111→ 0 present in the lower-numbered rules.

The large rectangle of poorly performing rules in the bottom right corner can be
attributed to the fact that all of these rules share the following three mappings:
111→ 1, 110→ 1, and 011→ 1. This combination of configurations increases the
likelihood that the network will converge to an all-1 state. For these rules, the only
scenario in which a cell in state 1 can transition to state 0 is if both of its predecessors
are in state 0.

As the number of predecessors increases, fewer rules exhibit superior performance
relative to others. A factor contributing to this trend is the majority voting mecha-
nism. For NP D-2, this is not a concern since there are only two predecessors. In the
case of NP D-3, there are three predecessors, resulting in three pairs of combinations.
Given that the number of pairs is odd, the impact of the majority vote remains
relatively insignificant. However, for NP D-4, there are four predecessors, leading to a
total of six pairwise combinations. With an even number of pairs, situations arise
where a tiebreaker is required. In such instances, the tiebreaker rule specifies that
the number of votes for state 1 must exceed those for state 0. This requirement can
negatively affect the information processing capabilities of certain rules.

The results are presented in Table 5.2. The rules are ordered from the best to the
worst performing based on the overall prediction accuracy across all input types and
the three network models. Additionally, the table includes the standard deviation for
each rule to indicate the consistency of their performance, as well as the rank among
all rules for each network model. The highest prediction accuracies are 75.9%, 72.2%,
and 77.1% for the three network models with increasing numbers of predecessors,
respectively. The standard deviation is relatively high, approximately 10% for most
rules, indicating variability in their performance. The exception to this are the rules
that consistently exhibit poor performance, which have a standard deviation around
0%.

The results suggest that the exact ranking of rules should be interpreted with caution.
Although each input type-network model-rule configuration is run 100 times, this
does not eliminate variability in the data. Nevertheless, certain trends are evident
and significant. Notably, the performance of rules varies substantially across different
network models. For example, the best-performing rule on NP D-2 is

29


5. Results

Table 5.2: Numerical results from the experiments, including the rank, mean
accuracy, and standard deviation for each rule on each network model, as well as the
overall results. Only a subset of the rules is shown; complete results can be found in

Table A.1 in Appendix A.

Ri Rank µ(NP D-2) σ2(NP D-2) Rank µ(NP D-3) σ2(NP D-3) Rank µ(NP D-4) σ2(NP D-4) Rank µ(·) σ2(·)

91 61 0.678 0.078 2 0.717 0.1 1 0.771 0.103 1 0.722 0.093
123 53 0.681 0.105 3 0.716 0.098 4 0.742 0.096 2 0.713 0.1
94 41 0.699 0.124 12 0.688 0.105 6 0.707 0.097 3 0.698 0.109

144 3 0.755 0.15 20 0.686 0.165 19 0.638 0.161 4 0.693 0.159
16 4 0.754 0.151 24 0.683 0.168 17 0.641 0.166 5 0.693 0.162
56 6 0.744 0.119 19 0.686 0.137 16 0.644 0.132 6 0.691 0.129
2 5 0.746 0.151 16 0.686 0.169 20 0.636 0.163 7 0.689 0.161

98 7 0.742 0.114 15 0.687 0.142 18 0.639 0.124 8 0.689 0.127
126 59 0.679 0.078 30 0.675 0.08 5 0.707 0.086 9 0.687 0.081
116 14 0.731 0.125 36 0.668 0.135 12 0.658 0.094 10 0.686 0.118
130 8 0.74 0.153 27 0.681 0.168 21 0.635 0.164 11 0.685 0.162
30 56 0.679 0.092 7 0.694 0.088 9 0.674 0.095 12 0.683 0.092
24 10 0.738 0.149 29 0.678 0.158 22 0.628 0.155 13 0.681 0.154
46 21 0.72 0.122 45 0.661 0.131 11 0.659 0.093 14 0.68 0.115
66 15 0.73 0.147 28 0.68 0.163 25 0.625 0.153 15 0.679 0.154

118 46 0.696 0.102 25 0.682 0.13 14 0.657 0.103 16 0.678 0.112
62 48 0.693 0.109 26 0.681 0.139 13 0.658 0.095 17 0.678 0.114

107 102 0.633 0.076 60 0.648 0.089 3 0.75 0.099 18 0.677 0.088
121 100 0.635 0.072 68 0.642 0.083 2 0.751 0.099 19 0.676 0.085
86 73 0.664 0.088 17 0.686 0.096 10 0.663 0.098 20 0.671 0.094
48 9 0.739 0.151 33 0.67 0.162 35 0.59 0.131 21 0.666 0.148
34 16 0.728 0.152 34 0.67 0.163 34 0.593 0.132 22 0.664 0.149

131 38 0.702 0.106 5 0.699 0.136 36 0.59 0.063 23 0.663 0.102
29 2 0.757 0.106 11 0.689 0.129 72 0.537 0.08 24 0.661 0.105

145 40 0.699 0.101 9 0.693 0.137 38 0.587 0.067 25 0.66 0.101
71 1 0.759 0.111 14 0.687 0.138 79 0.53 0.071 26 0.658 0.107

125 26 0.715 0.1 8 0.693 0.096 49 0.558 0.066 27 0.655 0.088
65 30 0.712 0.095 18 0.686 0.095 47 0.565 0.104 28 0.654 0.098

176 37 0.705 0.147 31 0.671 0.159 39 0.584 0.105 29 0.653 0.137
122 118 0.609 0.081 50 0.658 0.077 8 0.692 0.082 30 0.653 0.08
. . . . . . . . . . . . . . .
133 45 0.697 0.121 10 0.692 0.109 69 0.54 0.05 36 0.643 0.093
. . . . . . . . . . . . . . .
33 49 0.692 0.102 1 0.722 0.098 104 0.511 0.032 40 0.642 0.077
. . . . . . . . . . . . . . .
129 54 0.68 0.082 22 0.685 0.084 80 0.53 0.046 45 0.631 0.071
. . . . . . . . . . . . . . .
135 58 0.679 0.097 6 0.694 0.091 92 0.517 0.035 48 0.63 0.074
. . . . . . . . . . . . . . .
37 62 0.677 0.074 4 0.707 0.091 183 0.5 0.035 51 0.628 0.066
. . . . . . . . . . . . . . .
90 187 0.507 0.064 47 0.66 0.085 7 0.7 0.101 60 0.622 0.084
. . . . . . . . . . . . . . .
246 39 0.701 0.162 71 0.641 0.159 204 0.5 0 75 0.614 0.107
. . . . . . . . . . . . . . .
191 44 0.698 0.16 72 0.64 0.155 204 0.5 0 77 0.613 0.105
. . . . . . . . . . . . . . .
247 42 0.699 0.165 75 0.637 0.155 204 0.5 0 79 0.612 0.107
. . . . . . . . . . . . . . .
110 125 0.598 0.099 104 0.577 0.085 28 0.608 0.091 88 0.595 0.092
. . . . . . . . . . . . . . .
15 143 0.562 0.068 118 0.551 0.068 52 0.556 0.1 121 0.556 0.079
. . . . . . . . . . . . . . .
85 142 0.562 0.075 125 0.543 0.071 55 0.552 0.102 128 0.552 0.083
. . . . . . . . . . . . . . .
54 157 0.544 0.063 190 0.502 0.034 109 0.508 0.042 168 0.518 0.047
. . . . . . . . . . . . . . .
0 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0

32 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0
250 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0
251 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0
254 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0
255 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0
. . . . . . . . . . . . . . .
222 245 0.499 0.025 256 0.497 0.022 146 0.501 0.019 252 0.499 0.022
200 254 0.498 0.025 247 0.498 0.024 165 0.501 0.022 253 0.499 0.024
19 256 0.496 0.031 237 0.5 0.021 158 0.501 0.021 254 0.499 0.025

221 244 0.5 0.022 242 0.499 0.022 253 0.497 0.021 255 0.499 0.022
132 251 0.499 0.022 253 0.497 0.023 254 0.497 0.026 256 0.498 0.024

30


5. Results

only ranked 26th overall, while the best-performing rule onNP D-3 is ranked 40th. This
indicates that no single rule universally excels across all network models. However,
some rules consistently achieve around 70% prediction accuracy across all network
models, suggesting a degree of generalization capability. These findings highlight the
importance of considering specific network model characteristics when evaluating
rule performance.

5.2.2 Categorisation based on Accuracy Distribution
A more detailed evaluation and categorization can be achieved by examining the
accuracy distribution. This distribution reveals the spread of accuracy for individual
instances of configurations with different input type, network model, and transition
rule. Based on the location and size of peaks in the distributions, rules can be
grouped into four categories with similar patterns by applying a few simple criteria.
Figure 5.4 illustrates examples of these distributions using kernel density estimation
plots, highlighting the core properties of each group.

Group 1 consists of rules that offer no valuable computation. An example of such a
rule is depicted in Figure 5.4a, where the distributions for all three input types are
centered around 0.5. This indicates that, regardless of the network instance, the rule
effectively erases any information introduced into the network.

(a) Example of Group 1 distribution. (b) Example of Group 2 distribution.

(c) Example of Group 3 distribution. (d) Example of Group 4 distribution.

Figure 5.4: Four groups of distribution types.

31


5. Results

Group 2 comprises rules similar to those in Group 1 but with one key difference.
These distributions exhibit a clear tail for one or more input types, suggesting that,
for a small subset of network instances, the rule can perform computations on the
network. An example of this is shown in Figure 5.4c.

Group 3 includes rules where the distributions demonstrate a clear ability to perform
computations on the networks, but this capability is dependent on the network
instance. This is indicated by a substantial portion of the distribution’s density
being located above 60% accuracy, while there remains a significant peak at 0.5 for
at least one input type. An example is provided in Figure 5.4b, where each input
type exhibits two peaks: one at 50% accuracy and another at a higher accuracy.

Group 4 consists of the best-performing rules, characterized by the absence of a sig-
nificant peak at 50% accuracy for any of the input types. These model configurations
display the ability to generalize effectively and represent the best options identified
in this study for computation on CAN. An example is illustrated in Figure 5.4d,
where the accuracy is consistently high.

The categorization is based on several metrics that differentiate rules across the
various groups. In summary, rules where all peaks are below 60% accuracy are
allocated to Group 1. Rules where all the highest peaks exceed 60% accuracy, with
no significant peak below 60%, are placed in Group 4. Rules with significant peaks
both above and below 60% accuracy are assigned to Group 3. The remaining rules,
which exhibit a tail in their distribution, are categorized into Group 2. Previous
results have suggested that the performance of individual rules is closely tied to the
network model. Therefore, the categorization will be conducted separately for each
of the three network models, rather than combining them all.

5.2.3 Categorisation of Rules for NPD-2

The categorisation of rules for NP D-2 is depicted in Figure 5.5. There are a total of
nine rules in Group 4, specifically rules 29, 56, 65, 71, 98, 118, 125, 131, and 145.
Most rules fall into Group 3, while the remaining rules are evenly distributed between
Groups 1 and 2. It is evident that a significant number of rules are effective on this
network model, although most are dependent on the specific network instance.

The distributions of a selected set of four rules are illustrated in Figure 5.6. These
rules were chosen for their superior performance with one or more input types. The
top-performing rules for NP D-2 are rules 71 and 29. These two rules are mirrored
versions of each other and exhibit similar behaviour. The two rules are symmetrical
and share the same general behaviour. The distribution for Rule 71 is shown in
Figure 5.6a, and Rule 29 displays a comparable distribution. These rules perform
exceptionally well on alternating input, similarly to a pure readout layer for biased
input, and handle alternating input with noise effectively.

Following rules 71 and 29, the next best rule is Rule 144, with Rules 16 and 2, among
other rules, demonstrating similar behaviour and performance. These rules are
characterized by incorporating one of the mappings 100→ 1 and 001→ 1, with most
other outputs in state 0. The distribution for Rule 144 is depicted in Figure 5.6b.

32


5. Results

Figure 5.5: Categorisation of rules on NP D-2.

It can be observed that for certain network instances, this model configuration
performs exceptionally well, but there is also a significant peak around 50% accuracy,
highlighting the rule’s network dependency. Consequently, these types of rules are
not included in Group 4.

Next, rules 56 and 98, which are mirrored versions of each other and exhibit similar
behaviour, show notable performance. The distribution for Rule 56 is presented in
Figure 5.6c. This distribution resembles that of Rule 144, with the rule performing
best on alternating text, followed by alternating text with noise, and finally biased
text. However, the peaks at 50% accuracy are less pronounced, but the overall
performance is slightly lower.

A rule that does not generally perform well but excels with biased text is Rule 85,
along with Rule 15, which shares similar behaviour. The distribution for Rule 85
is shown in Figure 5.6d. It is evident that this rule cannot process alternating text
effectively; however, it performs exceptionally well with biased text. Although the
average performance is still lower than using the readout layer directly on the input,
the peak performance is better. The peak at 50% accuracy indicates that the rule’s
performance is still dependent on the network instance, meaning it does not work
uniformly well across all network instances.

33


5. Results

(a) Rule 71 (b) Rule 144

(c) Rule 56 (d) Rule 85

Figure 5.6: Four of the best rules on NP D-2.

5.2.4 Categorisation of Rules for NPD-3

The categorization of rules for NP D-3 is illustrated in Figure 5.7. A total of four
rules falls into Group 4: rules 30, 33, 37, and 65. The majority of rules are classified
into Group 3, while the remaining rules are evenly distributed between Groups 1
and 2. Similar to NP D-2, it is evident that many rules are effective on this network
model, but their performance is largely dependent on the specific network instance.
However, the number of effective rules on NP D-3 is lower than on NP D-2.

The distributions of a select four rules are illustrated in Figure 5.8. These rules
were chosen for their exceptional performance with one or more input types. Similar
to NP D-2, the best rules come in pairs with analogous performance and behavior.
Moreover, for the three pairs discussed here, one rule in the pair is classified in Group 3
and the other in Group 4, despite their near-identical characteristics. Interestingly, the
rule in Group 4 is not necessarily the one with the highest performance, highlighting
the limitations of using purely metric-based criteria for grouping. Nevertheless, the
rules in Group 4 are among the best performers, indicating that the groups are still
meaningful.

The top-performing rule is Rule 33, which shares similar performance and behaviour
with Rule 123. The distribution for Rule 33 is depicted in Figure 5.8a. Unlike the
best rules for NP D-2, this model configuration performs better on biased text than
on text with noise. The overall performance is also approximately 4% lower than the

34


5. Results

Figure 5.7: Categorisation of rules on NP D-3.

best performer on NP D-2.

The second best-performing rule is Rule 91, which shares similar performance and
behaviour with Rule 37. The distribution for Rule 91 is shown in Figure 5.8b. These
rules perform better on text with noise than on biased text. In fact, the performance
on biased text is relatively poor, with many runs centred around 50% accuracy.
However, this is offset by better performance on alternating text and text with noise
compared to Rule 33. Additionally, Rules 30 and 135 exhibit different behaviour but
show similar distributions.

A rule that matches the average performance of the aforementioned rules is Rule
131. The distribution for Rule 131 is depicted in Figure 5.8c. It is evident that Rule
131 heavily depends on the network instance, but when the setup is optimal, its
performance surpasses that of its counterparts.

Finally, the last rule in Group 4 that has not been mentioned yet is Rule 65, which was
also part of Group 4 for NP D-2. The distribution for Rule 65 is shown in Figure 5.8d.
The peaks just surpass the 60% accuracy threshold, which explains why the average
performance is relatively poor despite belonging to Group 4. This trend was also
observed for this rule on NP D-2, where it was also in Group 4.

35


5. Results

(a) Rule 33 (b) Rule 91

(c) Rule 131 (d) Rule 65

Figure 5.8: Four of the best rules on NP D-3.

5.2.5 Categorisation of Rules for NPD-4

The categorisation of rules for NP D-4 is illustrated in Figure 5.9. A total of two
rules are classified in Group 4: rules 91 and 121. Most rules fall into Group 1, while
the remaining rules are evenly split between Groups 2 and 3. It is evident that
the number of effective rules for this network model is significantly more limited
compared to the previous models.

The distributions of the four best-performing rules are shown in Figure 5.10. For
NP D-4, the four rules that perform significantly better than the others are, in order:
91, 121, 107, and 123. Additionally, four more rules — 126, 94, 90, and 122 — stand
out among the rest. Out of these rules, 91 and 121 belong to group 4.

The distributions for the top four rules exhibit many similarities. They all perform
well on alternating text, and their performance on biased text is comparable. The
primary difference lies in their performance on text with noise. Compared to rules
on other network models, the performance of these rules is slightly lower than the
best-performing configurations on NP D-2. However, rule 91 on NP D-4 stands out as
the best overall performing rule, with an accuracy of 77.1%, compared to 75.9%
for Rule 71 on NP D-2. It is important to note that there is some variance in the
prediction accuracy that affects the true difference.

36


5. Results

Figure 5.9: Categorisation of rules on NP D-4.

(a) Rule 91 (b) Rule 121

(c) Rule 107 (d) Rule 123

Figure 5.10: Four of the best rules on NP D-4.
37


5. Results

5.3 Information Gain
The information gain of the rules on different network models was also examined.
The average information gain IG(·) is presented in Table 5.3, the full results can
be viewed in Table A.3 in Appendix A. Xcurrent represents the current cell, Xa the
first predecessor, Xb the second predecessor, and so on. The observations are as
expected. For networks with two predecessors, all three cells, including the current
cell, exhibit the same level of information gain. For networks with three predecessors,
the current cell generates the most information gain and has the greatest influence
on the outcome. In this configuration, the middle predecessor, which is considered
once as a left neighbour and once as a right neighbour, has the least influence. A
similar trend is observed for networks with four predecessors, where the current cell
again exerts the most influence, and the middle cells have the least. Additionally,
the average information gain for a single predecessor, mean IG(Xa−d), diminishes as
the number of predecessors increases. This indicates that predecessors influence the
cell less when there are more predecessors, thereby giving the current cell a larger
impact on the outcome. Furthermore, the uncertainty H(Y ) when there is no prior
information decreases for networks with more predecessors, indicating that denser
networks become more predictable.

Information gain is also employed to group similar rules based on behaviour rather
than performance. Several important observations can be highlighted. For NP D-4,
rules 91, 94, 107, 121, and 122 exhibit identical information gain values for the
current cell and all predecessors at 0.0084. This is also true for rules 90 and 126;
however, in these cases, the information gain from knowledge of either the current
cell or a single predecessor is 0. Similarly, for rule 123, which is among the top 8
performers on NP D-4, there is no information gain from a single predecessor, but
there is information gain from knowledge of the current cell.

Among the best rules for NP D-3, there is more variation in information gain. Rules 91
and 37 show consistent information gain of 0.018 from the state of either the current
cell or a single predecessor. Rules 123 and 33 exhibit no information gain from a
single predecessor but do show information gain from the state of the current cell.
However, for rules 30, 131, and 135, the information gain from predecessors depends
on their order. In all three cases, knowledge of the state of the first predecessor in
order yields significantly higher information gain compared to any other predecessor.

The best rules for NP D-2 can also be described using information gain. Without
prior information, the uncertainty of the outcome when using rules 29 or 71 is 1,
meaning the states 0 and 1 are equally likely. Knowledge of the current cell’s state

Table 5.3: Average information gain for 2, 3, and 4 predecessors, respectively.

Nγ H(Y ) IG(Xcurrent) IG(Xa) IG(Xb) IG(Xc) IG(Xd) Mean IG(Xa−d)

NP D-2 0.902 0.122 0.122 0.122 - - 0.122
NP D-3 0.857 0.167 0.104 0.010 0.104 - 0.072
NP D-4 0.806 0.176 0.062 0.019 0.019 0.062 0.040

38


5. Results

produces no information gain, while knowledge of either predecessor provides an
information gain of 0.19. Rules 56 and 98 yield the same information gain from the
current cell or a single predecessor at 0.049. This is also true for rules 2 and 16, but
with an information gain of 0.14. Rule 144 is asymmetrical, as knowledge of the first
predecessor provides an information gain of 0.31, while knowledge of the current cell
or second predecessor yields no information gain.

It is important to note that these observations are not unique to the best performing
rules. Many rules with the same information gain distribution as the best rules, do
not perform well. Additionally, it is generally observed that the average informa-
tion gain from a single predecessor for the best performing rules is relatively low
compared to other rules. This could suggest that the combined values of multiple
predecessors are necessary to extract meaningful information from the system. It
is this interplay among multiple predecessors that likely creates the complex and
interesting computational behaviour observed in the best performing rules.

39


5. Results

40


6
Discussion

This chapter is organized into six sections. The first three sections address the
use of cellular automata as a reservoir from various perspectives. Specifically, the
performance of different network models and transition rules is discussed, as well as
the performance of the overall system. The fourth section discuss reservoir computing
as a methodological framework. The fifth section address the utilisation of a genetic
algorithm to optimise cellular automata networks, while the last section introduces
potential directions for future research.

6.1 Finding the Correct Network Topology
It appears that a network topology should have a certain amount of structure to
function well. This was shown by networks with a fixed amount of predecessors
outperforming their counterparts. Similarly, a network with more constraints on the
connections, like the small-world network, consistently perform better than a more
random network like the Erdős-Rényi network. This is in line with the edge of chaos
idea, which emphasizes the importance of balancing the chaotic and deterministic
nature of the reservoir. The results indicate that for the transition rules examined in
this study, a network cannot be too dense, as this causes the system to deviate from
this edge. It is also clear that trivial connections, where a cell has none or only one
predecessor, should be avoided, as these provide no additional computation.

The findings suggest that different network topologies do not produce significant
differences in performance, unlike factors such as density and a uniform predecessor
count. Among the networks examined in this study, NP D-2, where all cells have two
predecessors, appears to be closest to the edge of chaos. Although other configurations
sometimes outperform rules on NP D-2, these instances are few and the performance
difference is negligible. While these configurations should not be entirely disregarded,
it is more sensible to focus on the network model that consistently performs well
across a wide range of transition rules.

The NP D-2 network model was kept simple but can be further optimized. Potential
areas for improvement include examining the order of predecessors, allowing for
dynamic or varying network sizes, exploring small-world or scale-free NP D-2 networks,
and identifying specific connection patterns that enhance performance. However,
caution is necessary when optimizing the system. Overfitting remains a significant

41


6. Discussion

risk, and optimizing for a particular problem or input type might reduce the network’s
generalization capacity.

From a practical perspective, a network with a sparse and fixed number of predecessors
is also advantageous. Each connection and transition logic incurs a cost in a physical
system, making simpler systems easier to implement in physical hardware.

6.2 Selecting an Appropriate Transition Rule
It is evident that the selection of a suitable transition rule depends on the underlying
network. For certain network models, such as NP D-4, determining an appropriate
rule is comparable to identifying a single rule that actually performs computation
effectively, as the options may be limited. In other cases, there are more options and
multiple factors to consider when choosing the best transition rule for a specific task.

6.2.1 Categorisations of Elementary Cellular Automata
Several properties are consistently observed in the best-performing transition rules. It
is essential to examine these properties in greater depth to gain a better understanding
of their impact. To facilitate this discussion, Figure 6.1 illustrates a selection of the
top-performing rules from the experiments as elementary cellular automata rules.

First and foremost, elementary cellular automata have been extensively studied in
prior research, leading to established categorisations of the rules, such as Wolfram’s
four classes of behaviour. For elementary cellular automata, these categorizations
are typically made on a one-dimensional grid rather than the networks studied here.
Consequently, they cannot be applied directly to this study but serve as useful
guidelines. This also means that the categorisation is the most relevant for NP D-2

0 1 0 0 0 1 1 1

Rule 71

0 0 1 0 0 0 0 1

Rule 33

0 1 0 1 1 0 1 1

Rule 91

Figure 6.1: Visualisation of the three elementary cellular automata rules with the
best performance on network model NP D-2, NP D-3, and NP D-4, respectively.

42


6. Discussion

networks, as the transition rule on that network model is the most similar to the
traditional elementary cellular automata rule, with only two neighbours. That is
also why in this section, the underlying elementary cellular automata is considered
for the discussion, rather than the extended version.

For example, totalistic cellular automata are rules that depend solely on the total
value of the cells in the neighbourhood, i.e., the current cell and its predecessors.
This means that the mappings can be divided into four groups: (111; total value 3),
(110, 101, 011; 2), (100, 010, 001; 1), and (000; 0), and within each group the outcome
should be identical. Among the elementary cellular automata rules classified as
totalistic, Rule 126 ranks 9th overall, while Rule 129 is 45th, with other rules
performing significantly worse. Rule 126 and 129 can be described as rules where,
if all cells in the neighbourhood are in the same state, i.e., mappings 111 and 000,
the output is one state, but when the states differ, the output is the opposite state.
While this setup appears to compute, other totalistic rules does not. The results
indicate that this categorization does not appear to be particularly relevant for this
problem.

Another categorisation is outer-totalistic cellular automata, where the output depends
on the total value of the predecessors and the state of the current cell separately.
This means that the mappings can now be divided into three groups: (111, 101;
total value 2 1), (110, 100, 011, 001; 1), and (010, 000; 0), and within each group the
outcome should either be identical, the state of the current cell should remain the
same, or the state of the current cell should be flipped. There are 64 outer-totalistic
elementary cellular automata rules. Among the best overall performing rules, those
classified as outer-totalistic include rules 33, 37, 90, 91, 94, 122, 123, 126, 129, and
133. Rule 90, ranked 60th overall, is the worst among these, and rule 91, ranked 1st
overall, is of course the best. There is significant overlap between these rules and
the best-performing rules on NP D-3 and NP D-4, indicating that properties tied to
outer-totalistic rules become useful as the number of predecessors increases beyond
two. This is not surprising, as the order of predecessors becomes less important due
to the extension of the rule, making rules that consider only the density, not the
order, more relevant. However, this categorization does not appear to be relevant for
NP D-2, as the rules listed above are not the best performers on NP D-2.

Wolfram’s four classes of behaviour is another natural categorisation of elementary
cellular automata. The most interesting rules are those belonging to Class 4, which
exhibit complex behaviour, in contrast to the predictable patterns in Classes 1 and
2 and the chaotic behaviour of Class 3. On the classic one-directional grid, rules
commonly recognized as Class 4 include Rule 54 and Rule 110. These rules are ranked
168th and 88th, respectively, in overall performance in this study’s experiments. This
suggests that behaviour observed on a one-directional grid does not directly translate
to a network context, or alternatively, that a Class 4 rule does not necessarily make
a great reservoir.

1The value is computed from the first and the last digit both being one, the one in the middle is
ignored

43


6. Discussion

6.2.2 The Ideal Transition Rule

Given the absence of an established categorization that accurately describes the
results of this study for all networks, it is necessary to examine the transition rules
in more detail. The effectiveness of transition rules is heavily dependent on both the
network and the input types. Consequently, it is challenging to determine whether
a well-performing rule generalizes effectively or if it benefits significantly from the
specific networks and input types used in this study. However, it is evident that the
transition rule plays a more critical role than the network topology, as there is a
much larger difference in performance between different rules compared to network
models.

It is tempting to conclude that outer-totalis