Text Classification with Cellular Automata Networks Exploiting The Reservoir Computing Approach Master’s thesis in Data Science and AI Oscar Johansson Department of Microtechnology and Nanoscience CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 Master’s thesis 2024 Text Classification with Cellular Automata Networks Exploiting The Reservoir Computing Approach Oscar Johansson Department of Microtechnology and Nanoscience Chalmers University of Technology Gothenburg, Sweden 2024 Text Classification with Cellular Automata Networks Exploiting The Reservoir Computing Approach Oscar Johansson © Oscar Johansson, 2024. Supervisor: Zoran Konkoli, Department of Microtechnology and Nanoscience Examiner: Zoran Konkoli, Department of Microtechnology and Nanoscience Master’s Thesis 2024 Department of Microtechnology and Nanoscience Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Categorisation of different transition rules based on experiment performance. Typeset in LATEX Gothenburg, Sweden 2024 iv Text Classification with Cellular Automata Networks Exploiting The Reservoir Computing Approach Oscar Johansson Department of Microtechnology and Nanoscience Chalmers University of Technology Abstract The thesis focuses on several options of exploiting reservoir computing ideas in the context of text classification, using cellular automata networks as the reservoir. The key innovative aspect of the study is related to finding options for computing with low-density networks, which should be easier to manufacture. The overarching idea is to extend Wolfram’s elementary cellular automata rules and implement them within the framework of random cellular automata networks. The primary objectives of this study are as follows: (i) To determine the optimal network topologies, (ii) to identify the most effective transition rules within these networks, and (iii) to develop a robust methodological framework for finding the best networks. By exploring various network configurations, the aim is to uncover the structural characteristics that facilitate efficient information propagation and decision-making within the cellular automata framework. Observations from the study indicate that there exist cellular automata network configurations which can function as efficient reservoirs. Moreover, all considered cellular automata networks are categorised into different groups based on their performance, providing a solid foundation for further research. Keywords: Reservoir computing, cellular automata networks, cellular automata, text classification, machine learning, natural language processing, supervised learning, information theory, elementary cellular automata. v Acknowledgements I would like to thank my supervisor, Zoran Konkoli, for all support and guidance I have received throughout my thesis work. I greatly appreciate your ability to create an encouraging and productive environment, allowing me to feel comfortable and do my best work. Working with you has inspired me to continue on my academic journey and pursue a PhD degree. Thank you. I would also like to extend my thanks to everyone I have met at the Department of Microtechnology and Nanoscience at Chalmers. I greatly value the help I have received and all the conversations we have had. Finally, I am deeply grateful to my friends and family for their interest and support, which continue to motivate me and allowing me to aim higher. Oscar Johansson, Gothenburg, 2024-06-17 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Reservoir Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Physical reservoirs . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 The readout layer . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Elementary Cellular Automata . . . . . . . . . . . . . . . . . 8 2.2.2 Cellular Automata Networks . . . . . . . . . . . . . . . . . . . 9 2.2.3 Cellular Automata Networks as a Reservoir . . . . . . . . . . 9 2.3 Network Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Reservoir Computing Framework 13 3.1 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Input structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.1 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.2 Transition Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Readout Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Methods 21 4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Rule Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Results 25 5.1 Network Topology Performance . . . . . . . . . . . . . . . . . . . . . 25 ix Contents 5.2 Transition Rule Performance . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.2 Categorisation based on Accuracy Distribution . . . . . . . . . 31 5.2.3 Categorisation of Rules for NP D-2 . . . . . . . . . . . . . . . . 32 5.2.4 Categorisation of Rules for NP D-3 . . . . . . . . . . . . . . . . 34 5.2.5 Categorisation of Rules for NP D-4 . . . . . . . . . . . . . . . . 36 5.3 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Discussion 41 6.1 Finding the Correct Network Topology . . . . . . . . . . . . . . . . . 41 6.2 Selecting an Appropriate Transition Rule . . . . . . . . . . . . . . . . 42 6.2.1 Categorisations of Elementary Cellular Automata . . . . . . . 42 6.2.2 The Ideal Transition Rule . . . . . . . . . . . . . . . . . . . . 44 6.2.3 The Inverse Rules Problem . . . . . . . . . . . . . . . . . . . . 45 6.3 Cellular Automata Networks as a Reservoir . . . . . . . . . . . . . . . 45 6.4 Reservoir Computing as a Methodological Framework . . . . . . . . . 46 6.5 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7 Conclusion 49 Bibliography 51 A Appendix 1 I x List of Figures 2.1 Visual representation of elementary cellular automata rule 91. . . . . 8 3.1 An overview of the model’s architecture. . . . . . . . . . . . . . . . . 13 3.2 Markov models for generating text phrases with different text structures. 15 5.1 Average performance µ(Nγ, I, Ri) over all experiments for each rule Ri, i ∈ {0, 1, . . . , 255}, grouped by network model γ. The rules are sorted from worst to best performance within each group, with the x-axis representing the rank of each rule within its respective group. . 25 5.2 Average performance µ(NP D-δ, Ia,d, R91) with error bars representing σ2(NP D-δ, Ia,d, R91) over 10 runs for each network model PD-δ, where δ ∈ {2, 3, . . . , 19}. The shaded gray area highlights the region where a significant drop in performance is observed. . . . . . . . . . . . . . 26 5.3 Average normalized prediction accuracy µ(NP D-α, I ′, Ri) over all three input sequences I ′ for each rule, based on the network models: (a) PD-2, (b) PD-3, (c) PD-4, and (d) the combined average for all three network models. Rules are presented in numerical order . . . . . . . . 28 5.4 Four groups of distribution types. . . . . . . . . . . . . . . . . . . . . 31 5.5 Categorisation of rules on NP D-2. . . . . . . . . . . . . . . . . . . . . 33 5.6 Four of the best rules on NP D-2. . . . . . . . . . . . . . . . . . . . . . 34 5.7 Categorisation of rules on NP D-3. . . . . . . . . . . . . . . . . . . . . 35 5.8 Four of the best rules on NP D-3. . . . . . . . . . . . . . . . . . . . . . 36 5.9 Categorisation of rules on NP D-4. . . . . . . . . . . . . . . . . . . . . 37 5.10 Four of the best rules on NP D-4. . . . . . . . . . . . . . . . . . . . . . 37 6.1 Visualisation of the three elementary cellular automata rules with the best performance on network model NP D-2, NP D-3, and NP D-4, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xi List of Figures xii List of Tables 5.1 Mean prediction accuracy µ(−, Ia,b,R), µ(−, Ia,c,R), and µ(−, Ia,d,R) without reservoir, i.e., the readout layer is applied directly to the input sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Numerical results from the experiments, including the rank, mean accuracy, and standard deviation for each rule on each network model, as well as the overall results. Only a subset of the rules is shown; complete results can be found in Table A.1 in Appendix A. . . . . . . 30 5.3 Average information gain for 2, 3, and 4 predecessors, respectively. . . 38 A.1 Full results for NP D-2, NP D-3, and NP D-4 from experiments, including the rank, mean accuracy, and standard deviation for each rule and network pair. Additionally, average results over all three networks is also included. The rules are ordered based on the overall results. . . . I A.2 Full results for NBA, NER, and NW S from experiments, including the rank, mean accuracy, and standard deviation for each rule and network pair. Additionally, average results over all three networks is also included. The rules are ordered based on the overall results. . . . IV A.3 Full information gain table, including the entropy H(Y ), the informa- tion gain for the current (“middle”) cell IGM , and the information gain for predecessors IGA, IGB, IGC and IGD, corresponding to the order the predecessors appear in the list of predecessors. The informa- tion gain is calculated for cells with two, three and four predecessors, separately. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII xiii List of Tables xiv 1 Introduction Computation fundamentally involves evaluating a function where an input is trans- formed into an output in a reproducible way [1]. To achieve this, mathematical abstractions representing the input and output must be embodied within physical systems [2]. Traditionally, this transformation is executed through a series of opera- tions on the data using binary logic and electrical signals, structured according to the well-established von Neumann architecture [3]–[5]. In this standard approach, there is a clear distinction between the central processing unit, the memory, and the data transfer pathways between them. This architecture underpins most personal computers and smartphones today. However, it suffers from a significant limitation known as the von Neumann bottleneck [5], [6], which arises from the physical and practical constraints on the number and efficiency of connections between components. As systems scale up, the complexity and overhead of managing these connections increase, leading to inefficiencies and limiting overall performance. To address these challenges, researchers have explored alternative computation methodologies [7]. Unconventional computing encompasses computational approaches that deviate from the traditional von Neumann architecture. This category includes, but is far from limited to, quantum computing, which uses quantum bits (qubits) instead of traditional bits; optical computing, which processes information using light rather than electrical signals; and neuromorphic computing, which mimics the neural structure and functioning of the human brain to perform computations. These unconventional methods offer distinct advantages and application areas where they can outperform or complement conventional techniques. This thesis explores the application of reservoir computing for text classification. Reservoir computing is a machine learning technique that leverages principles of neuromorphic computing. A typical reservoir computer consists of a dynamical system, known as the reservoir, driven by an external input that alters its state. Instead of analysing the original input directly, the state of the system is analysed and assigned a label using a linear readout layer. In this way, the information contained in the input sequence is transformed into the information represented by the label, constituting the computational process where one type of information is converted into another. The key advantage of this approach is that the readout layer can be simple, and thus easy to train, since the reservoir performs the computation. A critical question which will be discussed in this thesis, is how to construct an effective reservoir. This research focuses on a novel approach to building such 1 1. Introduction reservoirs using cellular automata networks. By investigating the properties and dynamics of such networks, this study aims to enhance our understanding of their potential as reservoirs in the context of text classification. 1.1 Problem Statement Traditional machine learning models frequently encounter limitations, such as large model sizes, time-consuming training processes, and substantial computational resource consumption. These challenges can restrict their deployment and limit their scalability and environmental sustainability. A significant contributing factor to these limitations is the inherent inefficiencies of the von Neumann architecture. Reservoir computing has emerged as a promising solution to these issues. Unconventional computing facilitates the development of machine learning models that are smaller, require minimal training, and consume fewer resources. Such models have the potential to outperform traditional methods in various tasks [8], while offering greater efficiency and reduced costs. Furthermore, these systems could be directly integrated into hardware. The success of reservoir computing is critically dependent on the underlying dynamical system used as the reservoir. Although certain dynamical systems have demonstrated potential, this area remains underexplored, revealing a significant gap in current research. This study addresses this gap by investigating a versatile dynamical system in Cellular Automata Networks (CAN). Cellular automata networks feature complex dynamics governed by local interactions, rendering them intriguing candidates for effective reservoirs. Through rigorous numerical experiments, the aim is to evaluate the capabilities of CAN as a reservoir for the classification of text structures, a fundamental problem in natural language processing. The objective is to differentiate between various forms of textual organization, which could ultimately contribute to applications such as language identification, authorship attribution, and the distinction between AI-generated and human-generated text. Given the novel application of CAN in this context, initial investigations will target simpler text structures to establish baseline capabilities and insights. 1.2 Research Objectives The primary objective is to identify key properties of cellular automata networks that enhance their functionality as reservoir computing systems, particularly for text classification tasks. This goal is part of a broader attempt to understand the characteristics that define an effective reservoir. Specifically, this study aims to achieve the following objectives: 1. To determine the optimal network topology of cellular automata networks in the context of reservoir computing for text classification. 2 1. Introduction 2. To identify the most effective transition rule for cellular automata within these networks. 3. To develop a robust methodological framework for addressing such complex and open-ended challenges. 1.3 Limitations To determine the optimal network topology of cellular automata networks, it is impractical to conduct an exhaustive search over all network sizes and connection organisations given the exponentially large number of possibilities. Therefore, the scope of this research is strategically narrowed. The study will utilize a fixed network size and focus on a subset of commonly observed random network topologies. By limiting the variety of topologies considered, the aim is to identify the network structures that most effectively support reservoir computing for text classification tasks. Furthermore, for simplicity matters, only binary classification tasks are considered. To identify the most effective transition rule, the investigation will also address the challenge of identifying optimal transition rules for cellular automata. The scope is confined to an extension of elementary cellular automata rules applied on graphs. By restricting the ruleset, the study aims to systematically evaluate the impact of different transition rules on the performance of the cellular automata networks in a controlled manner. This approach allows for a focused analysis of how specific rules influence the dynamical properties and computational capabilities of the network. To develop a robust methodological framework for addressing such complex and open-ended challenges, the primary goal of this research is not to find the highest- performing classifier but to establish a framework that identifies the optimal reservoir for the task. Consequently, the hyperparameters of the readout layer will remain constant throughout the experiments. This decision ensures that the influence of different reservoirs on system performance is assessed independently of the readout layer optimization. By maintaining consistent parameters, the study isolates the effects of network topologies and transition rules, thereby contributing to a robust methodology for future research in this domain. Given these constraints, it is acknowledged that this study may not discover the optimal cellular automata network and reservoir computing system. However, identi- fying the best-performing configurations under the specified conditions will provide valuable insights into the key properties of effective cellular automata networks as a reservoir. These findings will lay a solid foundation for future research and broader applications. 1.4 Contributions This study contributes several key innovations. First, an extension of elementary cellular automata rules is proposed to accommodate more than two neighbours, 3 1. Introduction generalizing their application from a grid structure to a graph structure. This broadens the scope and versatility of cellular automata in computational tasks. Additionally, a novel type of random network has been suggested to construct the cellular automata network. These networks exhibit low connectivity density and when implemented with the extended elementary cellular automata rules, they perform better than their more frequent counterparts such as Erdős-Rényi networks, small- world networks, and scale-free networks. Furthermore, implementing a reservoir computing system with cellular automata networks is by itself a novel innovation. 1.5 Thesis Structure This thesis is organized into seven main chapters. Chapter 1 introduces the research problem, objectives, limitations, and contributions of the thesis. Chapter 2 presents the background on reservoir computing, cellular automata, network theory, and information theory, including discussions on physical reservoirs, the readout layer, and the use of cellular automata networks as reservoirs. Chapter 3 outlines the reservoir computing framework, detailing the input, reservoir, and readout layer components, while introducing the various network topologies and transition rules to be examined. Chapter 4 describes the methods employed for classification and rule entropy analysis within the reservoir computing framework. Chapter 5 presents the experimental results, discussing the performance of different network topologies and transition rules. Chapter 6 interprets the results, discussing the selection of network topologies and transition rules, the evaluation framework, and other related topics, while also proposing future research directions. Finally, Chapter 7 summarizes the key findings and highlights the contributions of the research. The thesis is supported by an appendix with complete experiment results. 4 2 Background This chapter covers key concepts and theories related to reservoir computing, cellular automata, network theory, and information theory. By establishing this groundwork, the aim is to contextualize the study and highlight the significance of the chosen methodologies and approaches. 2.1 Reservoir Computing Reservoir computing is a machine learning method, a type of neuromorphic computing centred around the concept of universal spontaneous computation implemented via a reservoir [9]. The reservoir is realised as a high dimensional dynamical system with a rich internal state space that evolves over time and interacts with its environment [10]. The core idea is that such dynamical systems are inherently capable of computation and information processing [11], [12]. Specifically, the computational power of the dynamical system is assessed by observing the evolution of its state space in response to external inputs. Ultimately, the state of the system is transferred to a simple, linear readout layer that generates the output. The primary applications of reservoir computing lie in classification and prediction tasks. Since the reservoir is realised by dynamical systems that evolve over time, this approach excels at processing time-dependent and sequential data. Key areas of application include time series prediction, speech recognition, signal processing, bioinformatics, and pattern recognition. These diverse applications highlight the versatility and robustness of reservoir computing in handling complex real-world data. The concept of reservoir computing emerged as a response to the challenges asso- ciated with training neural networks, and it was developed independently by two different research groups. The training process of recurrent neural networks faces several problems, including vanishing and exploding gradients [13], overfitting, high computational costs, and extensive data requirements [14]. To address these issues, the idea to entirely forego training of the network, except for the readout layer, was proposed. This lead to the introduction of two novel systems: Echo State Networks [15], [16], which are randomly initiated recurrent neural networks that possess the echo state property, and Liquid State Machines [17], [18], which utilize randomly connected spiking neural networks. 5 2. Background Since the introduction of reservoir computing, the concept of the reservoir component has been generalized to incorporate any arbitrary dynamical system with specific key properties. Firstly, the reservoir should be high-dimensional to project the input into a complex state space, allowing for detailed representations. Secondly, its dynamics should be rich, enabling the separation of similar inputs within the state space. Additionally, the elements within the reservoir must exhibit nonlinear behaviour to effectively process complex patterns in the input data. Lastly, the reservoir should also possess the echo state property, which means the system has a fading memory where the influence of past inputs gradually diminishes. Within the context of reservoir computing, it is widely believed that optimal per- formance is achieved when the reservoir operates at the edge of chaos — a critical transition point between order and chaos. Research has shown that dynamical sys- tems can perform complex computations most effectively at this boundary, where the system balances stability and adaptability, enabling efficient information processing [19]. Consequently, reservoirs functioning at the edge of chaos are considered highly desirable for maximizing computational capabilities. 2.1.1 Physical reservoirs A primary appeal of reservoir computing lies in the ability to leverage physical reservoirs. A classic study demonstrated how waves on the surface of a bucket of water could function as a Liquid State Machine for speech recognition tasks, showing that a simple readout layer suffices for classification [20]. Although the practicality of that particular reservoir was not ideal, significant advances have been made in various other fields of physical reservoir computing [21]. Concrete examples of physical reservoirs that have been explored include memristors [22], photonic systems [23], spintronic devices [24], biological systems [25], and mechanical systems [26], [27]. These physical reservoirs offer unique advantages due to their inherent properties. For instance, memristors can emulate synaptic plasticity, photonic systems benefit from the speed of light for high-speed data processing, spintronic devices exploit electron spin for energy-efficient computations, biological systems harness the complex dynamics of neural tissue, and mechanical systems leverage the non-linear dynamics of physical materials. Simulating physical reservoirs is also a common practice, especially in the early stages of developing reservoir computing systems. Examples include using neural networks, cellular automata, or Boolean networks. This approach is often more practical during the initial development phases of reservoir computing, as it is more cost-efficient and simpler to implement. Simulations allow researchers to test and refine their models before deploying physical systems, which can be more expensive and complicated to build. 6 2. Background 2.1.2 The readout layer The role of the readout layer in reservoir computing is not to perform the main computation; rather, this computation is carried out by the reservoir itself. The readout layer assesses the state of the reservoir and maps these states to output labels or values [9]. Ideally, the readout layer should be kept as simple as possible, often involving a linear transformation of the reservoir’s state space. The weights of the readout layer are the only parts of the system that require training. Since the readout layer is simple, it can be trained with a relatively small amount of data, making the training process fast and resource efficient. This simplicity and efficiency in training are among the primary benefits of using reservoir computing. It allows for reduced computational costs and time compared to traditional methods, which is particularly advantageous when working with large datasets or when rapid training is required. Furthermore, the simplicity of the readout layer contributes to the overall robustness and flexibility of the reservoir computing framework. It can easily adapt to different types of tasks, such as classification, regression, and signal processing, without the need for complex training algorithms. This versatility, combined with the inherent efficiency of the training process, underscores the practical utility of reservoir computing in various real-world applications. 2.2 Cellular Automata Cellular Automata are simple, yet powerful mathematical models used to represent physical, biological, and computational systems. Initially proposed by John von Neu- mann [28] and extensively studied by Stephen Wolfram [29], [30], cellular automata consist of a grid of cells that interact with their neighbours according to predefined local transition rules. The state of the cells updates synchronously in discrete time steps. Since the cells only interact with their local neighbourhood, the system is decentralised, with no central controller. Despite the simplicity, cellular automata can exhibit complex dynamics and emergent behaviours from the interaction of simple rules. This makes them valuable for modelling a wide range of real-world systems. A notable example of cellular automata that produce complex behaviour from simple rules is Conway’s Game of Life (for an introduction see [31] and references therein), which has been demonstrated to simulate a Turing machine [32], a fundamental model of computation capable of performing any calculation that a digital computer can. This capability to simulate a Turing machine highlights the immense computational power inherent in cellular automata. It underscores their potential as computational devices capable of complex data processing and algorithm execution. The Game of Life, thus, serves as a powerful example of how simple, local interactions can lead to complex, global behaviours, making cellular automata a valuable tool in both theoretical and applied computational research. 7 2. Background 0 1 0 1 1 0 1 1 Rule 91 Figure 2.1: Visual representation of elementary cellular automata rule 91. 2.2.1 Elementary Cellular Automata Elementary cellular automata are the simplest class of cellular automata, defined on a one-dimensional grid [33], [34]. Each cell interacts with its immediate left and right neighbours. The rule can be seen as a Boolean function that takes three bits x1, x2, and x3 as input and produces a single bit as output, denoted as y = r(x1, x2, x3). A simple combinatorial argument leads to the conclusion that one can construct 223 = 256 such rules in total. Such a rule explains how to update a cell that is in the state x2 that is influenced by two neighbouring cells in states x1 and x3. The variable y is the new state of the cell after the update. Assume that the automaton of interest is made of N cells c1, c2, · · · , ci−1, ci, ci+1, · · · , cN . Then a cell ci is updated according to the rule ci(t + 1) = f(ci−1(t), ci(t), ci+1(t)) (2.1) These rules are named based on the binary representation of their outputs. For example, if the resulting sequence of f(1, 1, 1), f(1, 1, 0), . . . , f(0, 0, 0) is the binary sequence 010110112, as illustrated in Figure 2.1, where the black and white cells represents 1 and 0, respectively. Then, this function f corresponds to rule 91, since 010110112 = 9110. A specific mapping from a particular input to an output for f can also be referred to as ci−1(t), ci(t), ci+1(t)→ ci(t + 1). Then, the first mapping in rule 91 is written as 111→ 0. Elementary cellular automata can be categorized based on the patterns they produce over time when starting from a random initial state. Stephen Wolfram introduced four classes of behaviour [33]: 1. Rules that converge to a uniform state. 2. Rules that converge to repetitive or stable states. 3. Rules that produce seemingly random states. 4. Rules that produce complex structures, including both repetitive and stable areas, as well as regions of complicated behaviour. Class 4 is particularly interesting because it demonstrates a balance between order and chaos, where some structures are stable while others exhibit complex, dynamic behaviours. Notably, rule 110, a member of Class 4, has been proven to be Turing complete [35], meaning it can be used to perform any computation that a universal Turing machine can. This highlights the impressive computational capabilities of even the simplest cellular automata. 8 2. Background 2.2.2 Cellular Automata Networks Cellular Automata Networks (CAN) extend traditional cellular automata by replacing the standard grid structure with a graph structure [36]. In the literature, this concept is also known as graph cellular automata [36], cellular automata on graphs [37], or simply CAN [38]. In this setup, each cell corresponds to a node in a graph, where the neighbourhood of a cell is defined by edges connected to the node in the graph, rather than proximity on a grid. CAN can be defined for both undirected and directed graphs. In the former case, the transition rule of the cellular automata functions similarly to traditional cellular automata, with connected nodes as neighbours and the degree as the neighbourhood size. In the latter case, it is important to distinguish between the in-degree and out- degree of a node to define the neighbourhood for a cell accurately. The neighbourhood corresponds to cells connected through incoming edges, and the neighbourhood size equals the in-degree of the node. To avoid confusion, this type of neighbour will be referred to as a predecessor. Similarly, cells connected by outgoing edges are referred to as successors. Importantly, the transition rule only considers predecessors to determine the next state of a cell. This adaptation allows for the use of cellular automata to model systems that deviate from the grid structure. By incorporating the flexibility of network topologies, CANs are particularly suited for simulating dynamics on networks found in the real world, such as social networks, transportation networks, and interconnected systems in biology and finance. This network-based approach offers a richer framework for exploring the dynamics of complex systems, where the structure of connections can significantly influence emergent behaviour. 2.2.3 Cellular Automata Networks as a Reservoir Before discussing cellular automata networks, it is important to note that traditional cellular automata have been successfully employed or discussed as a reservoir in various studies [10], [39]–[42]. In these studies, elementary cellular automata were primarily utilized. Additionally, a key paradigm of reservoir computing is that the readout layer should only access the current state of the reservoir. However, when cellular automata are used, it is common practice for the readout layer to access the entire evolution of the cellular automata, or at least the recent history of the evolution. Despite the success of traditional cellular automata, the utilization of CAN as a reservoir has not been extensively studied. In the context of reservoir computing, CAN was first introduced to model a memristor network, which was then used as a reservoir to classify protein toxicity [43]. Consequently, the use of CAN remains a novel approach in the field of reservoir computing. 9 2. Background 2.3 Network Theory Network theory has been extensively studied, and for the purpose of this thesis, only a few key concepts are pertinent. Networks can model real-world systems, and the structure of the connections between entities in these systems varies depending on the specific system. The connection structure is commonly known as the network’s topology and there are three network types that are particularly relevant for this study. The random graph, first introduced by Erdős and Rényi [44] is generated by creating an edge between each pair of nodes with a certain probability p, resulting in what is known as an Erdős-Rényi graph. When discussing random networks, this type of graph is what one usually has in mind. Building on the work of Erdős and Rényi, Watts and Strogatz introduced the small-world network [45]. Recognizing that few real-world systems are completely random, this topology aims to balance regularity and randomness. It is constructed by starting with a regular network where each node is connected to its nearest neighbours. Subsequently, each edge is rewired to a randomly chosen node with a probability q, thereby creating long-distance connections. Similarly, the scale-free network was introduced by Barabási and Albert [46] to address the limitations of random networks in modelling certain real-world systems. In a scale- free network, the likelihood of a node acquiring new connections is proportional to its existing degree. Thus, nodes with higher degrees are more likely to attract additional edges. This preferential attachment process results in networks characterized by a few highly connected hubs and many nodes with lower degrees. These network types — random, small-world, and scale-free — provide a foundational framework for understanding the structural properties of complex networks. The topology of these networks offers unique insights into the dynamics and connectivity patterns that can emerge in real-world systems. 2.4 Information Theory Information theory, introduced by Claude Shannon [47] provides a mathematical framework for quantifying information. This theory is fundamental in understanding and optimising communication systems, computational devices, and many other applications in modern technology. In this thesis, information theory is used to analyse the efficiency and capabilities of different transition rules on specific networks. One of the core concepts in information theory is Shannon entropy. Entropy measures the uncertainty or unpredictability associated with a random variable. For a discrete random variable Y with possible outcomes {y1, y2, . . . , yn}, the Shannon entropy H(Y ) is defined as: H(Y ) = − n∑ i=1 p(Y = yi) log2 p(Y = yi) 10 2. Background In this equation, p(Y = yi) represents the probability of the outcome yi. Shannon entropy indicates how much information is needed, on average, to describe the state of a random variable. Higher entropy indicates greater unpredictability, while lower entropy signifies more predictability. For example, a fair coin toss, with binary outcomes {Heads, Tails} each having a probability of 0.5, has a Shannon entropy of: H(X) = −(0.5 log2 0.5 + 0.5 log2 0.5) = 1 bit. (2.2) In contrast, a biased coin that always lands on heads has zero entropy, as there is no uncertainty in the outcome. 11 2. Background 12 3 Reservoir Computing Framework This chapter introduces the reservoir computing framework and details the imple- mentation and parameter choices of classification models in this study. The chapter is organized into four sections. The first section describes the general architecture of the system. The second section provides a detailed description of the input text structures. The third section introduces the reservoir, along with the variations in network topologies and transition rules evaluated in this study. Finally, the fourth section presents the concrete implementation of the readout layer. 3.1 General Architecture The architecture of the models adheres to standard reservoir computing practices when cellular automata is utilised as the reservoir. The key difference is the substi- tution of cellular automata with CAN. However, this exchange does not significantly alter the general architecture due to the modular nature of the reservoir comput- ing framework. Figure 3.1 provides an illustration of the model, highlighting its components and their interactions. Note that a CAN with directed edges are used. In this architecture, the input sequence is encoded as a vector and interacts with the reservoir through input nodes. Specifically, each input node is randomly connected to a single cell in the reservoir as a predecessor, and its order as a predecessor Figure 3.1: An overview of the model’s architecture. 13 3. Reservoir Computing Framework is also randomized. Alternative approaches include driving certain cells in the reservoir directly with the input sequence or allowing an input node to have multiple connections to the reservoir. The effects of such alternative implementations have not been studied in this research. The readout layer has access to more than just the current state of the reservoir. Specifically, the readout layer receives the weighted sum of the states from the last 20 discrete time steps of the reservoir’s evolution, w1c1 + · · · + wi−1ci−1 + wici + wi+1ci+1 + · · ·+ w20Nc20N , where ci is the state of the i-th cell in the evolution, wi is the weight and N is the network size. In this study N = 20 is used. The entire evolution spans 100 time steps; however, to allow the system to transition away from its initial random state, only the latter part of the evolution is used. The 20 last time steps are used, given that N = 20, ensuring the ability to track the impact of a single input on each cell within the networks regardless of topology. Additionally, given that the CANs considered in this study are relatively small, the evolution of each cell within the CAN is transferred to the readout layer. An alternative approach would be to transfer the evolution of only a subset of cells, but this option was not explored in this research. 3.2 Input structure The input consists of phrases with varying text structures. To construct these phrases, the simplest possible non-trivial alphabet, A = {a, b}, comprising only two letters, is used. Thus, each input phrase is represented as a sequence of a and b characters. Prior to being processed by the reservoir, each letter in the input sequence undergoes one-hot encoding. This encoding scheme assigns a unique binary vector to each letter. Since, there are two letters, the vector’s length is also two, with a = [ 1 0 ] and b = [ 0 1 ] One input node is then assigned to each position in the vector, corresponding to each letter in A. Specifically, an input node is activated, i.e., state set to 1, if the current letter in the input sequence matches the corresponding letter in A; otherwise, the node remains inactive, i.e., state set to 0. Consequently, the system is equipped with two input nodes. Although a single input node could suffice due to the binary nature of the input, one-hot encoding is employed for its simplicity and scalability to larger alphabets. This approach ensures that each unique letter is distinctly represented by a separate input node, facilitating a more intuitive input representation for the system. To introduce structure to the input sequence, each phrase is generated by a Markov model. The four Markov models considered in this study are displayed in Figure 3.2. Each Markov process initiates in State 0. As the system transitions to new states, letters are generated and appended to the phrase. Figure 3.2a represents purely random text, where the letters a and b have equal probabilities of occurrence. Figure 3.2b also represents random text; however, the letter a is more likely to 14 3. Reservoir Computing Framework 0 1 0.5, a 0.5, b (a) Random text. 0 1 0.7, a 0.3, b (b) Biased random text. 1 0 2 0.5, b0.5, a 1, b 1, a (c) Alternating text 1 0 2 0.5, b0.5, a 0.1, a 0.9, b 0.1, b 0.9, a (d) Alternating text with noise. Figure 3.2: Markov models for generating text phrases with different text structures. appear than b, with probabilities p(a) = 0.7 and p(b) = 0.3. Figure 3.2c generates text that alternates between a and b, although the initial letter in the phrase is random. Similarly, Figure 3.2d also generates alternating text, but this time with noise. Noise is introduced with a small probability p(noise) = 0.1 and consists of the opposite letter in A. These Markov chains produce phrases exhibiting some of the most fundamental properties of text, providing a basis for evaluating the computational capabilities of cellular automata networks in text classification tasks. Specifically, in these tasks, the random text is compared to input with varying letter frequency, high correlation, or correlation with noise present. 3.3 Reservoir A range of CAN with different combinations of network topologies and transition rules will be considered for the reservoir. In this study, a unique pair of network topology and transition rule will be referred to as a configuration. Consistent with the first two objectives of this thesis — to identify the optimal network topology and transition rule — all other hyperparameters are kept constant. Specifically, the network size is limited to 20 cells, excluding the input nodes. This relatively small network size is deliberately chosen to enhance the training efficiency of the classifier and to showcase the performance capabilities of even the simplest models. The reservoir operates through a straightforward mechanism. At each discrete time step, the states of the cells are synchronously updated according to the defined transition rule. Input nodes, however, are updated based on the incoming input sequence rather than the transition rule. This iterative update process allows the system to evolve over time, and it is this dynamic evolution of cell states that performs the core computational tasks within the reservoir. 15 3. Reservoir Computing Framework 3.3.1 Network Topology This study will examine four distinct types of network models and evaluate the impact of different transition rules on each model. Three standard network models are employed, specified as Algorithm 1, Algorithm 2, and Algorithm 3. These algorithms are adaptations from previous work on reservoir computing with memristor networks [43] and ensure the formation of weakly connected directed graphs. Additionally, a specialized model developed specifically for this study is defined as Algorithm 4. Algorithm 1 generates networks based on the Erdős-Rényi (ER) model [44]. These networks consist of a predetermined number of nodes n with each pair of nodes connected by an edge with a fixed probability p. Let the number of edges be defined as d. To ensure network connectivity, the algorithm deviates from the classical ER framework by initially constructing a tree structure, to which additional edges are randomly added. This method guarantees connectivity and allows for precise control over the total number of edges. Thus, guaranteeing a network density p. This approach facilitates more equitable comparisons between different networks, while retaining the fundamental characteristics of the ER model: edges can be created between any two nodes, and the creation of edges remains independent. Furthermore, the network includes driven input nodes, whose states are regulated by an input sequence. These driven nodes are added at random after the network has been created and are unique in that they can only serve as predecessors to other nodes and are not themselves successors to any node. Algorithm 1 Weakly connected directed Erdős-Rényi model 1: Create a node in a random state. 2: for i = 1, 2, . . . , n− 1 do 3: Create a node in a random state. 4: Create a directed edge from the last node created to a random existing node with random orientation. 5: end for 6: while d < p · n(n− 1) do 7: Select two random nodes a and b. 8: if There exists no edge from a to b then 9: Create an edge from a to b. 10: end if 11: end while Algorithm 2 generates networks based on the Watts-Strogatz (WS) model [45], commonly known as small-world networks. These networks comprise a predetermined number of nodes n, with each node initially connected to its k nearest neighbors, where k is an even number. Thereafter, each edge is rewired with a fixed small probability q, reassigning the second node of the edge to a randomly chosen node while avoiding self-loops and duplicate edges, thus creating long-distance connections. In a traditional small-world network, the number of edges d is a multiple of n. To allow d to be any number, the algorithm has been modified by adding more edges to the initial regular structure until the desired edge density p is achieved. This 16 3. Reservoir Computing Framework Algorithm 2 Weakly connected directed Watts-Strogatz model 1: k = 2⌊p(n− 1)⌋ 2: Compute the missing edges m = p · n(n− 1)− n · k/2 3: Create n nodes in random states. 4: k = k + 2 5: for i = 1, 2, . . . , n do 6: if i = m + 1 then 7: k = k − 2 8: end if 9: for j = 1, 2, . . . , k/2 do 10: Create an edge from node i to node i + j mod n with a random direction. 11: With probability q rewire the edge into a long-distance connection to a random node. 12: end for 13: end for modification ensures that the algorithm produces a network that corresponds exactly to the classical WS network when p(n − 1) is an integer. Otherwise, the network includes some additional edges. Similarly to the previous algorithm, driven input nodes are added to the network after it has been generated. Algorithm 3 generates networks based on the Barabási-Albert (BA) model [46]. The algorithm for generating BA networks proceeds as follows: initially, a small number of nodes are established. At each subsequent time step, a new node is introduced along with a fixed number of edges connecting it to the existing nodes. However, this attachment process is not uniform; the probability that a new node will connect Algorithm 3 Weakly connected directed Barabási-Albert model 1: if p < 0.25 then 2: n0 = n + 1− √ (n + 1)2 − 4(p · n(n− 1) + 1) 2  3: h = n0 4: else 5: n0 = 2n + 1− √ (2n + 1)2 − 8(p · n(n− 1) + 1) 4  6: h = 2n0 7: end if 8: t = n− n0 9: Create n0 nodes in a random state. 10: for i = 1, 2, . . . , t do 11: Create a node in a random state. 12: Create h edges from the last node created to random existing nodes with random orientation. 13: end for 17 3. Reservoir Computing Framework to an existing node is proportional to the number of edges that the existing node already possesses, following a preferential attachment mechanism. The model is characterized by three parameters: the initial number of nodes n0, the number of edges added with each new node h, and the total number of time steps t. For the purposes of this study, these parameters are represented as the number of nodes n and edge density p, as detailed in [43]. Like the previous algorithms, driven input nodes are incorporated into the network after it has been generated. Algorithm 4 is not derived from a standard network model. Instead, it is character- ized by a fixed number of predecessors for each node. The algorithm generates the network by iterating over all n nodes and assigning α unique predecessors to each node. Consequently, the edge density p is constant for a given α. This approach produces a network with properties like those of an ER network, but with a prede- termined number of predecessors for each node. Additionally, driven input nodes are incorporated into the network. Unlike the previous algorithms, which add new edges for the driven nodes, this algorithm randomly replaces existing predecessors with driven nodes, thereby maintaining a constant number of predecessors for each node. Additionally, while this algorithm does not guarantee weak connectivity, the probability that the network is not weakly connected is insignificant. Moreover, the reservoir’s ability to compute does not depend on weak connectivity. In this study, this network model is referred to as the Fixed Predecessor model (PD-α), where α denotes the number of predecessors. Algorithm 4 Directed Fixed Predecessor (PD-α) model 1: Create n nodes in random states. 2: for i = 1, 2, . . . , n do 3: Randomly assign α predecessors to node i. 4: Create directed edges from all predecessors to node i. 5: end for 3.3.2 Transition Rule The transition rule is applied uniformly across each cell in the network, enabling the isolated observation of a single rule’s behaviour. This rule is an adaptation of Wolfram’s elementary cellular automata rules, extended to accommodate cells with more than two predecessors while retaining the simplicity of the two-neighbour approach. The chosen method integrates arbitrary network topologies with Wolfram’s elementary cellular automata rules through a straightforward procedure: when a cell has more than two inputs, a voting mechanism determines its new state. Specifically, the rule is applied to all possible pairs of predecessors, and the cell’s new state is decided by a majority vote of these outcomes. Exceptions are made for cells with fewer than two predecessors. If a cell is not connected to any other cell by an incoming edge, its state remains unchanged. If a cell has only one predecessor, it adopts the state of that cell. For example, consider a cell c1 that receives input from a list of predecessors P = {c2, c3, · · · , cn}. It is important to note that for each list of predecessors, 18 3. Reservoir Computing Framework a predetermined, randomized ordering unique to each target cell c1 is assumed. The Wolfram rule f is applied to each pair within P , generating a list of outcomes f(ci, c1, cj), where ci, cj ∈ H and i ̸= j. Then, the majority outcome becomes the new state of c1. 3.4 Readout Layer The readout layer applies a linear transformation to the evolution of the reservoir. Then, the model employs the sigmoid activation function to convert the linear combination of input features into a value in (0, 1). The sigmoid function is defined as follows: σ(z) = 1 1 + e−z This function maps any real-valued number into the range (0, 1), with σ(0) = 0.5, making it suitable for binary classification tasks. For an input vector x with weights w and a bias term wbias, the prediction ŷ(x) is calculated using the following equation: ŷ(x) = round ( σ ( n−1∑ i=0 wixi + wbias )) Here, n represents the number of features in x. Each weight wi corresponds to a feature, and wbias is the bias term added to the linear combination to adjust the decision boundary. If the weighted sum is positive the prediction becomes 1, otherwise it is 0. During the training phase, the weights are adjusted using the gradient descent algorithm to minimize the difference between the predictions and the actual labels. The update rule for the weights is given by: wj ← wj + α m m∑ i=1 ( y(i) − ŷ(x(i)) ) x (i) j for j = 0, 1, . . . , n− 1, and wbias ← wbias + α m m∑ i=1 ( y(i) − ŷ(x(i)) ) Here, α is the learning rate, m is the number of training examples, x(i) is the input vector of the i-th example, and y(i) is the actual label of the i-th example. Ultimately, the final prediction is determined by computing ŷ(x(i) test), where x(i) test is a sample from the test dataset. 19 3. Reservoir Computing Framework 20 4 Methods This chapter introduces the methods used to investigate the performance and func- tionality of different reservoirs. The chapter is organized into two sections. The first section details the text classification task, and the second section describes how information theory is applied to assess the system. 4.1 Classification This study consists of multiple experiments, each comprising numerous classification tasks that follow the same procedure but with different combinations of network mod- els, transition rules, and input text structures. LetNγ = {N1, N2, . . . , N100} represent a set of networks generated by a network model γ, where γ ∈ {ER, BA, WS, PD-2, PD-3, PD-4} corresponds to the desired network topology. The target density of the ER, BA, and WS networks is 0.16, while the density of the PD networks are 0.11, 0.16, and 0.21 for increasing number of predecessors. Additionally, for the WS network a fixed probability q = 0.1 is chosen for the long-distance connections. Moreover, let R = {R0, R1, . . . , R255} denote the set of transition rules, where Ri is the extension of elementary cellular automata rule i that is introduced in this study. The aim is for the classifier to differentiate between input generated by different text structures α and β. Suppose that Iα,β = {Iα,β 1 , Iα,β 2 , . . . , Iα,β n } represents a set of input text sequences for the classifier, where Iα,β i is an input sequence of either type α or β. Here, α and β define the type of text structure an input instance can be generated by, with α, β ∈ {a, b, c, d}, corresponding to the Markov chains shown in Figure 3.2. Specifically, a is random text, b is biased text, c is alternating text, and d is alternating text with noise. To analyse the performance of the various configurations, two stochastic variables are introduced which are associated with the choices of Nγ, Iα,β, and R. Given the objective to differentiate between rules, each rule Ri in R is considered individually. The first variable, µ(Nγ, Iα,β, Ri), denotes the mean classification accuracy for a given combination of network model, input text structure, and transition rule. The second variable, σ2(Nγ, Iα,β, Ri), represents the standard deviation of the classifi- cation accuracy. This variable quantifies the variability or spread of the individual classification accuracies across the experiments, providing insight into the consistency of the performance for each configuration. 21 4. Methods To obtain µ(Nγ, Iα,β, Ri) and σ2(Nγ, Iα,β, Ri), the following procedure is repeated 100 times: 1. Generate a network based on the network model γ. 2. For each transition rule Ri, use the network and transition rule as the reservoir in a reservoir computing context. 3. Train the system on a training dataset Iα,β. 4. Test the system on a test dataset I ′ α,β 5. Record the prediction accuracy for Ri on the network, as the number of correct prediction divided by the number of total predictions. After creating 100 unique graphs, µ(Nγ, Iα,β, Ri) is computed as the mean prediction accuracy for the rule Ri, and σ2(Nγ, Iα,β, Ri) is calculated as the standard deviation of the accuracy across different graphs. In this framework, the efficiency of rule Ri is characterized by the metrics µ(Nγ, Iα,β, Ri) and σ2(Nγ, Iα,β, Ri), which indicate average performance and performance consis- tency, respectively, across the network ensemble within the specified topology and input type. For simplicity, the notation I will be used instead of Iα,β when repre- senting the mean and standard deviation for a network model and rule across all input types. The scope of the experiments can be summarized as follows. Numerous small models are created. In a single experiment, 256 ·100 = 25600 reservoir computing models are trained and evaluated. In this study, a total of 6 · 3 = 18 experiments are conducted. This means that over all experiments, a total of 25600 · 6 = 460800 models are create and 256 ·6 = 1536 different model configurations are evaluated. The hyperparameters used for the training of the readout layer include a training dataset size of 400, a test dataset size of 100, a learning rate of 0.01, and 1000 passes over the training dataset. 4.2 Rule Entropy In addition to the experiments, the information gain of predecessors in a single cell will also be computed for all configurations of network topology and transition rules. This type of assessment of CAN has been conducted in previous studies [37]. Information gain is defined as the difference between the entropy of a variable Y and the conditional entropy of Y given a variable X. Here, Y is defined as the next state of a single cell, and X is defined as the current state of either the cell itself or one of its predecessors. This is also known as mutual information. The computation aims to better understand the effects of different configurations when certain parts of the system are known and to highlight important properties of CAN. It is also utilized to categorise transition rules with similar behaviour. In information theory, the entropy [47] of the binary variable Y is calculated as: H(Y ) = − ∑ y∈{0,1} p(Y = y) log2 p(Y = y) 22 4. Methods where p(Y = y) is the probability of Y being y. Similarly, the conditional entropy of Y given the binary variable X is calculated as: H(Y |X) = ∑ x∈{0,1} p(X = x) − ∑ y∈{0,1} p(Y = y|X = x) log2 p(Y = y|X = x)  where p(Y = y|X = x) is the conditional probability of Y being y given X = x, and p(X = x) is the probability of X being x. Finally, the information gain from knowing X is computed as: IG(X) = H(Y )−H(Y |X) This measures the reduction in uncertainty, i.e., the number of bits needed to describe the outcome, for Y when X is known. For the purposes of this study, it is assumed that all possible current states are equally likely. This simplification facilitates a more straightforward computation of the conditional entropy and, subsequently, the information gain. Therefore, the calculations can be further simplified as: H(Y |X) = 1 2H(Y |X = 0) + 1 2H(Y |X = 1) In this study, the outcome consists of a single bit. This means that the maximum possible entropy is H(Y ) = 1, indicating that the outcome is equally likely to be 0 or 1. Conversely, the minimum entropy is H(Y ) = 0, which means that the outcome is always the same. Consequently, IG(X) = 1 implies that both outcomes were equally likely before, but with the additional knowledge of X, the outcome becomes certain. In contrast, IG(X) = 0 indicates that the knowledge of X does not provide any additional information about the outcome. 23 4. Methods 24 5 Results This chapter presents the results from the text classification tasks, organized into three distinct sections, each offering a unique perspective on the findings. The first section examines the impact of different network topologies on prediction accuracy. The second section explores how various transition rules affect prediction accuracy. Finally, the third section details the information gain associated with different combinations of network topologies and transition rules. 5.1 Network Topology Performance The overall performance of different network models on the text classification tasks is summarized in Figure 5.1. The average prediction accuracy, µ(Nγ, I, Ri), is calculated for each combination of network model and transition rule across all input types. The results are then grouped by γ, and within each group, the transition rules are ordered from worst to best, based on their prediction accuracy. This implies that Figure 5.1: Average performance µ(Nγ, I, Ri) over all experiments for each rule Ri, i ∈ {0, 1, . . . , 255}, grouped by network model γ. The rules are sorted from worst to best performance within each group, with the x-axis representing the rank of each rule within its respective group. 25 5. Results the i-th best-performing rule for each γ does not have to be the same rule across all groups. Instead, the i-th position within each group represents the rule with the i-th best performance among rules applied to that specific network model. Moving forward, Nγ is used instead of γ to emphasize that a set of network instances of the network model γ are utilised in all experiments, rather than a single instance of γ. The analysis of the graph reveals a distinct hierarchy in the performance of different Nγ . The highest overall performance is observed for NP D-2, followed by NP D-3, NW S, NBA, NER, respectively. An exception is noted for NP D-4, where the prediction accuracy does not follow the same trend as the other network models. While NP D-4 exhibits poor overall performance, it demonstrates excellent peak performance compared to other network models. Additionally, it is evident that for each Nγ , the number of transition rules n that fail to provide meaningful computation, indicated by µ(Nγ, I, Ri) ≈ 0.50, is approximately the same across each Nγ, with n ≈ 75. The observed results indicate that the most promising performance with the proposed transition rules is found in network models with a fixed number of predecessors, specifically NP D-2, NP D-3, and NP D-4. Therefore, for the sake of simplicity, the primary focus will be on these network models moving forward. The full results for NBA, NER, and NW S can instead be found in Appendix A, Table A.2. Moreover, network models with more than four predecessors do not appear to offer any additional computational capabilities, as demonstrated in Figure 5.2. The experiment focuses exclusively on the best performing rule, R91, for NP D-4, as this rule seems to exhibit properties that are well-suited for network models with more than two predecessors. Additionally, the investigation only considers one type of input, Ia,d. Given these limitations compared to the original experiments, the Figure 5.2: Average performance µ(NP D-δ, Ia,d, R91) with error bars representing σ2(NP D-δ, Ia,d, R91) over 10 runs for each network model PD-δ, where δ ∈ {2, 3, . . . , 19}. The shaded gray area highlights the region where a significant drop in performance is observed. 26 5. Results average performance is evaluated over 10 runs for each network model NP D-α, α ∈ {2, 3, . . . , 19}. The results strongly suggests that for α > 4 there is no meaningful computation performed by the network. Consequently, these types of network models will not be investigated further. For comparative purposes, a simple network consisting solely of the two driven input nodes was also evaluated. This configuration is equivalent to the readout layer directly processing the input sequence, which violate the fundamental principle of reservoir computing that the readout layer should not perform any computation. Nonetheless, this comparison is crucial as it provides a baseline to determine whether the other network models genuinely enhance the computational capabilities of the system. Table 5.1 presents the prediction accuracy when this simple network is used as the reservoir. Similar to previous experiments, the readout layer processes the last 20 states of the network evolution. The results indicate that while the readout layer can distinguish between random input and input of type B, it fails to differentiate between inputs of type C or D. Table 5.1: Mean prediction accuracy µ(−, Ia,b,R), µ(−, Ia,c,R), and µ(−, Ia,d,R) without reservoir, i.e., the readout layer is applied directly to the input sequence. Input type B C D Mean Accuracy 0.74 0.50 0.50 5.2 Transition Rule Performance In the section on transition rules, the analysis will begin with an overview of the general performance of the rules. Then, a detailed examination will be conducted for each network, aiming to categorize the rules based on their performance and highlight the best performing rules for the network topology. This will specifically involve analysing the distribution of prediction accuracy across different types of input. 5.2.1 General Performance The performance of each transition rule given a network model is summarised in Figure 5.3. The average prediction accuracy is normalised to highlight the relative performance of the rules. Specifically, the prediction accuracy for each input type is normalised over all rules, which means that the new value for the best performing rule for a specific input type is 1, and for the worst performing rule the same value is 0. The normalised average performance is defined as µ(Nγ, I ′, Ri), where I ′ indicates that the prediction accuracy for each input type is normalized before combining the averages of the three input types. The results are presented in a grid format, where each square corresponds to the normalised prediction accuracy of a single rule, µ(Nγ, I ′, Ri). The rules are displayed in numerical order. Recall that a rule 27 5. Results (a) µ(NP D-2, I ′, Ri) (b) µ(NP D-3, I ′, Ri) (c) µ(NP D-4, I ′, Ri) (d) µ(NP D-2,3,4, I ′, Ri) Figure 5.3: Average normalized prediction accuracy µ(NP D-α, I ′, Ri) over all three input sequences I ′ for each rule, based on the network models: (a) PD-2, (b) PD-3, (c) PD-4, and (d) the combined average for all three network models. Rules are presented in numerical order comprises eight mappings from current state to next state and is named after the binary representation of the output cell in these mappings. Consequently, all rules in the same row are identical with respect to the first four mappings, while all rules in the same column share the same last four mappings. There are four grids presented in total. The first grid corresponds to NP D-2, the second to NP D-3, and the third to NP D-4. The fourth grid is a composite of the previous three, representing the average value of those grids for each rule, denoted as µ(NP D-2,3,4, I ′, Ri). While specific rules will be discussed in detail later in this section, some general trends can already be observed across these four grids. Most notably, the performance of each rule is not random, and certain rules are more effective at processing input than others. Some rules exhibit consistently poor 28 5. Results performance, effectively erasing any information introduced into the network. Con- versely, there are rules that consistently demonstrate superior information processing capabilities compared to others. In NP D-2, a wide range of rules exhibit decent performance. However, as the number of predecessors increases, only a few rules stand out in terms of performance. Furthermore, there is no universally optimal rule that performs best across all types of network models. Nonetheless, several rules consistently rank among the top performers across all network models considered. Several general trends can be observed in the grids. Odd-numbered rules tend to perform better than even-numbered ones. This advantage arises from the fact that odd-numbered rules include the mapping 000→ 1. Consequently, this configuration prevents networks with many cells in state 0 from becoming stuck in that state. Similarly, rules numbered 127 and below outperform those numbered 128 and above. This distinction is due to the mapping 111→ 0 present in the lower-numbered rules. The large rectangle of poorly performing rules in the bottom right corner can be attributed to the fact that all of these rules share the following three mappings: 111→ 1, 110→ 1, and 011→ 1. This combination of configurations increases the likelihood that the network will converge to an all-1 state. For these rules, the only scenario in which a cell in state 1 can transition to state 0 is if both of its predecessors are in state 0. As the number of predecessors increases, fewer rules exhibit superior performance relative to others. A factor contributing to this trend is the majority voting mecha- nism. For NP D-2, this is not a concern since there are only two predecessors. In the case of NP D-3, there are three predecessors, resulting in three pairs of combinations. Given that the number of pairs is odd, the impact of the majority vote remains relatively insignificant. However, for NP D-4, there are four predecessors, leading to a total of six pairwise combinations. With an even number of pairs, situations arise where a tiebreaker is required. In such instances, the tiebreaker rule specifies that the number of votes for state 1 must exceed those for state 0. This requirement can negatively affect the information processing capabilities of certain rules. The results are presented in Table 5.2. The rules are ordered from the best to the worst performing based on the overall prediction accuracy across all input types and the three network models. Additionally, the table includes the standard deviation for each rule to indicate the consistency of their performance, as well as the rank among all rules for each network model. The highest prediction accuracies are 75.9%, 72.2%, and 77.1% for the three network models with increasing numbers of predecessors, respectively. The standard deviation is relatively high, approximately 10% for most rules, indicating variability in their performance. The exception to this are the rules that consistently exhibit poor performance, which have a standard deviation around 0%. The results suggest that the exact ranking of rules should be interpreted with caution. Although each input type-network model-rule configuration is run 100 times, this does not eliminate variability in the data. Nevertheless, certain trends are evident and significant. Notably, the performance of rules varies substantially across different network models. For example, the best-performing rule on NP D-2 is 29 5. Results Table 5.2: Numerical results from the experiments, including the rank, mean accuracy, and standard deviation for each rule on each network model, as well as the overall results. Only a subset of the rules is shown; complete results can be found in Table A.1 in Appendix A. Ri Rank µ(NP D-2) σ2(NP D-2) Rank µ(NP D-3) σ2(NP D-3) Rank µ(NP D-4) σ2(NP D-4) Rank µ(·) σ2(·) 91 61 0.678 0.078 2 0.717 0.1 1 0.771 0.103 1 0.722 0.093 123 53 0.681 0.105 3 0.716 0.098 4 0.742 0.096 2 0.713 0.1 94 41 0.699 0.124 12 0.688 0.105 6 0.707 0.097 3 0.698 0.109 144 3 0.755 0.15 20 0.686 0.165 19 0.638 0.161 4 0.693 0.159 16 4 0.754 0.151 24 0.683 0.168 17 0.641 0.166 5 0.693 0.162 56 6 0.744 0.119 19 0.686 0.137 16 0.644 0.132 6 0.691 0.129 2 5 0.746 0.151 16 0.686 0.169 20 0.636 0.163 7 0.689 0.161 98 7 0.742 0.114 15 0.687 0.142 18 0.639 0.124 8 0.689 0.127 126 59 0.679 0.078 30 0.675 0.08 5 0.707 0.086 9 0.687 0.081 116 14 0.731 0.125 36 0.668 0.135 12 0.658 0.094 10 0.686 0.118 130 8 0.74 0.153 27 0.681 0.168 21 0.635 0.164 11 0.685 0.162 30 56 0.679 0.092 7 0.694 0.088 9 0.674 0.095 12 0.683 0.092 24 10 0.738 0.149 29 0.678 0.158 22 0.628 0.155 13 0.681 0.154 46 21 0.72 0.122 45 0.661 0.131 11 0.659 0.093 14 0.68 0.115 66 15 0.73 0.147 28 0.68 0.163 25 0.625 0.153 15 0.679 0.154 118 46 0.696 0.102 25 0.682 0.13 14 0.657 0.103 16 0.678 0.112 62 48 0.693 0.109 26 0.681 0.139 13 0.658 0.095 17 0.678 0.114 107 102 0.633 0.076 60 0.648 0.089 3 0.75 0.099 18 0.677 0.088 121 100 0.635 0.072 68 0.642 0.083 2 0.751 0.099 19 0.676 0.085 86 73 0.664 0.088 17 0.686 0.096 10 0.663 0.098 20 0.671 0.094 48 9 0.739 0.151 33 0.67 0.162 35 0.59 0.131 21 0.666 0.148 34 16 0.728 0.152 34 0.67 0.163 34 0.593 0.132 22 0.664 0.149 131 38 0.702 0.106 5 0.699 0.136 36 0.59 0.063 23 0.663 0.102 29 2 0.757 0.106 11 0.689 0.129 72 0.537 0.08 24 0.661 0.105 145 40 0.699 0.101 9 0.693 0.137 38 0.587 0.067 25 0.66 0.101 71 1 0.759 0.111 14 0.687 0.138 79 0.53 0.071 26 0.658 0.107 125 26 0.715 0.1 8 0.693 0.096 49 0.558 0.066 27 0.655 0.088 65 30 0.712 0.095 18 0.686 0.095 47 0.565 0.104 28 0.654 0.098 176 37 0.705 0.147 31 0.671 0.159 39 0.584 0.105 29 0.653 0.137 122 118 0.609 0.081 50 0.658 0.077 8 0.692 0.082 30 0.653 0.08 . . . . . . . . . . . . . . . 133 45 0.697 0.121 10 0.692 0.109 69 0.54 0.05 36 0.643 0.093 . . . . . . . . . . . . . . . 33 49 0.692 0.102 1 0.722 0.098 104 0.511 0.032 40 0.642 0.077 . . . . . . . . . . . . . . . 129 54 0.68 0.082 22 0.685 0.084 80 0.53 0.046 45 0.631 0.071 . . . . . . . . . . . . . . . 135 58 0.679 0.097 6 0.694 0.091 92 0.517 0.035 48 0.63 0.074 . . . . . . . . . . . . . . . 37 62 0.677 0.074 4 0.707 0.091 183 0.5 0.035 51 0.628 0.066 . . . . . . . . . . . . . . . 90 187 0.507 0.064 47 0.66 0.085 7 0.7 0.101 60 0.622 0.084 . . . . . . . . . . . . . . . 246 39 0.701 0.162 71 0.641 0.159 204 0.5 0 75 0.614 0.107 . . . . . . . . . . . . . . . 191 44 0.698 0.16 72 0.64 0.155 204 0.5 0 77 0.613 0.105 . . . . . . . . . . . . . . . 247 42 0.699 0.165 75 0.637 0.155 204 0.5 0 79 0.612 0.107 . . . . . . . . . . . . . . . 110 125 0.598 0.099 104 0.577 0.085 28 0.608 0.091 88 0.595 0.092 . . . . . . . . . . . . . . . 15 143 0.562 0.068 118 0.551 0.068 52 0.556 0.1 121 0.556 0.079 . . . . . . . . . . . . . . . 85 142 0.562 0.075 125 0.543 0.071 55 0.552 0.102 128 0.552 0.083 . . . . . . . . . . . . . . . 54 157 0.544 0.063 190 0.502 0.034 109 0.508 0.042 168 0.518 0.047 . . . . . . . . . . . . . . . 0 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0 32 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0 250 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0 251 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0 254 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0 255 232 0.5 0 222 0.5 0 204 0.5 0 232 0.5 0 . . . . . . . . . . . . . . . 222 245 0.499 0.025 256 0.497 0.022 146 0.501 0.019 252 0.499 0.022 200 254 0.498 0.025 247 0.498 0.024 165 0.501 0.022 253 0.499 0.024 19 256 0.496 0.031 237 0.5 0.021 158 0.501 0.021 254 0.499 0.025 221 244 0.5 0.022 242 0.499 0.022 253 0.497 0.021 255 0.499 0.022 132 251 0.499 0.022 253 0.497 0.023 254 0.497 0.026 256 0.498 0.024 30 5. Results only ranked 26th overall, while the best-performing rule onNP D-3 is ranked 40th. This indicates that no single rule universally excels across all network models. However, some rules consistently achieve around 70% prediction accuracy across all network models, suggesting a degree of generalization capability. These findings highlight the importance of considering specific network model characteristics when evaluating rule performance. 5.2.2 Categorisation based on Accuracy Distribution A more detailed evaluation and categorization can be achieved by examining the accuracy distribution. This distribution reveals the spread of accuracy for individual instances of configurations with different input type, network model, and transition rule. Based on the location and size of peaks in the distributions, rules can be grouped into four categories with similar patterns by applying a few simple criteria. Figure 5.4 illustrates examples of these distributions using kernel density estimation plots, highlighting the core properties of each group. Group 1 consists of rules that offer no valuable computation. An example of such a rule is depicted in Figure 5.4a, where the distributions for all three input types are centered around 0.5. This indicates that, regardless of the network instance, the rule effectively erases any information introduced into the network. (a) Example of Group 1 distribution. (b) Example of Group 2 distribution. (c) Example of Group 3 distribution. (d) Example of Group 4 distribution. Figure 5.4: Four groups of distribution types. 31 5. Results Group 2 comprises rules similar to those in Group 1 but with one key difference. These distributions exhibit a clear tail for one or more input types, suggesting that, for a small subset of network instances, the rule can perform computations on the network. An example of this is shown in Figure 5.4c. Group 3 includes rules where the distributions demonstrate a clear ability to perform computations on the networks, but this capability is dependent on the network instance. This is indicated by a substantial portion of the distribution’s density being located above 60% accuracy, while there remains a significant peak at 0.5 for at least one input type. An example is provided in Figure 5.4b, where each input type exhibits two peaks: one at 50% accuracy and another at a higher accuracy. Group 4 consists of the best-performing rules, characterized by the absence of a sig- nificant peak at 50% accuracy for any of the input types. These model configurations display the ability to generalize effectively and represent the best options identified in this study for computation on CAN. An example is illustrated in Figure 5.4d, where the accuracy is consistently high. The categorization is based on several metrics that differentiate rules across the various groups. In summary, rules where all peaks are below 60% accuracy are allocated to Group 1. Rules where all the highest peaks exceed 60% accuracy, with no significant peak below 60%, are placed in Group 4. Rules with significant peaks both above and below 60% accuracy are assigned to Group 3. The remaining rules, which exhibit a tail in their distribution, are categorized into Group 2. Previous results have suggested that the performance of individual rules is closely tied to the network model. Therefore, the categorization will be conducted separately for each of the three network models, rather than combining them all. 5.2.3 Categorisation of Rules for NPD-2 The categorisation of rules for NP D-2 is depicted in Figure 5.5. There are a total of nine rules in Group 4, specifically rules 29, 56, 65, 71, 98, 118, 125, 131, and 145. Most rules fall into Group 3, while the remaining rules are evenly distributed between Groups 1 and 2. It is evident that a significant number of rules are effective on this network model, although most are dependent on the specific network instance. The distributions of a selected set of four rules are illustrated in Figure 5.6. These rules were chosen for their superior performance with one or more input types. The top-performing rules for NP D-2 are rules 71 and 29. These two rules are mirrored versions of each other and exhibit similar behaviour. The two rules are symmetrical and share the same general behaviour. The distribution for Rule 71 is shown in Figure 5.6a, and Rule 29 displays a comparable distribution. These rules perform exceptionally well on alternating input, similarly to a pure readout layer for biased input, and handle alternating input with noise effectively. Following rules 71 and 29, the next best rule is Rule 144, with Rules 16 and 2, among other rules, demonstrating similar behaviour and performance. These rules are characterized by incorporating one of the mappings 100→ 1 and 001→ 1, with most other outputs in state 0. The distribution for Rule 144 is depicted in Figure 5.6b. 32 5. Results Figure 5.5: Categorisation of rules on NP D-2. It can be observed that for certain network instances, this model configuration performs exceptionally well, but there is also a significant peak around 50% accuracy, highlighting the rule’s network dependency. Consequently, these types of rules are not included in Group 4. Next, rules 56 and 98, which are mirrored versions of each other and exhibit similar behaviour, show notable performance. The distribution for Rule 56 is presented in Figure 5.6c. This distribution resembles that of Rule 144, with the rule performing best on alternating text, followed by alternating text with noise, and finally biased text. However, the peaks at 50% accuracy are less pronounced, but the overall performance is slightly lower. A rule that does not generally perform well but excels with biased text is Rule 85, along with Rule 15, which shares similar behaviour. The distribution for Rule 85 is shown in Figure 5.6d. It is evident that this rule cannot process alternating text effectively; however, it performs exceptionally well with biased text. Although the average performance is still lower than using the readout layer directly on the input, the peak performance is better. The peak at 50% accuracy indicates that the rule’s performance is still dependent on the network instance, meaning it does not work uniformly well across all network instances. 33 5. Results (a) Rule 71 (b) Rule 144 (c) Rule 56 (d) Rule 85 Figure 5.6: Four of the best rules on NP D-2. 5.2.4 Categorisation of Rules for NPD-3 The categorization of rules for NP D-3 is illustrated in Figure 5.7. A total of four rules falls into Group 4: rules 30, 33, 37, and 65. The majority of rules are classified into Group 3, while the remaining rules are evenly distributed between Groups 1 and 2. Similar to NP D-2, it is evident that many rules are effective on this network model, but their performance is largely dependent on the specific network instance. However, the number of effective rules on NP D-3 is lower than on NP D-2. The distributions of a select four rules are illustrated in Figure 5.8. These rules were chosen for their exceptional performance with one or more input types. Similar to NP D-2, the best rules come in pairs with analogous performance and behavior. Moreover, for the three pairs discussed here, one rule in the pair is classified in Group 3 and the other in Group 4, despite their near-identical characteristics. Interestingly, the rule in Group 4 is not necessarily the one with the highest performance, highlighting the limitations of using purely metric-based criteria for grouping. Nevertheless, the rules in Group 4 are among the best performers, indicating that the groups are still meaningful. The top-performing rule is Rule 33, which shares similar performance and behaviour with Rule 123. The distribution for Rule 33 is depicted in Figure 5.8a. Unlike the best rules for NP D-2, this model configuration performs better on biased text than on text with noise. The overall performance is also approximately 4% lower than the 34 5. Results Figure 5.7: Categorisation of rules on NP D-3. best performer on NP D-2. The second best-performing rule is Rule 91, which shares similar performance and behaviour with Rule 37. The distribution for Rule 91 is shown in Figure 5.8b. These rules perform better on text with noise than on biased text. In fact, the performance on biased text is relatively poor, with many runs centred around 50% accuracy. However, this is offset by better performance on alternating text and text with noise compared to Rule 33. Additionally, Rules 30 and 135 exhibit different behaviour but show similar distributions. A rule that matches the average performance of the aforementioned rules is Rule 131. The distribution for Rule 131 is depicted in Figure 5.8c. It is evident that Rule 131 heavily depends on the network instance, but when the setup is optimal, its performance surpasses that of its counterparts. Finally, the last rule in Group 4 that has not been mentioned yet is Rule 65, which was also part of Group 4 for NP D-2. The distribution for Rule 65 is shown in Figure 5.8d. The peaks just surpass the 60% accuracy threshold, which explains why the average performance is relatively poor despite belonging to Group 4. This trend was also observed for this rule on NP D-2, where it was also in Group 4. 35 5. Results (a) Rule 33 (b) Rule 91 (c) Rule 131 (d) Rule 65 Figure 5.8: Four of the best rules on NP D-3. 5.2.5 Categorisation of Rules for NPD-4 The categorisation of rules for NP D-4 is illustrated in Figure 5.9. A total of two rules are classified in Group 4: rules 91 and 121. Most rules fall into Group 1, while the remaining rules are evenly split between Groups 2 and 3. It is evident that the number of effective rules for this network model is significantly more limited compared to the previous models. The distributions of the four best-performing rules are shown in Figure 5.10. For NP D-4, the four rules that perform significantly better than the others are, in order: 91, 121, 107, and 123. Additionally, four more rules — 126, 94, 90, and 122 — stand out among the rest. Out of these rules, 91 and 121 belong to group 4. The distributions for the top four rules exhibit many similarities. They all perform well on alternating text, and their performance on biased text is comparable. The primary difference lies in their performance on text with noise. Compared to rules on other network models, the performance of these rules is slightly lower than the best-performing configurations on NP D-2. However, rule 91 on NP D-4 stands out as the best overall performing rule, with an accuracy of 77.1%, compared to 75.9% for Rule 71 on NP D-2. It is important to note that there is some variance in the prediction accuracy that affects the true difference. 36 5. Results Figure 5.9: Categorisation of rules on NP D-4. (a) Rule 91 (b) Rule 121 (c) Rule 107 (d) Rule 123 Figure 5.10: Four of the best rules on NP D-4. 37 5. Results 5.3 Information Gain The information gain of the rules on different network models was also examined. The average information gain IG(·) is presented in Table 5.3, the full results can be viewed in Table A.3 in Appendix A. Xcurrent represents the current cell, Xa the first predecessor, Xb the second predecessor, and so on. The observations are as expected. For networks with two predecessors, all three cells, including the current cell, exhibit the same level of information gain. For networks with three predecessors, the current cell generates the most information gain and has the greatest influence on the outcome. In this configuration, the middle predecessor, which is considered once as a left neighbour and once as a right neighbour, has the least influence. A similar trend is observed for networks with four predecessors, where the current cell again exerts the most influence, and the middle cells have the least. Additionally, the average information gain for a single predecessor, mean IG(Xa−d), diminishes as the number of predecessors increases. This indicates that predecessors influence the cell less when there are more predecessors, thereby giving the current cell a larger impact on the outcome. Furthermore, the uncertainty H(Y ) when there is no prior information decreases for networks with more predecessors, indicating that denser networks become more predictable. Information gain is also employed to group similar rules based on behaviour rather than performance. Several important observations can be highlighted. For NP D-4, rules 91, 94, 107, 121, and 122 exhibit identical information gain values for the current cell and all predecessors at 0.0084. This is also true for rules 90 and 126; however, in these cases, the information gain from knowledge of either the current cell or a single predecessor is 0. Similarly, for rule 123, which is among the top 8 performers on NP D-4, there is no information gain from a single predecessor, but there is information gain from knowledge of the current cell. Among the best rules for NP D-3, there is more variation in information gain. Rules 91 and 37 show consistent information gain of 0.018 from the state of either the current cell or a single predecessor. Rules 123 and 33 exhibit no information gain from a single predecessor but do show information gain from the state of the current cell. However, for rules 30, 131, and 135, the information gain from predecessors depends on their order. In all three cases, knowledge of the state of the first predecessor in order yields significantly higher information gain compared to any other predecessor. The best rules for NP D-2 can also be described using information gain. Without prior information, the uncertainty of the outcome when using rules 29 or 71 is 1, meaning the states 0 and 1 are equally likely. Knowledge of the current cell’s state Table 5.3: Average information gain for 2, 3, and 4 predecessors, respectively. Nγ H(Y ) IG(Xcurrent) IG(Xa) IG(Xb) IG(Xc) IG(Xd) Mean IG(Xa−d) NP D-2 0.902 0.122 0.122 0.122 - - 0.122 NP D-3 0.857 0.167 0.104 0.010 0.104 - 0.072 NP D-4 0.806 0.176 0.062 0.019 0.019 0.062 0.040 38 5. Results produces no information gain, while knowledge of either predecessor provides an information gain of 0.19. Rules 56 and 98 yield the same information gain from the current cell or a single predecessor at 0.049. This is also true for rules 2 and 16, but with an information gain of 0.14. Rule 144 is asymmetrical, as knowledge of the first predecessor provides an information gain of 0.31, while knowledge of the current cell or second predecessor yields no information gain. It is important to note that these observations are not unique to the best performing rules. Many rules with the same information gain distribution as the best rules, do not perform well. Additionally, it is generally observed that the average informa- tion gain from a single predecessor for the best performing rules is relatively low compared to other rules. This could suggest that the combined values of multiple predecessors are necessary to extract meaningful information from the system. It is this interplay among multiple predecessors that likely creates the complex and interesting computational behaviour observed in the best performing rules. 39 5. Results 40 6 Discussion This chapter is organized into six sections. The first three sections address the use of cellular automata as a reservoir from various perspectives. Specifically, the performance of different network models and transition rules is discussed, as well as the performance of the overall system. The fourth section discuss reservoir computing as a methodological framework. The fifth section address the utilisation of a genetic algorithm to optimise cellular automata networks, while the last section introduces potential directions for future research. 6.1 Finding the Correct Network Topology It appears that a network topology should have a certain amount of structure to function well. This was shown by networks with a fixed amount of predecessors outperforming their counterparts. Similarly, a network with more constraints on the connections, like the small-world network, consistently perform better than a more random network like the Erdős-Rényi network. This is in line with the edge of chaos idea, which emphasizes the importance of balancing the chaotic and deterministic nature of the reservoir. The results indicate that for the transition rules examined in this study, a network cannot be too dense, as this causes the system to deviate from this edge. It is also clear that trivial connections, where a cell has none or only one predecessor, should be avoided, as these provide no additional computation. The findings suggest that different network topologies do not produce significant differences in performance, unlike factors such as density and a uniform predecessor count. Among the networks examined in this study, NP D-2, where all cells have two predecessors, appears to be closest to the edge of chaos. Although other configurations sometimes outperform rules on NP D-2, these instances are few and the performance difference is negligible. While these configurations should not be entirely disregarded, it is more sensible to focus on the network model that consistently performs well across a wide range of transition rules. The NP D-2 network model was kept simple but can be further optimized. Potential areas for improvement include examining the order of predecessors, allowing for dynamic or varying network sizes, exploring small-world or scale-free NP D-2 networks, and identifying specific connection patterns that enhance performance. However, caution is necessary when optimizing the system. Overfitting remains a significant 41 6. Discussion risk, and optimizing for a particular problem or input type might reduce the network’s generalization capacity. From a practical perspective, a network with a sparse and fixed number of predecessors is also advantageous. Each connection and transition logic incurs a cost in a physical system, making simpler systems easier to implement in physical hardware. 6.2 Selecting an Appropriate Transition Rule It is evident that the selection of a suitable transition rule depends on the underlying network. For certain network models, such as NP D-4, determining an appropriate rule is comparable to identifying a single rule that actually performs computation effectively, as the options may be limited. In other cases, there are more options and multiple factors to consider when choosing the best transition rule for a specific task. 6.2.1 Categorisations of Elementary Cellular Automata Several properties are consistently observed in the best-performing transition rules. It is essential to examine these properties in greater depth to gain a better understanding of their impact. To facilitate this discussion, Figure 6.1 illustrates a selection of the top-performing rules from the experiments as elementary cellular automata rules. First and foremost, elementary cellular automata have been extensively studied in prior research, leading to established categorisations of the rules, such as Wolfram’s four classes of behaviour. For elementary cellular automata, these categorizations are typically made on a one-dimensional grid rather than the networks studied here. Consequently, they cannot be applied directly to this study but serve as useful guidelines. This also means that the categorisation is the most relevant for NP D-2 0 1 0 0 0 1 1 1 Rule 71 0 0 1 0 0 0 0 1 Rule 33 0 1 0 1 1 0 1 1 Rule 91 Figure 6.1: Visualisation of the three elementary cellular automata rules with the best performance on network model NP D-2, NP D-3, and NP D-4, respectively. 42 6. Discussion networks, as the transition rule on that network model is the most similar to the traditional elementary cellular automata rule, with only two neighbours. That is also why in this section, the underlying elementary cellular automata is considered for the discussion, rather than the extended version. For example, totalistic cellular automata are rules that depend solely on the total value of the cells in the neighbourhood, i.e., the current cell and its predecessors. This means that the mappings can be divided into four groups: (111; total value 3), (110, 101, 011; 2), (100, 010, 001; 1), and (000; 0), and within each group the outcome should be identical. Among the elementary cellular automata rules classified as totalistic, Rule 126 ranks 9th overall, while Rule 129 is 45th, with other rules performing significantly worse. Rule 126 and 129 can be described as rules where, if all cells in the neighbourhood are in the same state, i.e., mappings 111 and 000, the output is one state, but when the states differ, the output is the opposite state. While this setup appears to compute, other totalistic rules does not. The results indicate that this categorization does not appear to be particularly relevant for this problem. Another categorisation is outer-totalistic cellular automata, where the output depends on the total value of the predecessors and the state of the current cell separately. This means that the mappings can now be divided into three groups: (111, 101; total value 2 1), (110, 100, 011, 001; 1), and (010, 000; 0), and within each group the outcome should either be identical, the state of the current cell should remain the same, or the state of the current cell should be flipped. There are 64 outer-totalistic elementary cellular automata rules. Among the best overall performing rules, those classified as outer-totalistic include rules 33, 37, 90, 91, 94, 122, 123, 126, 129, and 133. Rule 90, ranked 60th overall, is the worst among these, and rule 91, ranked 1st overall, is of course the best. There is significant overlap between these rules and the best-performing rules on NP D-3 and NP D-4, indicating that properties tied to outer-totalistic rules become useful as the number of predecessors increases beyond two. This is not surprising, as the order of predecessors becomes less important due to the extension of the rule, making rules that consider only the density, not the order, more relevant. However, this categorization does not appear to be relevant for NP D-2, as the rules listed above are not the best performers on NP D-2. Wolfram’s four classes of behaviour is another natural categorisation of elementary cellular automata. The most interesting rules are those belonging to Class 4, which exhibit complex behaviour, in contrast to the predictable patterns in Classes 1 and 2 and the chaotic behaviour of Class 3. On the classic one-directional grid, rules commonly recognized as Class 4 include Rule 54 and Rule 110. These rules are ranked 168th and 88th, respectively, in overall performance in this study’s experiments. This suggests that behaviour observed on a one-directional grid does not directly translate to a network context, or alternatively, that a Class 4 rule does not necessarily make a great reservoir. 1The value is computed from the first and the last digit both being one, the one in the middle is ignored 43 6. Discussion 6.2.2 The Ideal Transition Rule Given the absence of an established categorization that accurately describes the results of this study for all networks, it is necessary to examine the transition rules in more detail. The effectiveness of transition rules is heavily dependent on both the network and the input types. Consequently, it is challenging to determine whether a well-performing rule generalizes effectively or if it benefits significantly from the specific networks and input types used in this study. However, it is evident that the transition rule plays a more critical role than the network topology, as there is a much larger difference in performance between different rules compared to network models. It is tempting to conclude that outer-totalis