A Self-Trained Chess Engine for Atomic Chess Leveraging self-training for the creation of chess bots in un- derexplored variants Bachelor’s thesis in Computer science and engineering Hannes Adolfsson David Lewis Anton Rahmn Sigge Rajamäe Edvin Rungardt Marco Tafani Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Bachelor’s thesis 2025 A Self-Trained Chess Engine for Atomic Chess Leveraging self-training for the creation of chess bots in underexplored variants Hannes Adolfsson David Lewis Anton Rahmn Sigge Rajamäe Edvin Rungardt Marco Tafani Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 A Self-Trained Chess Engine for Atomic Chess Hannes Adolfsson David Lewis Anton Rahmn Sigge Rajamäe Edvin Rungardt Marco Tafani © Hannes Adolfsson, David Lewis, Anton Rahmn, Sigge Rajamäe, Edvin Rungardt, Marco Tafani 2025. Supervisor: Andreas Abel, Department of Computer Science and Engineering Examiner: Patrik Jansson and Arne Linde, Department of Computer Science and Engineering Graded by teacher: Yehia Abd Alrahman, Department of Computer Science and Engineering Bachelor’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 3 Abstract Chess has long been a benchmark for artificial intelligence (AI) research due to its complexity and well-defined rules. Recent advances, such as AlphaZero, introduced self-learning AI through reinforcement learning and self-play, achieving superhuman performance without prior strategic knowledge, relying solely on the rules of the game. AlphaZero defeated the world-champion chess engine Stockfish after only four hours of training, leveraging large-scale computational resources to rapidly learn and refine its strategies. This thesis presents the development of a chess engine for the chess variant Atomic Chess. The engine was developed in C++ and trained through self-play and re- inforcement learning, taking inspiration from AlphaZero’s approach. This project explores the extent to which a chess engine with this approach is feasible for the average enthusiast. Cost-effective cloud-based virtual machine instances with powerful hardware were rented to manage training workloads. Given limited computational resources, we opted for a data-centric approach, focusing on refining the training pipeline to max- imize the training data that could be produced, rather than hyperparameter tuning and experimenting with neural network architectures. The final engine was trained on approximately 450,000 self-play games in roughly 150 hours. The final engine was deployed on the chess platform Lichess and achieved an ELO- rating of 1,729, which corresponded to the top 10th percentile of Atomic Chess players on Lichess. These results demonstrate that it is possible to achieve a com- petitive Atomic Chess engine within a budget of 3,000 SEK for cloud computation. This shows that strong self-play reinforcement learning agents for niche games can be developed without requiring large-scale computing infrastructure. These results highlight the viability of accessible, low-budget AI research for underexplored game variants. Sammandrag Schack har sedan länge använts för att evaluera forskning inom artificiell intelli- gens på grund av spelets komplexitet samt väldefinierade regler. Nya framsteg, såsom AlphaZero, introducerade självlärande AI genom förstärkningsinlärning och självspelande som uppnådde övermänsklig prestanda utan någon tidigare strategisk kunskap, enbart utifrån spelets regler. AlphaZero besegrade den dåvarande bästa schackmotorn, Stockfish, efter endast ett fåtal timmars träning, där den utnyttjade storskalig hårdvaruinfrastruktur för att snabbt bli bättre och finslipa sina strategier. Denna avhandling presenterar utvecklingen av en schackmotor för schackvarianten Atomic Chess. Motorn utvecklades i C++ och tränade genom självspelande och förstärkningsinlärning, likt AlphaZero. Detta projekt utforskar om dessa metoder är rimliga för den vanlige entusiasten. Kostnadseffektiva molnbaserade virtuella maskininstanser med kraftfull hårdvara hyrdes för att hantera träningen. Med tanke på den begränsade tillgången till hårdvara valde vi ett datacentrerat tillvägagångssätt med fokus på att förfina trän- ingspipelinen för att maximera mängden träningsdata som kunde produceras, istället för att justera hyperparametrar och experimentera med neuronnätverksarkitekturer. Den slutgiltiga modellen tränades genom att spela approximativt 450,000 partier mot sig själv under en sammanlagd träningsperiod på 150 timmar. Den resulterande schackmotorn gjordes tillgänglig på schackplattformen Lichess och uppnådde en ELO-rating på 1,729, vilket motsvarar den översta 10:e percentilen av Atomic Chess-spelare på Lichess. Resultaten visade att det är möjligt att skapa en motor för Atomic Chess som spelar på en hög nivå inom en budget av 3,000 SEK för att hyra hårdvara. Detta visar att starka självspelande förstärkningsin- lärningsmotorer gjorda för nischade spel kan utvecklas utan att behöva storskalig hårdvaruinfrastruktur. Dessa resultat belyser möjligheten med lättillgänglig, låg- budgetbaserad AI-forskning för outforskade spelvarianter. Acknowledgements We would like to extend our gratitude to our supervisor, Andreas Abel, who has offered us his guidance and support throughout this entire project. Furthermore, the evaluation of the project would have been significantly more difficult had it not been for the players and creators of bots who competed against our engine on Lichess. Lastly, we would like to show appreciation to the many people who have developed the open source chess platform lichess.org. Hannes Adolfsson, David Lewis, Anton Rahmn, Sigge Rajamäe, Edvin Rungardt, Marco Tafani. Gothenburg, June 2025. Contents 1 Introduction 7 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Theory 9 2.1 Atomic Chess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Justification for Chess Variant Selection . . . . . . . . . . . . 9 2.1.2 Atomic Chess Rules . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Chess Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Standard Monte Carlo Tree Search . . . . . . . . . . . . . . . 12 2.3.2 AlphaZero’s Monte Carlo Tree Search . . . . . . . . . . . . . . 13 2.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Neural Network Layers . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Learning Process of Neural Networks . . . . . . . . . . . . . . 20 2.4.3 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.4 AlphaZero Network Construction . . . . . . . . . . . . . . . . 23 2.5 Bitboards and Magic Number Move Generation . . . . . . . . . . . . 23 2.6 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.1 Monte Carlo Tree Search Parallelization . . . . . . . . . . . . 25 2.6.2 Training Parallelization . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Previous Bachelor Theses and Related Work . . . . . . . . . . . . . . 26 3 Implementation 27 3.1 Implementation Overview . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Game Rules and Move Generation . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Bitboard-Based Representation . . . . . . . . . . . . . . . . . 28 3.2.2 Standard Chess Move Generation . . . . . . . . . . . . . . . . 29 3.2.3 Handling Piece Constraints . . . . . . . . . . . . . . . . . . . 29 3.2.4 Atomic Chess Extension . . . . . . . . . . . . . . . . . . . . . 30 3.3 Monte Carlo Tree Search Implementation . . . . . . . . . . . . . . . . 30 3.3.1 Monte Carlo Tree Search Structure . . . . . . . . . . . . . . . 30 3.3.2 Batching Selections . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 Final Move Choice . . . . . . . . . . . . . . . . . . . . . . . . 32 2 Contents 3.4 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 Self Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.2 Calculating Loss . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.3 Replay Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Method 37 4.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Evaluation of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Internal Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 External Evaluation . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Hardware Utilization and Cost . . . . . . . . . . . . . . . . . . . . . . 39 5 Results 40 5.1 Engine Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.1 Model Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.2 Final Playing Strength . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Neural Network and Training . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.2 Learning Verification . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.3 Training Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.4 Virtual Machine Costs . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Move Generation Component . . . . . . . . . . . . . . . . . . . . . . 46 5.4 Monte Carlo Tree Search Optimizations . . . . . . . . . . . . . . . . . 46 5.4.1 Batching Selections . . . . . . . . . . . . . . . . . . . . . . . . 46 5.4.2 Virtual Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5 Time Analysis of the Engine . . . . . . . . . . . . . . . . . . . . . . . 47 6 Discussion 49 6.1 Analysis of Final Product and Results . . . . . . . . . . . . . . . . . 49 6.1.1 Final Engine Strength Discussion . . . . . . . . . . . . . . . . 49 6.1.2 Neural Network and Training Process . . . . . . . . . . . . . . 50 6.1.3 Computational Performance Analysis . . . . . . . . . . . . . . 51 6.1.4 Training Data Volume . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Engineering Decisions and Their Effects . . . . . . . . . . . . . . . . 52 6.2.1 Effect of Replay Buffer . . . . . . . . . . . . . . . . . . . . . . 52 6.2.2 Reflection on Programming Language . . . . . . . . . . . . . . 52 6.2.3 Financial Implications . . . . . . . . . . . . . . . . . . . . . . 52 6.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3.1 Symmetry Augmentation . . . . . . . . . . . . . . . . . . . . . 53 6.3.2 Improving the Middle and End-game Strength . . . . . . . . . 54 6.3.3 Additional Neural Network Inputs . . . . . . . . . . . . . . . . 54 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.4.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.4.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3 List of Figures 2.1 Example of white winning a game by the unique explosion mechanic in Atomic Chess. The move results in the capture of black’s knight, king and rook while leaving the pawns unaffected. . . . . . . . . . . . 10 2.2 The left picture showcases an illegal move, since capturing the black queen would result in removing the black knight, in turn opening up a check on white’s king. The right showcases a very similar situa- tion, differing only by a black king, where the same move would be considered legal since it captures black’s king. . . . . . . . . . . . . . 10 2.3 Example of white performing an en passant move. The move results in an explosion around the ending tile of the attacking pawn. . . . . . 11 2.4 The different steps to Monte Carlo Tree Search visualised. Each node has a "visit count/value" pair. "MCTS-steps" by Robert Moss is li- censed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/?ref=openverse. . . . 14 2.5 Example of tree after two iterations where N is visit count, W is value and P is policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 A visual representation of a perceptron. "Perceptron moj" by Mayranna is licensed under CC BY-SA 3.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/3.0/?ref=openverse. . . . 17 2.7 A visual representation of a multilayer neural network. "Multilayer Neural Network" by John Salatas is licensed under CC BY-SA 3.0. To view a copy of this license, visit https://creativecommons.org/licenses/by- sa/3.0/?ref=openverse. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.8 Example of pattern detectable by 3 × 3 kernel. The diagonal forms a continuous line of values. . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.9 Example of convolutional layer with kernel where 4 grids have been forwarded together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.10 Visual representation of the difference between learning rate sizes. . . 21 2.11 Visual representation of a residual block. . . . . . . . . . . . . . . . . 23 2.12 Calculation of blockers for a rook in the bottom left corner. . . . . . . 24 3.1 Conversion from a board position to a tensor. . . . . . . . . . . . . . 33 3.2 Structure of the neural network. . . . . . . . . . . . . . . . . . . . . . 34 4 List of Figures 5.1 ELO progression during the final training phase. The last model, Model-8000 was set to an arbitrary ELO of 1,000, the other ELOs are then calculated in relation to their win rate against Model-8000 . 42 5.2 Atomic Chess ELO distribution and the projects final engines place- ment on the chess platform Lichess [17] . . . . . . . . . . . . . . . . 43 5.3 Probabilities for the known best move from MCTS and the neural network in a fixed mate-in-2 test. MCTS ran with 10,000 iterations per move. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Speed up of the engine using different numbers of selections that are batched and sent to the network. . . . . . . . . . . . . . . . . . . . . 47 5 List of Tables 4.1 Rented hardware and associated costs . . . . . . . . . . . . . . . . . . 39 5.1 Comparing win rate model-0 to untrained . . . . . . . . . . . . . . . 40 5.2 Model-8000’s win rates against the other models over 500 matches (1,000 for both). Config: MCTS-iterations: 3,000 . . . . . . . . . . . 41 5.3 Model-4000’s win rates against the other models over 500 matches (1,000 for both). Config: MCTS-iterations: 3,000 . . . . . . . . . . . 41 5.4 The win rate of the model when playing 1,000 games against itself . . 42 5.5 Final cost per unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6 Average number of select collisions per move in MCTS with and with- out virtual loss for 1,000 and 5,000 iterations per move. . . . . . . . . 47 5.7 Execution Time and Proportion by Component . . . . . . . . . . . . 48 6 1 Introduction 1.1 Background Chess has been a central subject in artificial intelligence (AI) research for decades due to its complexity and well-defined rules. The development of chess engines has evolved significantly over time, from brute-force search methods to sophisticated AI-driven approaches. A major milestone in this evolution was Deep Blue’s victory over Garry Kasparov, the chess world champion at the time, in 1997 [1]. Deep Blue was used by IBM as a demonstration of their processing prowess, but also became an example that handcrafted functions for the evaluation of chess positions could beat even the strongest human players. In 2017, a new paradigm emerged with AlphaZero, which showcased the effective- ness of self-learning AI [2]. Unlike traditional engines with hand-crafted evaluation functions, AlphaZero relied solely on reinforcement learning and self-play, where it in simple terms played games against itself, learned by reinforcement from those games and then repeated this process to continue improving on itself iteratively. With this self-learning approach, AlphaZero progressed from having no prior knowl- edge of how to play chess to becoming the world’s strongest chess engine within 4 hours of training. Chess variants modify the standard chess rules, such as board size, piece movement, or win conditions. These variations present new opportunities for AI research due to the lack the well-established theory and human expertise in these variants when compared that which exists for regular chess. This makes them an ideal testing ground for self-playing AI engines, which develop strong strategies entirely on their own. Since the creation of AlphaZero, computational strength of computer hardware has dramatically improved. The increased interest in machine learning has also led to improvements specifically in hardware that specializes in this. These factors result in a lower barrier of entry to train an AI model similar to AlphaZero and can therefore be achieved for a lower price today compared to in 2017 when AlphaZero was created. This is what this project aims to investigate. 7 1. Introduction 1.2 Purpose This project seeks to build a self-play engine for the chess variant named Atomic Chess, utilizing reinforcement learning to discover effective strategies without hand- crafted heuristics. It also aims to investigate how AI can successfully adapt to unfamiliar strategic environments while working within a level of expenditure on hardware that keeps the training easily replicable, and to discover which optimiza- tions and techniques are the most impactful in improving the performance and training of a model as to train as efficiently as possible on limited hardware. To be more specific, the research questions that this project aims to answer are as follows: • Q1: How feasible is it to train a chess engine through self-play, similar to AlphaZero, using computational resources available to the common enthusiast? • Q2: How can a chess engine using a self-training model be designed and optimized to work at this scope? 8 2 Theory This chapter outlines the theoretical foundations of our Atomic Chess engine. We begin by describing the rules and unique mechanics of Atomic Chess, followed by an overview of chess engines and AlphaZero, the self-learning chess engine that this project is attempting to replicate. Key concepts in Monte Carlo Tree Search (MCTS) and neural networks are then introduced, focusing on the modifications and techniques relevant to AlphaZero and our implementation. The chapter con- cludes with a discussion of move generation techniques, bitboard representations, and related work, providing context for the design decisions presented in Chapter 3. 2.1 Atomic Chess Atomic Chess is a variant of chess that preserves the fundamental movement rules of the original game while introducing a unique capture mechanic, where captures trigger explosions that remove surrounding pieces. This combination results in a game that is both familiar and strategically distinct. 2.1.1 Justification for Chess Variant Selection There were several motivations for selecting Atomic Chess as the focus of this project. Firstly, the explosion mechanic accelerates the rate at which pieces are removed from the board, often leading to significantly shorter games compared to classical chess. This characteristic allows for a higher throughput of self-play games during training, which is particularly advantageous when computational resources are limited. The increased number of training iterations per unit time can lead to faster learning progress and more rapid model refinement. Secondly, because Atomic Chess shares the same piece movements as standard chess, a baseline implementation of classical chess could be reused to verify the correctness of the move generation system using established testing methodologies. Once validated, the Atomic Chess–specific rules could be layered on top, as detailed later in Section 2.1.2. This approach simplifies implementation by isolating the variant-specific logic. Finally, while Atomic Chess is relatively well-known and supported by major chess platforms such as Lichess, it lacks the depth of established theory found in classi- 9 2. Theory cal chess. The absence of extensive opening books and end-game tablebases makes it a particularly suitable environment for a self-learning engine, as it must develop strategies independently without relying on pre-existing human knowledge or heuris- tics. 2.1.2 Atomic Chess Rules According to the rules outlined by Lichess [3] as well as further verification through testing scenarios on Lichess’ Atomic Chess mode, pieces in Atomic Chess have the same moves as in regular chess. During a capture, both pieces are removed from play, and an explosion occurs on the ending tile of the capturing piece, which removes all non-pawn pieces in the 8 adjacent tiles in a 3 × 3 area as shown in Figure 2.1. A capture may neither explode your own king nor may it reveal a regular check on your king, unless it also explodes the enemy king in the process as shown in Figure 2.2. If a player is in check they can either evade it through regular means, but also by exploding the enemy king or the checking piece. Since both pieces involved in a capture explode, kings can not capture. This also allows kings to occupy tiles next to one another since the other king is not a threat. During en passant, both pawns involved in the capture are taken out of play, and the explosion is centred around the tile the attacking pawn lands on, shown in Figure 2.3. Figure 2.1: Example of white winning a game by the unique explosion mechanic in Atomic Chess. The move results in the capture of black’s knight, king and rook while leaving the pawns unaffected. Figure 2.2: The left picture showcases an illegal move, since capturing the black queen would result in removing the black knight, in turn opening up a check on white’s king. The right showcases a very similar situation, differing only by a black king, where the same move would be considered legal since it captures black’s king. 10 2. Theory Figure 2.3: Example of white performing an en passant move. The move results in an explosion around the ending tile of the attacking pawn. 2.2 Chess Engines Chess engines are computer programs made to play chess. For an engine to be able to play well, it needs to know what valid moves there are in a given chess position. To do this, the engine requires a way to represent the board to the computer. In this project bitboards are used for this. The engine then needs a way to generate what valid moves are available from a given board position. Relevant theories to this will be described in Section 2.5. With these parts, a typical chess engine searches for the best moves by simulating legal moves for both players and evaluating the resulting positions from these move sequences. Standard chess is a well studied game and therefore several heuristics to evaluate board positions have been developed. These heuristics are also used by engines in conjunction with different algorithms to rank future moves based upon their strength. Examples of chess engines that use established chess heuristics are Deep Blue and Stockfish [1][4]. Atomic Chess does not have established strategies and heuristics to the same extent, making a different method of evaluation useful, which brings us to AlphaZero. AlphaZero, introduced by Silver et al. [2], employs a generalized learning approach based on reinforcement learning and self-play. The system was designed to master the games of chess, shogi, and Go without any prior knowledge beyond the rules of the game. AlphaZero utilizes a neural network to search and evaluate chess moves and posi- tions. The neural network is given the state of the chess board and returns a value and move policy. This is used to guide the Monte Carlo Tree Search algorithm, further discussed in Section 2.3. The value of a board position is a value between -1 (black win) and 1 (white win) and the move policy is a list of probabilities corre- sponding to different moves. To train the neural network, AlphaZero plays games against itself (self-play) and trains the network on these games. Once a game of self-play reaches a result, a data set is created based on the resulting search tree. This is then compared to new 11 2. Theory predictions on those positions and used to adjust the network to bring future predic- tions closer to the actual result (explained further in Section 2.4.2). As the process is repeated the neural network gets better at evaluating moves and board states which eventually, over many training iterations, results in AlphaZero discovering novel strategies beyond what humans understand. A key distinction between AlphaZero and its predecessor, AlphaGo Zero, is its ability to generalize across multiple board games using the same learning framework. The results presented in the paper demonstrate that AlphaZero surpassed Stockfish, one of the strongest traditional chess engines, within a matter of hours of training. The success of AlphaZero underscores the effectiveness of self-learning AI in complex decision-making tasks. By removing the dependency on human expertise, AlphaZero exemplifies the potential of reinforcement learning to discover novel and highly effi- cient strategies in strategic environments. Its approach has since influenced further AI research within non-deterministic board games such as a simplified version of Chinese Dark Chess where it converges towards optimal play [5]. Additionally, Al- phaZero also has influence in broader AI research such as the prediction of protein structures with AlphaFold [6], although not using the same algorithm or network. 2.3 Monte Carlo Tree Search Monte Carlo Tree Search (MCTS) was not only used in AlphaZero but is a com- mon search algorithm used in decision-making problems, particularly in games with large state spaces and high branching factors. Unlike other search algorithms like minimax, which rely on exhaustive exploration of the game tree, MCTS uses a prob- abilistic approach to balance exploration of new moves and exploitation of known promising moves. This makes it particularly well-suited for games like chess and its variants, where the number of possible positions is too large to evaluate exhaustively. MCTS was chosen for this project because of its proven success in self-learning chess engines like AlphaZero. Its ability to dynamically balance exploration and exploitation makes it ideal for discovering strategies in a domain with limited prior knowledge, such as Atomic Chess. By combining MCTS with neural network-guided evaluations, our engine can efficiently navigate the game tree and develop a deep understanding of the variant’s unique dynamics. 2.3.1 Standard Monte Carlo Tree Search The implementation of MCTS used by AlphaZero features some key differences to the standard implementation. To better understand these differences, it is important to establish how standard MCTS works. MCTS is a tree search algorithm where each node represents a game-state and contains some statistics. It starts off with the root node which represents the current game state and performs search iterations according to a budget, which can be a set number of iterations or a time limit. Each MCTS search iteration operates in four main phases [7]: 12 2. Theory 1. Selection: Starting from the root node, the algorithm traverses the tree by selecting moves that maximize the Upper Confidence Bound for Trees (UCT) formula. 2. Expansion: When the algorithm reaches a leaf node (a node that has not been fully explored), it expands the tree by adding one or more child nodes representing possible moves from that position. 3. Simulation: From the selected leaf node starts a rollout that plays out random moves until it reaches a terminal state in the game (a player winning or a draw). 4. Back propagation: After the simulation, the result is propagated back up the tree (from the leaf node that started the rollout), updating the statistics of each node along the path. The selection phase uses the Upper Confidence Bound applied to Trees (UCT) which, for all valid moves possible from a position, is the following formula: w n + C · √ ln(N) n (2.1) The w n term is the winrate of the move corresponding to a node, where w is the value and n is the visit count of said node. The N variable is the visit count of the current state. The two terms added represent how exploitative and explorative the search algorithm is. The win rate is the exploitative part whilst the second term is the explorative. The explorative term has a constant C, known as the exploration parameter, which affects how likely the algorithm is to explore different moves instead of exploiting the same high win rate move. The parameter is set depending on the application and how you want the search to behave. 2.3.2 AlphaZero’s Monte Carlo Tree Search The differences to standard MCTS mainly revolve around the fact that AlphaZero integrates a neural network into the search algorithm. Since neural networks will be covered later in Section 2.4, we can for the time being think of the neural network as an oracle f . Given the current state of the chess board (s) our oracle will give us a value for how good the current position is for the current player (v) and a move policy for how good each move from this position is (p): f(s) = (v, p) (2.2) The selection phase for this version of MCTS uses a different formula known as Predictor UCT (PUCT). For each valid move a from state s, the PUCT formula 13 2. Theory Figure 2.4: The different steps to Monte Carlo Tree Search visualised. Each node has a "visit count/value" pair. "MCTS-steps" by Robert Moss is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/?ref=openverse. 14 2. Theory looks as follows [2]: Q(s, a) + C · P (s, a) · √ N(s) 1 + N(s, a) (2.3) Compared to the regular UCT formula (Equation 2.1), there are some differences, mainly the addition of the prior probability P (s, a) from the neural network. The Q(s, a) term is the action value. The second term contains the exploration constant C, prior probability of action a as P (s, a) and the visit counts N. Next in the ex- pansion phase there are some big differences. When the search has reached a leaf node it expands the tree by creating new nodes for every valid move available. It then performs f(s) on the current board state to get a value and move policy which it assigns to the corresponding nodes. After the expansion is done, there is no sim- ulation with random rollouts, instead the value given by f(s) is propagated. Once it has propagated up to the root of the search tree we have completed one iteration. Figure 2.5 shows what the search tree looks like after two iterations from the begin- ning of the game. Firstly it expands the root node with every valid opening move for white and assigns a move policy value, P. Since it is the root it does not back- propagate any value and only increases the visit count of the root. Next iteration it chooses Knight to h3 as the most promising move and expands it with all the opening moves for black. It also assigns the board position a value of 0.1, which is then backpropagated to the root. 15 2. Theory Figure 2.5: Example of tree after two iterations where N is visit count, W is value and P is policy. Having introduced the search algorithm used by AlphaZero we can now hone in on the oracle f(s) which is a neural network. 2.4 Neural Networks Artificial neural networks are computational models that are inspired by the struc- ture and function of neural networks in the biological brain and are an integral part of AlphaZero. The simplest neural networks are single artificial neurons, called perceptrons, that can be used to classify inputs. Looking at Figure 2.6 we see a per- ceptron that takes in the inputs x, and multiplies them with the weights w. These weights are used to tell the perceptron what parts of the input are important. For example if we want to classify an image the different inputs could be different pixels. With these inputs we compute the weighted sum and also add the bias θ, which is fed into an activation function. 16 2. Theory Figure 2.6: A visual representation of a perceptron. "Perceptron moj" by Mayranna is licensed under CC BY-SA 3.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/3.0/?ref=openverse. In Figure 2.6 the activation function used is a simple step function that gives 1 as an output if the sum exceeds zero and otherwise outputs 0. AlphaZero utilizes tanh and rectified linear unit (ReLU) as activation functions in its neural network: tanh(x) = ex − e−x ex + e−x (2.4) ReLU(x) = max(0, x) (2.5) Tanh is a function that outputs a number between -1 and 1. ReLU is a function that sets any negative values to 0 and otherwise uses the original input [8]. Perceptrons on their own are not particularly complex so for a neural network to be “smart” we need to combine multiple perceptrons into layers. For a neural network we will always have an input and output layer but we might also have layers between them known as hidden layers. The structure of layers can be seen in Figure 2.7. Every perceptron in the first hidden layer is connected to each of the three inputs, and same for the second hidden layer but it is connected to the first hidden layer. The output layer has two perceptrons delivering 2 values, so the whole network implements a function: R3 → R2. 17 2. Theory Figure 2.7: A visual representation of a multilayer neural network. "Multilayer Neural Network" by John Salatas is licensed under CC BY-SA 3.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/3.0/?ref=openverse. 2.4.1 Neural Network Layers We may say that a neural network is a structured set of interconnected layers. The number and types vary between networks and applications. In the following sections, fully connected and convolutional layers are explained since they are relevant to chess engine networks. Fully Connected Layers The layers displayed in Figure 2.7 are so called fully connected layers. This is because the perceptrons in each layer has a connection to each perceptron in the previous layer. While this wiring structure is feasible for small networks, it has very poor scalability for more complex and deep networks such as those required for image identification, for example. If we want to identify an image that is 256 × 256 pixels, we need 65,536 inputs. Now say we have a fully connected hidden layer that has 20 perceptrons. This gives us 1,310,720 weights that we need to update when we backpropagate through the network. For tasks such as image recognition it has therefore become much more common to use convolutional layers. Convolutional layers In many tasks involving grid-like data structures such as images or chess boards, it is important to detect patterns that depend on the relative positions of neighboring elements. A convolutional layer is a type of neural network layer specifically designed for this. Rather than connecting every input to every output neuron as in a fully 18 2. Theory connected layer, a convolutional layer applies a small, learnable function called a kernel or filter over small regions of the input at a time [9]. In the 2D case, the kernel is a small matrix with a fixed size that moves across the input grid. At each position, it performs a calculation based on the input values it covers. The distance it moves between each step is called the stride. To ensure the output size remains consistent with the input, convolutional layers can use a technique called zero padding, which adds extra rows and columns of zeros around the edges of the input grid. This concept also generalizes to higher dimensions. Multiple 2D grids can be forwarded through a convolutional layer together the re- sulting operations will treat the input as a tensor of size x × y × num_grids. This allows for related information stored into equal sized 2D grids to be processed to- gether, such as the RGB colour values of a coloured image. In this case, assuming we wanted set the 2D convolutions to be 3 × 3 in size, the effective size of the 3D convolution would be 3 × 3 × 3 to account for the 3 RGB layers. Each convolution would then compute each kernel position using a set of 3 × 3 × 3 weights to produce a final value. The reason for their usage in fields such as image recognition and local pattern recognition in applications like board games is their ability to handle bigger inputs, unlike regular layers that would need a great number of perceptrons and connections and have a higher likelihood of overfitting and have difficulties learning meaningful patterns from training. Convolutional layers are a great fit for tabletop games like chess, played on a two dimensional grid since they can recognize patterns regardless of where they appear on the board, in other words, they have positional invariance. Figure 2.8: Example of pattern detectable by 3 × 3 kernel. The diagonal forms a continuous line of values. 19 2. Theory Figure 2.9: Example of convolutional layer with kernel where 4 grids have been forwarded together. 2.4.2 Learning Process of Neural Networks Neural networks are not “smart” by design, in fact the weights start out with random values. To get “smart” it needs to learn by producing guesses and improving based upon them, also called reinforcement learning. To get an idea of the process we will first go through a general overview to then get into specifics about AlphaZero. Neural Networks can be seen as a function that, when provided an input, should produce an expected output. This pair between input and expected output creates the basis for training the network. Giving the neural network an input should hopefully give us the expected output, but for a newly created neural network with randomized weights, this is highly unlikely. Comparing the difference between actual output with the expected output we can compute the loss, which is a measure for how “wrong” the predicted output from the neural network was [8]. The neural network is improved by reducing this loss, which is done by adjusting the network’s weights accordingly. Each weight in the network contributes differently to the overall loss and during backpropagation, we compute how much a small change in each weight would affect the loss. This value is called the gradient for that specific weight. By backpropagat- ing the loss through the network and calculating these gradients, we can determine in which direction to adjust each weight to reduce the loss. The amount the weights are changed is proportional to the magnitude of the gradient that was calculated. How much we move towards the direction which decreases the loss is also influenced by the learning rate, and the bigger the learning rate the more change in weight. This process of using training data to find out how “wrong” the neural network is repeats until we find an optimal solution to our problem. When deciding the 20 2. Theory learning rate we want to find one that is neither too high nor too low. Figure 2.10 shows the downsides when using learning rates which are too high or too low. The weight θi and the effect of it on the loss function l(θi) are the axes. For the high learning rate example the weight is initialized low, but misses the global minima because of the slope and learning rate, and settles in a less optimal value. For the low learning rate example we only find the closest minima, which is not the global minima. Figure 2.10: Visual representation of the difference between learning rate sizes. If we look at AlphaZero’s input and output, it takes in the current state of the chess board as the input and outputs a value and a move policy [2]. To get the expected output for the neural network we will use the results we get from the MCTS algo- rithm. A vector can be formed where each value is the number of visits to that node divided by total number of visits to all nodes from the starting position. With these values we can compare the output of the neural network with the move probabilities we get from our MCTS to compute a loss. This setup reflects a reinforcement learn- ing framework, where the model learns to improve through self-play. AlphaZero’s loss function utilizes a sum of cross-entropy losses between the MCTS policy pre- diction and a new model prediction with the mean-squared error between the value prediction with the game result. L2-regularization is also applied to prevent the weights in the model from endlessly growing. The loss function can be seen below in Equation 2.6: 21 2. Theory l = (z − v)2 − πT log p + c||θ||2 (2.6) The first part of the loss function is the mean-squared error of the value v given by the neural network and z is the result of the self-played game, with it being 1 if the player who’s turn it is won and -1 if they lost. The second part is the cross-entropy loss of the policy p given by the network. The vector p is multiplied by a transposed vector π which is the policy based on the tree search mentioned earlier. The last term is used for regularization for large weights and consists of the network param- eters (the weights), θ, and a regularization coefficient, c. The coefficient determines how much to punish large weights to reduce overfitting. AlphaZero addresses the issue of learning rate size by using different learning rates depending on how far it is into training [2]. At the start of training the learning rate is set to 0.2 and is then updated depending on how many times it has updated the weights in the network. After 100,000 updates the learning rate changes to 0.02, after 300,000 it changes to 0.002 and after 500,000 it changes to 0.0002. 2.4.3 Residual Networks When backpropagating through deep neural networks we run the risk of getting vanishing or exploding gradients. As we travel backwards through the network to update weights the impact of the loss is either lessened or blown out of proportion due to the chain rule. To prevent this we can use something called a residual network, also known as ResNet. These networks introduce skip connections that bypass layers of the neural network and combine the output of the skipped layers with the input to the block, referred to as residual blocks. With this skip connection there is an alternate, less computationally heavy road backwards through the network that means earlier layers also get meaningful weight updates. This has proven to be effective at training and optimizing deep neural networks and is used in AlphaZero’s network construction (discussed further in Section 2.4.4) [2][10]. Batch normalization is a widely used technique in residual blocks to stabilize and accelerate training in deep neural networks. It works by normalizing the inputs within a batch of inputs, adjusting them to have zero mean and variance based on the batch’s statistics. This process reduces internal covariate shift, allowing the network to train faster and more effectively by maintaining more stable gradients throughout the layers [11]. In Figure 2.11 we see a residual block with a skip connection going above it. In the F (X) path we have convolutional layers (Conv) and batch normalizations (BN) which are added before each ReLU function. After the second batch normalization, the skip connection is added to F (X) before the next ReLU function which is the end of the residual block. 22 2. Theory Figure 2.11: Visual representation of a residual block. 2.4.4 AlphaZero Network Construction The Neural Network in use by AlphaZero as described in the Supplementary Ma- terials [2] consists of a body, which is responsible for the majority of calculations in the network, which then feeds into two separate heads. The heads consist of a value head and a policy head, each of which are responsible for extracting features valuable for either the value or policy from the main body prediction and format- ting them to become the value and policy output. The body consists of 20 parts, the first of which is a convolutional layer which is responsible for batch-normalizing the inputs into a range around 0 and rectifying all negative values to 0. This is followed by 19 residual blocks as seen in Figure 2.11. Each convolution layer applies 256 filters of kernel size 3 × 3 with stride 1. After the body it splits into the value and policy heads. The policy head applies an additional rectified, batch-normalized convolutional layer, followed by a final convolution of 73 filters. The value head applies an additional rectified, batch-normalized convolution of 1 filter of kernel size 1 × 1 with stride 1, followed by a rectified linear layer of size 256 and a tanh-linear layer of size 1. 2.5 Bitboards and Magic Number Move Genera- tion Bitboards and magic numbers are used to calculate the legal moves from any given position. Bitboards are an efficient way to represent chess positions using 64-bit integers, where each bit corresponds to a square on the chessboard. This repre- sentation allows for fast manipulation of piece positions using bitwise operations, 23 2. Theory rather than traditional array-based approaches for storing a 2D space like a chess- board. By leveraging the inherent parallelism of bitwise operations, bitboards enable highly optimized and computationally efficient legal move generation. Since there are multiple pieces and piece types in chess, each type of piece typically has its own associated bitboard. The main advantage on the computation side is allowing for the pre-computation of possible moves for sliding pieces, making it possible to avoid evaluating whether each position the sliding piece could move to is blocked. Instead, since each bitboard is also a 64-bit number, all of them can be combined using OR operations into a single bitboard containing information regarding positions of all pieces and possible blockers. By then executing a bitwise AND operation on the combined bitboard and a mask of possible moves for the piece, keeping only an oversight of pieces capable of blocking the sliding piece. Since the bitboard containing information of all pieces can be computed once at the start of move generation for the turn, only the bitwise AND operation has to be performed to calculate all blockers making it extremely fast. Figure 2.12: Calculation of blockers for a rook in the bottom left corner. At this stage, the only thing that remains is calculating which spaces the piece is able to move to. As shown in Figure 2.12, there are a total of 12 spaces which can be occupied by a blocking piece, meaning that in that instance, there exists 212 or 4096 possibilities of blockers for a rook positioned in the absolute bottom-left square. This is a small enough number that it is possible to pre-compute all possible movements per blocker configuration and store them into a lookup table for later, meaning that the computation can be simplified to a simple value lookup making it extremely fast as well. This process can then be repeated for all 64 squares as well as once for both rooks and bishops creating a full set of lookup tables for all sliding pieces, with queens being able to use both rook and bishop lookup tables sequentially for their movements. After the lookup, an additional check is carried out to see if the target square is of the same colour as the piece the calculations are being carried out for, in which case the move is rejected as invalid. After the allowed moves have been pre-generated, they can be stored into a hashmap 24 2. Theory which allows for compact storage and psuedo-constant lookup times. One further optimization that is often performed is utilizing a custom hashing algorithm utilizing magic numbers to avoid collisions. Since we can know all of the possible keys that will be input into the hashmap, it is also possible to pre-compute a magic number which will allow us to avoid collisions when looking up values, meaning we can skip the key comparison step as well as ensure constant time on lookups. The most common version of this hashing algorithm is to multiply the bitboard by the magic number, then keep only the most a set number of highest bits decided by the index position of the moving piece. Originally, these magic numbers were pre-computed using a brute-force search during development, due to each piece type and square needing their own number. These numbers have been widely disseminated onto the Internet since. 2.6 Parallelization Parallelization is critical in this project due to the computational demands of training a self-learning chess engine. The two main processes that benefit from parallelization are MCTS and neural network training. 2.6.1 Monte Carlo Tree Search Parallelization It is not trivial to effectively parallelize the MCTS algorithm in the case of a self- learning engine. One common strategy for pure MCTS is to do root parallelization, where multiple threads create separate MCTS trees in parallel and later sum the visit counts to get a more accurate result. However, this strategy will not work in the case of AlphaZero’s modified MCTS implementation. The reason for this is that the random rollout step in the algorithm is switched out for a neural network that always values a position the same way. This results in multiple identical trees if multiple threads start from the same root, as the same path through the tree will always be chosen. An alternative approach is tree parallelization, where multiple threads operate on the same search tree simultaneously. However, this can lead to performance issues, as threads tend to compete for access to the nodes with the highest PUCT scores, resulting in frequent contention. To address this, virtual losses are used, a proven technique for reducing such conflicts [12]. The core idea is to temporarily assign a “virtual loss” to nodes selected by other threads, artificially reducing their PUCT scores. This discourages other threads from choosing the same high-priority nodes, spreading the exploration more effectively and reducing the risk of redundant com- putations. 2.6.2 Training Parallelization Similarly, neural network training involves processing large amounts of self-play data, which can be accelerated through parallelization. By distributing the workload 25 2. Theory across multiple CPU or GPU cores, the training process becomes more efficient, al- lowing for faster iterations and better utilization of limited computational resources. One example would be to utilize multiple threads to play individual games, meaning that the number of moves generated to train using could scale with the number of available cores on a CPU, assuming there is no bottleneck on the GPU. Given the constrained hardware available for this project, parallelization is not just a performance optimization but a necessity. It enables the engine to complete more training iterations and play more games within a fixed time-frame, ultimately im- proving the engine’s playing strength and the quality of the learned strategies. 2.7 Previous Bachelor Theses and Related Work Projects similar to ours have been done as bachelor’s thesis before, and we had access to two of those reports [13] [14]. By examining the challenges, limitations, and insights from earlier work, we can identify strategies to improve the current project and avoid common pitfalls. Note that these projects made engines for different chess variants, differing in complexity to Atomic Chess. Below, we summarize the key takeaways from previous bachelor theses’ findings that are relevant to this project. • Challenges and Limitations: Both projects encountered significant chal- lenges related to overfitting and computational resources. Overfitting was a recurring issue, where models performed well on training data but poorly on validation data. Techniques such as weight decay, dropout layers, and increas- ing dataset size were suggested as potential solutions. Additionally, compu- tational constraints were a major bottleneck, particularly for training deep neural networks and running MCTS with high iteration counts. These limita- tions highlight the importance of balancing model complexity with available hardware and optimizing training efficiency. • Neural Network Architectures: One project successfully implemented a convolutional neural network (CNN) with a reduced number of hidden layers (13 layers instead of AlphaZero’s 19) to manage computational constraints. Despite the simplification, the network demonstrated good learning capabili- ties, suggesting that smaller-scale projects can achieve meaningful results with simpler architectures. 26 3 Implementation This chapter aims to explain the components of the final product and how they are constructed. Firstly, an overview of the entire implementation is given. The chapter then goes into more detail explaining the individual components. The chapter will build on most of the concepts that were introduced in Chapter 2. 3.1 Implementation Overview The architecture consists of four core components: a game representation system, a neural network, a Monte Carlo Tree Search (MCTS) algorithm, and a training framework. The board states and the move generation logic are implemented using bitboards and allows the program to compute legal moves for Atomic Chess. The engine was developed in C++, primarily due to the language’s strong perfor- mance characteristics and its compatibility with Libtorch, the C++ distribution of PyTorch. C++ also has a wide variety of other libraries which are made with perfor- mance in mind, which makes it a fitting language for a project where the importance of execution speed is paramount. The network made with LibTorch adopts a residual network architecture similar to that used in AlphaZero, but scaled down in size. Our implementation processes board states through 12 residual blocks, producing two outputs: a policy vector where each value is an evaluation by the network on how advantageous that op- tion would be if played, with the entire vector being normalized. representing the probability distribution over legal moves, as well as a scalar value estimating how advantageous the current position is for the turn player. To guide decision-making during gameplay, the engine employs Monte Carlo Tree Search. The search algorithm uses the PUCT formula to balance exploration and exploitation and performs backpropagation to update node statistics based on net- work evaluations. Batch processing is used to accelerate inference across multiple nodes. Training is organized into iterations where the engine plays games against itself, 27 3. Implementation collects positions from the game in a replay buffer, and updates the neural network based on a batch of sampled data from the replay buffer. To train, debug, and evalutate the models and more, several modes were developed to help perform these tasks. These are the main modes: • Arena: for head-to-head matches between two models. • Perft: for validating move generation correctness. • Dojo: for self-play training. • FEN: for starting games from specific board positions. • Lichess-UCI: Implements UCI (Universal Chess Interface), this is used to deploy the bot on Lichess.org. Additionally, a graphical interface was introduced where the board state can be seen, both during training and of course when playing against the engine. Now we will be diving deeper into the individual components and their implemen- tation details. 3.2 Game Rules and Move Generation Move generation is a foundational subsystem of the chess engine, responsible for enumerating all legal actions from a given board state. This component is used in conjunction with the neural network, which returns the move policy with the moves it thinks have the highest win probability. The move generator’s list of legal moves is then used to invalidate all the illegal moves that the neural network suggested, leaving a complete move policy. The implementation was carried out in two distinct phases: initially supporting standard chess to ensure correctness and ease of verification, followed by the in- corporation of Atomic Chess mechanics. This staged development strategy enabled reliable testing and debugging prior to introducing the additional complexity of the atomic variant, and was feasible as the atomic features are simple to add on top of regular chess rules. 3.2.1 Bitboard-Based Representation Bitboards are used to represent the board state, where the 64 squares of the board are represented using 64-bit integers. Separate bitboards are maintained for each player’s pieces, so there are 12 bitboards in total, 6 unique piece types per player. Each bitboard also encodes the occupancy and attack patterns for every piece type on the board. 28 3. Implementation 3.2.2 Standard Chess Move Generation Legal move generation is sorted by how the pieces move, with specialized logic for sliding, non-sliding, pawns and castling. Sliding Pieces (Bishop, Queen, Rook) : Sliding piece attacks are computed using directional occupancy masks and magic bitboards, explained in Section 2.5. Meaning it works by multiplying the occupancy on relevant squares by a magic number. Then it shifts the result, and using it as an index into a precomputed lookup table. This table stores all possible attack patterns for each unique blocker scenario along a given direction. Non-Sliding Pieces (King, Knight): Kings and Knights always have the same movement pattern, and cannot be blocked by other pieces in the same way, except if allied piece is on the square already. But this means that the non-sliding piece attack sets are stored in static lookup tables that are indexes by their square. Pawns: Pawns are more complex as they have many unique features. They require asymmetric logic for forward movement as the piece only moves in one direction relative to the player. Diagonal captures, are easy to compute as they work the same way as a non-sliding piece. Promotions is handled using separate logic that simply checks whether a pawn has reached the last rank. Similarly, single and double advances are checked depending on rank. Lastly, en passant legality is verified after move generation to ensure it does not expose the king to discovered checks. Castling: Castling rights are validated via bitwise tests: verifying that both the king and the rook in question have not moved, that the path is unoccupied, and that intermediate squares are not under attack. 3.2.3 Handling Piece Constraints Check Detection: All opposing piece attacks masks are checked to determine if the side to move is in check or not. When a king is under check, its legal moves are restricted to three actions. These are either capturing the checking piece, blocking its line of attack, or moving the king to a non-attacked square. Pin Detection: Pinned pieces are identified by scanning all straight lines extend- ing from the king’s square, in the same manner a queen can move. If a friendly piece lies on the same line between the king and an opposing bishop, rook, or queen, and there are no other pieces blocking the path, it is considered pinned. Such a piece is then restricted to moving only along that line, if at all. Legal move generation begins with a set of pseudo-legal moves, which are then filtered by applying the constraints described above. The filtering logic then excludes moves that violate these constraints. 29 3. Implementation 3.2.4 Atomic Chess Extension After implementing and verifying that the standard rules of chess detailed above worked, the Atomic Chess additions were added. Explosion Capture: After a capture, the piece that made the capture is removed. Then a 3x3 explosion mask is applied to the area. The board state is updated accordingly by clearing the relevant bits in each affected bitboard, leaving pawns unaffected, as they are immune to nearby explosions. Modified Legality Checks: Captures that would result in the player’s own king being removed are disallowed. This requires a re-evaluation of the resulting board state post-explosion, ensuring that all remaining pieces are valid under the variant’s rules. Pinned pieces are allowed to move in a way that checks their own king if the resulting move would result in capturing an opposing king through an explosion. This requires the logic handling pinned pieces to be re-evaluated as pinned pieces could never be moved in regular chess. Impact on Attack Evaluation: Attack generation routines were extended to consider the effects of explosions on adjacent pieces, influencing king safety and positional evaluation. 3.3 Monte Carlo Tree Search Implementation Monte Carlo Tree Search (MCTS) was chosen as the engines decision-maker because of its use in AlphaZero, where they proved that it is an effective implementation when using neural networks and in environments with high branching factor, such as a complex game like chess that can alter the board state drastically in just a few moves. The implementation is based on the PUCT formula similar to AlphaZero’s implementation. 3.3.1 Monte Carlo Tree Search Structure A tree-node data structure is the core of the MCTS, which represents a node in the game state search tree. Each node maintains the following: • A reference to its parent and a list of children. • The move that led to the node. • The game state the node represents • PUCT statistics: total accumulated value, visit count, and the prior probabil- ity provided by the neural network. 30 3. Implementation The MCTS steps are similar to what was described in Section 2.3.2. In our imple- mentation, the Monte Carlo tree is constructed by first initializing a root node and then performing a series of iterations, where each iteration executes the following steps. 1. Selection: Selection works by, at each level of depth, selecting the child node corresponding to the move with the highest PUCT score, calculated as: PUCT(a) = w n + p · C · √ N 1 + n (3.1) where n is the visit count of move a. The total visit count of the current parent node is N , the average value (win rate) of simulations through move a is then w n , the prior probability of move a provided by the neural network is p and C is the exploration constant. 2. Expansion and Evaluation: When the search reaches a leaf node, an expand function is called. This function performs the following steps: • Generates all legal moves from the current game state. • Evaluates the position using the neural network, which returns a policy vector (prior probabilities) and a scalar value (position evaluation). • Applies softmax normalization to the policy outputs. • Initializes child nodes for each legal move using the normalized priors. 3. Backpropagation: Once a leaf node has been expanded and evaluated, the scalar value is propagated back through the visited path. During this phase, the total value and visit count for each node along the path are updated. The value is inverted from positive to negative at each step to reflect the opponent’s perspective. 3.3.2 Batching Selections In order to leverage the strength of GPU parallel processing, some additional modi- fications to the original MCTS algorithm were made. In the original algorithm, only one node could be selected at a time and have its game-state be processed by the neural network. This however, does not leverage the parallel nature of the GPU. Instead we want to send a big batch of multiple game-states to the network and evaluate them all at the same time. So by running multiple selections and batching their game-states for a batch evaluation, the search will be much faster on a GPU. To make this change work, already selected nodes are skipped during the selection phase to not evaluate duplicate positions. But this introduces an issue where all 31 3. Implementation selections will try to go down the same path in the tree. This is resolved through the use of virtual loss. 3.3.3 Final Move Choice Once exploration has ended, a move suggestion can be extracted from the tree structure. Two different strategies are implemented for final move selection and are used depending on the circumstance. The first strategy is choosing the node with the highest visit count, pure exploitation. This is used during real play against other opponents, because we want to play what the neural network thinks is best. Nodes having high visit counts indicate that the move is likely to be good because it has been selected more often than others using the PUCT-formula. The second strategy is proportional sampling, which results in more exploration. This is used during self-play to encourage variations in the training games. The strategy works by looking at all the root’s children’s visit-counts and divide them by the total visits for all the children to get a list of probabilities. Then choosing the move randomly based on the probabilities. For instance, if you are at one node with three child nodes, with 70, 20 and 10 visit-counts respectively. Then we will randomly pick between them, with the nodes having a 70%, 20% and 10% chance of being chosen respectively. This will most often result in the best move being picked, but allows for less visited nodes to be expanded and explored as well. 3.4 Neural Network The neural network used in the engine is a deep convolutional neural network, similar in structure to that of AlphaZero, but scaled down in size. Network Input To process the game state with the neural network, the board must first be encoded as tensors. Each of the 12 bitboards are converted into an 8x8 tensor. These tensors are then stacked along the channel dimension, forming a final single tensor with 12 channels, similar to how colour channels work in image processing. This is illustrated in Figure 3.1. This final tensor is the input to the neural network without any other information. Additionally, because chess is both spatially and rule-symmetric, we can mirror the board vertically and swap the colours of the pieces when it is black’s turn. This ensures that the network always sees the position from the perspective of the player to move. As a result, the network does not need to learn separate strategies for white and black, or learn to shift its perspective. 32 3. Implementation Figure 3.1: Conversion from a board position to a tensor. Network Structure Once the game-state has been converted into a tensor and passed into the neural network as an input. The input tensor of size 8 × 8 × 12 first goes through an initial convolutional layer consisting of 128 filters, each with a kernel size of 3 × 3 × 12 and a stride of 1, which produces a tensor of size 8 × 8 × 128. This is followed by 12 residual blocks, each with 128 filters of kernel size 3 × 3 × 128 and stride of 1, outputting tensors of the same size as the input, as illustrated in Figure 2.11. Due to the resource constraints of this project, a shallower network was implemented than the 19 residual blocks used by AlphaZero as was discussed in Section 2.4.4. Lastly, the network needs to output the policy and value heads. These are respon- sible for guiding the search and evaluating the game state. The policy head begins with a 2 filter 1 × 1 × 128 convolution that reduces the number of channels to 2, yielding a 8 × 8 × 2 tensor that is flattened to a vector of size 128. This is followed by a fully connected layer that outputs the move policy head with all the move probabilities, where the element at each position in the output represents a move. For instance, the first element in the output (a1a2) represents a move for square a1 to a2 which contains the likelihood that this move is the best move. The value head starts similarly to the policy head with a 1 × 1 × 128 convolution to reduce the channels to 1, followed by a ReLU activation. The resulting tensor is then flattened to a vector of size 64, and passed through a ReLU-activated hidden layer outputting 33 3. Implementation the same size, before being reduced to a single value through a last perceptron. The final output is a single scalar value representing the expected outcome of the game from the current position. This separation of the network value and policy head allows the network to perform both move prediction and position evaluation. The overall structure of the network is illustrated in Figure 3.2. Figure 3.2: Structure of the neural network. 3.5 Training Setup The training setup, like most components, draws inspiration from AlphaZero. The iterative process goes through two main phases, generating training game by self- play, then using this data to update the neural network. 3.5.1 Self Play During training, the engine goes through multiple training steps. For every training step, the engine uses self-play to generate a set amount of training data. By using proportional sampling in self-play, see Section 3.3.3, we ensure a diverse set of train- ing games. Each data sample consists of a chess position, the MCTS probability distribution for that position and the final outcome of the game. By training the network on these data samples, the model should gradually improve as more generations are completed. The assumption here is that the MCTS move distribution and final outcome is more accurate than the output from the network. So by having the model continuously try to mimic this, it should improve a small amount each training step. 3.5.2 Calculating Loss The loss between a data sample and the model’s prediction is calculated in two parts. First, the value loss is calculated to determine how accurately the model can predict the final outcome of the game. This is calculated as the mean squared error between the final game outcome (z), and the model’s prediction (v), where a 1 represents a win, 0 a draw, and -1 a loss for the acting player. The term then becomes: 34 3. Implementation (z − v)2 (3.2) Second, the policy loss determines how accurately the model can predict what the best moves are. This is calculated as the cross-entropy between the MCTS move probability distribution and the model’s policy head. This term becomes: −πT log p (3.3) These two values are then added together and the resulting loss function is written in Equation 3.4. l = (z − v)2 − πT log p (3.4) A Stochastic Descent Optimizer is used to optimize the model. The optimizer is originally set with a learning rate of 0.01 and a momentum of 0.9. The optimizer is also set to a weight decay of 1e−4. The loss Equation 3.4 does not include a term responsible for L2-regularization unlike the formula used by AlphaZero, as seen in Equation 2.6. This is because the weight decay parameter in the optimizer will ensure that L2-regularization is carried out each time the optimizer is used. 3.5.3 Replay Buffer After a batch of training data is created, it is stored in a replay buffer of training samples. By using a buffer of training examples, the model can sample from a diverse and robust dataset, not only the latest games. Over time, the oldest data is overwritten to prevent the model from training on outdated training data. Another feature added to prioritize newer training data is “prioritized experience replay” [15]. By using this, newer samples are prioritized when sampling from the replay buffer. The size of the replay buffer varied during development but was set between 100,000 to 200,000 samples. For each training step, a mini-batch of 4,096 is sampled from the replay buffer and used to update the model, similar to what AlphaZero did [2]. For each of these training steps, about 2,056 additional samples are generated through self-play and added to the buffer. 3.6 Parallelization Parallelization enables multiple training games to run concurrently, significantly in- creasing the speed at which data samples can be generated. At the start of training, 35 3. Implementation a variable number of permanent worker threads is spawned. All workers indepen- dently play games and, at the end of each game, create samples and add them to the common replay buffer. Once enough samples (per training step) have been created, the first worker to end its game is responsible for launching the training routine that updates the weights of the network, at which point all other threads remain idle. Importantly, workers may still have ongoing games while weights are updated, but they are never selecting moves, i.e. performing MCTS due to the back-propagation algorithm not being threadsafe. In practice, this was simply done by using a reader-writer mutex on the actual model object, where updating the weights of the network is considered a writing action, and selecting a move is considered a reading action. In essence, all workers shared a single model, and updates to the weights of the neural network were coordinated in a round-robin fashion. 36 4 Method This chapter describes the methods used to train and assess the strength of our engine. 4.1 Training Process Training a large convolutional neural network and generating self-play games is very computationally costly, therefore, we opted to train only a single neural network model and not try any other architectures. Neural network weights between different architecture can not be reused or transferred to a different network structure. If we had explored multiple different architectures, then we would essentially be wasting large amounts of time computing training models that would later be discarded, so we decided against this. Instead we trusted the general neural network structure that AlphaZero had de- scribed and stuck with the scaled-down version of it described in Section 3.4. There- fore, the majority of the time spent after creating an initial working training pipeline was spent refining the pipeline as opposed to hyperparameter tuning the network. The main adjustments to the training pipeline are the following: Multithreading optimizations: The self-play and training loops were better optimized to better utilize the available hardware and reduce bottlenecks, resulting in a higher throughput of games. The primary changes were running the self-playing games in parallel and letting an idle thread start the training routine without waiting for the others to finish playing games. Resizing of replay buffer: The size of the replay buffer was altered in the middle of training from 100,000 to 200,000, helping to balance the trade-off between data freshness and remembering long-term strategies. Adjustments to learning rate: The learning rate was also tuned manually dur- ing the training to prevent stagnation and continue the learning effectively. One problem with this manual tweaking and altering of the training pipeline is that 37 4. Method it makes tracking progress more difficult as there are more changing variables. Mak- ing it difficult to pinpoint exactly what is contributing and the magnitude of the contribution to the learning process. It is also hard to quickly benchmark if the changes made were in fact an improvement as training is a very slow process and it takes hours before a noticeable difference in playing strength can be observed. Therefore we mostly tweaked based on our hypothesis around the previously men- tioned trade-off between data freshness and remembering long-term strategies. 4.2 Evaluation of Models Throughout the training, multiple training checkpoints were made between training iterations. To evaluate the strength of these models and also the final strength of the best model, two methods for evaluation are used. 4.2.1 Internal Evaluation To compare our different models an arena mode was developed. In this mode two models play against each other for a set number of games. The first model plays as white and the second model as black. The number of wins for white, black and the amount of draws that the games result in are then outputted. The set number of games per side was set to 500, then the sides were reversed for another 500 games, to create 1,000 games between two models in total. This makes it easy to calculate the win rate between the models on white / black / both. 4.2.2 External Evaluation To evaluate the final model, the chess engine was deployed on Lichess.org on a dedicated bot account. This was done with the chess engine mode Lichess-UCI, where a computer runs our engine and upon receiving a match request from Lichess using the Universal Chess Interface (UCI) it accepts the request and plays the game. In this mode the engine always plays with 100% exploitation. The engine was hosted on a consumer grade PC with an Intel i5-12400f CPU and a NVIDIA 3060TI GPU. The MCTS-iterations were set to 50,000, which results in a thinking time of approximately 10 seconds per move. Other players and chess engines with established ELO-ratings can be challenged from a bot account, which allows us to get an ELO for our engine. Bot accounts are however not allowed to queue for the standard “find a game” function, which limits the players we could find to those who were on the Atomic Chess top 200 leader board. Additionally, there are not many other bot accounts on Lichess that play Atomic Chess at an intermediate level, with most playing either on a very high level or a very low level. Lichess also provides data as to where the bots ELO-rating ranks in terms of the entire active playerbase, which is a valuable metric better at clarifying its playing strength rather than a somewhat arbitrary ELO-rating. 38 4. Method 4.3 Hardware Utilization and Cost To handle the computational demands of training deep reinforcement learning mod- els we utilized cloud-based GPU virtual machines (VMs) from TensorDock [16]. TensorDock is a cloud computing service specializing in on-demand GPU instances well-suited for machine learning workloads. We rented two VMs, one was mostly running the current best available version of the training pipeline, ensuring uninterrupted progress toward a final model. A second VM was periodically brought online to test modifications to the training pipeline or to evaluate experimental configurations. This approach allowed us continue to improve the training without risking ending up without a final satisfactory product. To optimize costs we powered VMs on and off as needed. Because TensorDock charges different hourly rates depending on whether a VM is active or idle, this flexibility helped us stay within the projects budget constraints. The total target expenditure for the entire project was 3,000 SEK The rented hardware configurations and their associated costs are shown in Table 4.1. Table 4.1: Rented hardware and associated costs Unit GPU CPU SEK/h (On) SEK/h (Off) 1 NVIDIA RTX A6000 PCIE 48GB Intel Xeon Gold 6226 (14 cores) 4 0.1 2 NVIDIA A100 SXM4 80GB AMD EPYC 7513 (32 cores) 14 0.1 39 5 Results This chapter presents the outcomes of the project by looking at the playing strength achieved by our engine, as well as comparing different models and evaluating the key components of the system individually. 5.1 Engine Strength In this section we present the strength of our best performing engine in the form of comparisons to earlier versions and also as an ELO progression. 5.1.1 Model Comparisons After choosing the final training setup, after some experimenting and making sure the training pipeline worked, the model went through 8,000 training steps using a replay buffer of 200,000 positions. In Table 5.2, we can see the win rates for the final model with 8,000 iteration steps matched against the models at previous steps in the training phase. It is important to note that less than 1% of the games typically resulted in draws, therefore games ending in draws were discarded when calculating the win rates. The number associated with each model represents the number of training iterations it has undergone during the final training phase. However, this number does not reflect the total number of training iterations from the start of the project. The final training phase was conducted on an already partially trained model, referred to as Model-0. The performance of Model-0, which can be seen in Table 5.1, shows that it beat a completely untrained model with randomized weights in 98.0 % of games. Table 5.1: Comparing win rate model-0 to untrained Untrained Model-0 (white) 99.2% Model-0 (black) 96.8% Model-0 (both) 98.0% 40 5. Results The training that was previously done on Model-0 was done with a suboptimal training pipeline where we sampled many smaller batches every time we generated new positions, in contrast to the current one where a batch of 4,096 is sampled each training iteration, increasing stability. Mostly this part of the training was done in order to test that the training was working at all, and due to the limited compute resources, we decided to keep the model that the initial training had given us in order to not waste resources. This unfortunately resulted in us not having any presentable data on the initial training, yet it is important to note that this early training was minimal and only accounts for a small factor of the training that was done in the final model. This can be seen in Table 5.2 where Model-8000 has a win-rate of 83.6% against Model-0. Table 5.2: Model-8000’s win rates against the other models over 500 matches (1,000 for both). Config: MCTS-iterations: 3,000 Untrained M-0 M-2000 M-4000 M-6000 M-8000 Model-8000 (white) 100% 88.8% 56.1% 52.3% 57.1% - Model-8000 (black) 100% 78.4% 49.7% 44.4% 44.4% - Model-8000 (both) 100% 83.6% 52.9% 48.4% 50.8% - Something that can be observed in Table 5.2 is that Model-4000 seems to outperform Model-8000. To further investigate this, we ran the same test again with Model-4000 as the main model to see if it was a superior model. Those results can be seen in Table 5.3. Table 5.3: Model-4000’s win rates against the other models over 500 matches (1,000 for both). Config: MCTS-iterations: 3,000 Untrained M-0 M-2000 M-4000 M-6000 M-8000 Model-4000 (white) 100% 86.0% 54.8% - 55.1% 55.6% Model-4000 (black) 100% 81.5 % 53.0% - 48.6% 47.7% Model-4000 (both) 100% 83.7 % 53.9% - 51.9% 51.6% Comparing the win rates for the two models we see that Model-4000 is the slightly stronger model overall, which is a result of its better performance when playing with the black pieces. Model-8000 might be worse overall but has a higher win rate than Model-4000 when playing white. To help visualize these results we can convert the win rates into difference in ELO rating, a more commonly understood metric of strength in chess. This can be done with the simple formulas that dictate how the ELO system in chess works. An arbitrary ELO rating of 1,000 is set for Model-8000 which will help us see how the ELO changed in the final training phase. 41 5. Results Figure 5.1: ELO progression during the final training phase. The last model, Model-8000 was set to an arbitrary ELO of 1,000, the other ELOs are then calculated in relation to their win rate against Model-8000 . Another interesting metric to look at Model-4000’s and Model-8000’s relative strength when playing white compared to black. Looking at Table 5.4 we can observe the same thing that we found when comparing the win rates of the models against other models in Tables 5.2 and 5.3. Model-8000 is stronger than Model-4000 at playing the white pieces, but worse when playing the black pieces. Table 5.4: The win rate of the model when playing 1,000 games against itself White Black Model-4000 52.9% 47.1% Model-8000 55.9% 44.1% 5.1.2 Final Playing Strength By looking at the results in Tables 5.3 and 5.2, Model-4000 was chosen to be matched against other bots online to get a strength estimate determined by the ELO rating. After 40 games against players and other chess engines, the engine got an ELO rating of 1,729. On Lichess this corresponds to a rating better than 90% of Atomic Chess players that week, which can be seen in Figure 5.2. 42 5. Results Figure 5.2: Atomic Chess ELO distribution and the projects final engines place- ment on the chess platform Lichess [17] The accuracy of this rating is hard to judge. The only atomic players we were able to find and play against were the players on the top 200 leader-board, all having a ELO rating above 1,950. Additionally, there are not many different Atomic Chess engines on Lichess, so the engine was either matched against engines much stronger or much weaker than itself. A few observations were made while playing against the engine and watching it play stronger players on Lichess. The engine has learned several sharp tactics and strategies in the opening and general strategies that hold true for all parts of the game. But when it comes to tactics specific to the mid- and end-game, the engine becomes noticeably weaker. The engine also does not have a great understanding of being resourceful with its pieces. For instance, when being several pieces down in material, the engine still plays moves that trades its own pieces for the opponents or for some positional winning. This goes against a common playing principle in chess that states you should not trade down pieces when you are behind, because you will be left in a situation with only pawns while the opponents has major pieces (rook, queen, etc.) left. It should be noted that we are not very familiar with Atomic Chess specific tactics and it is possible the engine is making good moves that we simply do not understand and therefore discount them as mistakes. 43 5. Results Additionally, the engine is not very good at time management. The engine currently is fully idle during the opponents turn, which means the engine wastes time every turn as when its turn starts, it still has no idea what it wants to do. A similar issue is the fact that the engine does not take its time remaining at all into account, it only does a hardcoded set number of MCTS-iterations, which typically takes the same time for every turn. This is opposed to a human that naturally make moves quicker when low on time in order to not lose on time. 5.2 Neural Network and Training In this section we cover the results of the training of our neural network. 5.2.1 Training Process Several training strategies were explored to identify the most effective approach. Two key factors were found to significantly impact training outcomes: replay buffer size and sampling strategy. With a very small replay buffer, fitting only the samples generated from previous training step, the model did not seem to learn anything. Even with 8 hours of training the model could not out-perform an untrained model. When matched against the untrained model it achieved a win rate of only 47% for the 500 games played (ignoring the draws). However, increasing the buffer size allowed the model to surpass an untrained version within the same training duration, highlighting the importance of sample diversity for effective learning. Initial experiments also involved sampling the entire replay buffer during each train- ing step. While this approach led to rapid loss reduction and short-term performance gains, it introduced significant overfitting. Models trained this way struggled when matched against their earliest versions, indicating a lack of generalization. To address this the strategy was adjusted to use mini-batches for each training step. This change dramatically improved long-term model strength, allowing it to consistently outperform all previous versions. For our final implementation we found that using a mini-batch size of 4,096, while generating 2,048 new positions per step, provided an optimal balance. This effectively resulted in each sample being utilized approximately twice, promoting both stability and sustained improvement. 5.2.2 Learning Verification To assess whether the network could learn simple tactical patterns a mate-in-2 test was conducted. The model played 50 games against itself from a fixed position where a forced mate was achievable. Figure 5.3 shows the move probability predicted by both MCTS and the neural network over training iterations. MCTS consistently favoured the correct move and after a few games the network started favouring that move as well. 44 5. Results Figure 5.3: Probabilities for the known best move from MCTS and the neural network in a fixed mate-in-2 test. MCTS ran with 10,000 iterations per move. 5.2.3 Training Metrics The systems throughput during training was measured. With the final configuration (12 residual blocks, 128 channels, 800 MCTS iterations per move on unit 2 in Table 4.1 with all 32 cores) the system achieved: • Games generated per hour: 3,200 • Positions generated: 16, 384, 000 = 2, 048 Samples Training Step · 8, 000 Training Steps These results highlight the computational demands of reinforcement learning using MCTS and neural networks, even on simplified chess variants. Generating data samples concurrently was likely the single most important optimization step, as the throughput of generated samples per unit time increased linearly with the number of workers up to 15 workers, on the hardware configuration used for training. In a later version of the program we were able to scale the number of workers up even further to around 64, but due to unresolved bugs and time constraints these changes were not used during final training. 45 5. Results 5.2.4 Virtual Machine Costs Table 5.5: Final cost per unit Unit Expenditure in SEK 1 1,212.32 2 627.39 The results show that the total expenditure for all virtual environments totalled 1839.71 SEK, which is under the goal of 3,000 SEK set for the project. 5.3 Move Generation Component To verify the correctness of move generation and evaluate its performance, we built a routine for conducting Perft tests. Perft tests take a position as input and output the number of positions, or nodes, found for a given depth, as well as how long the search took. Additionally, Perft divide tests are used to find bugs in move generation, where the number of nodes found for each move from the root position is also recorded. With this, we were able to verify the correctness and measure the speed of our move generation. Running on an AMD Ryzen 7 5800H, a consumer grade CPU, using all 16 cores, our Perft routine found roughly 250,000,000 to 360,000,000 nodes per second, depending on the root position. The output of our Perft routines was compared to the output of Fairy Stockfish’s Perft routines for at least ten important positions, totalling over a hundred billion positions, ensuring the same total number of nodes was found for each position [18]. In the engine the move generation runs on independent cores. The single-core per- formance was benchmarked by repeatedly generating all legal moves for randomly generated positions. The system consistently achieved a rate of 44,519,649 positions per second. 5.4 Monte Carlo Tree Search Optimizations In this section we present the results of different optimization methods for MCTS. 5.4.1 Batching Selections By batching together multiple game-states before sending them to the neural net- work, the performance increased significantly. This takes advantage of the GPUs efficiency when it comes to parallel computations. In Table 5.4 we can see what speed-up was achieved for different number of selections per batch. Our implemen- tation also used virtual loss to minimize competition between threads when it comes to selecting multiple nodes. 46 5. Results Figure 5.4: Speed up of the engine using different numbers of selections that are batched and sent to the network. 5.4.2 Virtual Loss Virtual loss was used in the engine to decrease the number of collisions when selecting multiple MCTS nodes during the selection phase. To test the effects of virtual loss, the engine was run with and without virtual loss with different numbers of MCTS iterations. In Table 5.6 we can see that the results show a significant decrease in collisions when using virtual loss. Additionally, while more MCTS iterations lead to more collisions, the increase is less than proportional with virtual loss. However, it is important to note that virtual loss optimizations might lead to distortion when it comes to the final visit counts for the MCTS algorithm. Though, by observing the games the engine played, no significant decrease in strength seemed to happen. Table 5.6: Average number of select collisions per move in MCTS with and without virtual loss for 1,000 and 5,000 iterations per move. MCTS Iterations No Virtual Loss Virtual Loss 1000 11 000 300 5000 48 000 380 5.5 Time Analysis of the Engine This section focuses exclusively on the inference-time performance of the engine during gameplay. The training phase, while computationally intensive in total, represents a minor proportion of runtime relative to the inference cost per move and 47 5. Results is not a limiting factor in real-time decision-making. As such, the analysis presented here concentrates solely on the performance of move generation, neural network inference, and MCTS search during self-play games. To identify computational bottlenecks in the engine, runtime profiling was conducted during a standard self-play game on a PC with a Intel i5-12400F CPU and an NVIDIA RTX 3060 GPU. These measurements will differ across different hardware configurations. The measurements include both the time distribution between major components and absolute execution times per move. The results are summarized in Table 5.7. Table 5.7: Execution Time and Proportion by Component Component Time Per Move Percentage of Total Time Neural Network Inference 120 ms 65% Monte Carlo Tree Search 65 ms 33% Generating Legal Moves < 1 ms < 1% The results show that neural network inference and MCTS search together account for nearly all execution time during gameplay, with neural network inference alone taking approximately twice as long as the MCTS search. Move generation, on the other hand, is computationally negligible, requiring under 1 millisecond per move and less than 1% of total execution time. 48 6 Discussion This chapter discusses the project results, limitations encountered, and lessons learned throughout development. It analyzes the final results in relation to the original project goals. The chapter then covers suggestions for future work and ex- plores directions for improvement. Lastly, the chapter concludes with answers to our research questions. 6.1 Analysis of Final Product and Results This section discusses the results from Chapter 5. We begin by reflecting on the final engine strength. 6.1.1 Final Engine Strength Discussion The final engine achieved an ELO rating of 1,729, placing it in the top 10th percentile of the active Atomic Chess player base on Lichess. Even though these results do not match the super-human performance of AlphaZero, they match the ambitions for this project. The training pipeline was therefore shown to be effective in producing a decent engine, yet the results also revealed several limitations that merit reflection. One of the more interesting findings was the inconsistent performance progres- sion across training iterations. Unexpectedly, Model-4000 ultimately outperformed Model-8000 in overall strength, particularly when playing with the black pieces. This suggests that more training does not necessarily correlate with better perfor- mance between the training iteration 4,000 and 8,000. The reasons for this will be discussed further in Section 6.1.2. The engine also demonstrated some suboptimal patterns of play. It demonstrated strong opening play but became noticeably weaker in mid- and end-game decision- making. For instance, the engine had a tendency to trade down pieces even while behind in material, which goes against chess strategy fundamentals and indicates a lack of long-term planning. One possible reason for this is that Atomic Chess simply has inherently more emphasis on the opening since there are many openings that lead to a quick checkmate. This rewards aggressive, short-term strategies that 49 6. Discussion sacrifice material in order to find a checkmate early. Despite this, it is still unde- sirable to be weak in the end-game, so there are several potential augmentations to the training pipeline that could counteract this issue. Some solutions could be starting a few self-play games from middle- or end-game positions, or adjusting the exploration/exploitation balance. These ideas will be discussed in Section 6.3.2. Finally its lack of time management further limited its practical playing strength. A simple fix to would be a variable MCTS iteration count that is based on the time setting for the match. This would mean a longer thinking period for longer time controls and setting a cutoff for the amount of time left, forcing the MCTS iterations to shrink down from the fixed MCTS iteration count to match the remaining time. Another possibility would be to use an additional machine learning algorithm to adjust the amount of MCTS iterations. The engine being idle during the opponents turn also causes a large time-loss. The obvious solution to this issue would be to continue MCTS indefinitely when waiting for the opponents move. When the opponent finally makes a move, the engine could then use the relevant subtree that has been built corresponding to that move. 6.1.2 Neural Network and Training Process A key part of the project is the neural network and training setup, which due to the limitations in the hardware, the complexity of the network and number of training samples that can be generated, is constrained. Therefore, unlike AlphaZero and other strong self-learning engines, we used a relatively shallow network structure due to the inference time benefits. There is an important trade-off between the speed of neural network inference and the complexity of the network. A shallow network enables faster position evalua- tions, which allows the engine to generate a larger volume of training data through self-play games in a given amount of time. This increases the quantity of training examples, which can help reduce variance in the learning process. Additionally, the shorter inference time allows for more MCTS iterations to be ran when searching for moves. On the other hand, having a too shallow of a network would create a bottleneck where the neural network can not find any more patterns beyond a cer- tain point due to being too simple. A deeper network will be able to identify more complex patterns, resulting in an engine that can find better strategies. It hard to tell whether our model size was a bottleneck or not. Looking at the hypothetical ELO progression plot we can see that the progress seems to halt after 2,000 training iterations. The reason for this could be that the model is not complex enough to find new patterns beyond this playing level. It