Reinforcement Learning-Based Cell Balancing for Electric Vehicles Master’s thesis in Computer science and engineering GIOVANNI MAZZOLO MATEI SCHIOPU Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Master’s thesis 2024 Reinforcement Learning-Based Cell Balancing for Electric Vehicles GIOVANNI MAZZOLO MATEI SCHIOPU Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden 2024 Reinforcement Learning-Based Cell Balancing for Electric Vehicles GIOVANNI MAZZOLO MATEI SCHIOPU © GIOVANNI MAZZOLO, MATEI SCHIOPU, 2024. Supervisor: Dr. Yang Xu, Volvo Group Advisor and Examiner: Pedro Petersen Moura Trancoso, Department of Computer Sci- ence and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Reinforcement Learning-Based Cell Balancing for Electric Vehicles GIOVANNI MAZZOLO MATEI SCHIOPU Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Lithium-ion battery packs are comprised of hundreds to thousands of individual cells which, even though manufactured uniformly, exhibit small variations in their character- istics that impact their behavior during operation. These differences cause cells’ State of Charge (SOC) to become unbalanced, which can, in turn, reduce the capacity uti- lization efficiency of the pack [1]. Additionally, battery cells age differently over time, and fast-aged cells can cause packs with healthy cells to be retired early, without fully taking advantage of each cell. When a battery has deteriorated to around 80% of its total capacity, it is retired from electric vehicle usage [2]. To maintain batteries functioning correctly, cell SOC balancing must be done on battery packs. However, balancing the SOC of cells provides a window of opportunity to also include cells’ health into the balancing equation, aiming for the homogenization of cell aging, allowing to thoroughly utilize a battery’s resources. In this way, it is possible to both keep batteries in operating condition and potentially increase their lifespan. In this work, we develop and research a multi-cell simulation framework and Reinforce- ment Learning (RL) methodologies to explore the potential of cell SOC and health balancing. We propose an active balancing strategy for re-configurable cell topology with RL, in which instead of transferring energy between high SOC cells to low SOC cells, cell utilization is modulated so that the power consumption is optimally distributed based on each cell’s SOC. This strategy is applied to SOC balancing, as well as SOC and State of Health (SOH) balancing simultaneously, to potentially allow for an exhaustive utilization of the battery’s potential. Keywords: Battery, cell balancing, reinforcement learning, lithium-ion batteries, automo- tive, computer science, engineering, deep learning. v Acknowledgements We would like to express our deepest gratitude to our supervisor Dr. Yang Xu during our thesis project at Volvo Group, for always showing kindness, patience, willingness to help, and prioritizing assisting us on the project even during busy days. He helped us broaden our views, offered advice, and motivated us to do our best and to continue learning. Without his great experience, knowledge, and commitment to help, this work would not have been possible nor have been as enjoyable as it was. Even when at times we made mistakes or when we were not at our best, he made sure to always stay positive and supported us to keep moving forward. We sincerely thank you. Many thanks to the Volvo BMS team for the feedback, and guidance and for welcoming us to the team during our thesis work. Thank you for the amazing insights into the world of lithium batteries. We would also like to thank our Chalmers supervisor, Pedro Petersen Moura Trancoso for supporting us during the thesis work. This Master’s thesis was conducted at Volvo GTT in conjunction with Chalmers Univer- sity of Technology, Department of Computer Science and Engineering. Giovanni Mazzolo and Matei Schiopu, Gothenburg, 2024-06-26 vii Contents Acronyms xi Nomenclature xiii List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Battery Management Systems . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 State of Charge (SOC) . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 State of Health (SOH) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Causes of imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Cell Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1 Passive Cell Balancing . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.2 Active Cell Balancing . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.1 Equivalent-circuit Model (ECM) . . . . . . . . . . . . . . . . . 13 2.7 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Methods 21 3.1 Balancing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Battery Simulation Environment . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Cell Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Drive Cycle Profiles . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Cell Simulation States . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.4 Cell Balancing Simulation . . . . . . . . . . . . . . . . . . . . . 31 3.3 Reinforcement Learning Model . . . . . . . . . . . . . . . . . . . . . . 33 ix Contents 3.3.1 RL environment . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Action sampling rate . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Reward Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.4 Action/Observation space normalization . . . . . . . . . . . . . 38 3.3.5 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.6 PPO vs. TD3 vs. SAC . . . . . . . . . . . . . . . . . . . . . . 44 3.3.7 Training techniques . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 RL Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 MATLAB-Python Interfacing . . . . . . . . . . . . . . . . . . . . . . . 55 3.6.1 Memory-map Interface Implementation . . . . . . . . . . . . . . 56 3.7 Running the Cell Simulation Environment . . . . . . . . . . . . . . . . . 59 3.7.1 System specifications . . . . . . . . . . . . . . . . . . . . . . . 60 4 Results 63 4.1 Control Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 SOC only Active Balancing . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Active SOC and SOH Balancing . . . . . . . . . . . . . . . . . . . . . . 69 4.3.1 SOC and SOH Balancing with 1% Threshold . . . . . . . . . . . 71 4.3.2 SOC and SOH Balancing with 2.5% Threshold . . . . . . . . . . 74 4.3.3 SOC and SOH Balancing with 3.5% Threshold . . . . . . . . . . 77 4.3.4 SOC and SOH Balancing with 5% Threshold . . . . . . . . . . . 80 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Conclusion 85 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Bibliography 87 A Appendix I A.1 Python Cell Simulation Execution Example . . . . . . . . . . . . . . . . I A.2 Additional Simulation Runs . . . . . . . . . . . . . . . . . . . . . . . . IV x Acronyms Below is the list of acronyms that have been used throughout this thesis: BMS Battery management system RL Reinforcement learning ML Machine learning AI Artificial intelligence EV(s) Electric vehicle(s) SOC State of charge SOH State of health SOQ State of capacity SOR State of resistance SOX State of charge, health, capacity, resistance DOD Depth of discharge ET Energy throughput PPO Proximal policy optimization SAC Soft Actor-Critic DPG Deep Policy Gradient DDPG Deep Deterministic Policy Gradient TD3 Twin Delayed DDPG OCV Open circuit voltage DQN Deep Q-Network RC Resistor–capacitor PID Proportional–integral–derivative MPC Model Predictive Control ADAM Adaptive Moment Estimation xi Contents gSDE Generalized State-Dependent Exploration EM Electro-chemical Models ECM Equivalent-circuit Models DDM Data-driven Models UCB Upper Confidence Bound RUL Remaining-useful-life BPNN Back Propagation Neural Network RBNN Radial Basis Neural Network LSTM Long Short Term Memory ANN Artificial Neural Network xii Nomenclature Reinforcement learning variables γ Discount factor π Policy σ Logical sigmoid function ET (s, i) Energy throughput of a cell in a certain state R(t) Return. Total sum of rewards starting from timestep t onwards, mod- ified by discount factor rt Reward assigned at timestep t Other symbols Cj Balancing feedback for power modulation Physics notations η Coulombic Efficiency Ah Ampere hour (Amp × hour) C − rate Charge rate i(t) Current iapp(t) Discharge or charging current ibalance(t) Balancing current ileakage(t) Leakage current inet(t) Total current iself−discharge(t) Self discharge current P Power Q Capacity qi Cell capacity R Resistance xiii Nomenclature R0 Cell R0 resistance v(t) Voltage vOC(t) Open Circuit Voltage vRC(t) RC circuit Voltage vt(t) Terminal Voltage z(t) State of charge iRC(t) RC circuit current xiv List of Figures 2.1 The composition of EV battery packs . . . . . . . . . . . . . . . . . . . 7 2.2 The immediate effects of cell imbalance . . . . . . . . . . . . . . . . . . 11 2.3 First order Thevenin model of a cell. . . . . . . . . . . . . . . . . . . . 14 2.4 RL training principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 RL agent state-action-reward unit . . . . . . . . . . . . . . . . . . . . . 15 2.6 Taxonomy of RL algorithms . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Balancing focused re-configurable cell pack topology. . . . . . . . . . . . 22 3.2 ”P14” cell model of OCV and SOC relationship, T=25 C°. . . . . . . . . 24 3.3 Highway (us06) and urban (udds) drive cycle profiles. . . . . . . . . . . 28 3.4 State flow chart of the cell pack simulation. . . . . . . . . . . . . . . . 29 3.5 Cell pack SOCs and SOC difference during discharging, charging, and resting phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 State transition within a training episode . . . . . . . . . . . . . . . . . 34 3.7 Control feedback loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Logistic sigmoid curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.9 Probability distribution - normal (Gaussian) distribution . . . . . . . . . 45 3.10 SAC mean episode reward - 2.5 million training steps to convergence . . 47 3.11 TD3 training instability - evaluation episodes . . . . . . . . . . . . . . . 48 3.12 Curriculum learning environment phases . . . . . . . . . . . . . . . . . 49 3.13 TD3 - Diverging after convergence . . . . . . . . . . . . . . . . . . . . 50 3.14 Illustration of two different processes sharing memory through a memory- mapped file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.15 Illustration of the memory-mapped files used for passing data between processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.16 Organization of the address spaces of the memory-mapped files. . . . . . 58 4.1 Simulation of 10 cells on the passive topology with no balancing. . . . . 63 4.2 Simulation of 10 cells on the active topology with no balancing. . . . . . 64 4.3 Passive balancing simulation of 10 cells - utilization = [10, 2]. . . . . . . 66 4.4 Close-up of 4.3 of steps 2025 to 2150. . . . . . . . . . . . . . . . . . . 67 4.5 Active balancing simulation of 10 cells - utilization = [10,2]. . . . . . . . 68 4.6 Active balancing simulation of 10 cells - close-up. . . . . . . . . . . . . . 69 4.7 Active balancing for SOC only (0% threshold) . . . . . . . . . . . . . . 71 4.7 Active balancing for SOC and SOH - 1% threshold - case 1. . . . . . . . 72 4.8 Active balancing for SOC and SOH - 1% threshold - case 2. . . . . . . . 73 xv List of Figures 4.9 Active balancing for SOC and SOH - 1% threshold - case 3. . . . . . . . 74 4.10 Active balancing for SOC and SOH - 2.5% threshold - case 1. . . . . . . 75 4.11 Active balancing for SOC and SOH - 2.5% threshold - case 2. . . . . . . 76 4.12 Active balancing for SOC and SOH - 2.5% threshold - case 3. . . . . . . 77 4.13 Active balancing for SOC and SOH - 3.5% threshold - case 1. . . . . . . 78 4.14 Active balancing for SOC and SOH - 3.5% threshold - case 2. . . . . . . 78 4.15 Active balancing for SOC and SOH - 3.5% threshold - case 3. . . . . . . 79 4.16 Active balancing for SOC and SOH - 5% threshold - case 1. . . . . . . . 80 4.17 SOC and SOC 5% threshold - case 1 - First 4000 steps. . . . . . . . . . 81 A.1 Active SOC balancing test - 1000 cycles - utilization [10, 2]. . . . . . . . IV A.2 Active SOC balancing test - 1000 cycles - utilization [20, 2]. . . . . . . . V xvi List of Tables 3.1 ”P14” cell model parameters. . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Cell simulation initial parameters. . . . . . . . . . . . . . . . . . . . . . 25 3.3 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 EvalCallback function parameters . . . . . . . . . . . . . . . . . . . . . 51 3.5 SAC training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Hyperparameters used within the experiments - SAC . . . . . . . . . . . 54 3.7 Cell simulation parameters description . . . . . . . . . . . . . . . . . . . 59 3.8 System specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1 Simulation training parameters for SOC only balancing. . . . . . . . . . 65 4.2 Simulation parameters of the trained SOC and SOH balancing models. . 70 4.3 Simulation parameters for testing SOC and SOH balancing models. . . . 70 xvii List of Tables xviii 1 Introduction Electric vehicles (EVs) have seen a massive surge in popularity over the past years, with reports from the International Energy Agency showing over 26 million EVs on the road in 2022, with more on the rise [3], giving way to a sustainable transportation method for people and industries. The increase in demand for electric vehicles goes hand in hand with battery demand, which comes with its own unique set of challenges. The battery pack comprises the most expensive component in EVs, with prices of $200/kWh [4]. Once the capacity of a battery degrades to a certain level, around 80%, the pack is deemed expired and must be replaced. The preservation of battery lifetime is thus integral and highly desirable in the field. Maximizing the amount of charge within a battery pack is yet another important research subject within battery technologies, to grant vehicles as much range as possible. Cell balancing is the process of bringing all cells in a pack to similar charge levels. Both battery lifetime and performance depend on the balancing mechanism of the battery. A lack of balancing will quickly render a battery unusable [5]. The widely-used industry balancing applications are using relatively simple topologies using control algorithms that are not battery health-aware and cannot effectively make use of all of the sensor data at their disposal. Traditional control mechanisms that efficiently follow multiple balancing objectives are difficult to develop, as they rely on precise modeling of battery pack relationships [6] [7]. Due to the non-linear and estimative nature of battery dynamics, when multiple objectives are pursued, the complexity of developing such control systems grows exponentially. Traditional balancing topologies rely on passive balancing, which wastes the charge from the cells, or active balancing, which attempts to transfer energy between cells until they are balanced. Both of these balancing methods happen during resting periods when the battery is not in use [8]. Within this thesis, we aim to introduce a cell balancing method that functions during vehicle operation by modulating cell utilization via reconfigurable cells according to the charge and capabilities of each cell. We develop a reinforcement learning (RL) balancing controller for this topology, training it to handle the complexity of such a topology, and attempt to balance the State of Charge (SOC) in a health-aware manner, preserving the State of Health (SOH). 1 1. Introduction 1.1 Research objectives We propose a new reconfigurable cell balancing topology, which aims to balance cells dur- ing operation. Our proposed topology utilizes a reconfigurable battery solution alongside power electronics that allows balancing to take place during cell operation, introducing the possibility of balancing outside rest phases - which is the norm for current battery management systems. Such a topology requires a complex controller that can interface with it and follow the balancing objective. RL agents can infer the required control actions without the requirement of relationship modeling used in methods such as MPC (Model predictive control) [7], while also being able to provide strong predictive behavior. This topology makes use of a high-dimensional continuous action space and observation space, in a highly non-linear environment, a challenging RL task, and we aim to provide an RL balancing controller that can effectively train to handle such an environment, as well as attempt to follow multiple training objectives. The aim of this thesis is to introduce a reinforcement learning approach to cell balancing. The intended outcome is to create a cell balancing controller built from a reinforcement learning agent, as well as provide the simulation environment for the training and testing of the agent. Designing a balancing controller that can take many factors into account using traditional control theory methods, for a non-linear environment, gets exponen- tially harder with each objective as there are difficulties in modeling these non-linear relationships between each new factor. We aim to train an RL controller agent to learn these relationships and make use of them in battery balancing, and simultaneously pursue multiple objectives such as SOC balancing in a health-aware manner. 1.2 Related work With the rise of electromobility, research in the area of battery systems has expanded rapidly. The first component of battery management systems (BMS) to be interfaced with AI/ML technology has been parameter and state estimation and prediction. Sev- eral papers discuss the applications of supervised and unsupervised learning in order to predict the values of SOC, SOH, capacity, battery aging, and other functioning param- eters. There have been extensive studies targeting the estimation of SOC, SOH, and Remaining-useful-life (RUL) for battery systems using machine learning. However, the cell balancing aspect of BMS remains an open question in the field of machine learn- ing, specifically when it comes to exploring the capabilities of reinforcement learning for battery controllers. Harwardt et al. [9] consider a reinforcement learning method that targets passive and active cell balancing using a cell simulation that contains a dynamic thermal model. However, this method is limited to balancing one cell per timestep and making powerful assumptions, such as constant voltage. Y. Yang et al. [10] explore a fast-charge RL battery control method using DQN for minimizing charging time in a balancing-aware manner. By using the generalization of 2 1. Introduction a neural network, the RL agent handles balancing control during charging. Duraisamy et al. [11] propose cell balancing methods that utilize back propagation neural network (BPNN), radial basis neural network (RBNN), and Long Short-Term Memory (LSTM) models to select an optimal resistor for passive battery balancing. Each model’s parameters are based on SOC, temperature rise, balancing time, and C-rate and they are compared and evaluated on a 3-cell and 3-resistor simulation environment. The study treats the case of the switched shunt resistor passive balancing topology and proposes a set of additional resistors that the model can swap between at will, comparing it to the limited capabilities of a weaker topology. Z. Xia et al. [12] utilize an artificial neural network (ANN) to estimate the battery capacity, which then feeds into a controller to manage the balancing of the battery cells. Both of the presented studies rely on machine learning techniques only for battery measurements, estimating their states. B. Jiang et al. [13] study a Deep Neural Network RL method of controlling a reconfig- urable cell topology based around switches, using a discrete action space. This balancing method considers the only SOC of cells for the RL agent’s balancing objective. The technical basis for battery pack modeling and simulation methods used for developing the new topology is adapted from G. Plett’s publications [14] [8]. These works provide exhaustive descriptions of physics-based models of lithium-ion cells and state-of-the-art applications of equivalent-circuit models used for battery management and control. The battery pack simulations of the project were adapted from these works. The reinforcement learning algorithms studied within this thesis and used to develop the RL controller are based on the following works. T. Haarnoja, et al. [15] describe Soft Actor-Critic (SAC), an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning, which is utilized in our work to train RL controllers. S. Fujimoto et al. [16] propose a novel mechanism that builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and results in the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) which we experiment with in our work. T. P. Lillicrap et al. [17] define the Deep Deterministic Policy Gradient, the actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and the principles that would later serve as the basis for TD3 and SAC algorithms. J. Schulman et al. [18] propose Proximal Policy Optimization (PPO), policy gradient methods for reinforcement learning which alternate between sampling data through interaction with the environment, and optimizing an objective function using stochastic gradient ascent. Existing studies treat cases of 3 cells [9], [11] or 4 cells [10], while others utilize machine learning only for estimation rather than direct control [12]. We seek to provide a scalable RL methodology that introduces state-of-the-art RL algorithms to the field and provides insight into RL development methods in the context of EV battery controllers. The existing methods handle balancing during resting [9], [11], or charging [10] phases. A study [13] proposes a reconfigurable topology based around switches, a discrete action space, coupled to an RL controller which only takes SOC into consideration in the reward function. We introduce a power-electronics battery balancing topology, with a 3 1. Introduction continuous action space and high-dimensional continuous observation space, which can be active during battery discharge during normal operation of a vehicle, coupled with an RL controller capable of handling this continuous action space to balance SOCs as well as maintain SOH through health-aware balancing. 1.3 Methodology Within our work, we propose and analyze applications of reinforcement learning agents for battery pack balancing control, as a method to follow multiple objectives within balancing. This is done within a dynamic battery pack simulation based on experimental cell datasets. Within the project, we adapt the simulation formulas to our proposed topology and link it up to simulated road-usage powder demands profiles, synthesized from a drive-cycle dataset, and altered by road conditions. This simulation is then attached and synchronized with reinforcement learning algorithms within our training framework, in order to train and produce RL agents which can fulfill the role of controllers for the BMS of the EV battery in an online RL way. We undergo a thorough analysis and experimentation of reinforcement learning algorithms, training techniques, optimizations, and reward design in order to find the most suitable algorithm for training RL agents for battery pack environments and provide insight into development methods for the complex high-dimensional and non-linear learning task of health-aware battery balancing. 1.4 Organization The thesis is formatted into several chapters. The theoretical background and foun- dational information are presented in the ’Background’ chapter, which is split into the battery pack, balancing, and RL theory. The implementation and experiments are dis- cussed in the ’Methodology’ chapter which presents the work done on the cell simulation, RL algorithms, and the interfacing of the two. The findings are showcased in the ’Re- sults’ chapter. Discussion based on the findings and final thoughts for applications are presented in the ’Conclusion’ chapter, as well as proposals for future improvements and subsequent work. 1.5 Limitations The electrical and electronics design aspect of the topology makes use of abstractions, especially in the case of duty-cycling power electronics. In-depth design and simulation of these elements are not within the scope of the work. Training of the RL model for this project was done using cell simulations that are based on the characterization of real cells, which are modeled using a first-order RC circuit, which we consider sufficient for the needs of this project. 4 1. Introduction 1.6 Ethical Considerations All datasets used within this work for cell modeling and drive cycles are public and contain no identifiable vehicle data. Reinforcement learning models, especially Deep Learning variants, are highly complex and raise difficulties in interpreting their decisions. This lack of transparency can lead to issues for safety considerations. Any EV applications that attempt to incorporate RL/ML technologies within their workflow must undergo extremely thorough testing and verification, especially in the case of battery technologies which can be subject to leaks of hazardous material or smoke, fire, and explosion as a result of thermal runaway. 5 1. Introduction 6 2 Background 2.1 Battery Management Systems Battery management systems (BMS) are software applications that oversee the safe and efficient operation of battery cells and the battery packs as a whole [19]. Figure 2.1: The composition of EV battery packs A battery pack can contain hundreds of cells, and an electric vehicle (EV) can contain multiple battery packs, pictured in Figure 2.1. Traditionally, EV manufacturers use the same cell type in vehicle packs, as mismatches add unnecessary complexity to functioning and manufacturing. However, cells of the same type, from the same manufacturer and manufacturing batches are not identical in characteristics. Current manufacturing limitations and industry demand do not allow precise uniformity among the cells [14] [8]. A BMS is responsible for keeping track of data from sensors across the battery pack, such as voltages, current, and temperatures, and uses it to maintain uniform usage in a non-uniform environment. Some sensors are present at cell-level, on each cell, such as the ones for current and voltage, while some can be placed across several sections of the battery pack in the case of thermal sensors. This data is used to manage temperature, charge C-rate, discharge profile, and other functions of the pack [19]. The raw data from the sensors is then used to estimate and predict ’states’ for the cells. State of Charge (SOC), State of Health (SOH) (including State of Capacity (SOQ) and State of Resistance (SOR)), and many other states are estimated by the BMS to determine the status of the battery pack. SOC, for example, can be used to directly 7 2. Background inform the power management unit of the remaining charge and, coupled with additional vehicle usage data, determine the remaining range of the vehicle and offer an estimation of how much longer the vehicle can run during assumed average usage [14]. 2.2 State of Charge (SOC) Arguably the most important estimation performed in the battery back is the State of Charge. It is crucial to the basic functionalities of a vehicle, however, SOC has many more important uses that are not directly visible to the end-user of an EV. As we cannot directly measure SOC, we must rely on estimation methods [14]. Rechargeable lithium-ion cells are small packets of different shapes and sizes that store electricity through chemical reactions. Cell usage is represented by charge/discharge rate, known as cycling rate (C-rate), and voltage, must be within manufacturer-defined limits [14]. State of Charge is bound to these limitations. If a cell were to over-charge beyond its factory limit, chemical reactions within the cell would lead to irreversible damage and a permanent loss of capabilities in the cell. The same goes for over-discharging. A cell is never ’fully discharged’ of its electric charge, but rather partially discharged up to a lower voltage limit which it must never pass [19]. When discharging, this usage bound is called Depth of Discharge (DOD), and it measures how much of a battery’s capacity has been used relative to its total capacity. DOD has a direct impact on the degradation of a cell, as higher DOD generally results in a larger volume change of active particles during cycling, increasing stress and leading to cracking and cell degradation [20]. The definition of SOC varies depending on the method used for the estimation. The most widely used methods are Open circuit voltage, Coulomb counting, Impedance Spectroscopy, Model-based or ANN-based. In the Open Circuit Voltage (OCV) method, SOC is defined as a one-to-one relationship to OCV. The battery is disconnected from the load and left to settle its chemical reac- tions, and then OCV is measured. It is then compared to a SOC-OCV lookup table and the SOC is determined [21]. This method however can require long relaxation times until a lithium-ion battery is fully settled. The lookup table depends on the battery chemistry, and the relationship between SOC-OCV can diverge from the one-to-one estimation with temperature changes and aging. In the Coulomb counting method, SOC is defined as the integration of current [21]. Given an accurate initial SOC estimation, coulomb counting keeps track of the current that flows in and out of the battery by accumulating the charge that is being transferred [22]. However, this method relies on the accuracy of the counting sensors, and inaccuracies from defects or electronics wear can accumulate into large estimation errors [21]. Electrochemical impedance spectroscopy functions by injecting small amplitude AC sig- nals to a battery at different frequencies. Parameters measured from these signals are then used as indicators of SOC. This method is not suitable for online measurements, while the battery is being used, [21] and usage is mostly restricted to lab environments. 8 2. Background The model-based methods are the most suitable for online measurements and involve modeling of the electrical, chemical, or combinations of both properties of a specific battery. The most widely used algorithms for this method are variations of the Kalman filter [23]. For our project, we utilize coulomb counting to track SOCs. We aim for a simulation of a battery within a vehicle in usage, rather than within a lab environment, and thus we make use of an accurate online, real-time, measurement method. 2.3 State of Health (SOH) The State of Health (SOH) is another critical parameter for BMSs that aims to estimate the loss of charge storage capabilities of a battery, by use and aging [24]. Fundamentally, it is the monitoring of the battery’s parameters whose degradation process is, usually, slow yet influences the battery’s performance. Over time, a battery’s total capacity gets reduced - this is known as capacity fade. Capacity fading occurs due to the cell’s structural deterioration as well as chemical side effects over time. Furthermore, battery cell aging causes its internal resistances to increase, for similar reasons. It is essential to have an accurate measurement of the capacity and internal resistances, given that they are major contributing parameters for estimating SOC and calculating energy [14]. Similar to SOC, SOH is also a parameter obtained through estimation and several meth- ods enable it, such as [24]: Characteristic Quantity Method, Model-based Method (Equiv- alent Circuit Model, Electro-chemical Model, Electro-chemical Impedance Spectroscopy), Fusion Method and Data-driven Methods (Neural-networks, Fuzzy Logic, Support Vector Machine, Parameter Identification). An influential aspect of a battery’s health is the C-rate, which defines how quickly the cell acquires or releases charge. C-rate has many implications on a cell’s lifetime and capacity, the most important one being that high C-rates lead to drastic loss of capacity and loss of power for the cell [8] [20]. Essentially, batteries that are under extreme load demands or that are charged at a very high rate can exhibit faster degradation, leading to reduced lifetime. Many elements are considered to influence the degradation of SOH in a battery during operation. The most drastic effects arise from overcharging and over-discharging, as well as operation at inadequate temperatures [8]. Within our work, we treat these effects via charge/discharge safety mechanisms and temperature abstractions within the battery simulation, and focus on the effects of cell balancing and cell usage. We use energy throughput as an indicator for preventing SOH degradation. By attempting uniform energy throughput for each cell while balancing, we aim to decrease the overall degradation of cells and limit the non-uniform degradation of cells in a pack to increase the lifetime of battery packs. Some state-of-the-art work [12], [25] that include SOH in the balancing strategy model SOH as the capacity fade of the cells. However, capacity fading can be caused by 9 2. Background several factors, including internal chemical reactions and cell operation. Cell chemical deterioration is not considered for this work as it is unavailable in our simulation capacity or dataset. For this, we consider SOH as the energy throughput of the cells. Cell throughput when considered as SOH allows for the RL model to be trained on data that updates consistently and in a fast manner, in addition to energy throughput being linked to cell aging [20]. 2.4 Causes of imbalance Imbalance is introduced by any combination of factors that make the SOCs of the cells diverge from one another. A common factor is the different Coulombic efficiency in cells. Two cells may start in the same SOC state, and quickly diverge during charging due to the different efficiency η [8]. The following formula describes how the SOC z of a cell is estimated while the cell is being charged z(t) = z(0) − 1 Q ∫ t 0 η(τ)iapp(τ)dτ (2.1) where Q is the capacity, η is the Coulombic efficiency and iapp is the charging current of the cell. Notice that η directly affects the charging current of the cell, and each cell’s η parameter is unique, even a slight difference can cause a cascading unbalancing effect over time. Imbalance can also happen due to different total current loads applied to the cells. The current that passes through cells is related to multiple different sources. Besides the main application current, the load requested from the battery pack for the operation of a vehicle, we also have self-discharge current as well as leakage current [8]. inet(t) = iapp(t) + iself -discharge(t) + ileakage(t) (2.2) Self-discharge rate differs from cell to cell and refers to the internal current flow within a battery cell when it is not connected to any external circuit. Self-discharge is a natural phenomenon that occurs in all types of batteries over time due to ongoing chemical reactions within the battery, even when not in use, as well as impurities and imperfections in manufacturing which create small pathways for current to flow internally. These chemical reactions accelerate with temperature increases, as well as high SOC values within the cell over long periods [8]. Leakage current refers to the small amount of current that powers attached BMS elec- tronic circuitry. Manufacturers of battery packs and BMS components specify the power consumption of their BMS circuitry in datasheets [8]. The effects of these additional currents are seen through every state of the battery, in charging, discharging, and resting. They are permanently active and vary with the parameters of the pack and cell itself [8]. 10 2. Background 2.5 Cell Balancing The definitions of a balanced pack differ from application to application. A balanced pack is commonly defined as a pack where all of the SOCs of its cells are the same. However, this is a very harsh requirement that depends strongly on very accurate measurements and cell manipulation techniques which are not usually available in an EV system. The SOCs in an EV battery pack vary greatly during usage and the estimations from measurements can be inconsistent, as the cells are not measured in an isolated lab environment [21]. Thus we introduce a balance threshold rather than an exact equality relationship. In a cell pack with SOC values between 0 and 100%, we compute the difference between the smallest SOC value and other cells in the pack. We assign a threshold for the maximum SOC difference between the cells, which within experiments can take values between 1% to 3.5%. Cells within the pack with a difference smaller than the set threshold are considered balanced. Due to the non-uniform characteristics of cells in a pack, which diverge even further through usage and aging, cells in a pack begin to have very different SOCs. As the cells are charged and discharged, these SOC variations get larger and larger to the point where, for example, a cell can be at below 50% SOC when the rest are almost at 100%. The short-term effects are a significant reduction of battery capacity and a waste of energy, reducing the range of an EV to a fraction of the intended baseline. Figure 2.2: The immediate effects of cell imbalance When such differences happen, a BMS limits the battery capacity to that of the weakest cell in the pack, pictured in Figure 2.2. Continuous usage of the vehicle with these imbalances can lead to rapid deterioration of the cells, rendering the whole pack unusable. An improper BMS implementation on an EV suffering from cell imbalance can lead to a full stop in the middle of a highway and similar safety risks [8]. These imbalances, if left untended over a longer period, lead to rapid degradation of the battery cells through accelerated aging and abuse. Erroneous charging/discharging of imbalanced batteries, usually due to measurement inaccuracies of batteries during stress, leads to overcharge 11 2. Background and over-discharge. This produces irreversible chemical reactions in the batteries, which reduce their capabilities and generate heat [5]. Imbalance in cells influences internal resistance. Cells with higher internal resistance tend to dissipate more heat during charging and discharging. This increased heat generation can exacerbate temperature differences within the battery pack, potentially leading to localized hotspots where thermal runaway may be initiated, in a heavily degraded cell [8]. Imbalanced cells may experience voltage spikes or current surges during charging or discharging cycles, especially when transitioning between different operational states. These irregularities in voltage and current can induce stress on the cells and lead to thermal runaway if not properly managed by the battery management system [5]. The sum of these factors can start, or contribute to, an incident of thermal runaway. As the lithium-ion cells’ temperature increases, the chemical reactions in the cell get faster and faster, leading to a self-feeding reaction. This catastrophic effect of self-heating, known as thermal runaway, causes the expansion of the cells, production of smoke, fire, and finally, leads to the explosion of the cell pack. If BMS detects the development of this effect too late and a battery enters a state of thermal runaway, it usually cannot be stopped. Outside thermal influence from cooling mechanisms, and an attempt to prevent this only before the thermal runaway has been initiated [5]. 2.5.1 Passive Cell Balancing Because of these issues and the critical flaw of lithium-ion packs, a BMS must regularly balance the battery cells and maintain a certain degree of uniformity for the SOCs. There are many methods for ensuring this balance, each requiring different circuitry and electronics within the pack to enable it, and they are categorized into two categories: passive and active cell balancing [8] [26]. In terms of energy, passive balancing usually is classified as dissipative, whereas active balancing is non-dissipative [27]. Passive cell balancing is the simplest form of balancing, where cells are discharged until they reach similar SOCs. The excess charge is transferred into heat through a consumer, a resistor. The main benefits of these methods are simple implementations and relatively cheap components. There is no need for advanced algorithms. In some implementations, the BMS has no direct influence on the passive balancing mechanism, which functions purely at the circuit level [8]. The downside of these methods is that battery charge is wasted destructively. Energy may not be simply removed from the cells and is instead transformed directly into heat. The resistors are placed close to the cells and this leads to the heating of the pack. Due to this effect, as well as the inaccuracy of SOC estimations during charge/discharge, it can be dangerous to allow this balancing to happen outside resting periods, where there is no outside current applied to the cells. Applying this method during charge/discharge comes with the danger of overheating the battery which, even if thermal runaway is avoided, still contributes to the deterioration of the cells [8] [26]. 12 2. Background 2.5.2 Active Cell Balancing Active balancing, as a non-dissipative type does not waste cells’ energy to achieve balance. Most active cell balancing implementations rely on transferring energy from high SOC cells to low SOC cells. There is some waste of energy on the circuitry needed to enable the energy transferring between cells, but lower in comparison to the passive balancing strategies [8]. Naturally, transferring energy between different cells requires more complex mechanisms, some of which are not effective enough to be considered as good alternatives to already implemented passive balancing designs. Not only in complexity, but voltages between imbalanced cells need to be relatively large so that cells can balance quickly. This means that low differences in cell voltages, even though they can be unbalanced, can make the energy transfer too slow, unable to achieve balance in the necessary amount of time without the need for additional circuitry that can bypass this limitation. 2.6 Cell Model There are a variety of different models utilized to simulate battery cell behaviors, some of which are Electro-chemical Models (EM), Equivalent-circuit Models (ECM), and Data- driven Models (DDM) [28], each with different advantages and disadvantages. In this project, we utilize the ECM model for cell and cell pack simulation. 2.6.1 Equivalent-circuit Model (ECM) ECM models the electrical behavior of battery cells through circuit theory, utilizing electrical elements such as resistors, capacitors, inductors, and voltage/current sources. The ECM model used is based on the Thevenin circuit model, one of the most widely used ECM models for cell battery simulation, given that it can more accurately represent the dynamic behaviors of the cells [28]. Due to the Thevenin ECM model being a linear and time-invariant circuit model, it does not capture accurately the nonlinear and time-variant physical behaviors of the cells. Additionally, for the model to be accurate, real cells must be measured and parameterized to tune the model parameters for precise simulation. This can be challenging given the non-linear behaviors that cells show depending on different parameters such as temperature, SOC, and rates of charge and discharge [8], [28]. 13 2. Background Figure 2.3: First order Thevenin model of a cell. In this model, OCV is defined as the Open Circuit Voltage, which is the voltage of the cell measured without load. The OCV is also a necessary parameter to determine the SOC of a given cell. The cell model voltage is modeled as shown in equation (2.3). v(t) = OCV (z(t)) − R1iR1(t) − R0i(t) (2.3) 2.7 Reinforcement Learning Reinforcement learning is a branch of machine learning. Machine learning, as a concept, is a computational approach to learning. Reinforcement learning uses ’interaction’ as its object of learning. The ’learner’ in an RL setting is frequently called the ’agent’. The objective of the agent is to solve an infeasible problem to approach through traditional algorithmic means [29]. The usual applications of RL algorithms are problems with a, often large, set of variables in a non-uniform environment. That is to say, they influence each other non-linearly [29]. Reinforcement learning is formalized as control optimization in an incompletely-known Markov decision process. The learning agent must sense the state of an environment and take actions to affect that state. RL is often referred to as a form of unsupervised learn- ing, however, the objectives of these two methods are different. Unsupervised learning tries to find a structure hidden in collections of unlabeled data, whereas reinforcement learning tries to maximize a reward sequence rather than trying to find a hidden struc- ture. Although there is some overlap in these objectives, as finding a structure would be beneficial for RL in most situations, however, it does not guarantee a maximization of a reward [29]. 14 2. Background Figure 2.4: RL training principles Figure 2.4 illustrates the the dataflow of RL training. An environment (i.e. the battery back) sends out states to the RL agent. The agent provides the best course of action or that state that it has learned through experience so far. The action alters the environ- ment, leading to a new state. This state is assigned a reward, according to the reward function, and the feedback is given to the RL agent. After enough state-transition re- ward pairs are gathered, the RL agent goes through a training step in which it updates its course of action. An RL agent’s target for interaction is the environment. An environment can be defined as the setting in which our agent can act and influence elements within. It can be several integer variables, a set of continuous arrays, or a complex three-dimensional simulation. The definition of an environment depends on the problem which we seek to solve. The RL agent is offered an interface through which it can interact with the environment. This is the Action Space of our environment, and it represents variables that are manipulable by our agent. The variables that cannot be directly influenced by our agent are termed the Observation Space. A ’snapshot’ of the Observation Space at a certain timestep is called a ’state’. States are linked together by the actions which the agent takes between them. We start interaction with an environment in an initial state, and each action the agent takes ends in another state. Actions are sequential and begin from the resulting state of the previous action [29]. Figure 2.5: RL agent state-action-reward unit Each action taken upon an initial state and leading to a final state is assigned a reward, formalized in Equation (2.4). The reward is defined by a reward function that represents 15 2. Background the target of our RL agent, the objective which it must achieve through its actions, bounded by a discount factor γ which takes values between 0 and 0.99 Equations (2.5, 2.6). The value discount factor determines how much the RL agent ’looks ahead’ as it tries to maximize the rewards. The largest instance of the sum is the immediate reward for a state-action pair, however, with a large discount factor, the future rewards for the following actions gain more importance. An agent may thus take an action that has a low immediate reward but allows for much higher rewards for future states. The agent then develops efficient long-term strategies that can secure high total rewards by taking into consideration the development of states. An RL agent with a discount factor of 0 is known as ’myopic’, and ignores future considerations in its decision-making. It has a maximum possible value of 0.99, which is a mathematical technique to ensure rewards in the distant future eventually approach 0. This ensures the immediate action reward has more importance than individual future actions. The predictive focus of an RL algorithm must be tuned to the specific task it aims to solve, and the discount factor is critical in defining this focus [29]. rt = R(st, at) (2.4) The RL objective is to select a policy that maximizes the expected reward sum when the agent acts according to that policy. A policy is a stochastic rule by which the agent selects actions as a function of states. A policy’s value functions (vπ and qπ) assign to each state, or state–action pair, the expected return from that state–action pair, given that the agent uses the policy (2.5). The value function v of a state s under a policy π, denoted vπ(s), is the expected return when starting in s and following π thereafter. Similarly, we define the value of taking action an in state s under a policy π, denoted qπ(s, a), as the expected return starting from s, taking the action a, and thereafter following policy π (2.6) [29] [16]. vπ(s) = Eπ[Gt|St = s] = Eπ [ ∞∑ k=0 γkRt+k+1|St = s ] (2.5) qπ(s, a) = Eπ[Gt|St = s, At = a] = Eπ [ ∞∑ k=0 γkRt+k+1|St = s, At = a ] (2.6) These value functions are estimated through experience, an agent follows the policy π and computes an average for the states encountered of the total reward returns that have followed that state, which converges to the state’s value vπ(s) as the number of times that state is encountered nears infinity. Keeping a separate average for each action in each state, we converge to the action values qπ(s, a). The value functions define ordering over policies. A policy π is defined as better than π′ if the expected total reward is greater than that of π′ in all states. Formally, π > π′ if and only if vπ(s) > vπ′(s). The RL task is to find the optimal policy π∗ which is greater than all other policies, formalized in the following equations (2.7), (2.8), (2.9) [29]. v∗(s) = max π qπ(s) (2.7) 16 2. Background q∗(s, a) = max π qπ(s, a) (2.8) q∗(s, a) = E[Rt+1 + γv∗(St+1)|St = s, At = a] (2.9) Traditional greedy algorithms in non-linear environments show severely limited perfor- mance. A greedy algorithm begins by sampling an action, and will then continuously take that action as it is the ’most rewarding’ action at its disposal. It makes no effort to explore the other actions it can take, as doing so is not according to its imperative of maximum immediate reward, which it can only compute from the actions it has taken. Thus, the same course of action is repeated ad infinitum with no improvements, never discovering the better alternatives offered by the other actions. RL is defined by its capacity for effective exploration of a large environment [29]. A big challenge in designing RL systems is the balance between exploration and exploita- tion. To accrue a large reward, the agent must take actions that it has tried in the past and knows will produce an effective reward. However, to discover such actions the agent must try actions that it did not attempt before. Our agent must exploit what it has already tried, and explore to discover new methods for achieving rewards. By choosing one, we fail the other, and a balance must be struck. This trade-off dilemma is specific to reinforcement learning and has no equivalent in supervised or unsupervised learning [29] [30] [31]. RL algorithms utilize techniques such as e-greedy, softmax, or Upper Confidence Bound (UCB) to encourage the agent to explore a wide range of actions and states in the early stages of learning. This exploration helps the agent discover effective strategies and solutions that might not be immediately apparent. By systematically balancing exploration with exploitation, RL ensures that the agent can both discover new strategies and refine them to maximize long-term rewards, eventually discovering a final optimal policy [29]. The biggest bottleneck in RL is samples. RL algorithms need many state-action-reward samples to train efficiently. The number of samples varies by task/environment but usu- ally reaches several million state transitions for convergence. An RL algorithm converges when further exploitation yields no further improvements, and the exploration mechanism tapers off. For example, some algorithms utilize an entropy parameter that gradually decreases according to loss functions based on the accuracy of predicted rewards, while others introduce noise to the action of the agent to ensure exploration [15] [16]. There are two approaches to RL when it comes to sample-gathering: Online RL and Offline RL. In offline RL, samples are provided from interactions with other actors (i.e. other BMS control algorithms). These interactions are recorded from logged data and tagged with rewards. In the training process, the RL agent cannot directly act upon the environment and instead learns from the actions of other controllers. In online RL, an environment is simulated and the RL agent interacts directly with the simulation to generate state-action-reward samples. This approach works best when 17 2. Background there is no abundant amount of usable logged data to train the RL agent with. It converts the sample scarcity problem into a computation-power problem. However, the simulation must be relatively accurate to the real-world environment we seek to learn [29]. Examples of RL infrastructures in the industry initially start with an online approach when branching into the area, and gradually move to Offline once they have the possibility of generating enough logged data from products, or through digital twins. A digital twin is a simulation that we can say is certifiably identical to its real-world counterpart. Online RL may also offer more possibilities for exploration than Offline learning from traditional controllers. However, the accuracy of the simulations remains a key issue. As we are experimenting with a novel topology for our approach, there is no widely avail- able large bank of logged BMS interaction data. Within our work, we rely on simulations to generate the necessary samples for the training process, via direct interaction from the RL agent in an online RL setup. Figure 2.6: Taxonomy of RL algorithms Reinforcement learning algorithms are classified into two large categories: model-based and model-free. Model-based algorithms learn an explicit model of the environment, in the form of tran- sition probabilities or a predictive model. They utilize this model to plan and make decisions about actions to take. That is to say, a model-based approach implies the possibility of predicting rewards and states, during training, before an action is taken. Rather than relying on trial and error, the model-based approach uses approximations of the environment to view the possible outcomes before taking a step [32]. 18 2. Background Model-based RL approaches require certainty that the approximation fully captures the essence of the environment, one that is identical to the ground truth. It is also crucial for this approximation to be very efficient in generating predictions, otherwise the time-to- train becomes infeasible. These approaches work best with easily modeled environments with few or no non-linear factors, such as games of Chess and Go [32]. Model-free approaches are applied directly to an environment, and learn through trial and error. A model-free agent must take an action, and see the results, to be able to learn from it. This grants much more flexibility and eliminates the risks of inaccurate models training unusable agents. It comes with the cost of less sample efficiency, as we need to run more simulations for the agent to find the best course of action [32]. Within the model-free methods, we have several classifications: Policy Optimization, and Q-learning methods. Q-learning algorithms are also known as ’off-policy’. They learn a value function, which is then used to derive a policy. Specifically, it learns the action-value function (Q-function), which predicts the expected utility of taking a given action in a given state and following a certain policy thereafter. It does so by estimating the Bellman equation. A big advantage of off-policy methods is that they can use data collected at any point during training, even from past actions from previous, less-trained, versions of the agent. The most popular algorithm representative of this class, which spawned this branch of RL, is Deep Q-Network (DQN) [33]. Policy optimization, or ’on-policy’ methods can only use data collected with the lat- est version of their policy. They learn a policy function that directly maps states to actions. The approach of policy-based methods makes them very stable and able to handle continuous action spaces, and smoothly converge to an optimal policy [18]. On- policy algorithms are ideal for scenarios with a large and continuous action space. They do however suffer from inefficiency in sample usage. 19 2. Background 20 3 Methods 3.1 Balancing Strategy As discussed previously, balancing can be categorized as dissipative and non-dissipative. For this project, we propose a non-dissipative balancing strategy that does not rely on transferring charge between cells. We explore the idea of each cell being discharged at different rates, by the normal operational power-draw of the vehicle, in order to achieve balance, for this to occur cells must be balanced during discharge phases, namely, during operation. Balancing during operation can allow the cell pack to be utilized fully, by allowing higher SOC or stronger cells to instead take part of the work of the low SOC or weak cells. With this, low SOC or weak cells get used less, balancing the overall power demand between all cells equitably so that cell aging averages out in a similar manner on all cells, rather that having cells aging more quickly that other. Ideally, when cell packs reach the end of their lifetime, all cells should have been fully exploited. To allow for balancing during operation, we utilize a cell pack topology which is a re- configurable battery topology, illustrated in Figure 3.1. In this battery pack, each cell is connected to a power electronics circuit unit whose operation is to modulate the amount of charge that each individual cell will provide. Each circuit unit, for the purposes of this project, is connected in series with other units, forming a pack. By being able to modulate the amount of charge each cell provides, it is possible to remove the needed charge of each cell during operation with the aim of achieving balance. In Figure 3.1, the buck-boost converter represents the power electronics section of the electronics while the controlling switch represents the circuit that will handle the charge flow accordingly. This is a representation of what could be expected of the controller power electronics to be, however the electrical design and considerations of this are out of scope for this work. Any similar circuit or system can be used instead of the presented one, as long as the handling circuit is capable of executing the demanded tasks implied by this work. 21 3. Methods Figure 3.1: Balancing focused re-configurable cell pack topology. 3.2 Battery Simulation Environment This section provides details about the cell simulation, the data used for the simulations, the interfacing between the simulation environment and the machine learning training environment and how balancing is handled through the simulation. 3.2.1 Cell Simulation The cell simulation code was leveraged from previous works [8] and in order for these to fit the objectives of this project, the code was modified and new code was added onto it. The cell simulation reproduces a cell’s electrical parameters over time based on the demanded power profile, utilizing the electrical cell model described in section 2.6 Cell Model. Every cell simulation uses the same cell model and parameters, as shown in Table 3.1, which correspond to cell model ”P14” when running the cell simulation environment. More cell models are available for this work, however only cell model ”P14” was used for the final experimental results. 22 3. Methods Table 3.1: ”P14” cell model parameters. Parameter Value Unit Description T 25 Degrees Celcius Temperature of operation of the cell q 14.53 Amp × hour (Ah) Cell capacity eta 0.999 Ratio, unitless Coulombic efficiency r0 0.00178096 Ohms Series resistor parameter r 6.477208e-04 Ohms Resistor-Capacitor resistance rc 0.823683 Ohms R-C resistance rt 2.5e-4 Ohms Cell tabs resistance The temperature parameter for this work has been set to 25 °C. Most of the parameters of the cells change depending on the temperature in varying degrees but mostly stay relatively similar. Given that this would greatly increase the complexity and time of the simulation, the temperature will remain constant for the purposes of this work. Each cell is simulated in a pack of 10 cells which have slight differences in their pa- rameters, in order to simulate factory irregularities, aging differences, and capture the unbalancing characteristics of the cells. If the cells in the simulation have the exact same parameters, then unbalancing will never occur since they will all act exactly the same way. To avoid this, some cell parameters have small variations. This is meant to emulate real cells, given that the manufacturing process of the cells cannot possibly yield virtually exact cells, thus this parameter variance is reasonable to have. To have some control over the randomness of the parameter values, each random parameter is seeded via a seed parameter that can be freely chosen on each simulation - this allows simulations to be accurately reproduced as well as being able to have reproducible yet random cell configuration. The parameters q, eta, r0 for cell i are randomized as follows Qi = q − 0.25 + 0.5αi (3.1) Etai = eta − 0.002βi (3.2) R0i = r0 − 0.0005 + 0.0015γi (3.3) where αi, βi, γi ∼ Unif (0, 1). Here αi, βi, γi and αj, βj, γj are independent when i ̸= j. During simulation, different voltages and currents are calculated depending on the power that is being demanded on each cell, and ultimately the SOC of each cell is computed. SOC is one of the primary values that is used for balancing, it also is used for the simulation itself, given that cells start with a set initial SOC value from which the rest of the parameters are obtained. The open circuit voltage (OCV) vOC is also part of the cell model parameters. The cell open circuit voltage vOC is a function that depends on the cell SOC and temperature, defined as follows vOC = OCV fromSOC(SOC, T, model) (3.4) 23 3. Methods where SOC is the current state of charge of the cell, T is temperature and model are the ”P14” cell model parameters and OCV fromSOC() is a lookup table function that finds the cell’s vOC depending on the input parameters. The cell model contains all the parameters shown in Table 3.1 as well as the voltage-SOC relationship of the simulated cell. This data belongs to the cell parameters we utilize and was obtained through cell characterization tests and measurements, which we leverage in this work. Figure 3.2 illustrates the relationship between vOC and SOC, where SOC is the free variable. Figure 3.2: ”P14” cell model of OCV and SOC relationship, T=25 C°. The initial value of SOC for all cells is the same, this is the free variable of the simulation which needs to be set at initialization. It is set to the upper SOC limit at the start of the simulation: SOC(0) = maxSOC (3.5) Later, once the simulation has calculated the voltages and currents going through the cells, the SOC is calculated again and updated, from which the simulation cycle begins again. By defining the maximum and minimum limits for SOC, maxSOC and minSOC respec- tively, we can use Equation (3.4) to calculate the respective maximum and minimum voltage limits that the cell can have, maxV lim and minV lim respectively - shown in Table 3.2. The simulation parameters that are set at initialization are shown in Table 3.2. 24 3. Methods Table 3.2: Cell simulation initial parameters. Parameter Value Unit Description maxSOC 0.95 Percentage Upper SOC limit for cell minSOC 0.10 Percentage Lower SOC limit for the cell maxVlim 4.095 Volts Upper voltage limit for the cell minVlim 3.5185 Volts Lower voltage limit for the cell leak_c 0.01 Ampere Leakage current of the cell Tsd 20 Celsius Self-discharge cell temperature The values for maxSOC, maxSOC, leak_c and Tsd are simulation defined, while maxV lim and minV lim are values obtained from the OCV-SOC relationship from the cell model, via Equation (3.4). Leakage current, which is parameter leak_c, aims to simulate cells slowly losing charge over time due to an external factor. This is to mimic cells becoming discharged due to circuitry utilizing cell charge for computation or control circuitry, such as powering the BMS control circuit. Additionally, cells can exhibit a self-discharge behavior, the simulation treats the self- discharge of the cell as an additional current. This is a different current than the leakage current, and the parameter that controls this current is Tsd - self-discharge cell temperature. The parameters leak_c and Tsd for cell i are randomized as follows Leak_ci = leak_c + 0.002αi (3.6) Tsdi = Tsd + 10βi (3.7) where αi, βi ∼ Unif (0, 1). Here αi, βi and αj, βj are independent when i ̸= j. Once the initial parameters are defined, the simulation can start calculating the sev- eral variables that are dependent on the discharge power profile used. Cells for passive balancing are considered connected in series and simulated as such, while cells for recon- figurable active balancing are simulated individually, with no direct connection to other cells. For passive and active balancing, the current inet represents the current flowing through any cell and is defined as follows inet(t) = iapp(t) + iself -discharge(t) + ileakage(t) (3.8) Where iapp(t) is the discharge current, dependent on the power profile, iself -discharge(t) is the self-discharge current and ileakage(t) is the leakage current. The discharge current iapp is initialized as follows 25 3. Methods iapp(0) = 0 (3.9) Cell pack simulation for passive balancing topology: N∑ j=1 (vOC,j − vRC,j) i − N∑ j=1 R0,ji 2 = P × N (3.10) where N is the number of cells in a battery pack, iapp is the current through the pack, and P is the power usage of a single cell, read from the drive cycle profile. For cell j, vOC,j is the open circuit voltage, vRC,j is the RC circuit voltage, and R0,j is the resistance in series. To calculate iapp with all voltages positive, we use iapp = N∑ j=1 (vOC,j − vRC,j) − √√√√√ N∑ j=1 (vOC,j − vRC,j) 2 − 4(P × N) N∑ j=1 R0,j 2 N∑ j=1 R0,j . (3.11) Cell pack simulation for reconfigurable active balancing topology: For cell j, j ∈ [1, 2, · · · , N ] (vOC,j − vRC,j) ij − R0,ji 2 j = P (3.12) where iappj is the discharging current, P is the power usage of a single cell, obtained from the drive cycle profile, vOC,j is the open circuit voltage, vRC,j is the RC circuit voltage, and R0,j is the resistance in series. To calculate iappj with all voltages positive, we use iappj = (vOC,j − vRC,j) − √ ((vOC,j − vRC,j))2 − 4PR0,j 2R0,j (3.13) The current calculation for iself -discharge(t) for cell j is defined as iself -discharge(t) = (vOC,j − vRC,j) ((−20 + 0.4Tsd) × SOCj + (35 − 0.5Tsd)) × 1000 (3.14) As for leakage current ileakage for cell j the value remains constant as shown in Formula (3.6) and is defined as ileakagej = Leak_cj (3.15) The RC voltage vRC(t) of any cell i is the voltage of the RC components of the ECM model, which is necessary for calculating the terminal voltage vt(t) of the cell. The RC voltage vRC(t) the can be calculated as follows 26 3. Methods vRCi(t) = ri × iRC,i(t) (3.16) Where ri is the RC resistance, from Table 3.1, and iRC,i(t) is the current of the RC components of the cell model. The RC current iRC(t) of cell i is defined by the following iRC,i(0) = 0 (3.17) iRC,i(t + 1) = rci × iRC,i(t) + (1 − rci) × ineti(t) (3.18) Where rci is the RC resistance value (cell parameter from Table 3.1) and ineti(t) is the current flowing through the cell. With the cell OCV voltage vOC and the RC voltage vRC, the terminal voltage vt of cell i can be calculated as follows vti(t) = vOCi(t) − vRCi(t) − ineti(t) × r0i (3.19) Where vti(t) is the terminal voltage, vOCi(t) is the open circuit voltage, vRCi(t) is the RC voltage, (ineti(t) is the current flowing through the cell and r0i is the r0 resistance parameter of the cell. Finally, the SOC of the cell can be calculated, which will be the new SOC value for the next simulation iteration. The SOC for cell i is calculated as follows SOCi(t + 1) = SOCi(t) − 1 3600 × ineti(t) qi (3.20) Where ineti(t) is the current flowing through the cell, and qi is the cell’s capacity. Additionally, in this work we utilized state of health (SOH) as the secondary parameter for balancing in addition to SOC. For SOH, the power expended by every cell is captured and accumulated. The calculation for SOH of cell i is defined as follows SOH i(0) = |(vOCi(0) − vRCi(0)) × ineti(0)| (3.21) SOH i(t) = SOH i(t − 1) + |(vOCi(t) − vRCi(t)) × ineti(t)| (3.22) 3.2.2 Drive Cycle Profiles The demanded power for each cell refers to the expected power output each cell must be providing at any given moment. This power value is converted by the simulation to the current that must go through the cells in order to supply the demanded power, this is calculated using the dynamic variables such as voltage as well as the internal cell parameters. With the power demand, the current that is expected to go through each cell is also calculated. The current going through each cell will either charge or discharge the cell, depending on which state of operation cells are in. 27 3. Methods Since the cell simulation requires for a power demand profile that will allow cells to provide energy and become discharged over time, the simulation utilizes drive cycle profiles based on real-world data, leveraged from previous works. This allows for cells to be discharged at a rate which is the closest to a real-world scenario as possible, so that we can capture more realistic cell discharge behaviors. The drive cycle profiles are a function of expected power demand versus time. The Power unit is in Watts and time is measured in seconds in the simulation. There are multiple drive cycle profiles of an EV’s power demand over time in a given driving environment, such as in a city, highway or urban area. Each of these scenarios have highly different characteristics in terms of power demand, for instance, on a highway the power demand is the highest given the need to maintain a high speed is energy-intensive, contrary to driving in a slow area with multiple stopping points in an urban area. Figure 3.3: Highway (us06) and urban (udds) drive cycle profiles. Figure 3.3 is a graph that shows the power demand profiles for one cell. The drive cycle of a highway profile is ”us06” and the urban drive cycle profile is ”udds”, leveraged from [34]. These are the two main used profiles in the simulation experimentation. The ”us06” profile discharges cells at a much higher rate and has a short duration, this means that cells under constant discharge with this profile will require to be charged more often. On the other hand, the profile ”udds” has a lower power demand and longer duration, this translates to cells becoming discharged much slower and requiring charging less often. 3.2.3 Cell Simulation States The cell simulation accounts for three different states where cells could be in: charging, discharging, and resting. These three different states aim to imitate the normal operation of cells on a real system. Cells will become discharged during usage, charged when they reach a certain lower threshold, and resting when the system is not under operation. Figure 3.4 shows the state-flow of the cell simulation. 28 3. Methods Figure 3.4: State flow chart of the cell pack simulation. The simulation time scale depends on the power profiles, and each simulation step simulates one second of operation. Each step, SOC along with other variables are calculated. During ”discharging” phase, cells become discharged based on the demanded power pro- file as well as leakage and self-discharge currents. At every time step, SOC is calculated for each cell, and later cell SOCs and cell voltages are checked, if they are below a certain threshold, then the simulation state changes to ”Charging”. Any cell is i considered discharged when vt,i(t) <= minV lim OR SOCi <= minSOC (3.23) Where vt,i(t) is the terminal voltage of the cell and SOCi is the SOC of the cell. If any of these conditions are true, the simulation state changes to ”charging”. In order to control how long cells are under operation and rest, two counter variables are used. The internal parameter usageCounter controls the time cells will be charging and discharging, and the internal parameter restingCounter controls how long cells will rest. After usageCounter finishes, restingCounter starts. Once restingCounter ends, this is considered a cycle. The amount of cycles the simulation will run for is provided by the user during the set-up and configuration of the simulation environment (Section 3.7), illustrated in Figure 3.4 as N . Once the simulation has executed N cycles, it ends. For the ”charging” state, cells are charged at a rate of 6.6KW . Once any cell in the pack has reached its maximum voltage or SOC threshold, charging stops and the simulation returns to its discharging state. Cells during charge are not capable of fully receiving all the charge due to inefficiencies with energy transfer. The parameter that defines the efficiency transfer is Coulombic Efficiency, or eta (η) as defined in Table 3.1. This parameter affects the charging current directly. 29 3. Methods When charging, the current inet(t) flowing through the cells is defined as inet(t) = iapp(t) × eta + iself -discharge(t) + ileakage (3.24) where iapp(t) is the charging current, iself -discharge(t) is the self discharge current and ileakage(t) is the leakage current. Any cell is i considered charged when v(t)i => maxV lim (3.25) where v(t)i is the terminal voltage of the cell. The simulation will reach the ”resting” state after usageCounter ends, and will remain in this until restingCounter ends. Essentially, after a certain amount of time that the simulation is ”discharging”, it will change to ”resting” to emulate cells under no load. During this state, cells will only be subject to leakage and self-discharge currents. If cells are left resting infinitely, they will eventually fully discharge. During the resting state, cells are not being charged nor discharged, this means that the only current flowing through the cells is leakage and self-discharge currents. As such, any cell is considered in rest when the current iapp(t) is zero. Where inet(t) is the current of the cell, resting current is defined as follows from Equation (3.8) iapp(t) = 0. (3.26) Thus inet(t) = 0 + iself -discharge(t) + ileakage (3.27) inet(t) = iself -discharge(t) + ileakage (3.28) Where inet(t) is the total current flowing through any cell during rest. Figure 3.5 shows how the SOCs of different cells look like over time during several simulation cycles and phases, as well as a plot of the difference of maximum SOC against minimum SOC - this is meant to illustrate how cells tend to unbalance over time. Note that the SOC differences increasingly grow over time, when there is no balancing policy controlling the simulation. 30 3. Methods Figure 3.5: Cell pack SOCs and SOC difference during discharging, charging, and resting phases. 3.2.4 Cell Balancing Simulation Balancing in the simulation is handled for both passive and reconfigurable active bal- ancing. For this to work along with the RL model training environment, the simulation has a feedback loop from where it obtains the actions that the RL model wants to take and applies them to the simulation, directly affecting the simulation calculation when balancing. This means that the simulation cannot continue unless it has received actions from the RL model, given that this is an online training methodology. Due to resource limitations when executing the simulation environment and passing infor- mation to the RL training environment, data passing had to be down-sampled between processes. Sending each simulation step to the RL training environment is a very slow process given that it involves transferring information to another process and waiting for that other process to compute the next action and return the action to the simulation. As a result, the simulation has a down-sampling factor of 30, which drastically reduced simulation and training time. As a consequence of this, the actions applied when bal- ancing persist for 30 simulation steps, until the next value is received and applied for balancing. Given that active and passive balancing occur at different stages of cell operation, the balancing strategy and feedback will be different for each. For passive balancing, cells need to be discharged over time while they are in the resting phase only. For the active balancing approach presented in this project, balancing occurs during cell operation only. Any phases where cells cannot be balanced, they will become naturally unbalanced if no balancing strategy is applied. 31 3. Methods Passive balancing feedback for the simulation is treated as a discharging current that flows through each cell individually. Each cell will become discharged based on the balancing strategy of the RL model. The RL balancing model has full control on the cell current discharge and the feedback is applied as-is to the cells. Since this balancing can only occur during resting windows, the current feedback is only applied during the resting phase and ignored on the other phases. When the resting phase ends, which the RL agent can’t control the timings of, balancing no longer occurs. For passive balancing each cell is balanced independently only during rest, at which point the current going through the cells at rest is defined by Equation (3.28). To include balancing, an additional balancing current ibalance is added to the equation. The total current inet(t) flowing through cell i during passive balancing is as follows inet,i(t) = ibalance,i(t) + iself -discharge,i(t) + ileakage,i (3.29) Where inet,i(t) is the total current flowing through the cell, iself -discharge,i(t) is the self- discharge current flowing through the cell, ileakage,i is the leakage current of the cell and ibalance,i is the balancing current of the cell. The balancing current is controlled externally to the simulation. Active balancing is approached differently in this work. Because active balancing occurs during the discharge phase, cells almost constantly have current flow through them. With the balancing strategy proposed in this project, each different cell will have a different amount of current flowing through them, controlled by the power electronics. The power electronics will enable the balancing during operation strategy by allowing varying amounts of charge to be drawn from individual cells at any given time. For this, the balancing feedback is treated as a percentage that represents the percentage of power that each cell will provide based on the power demanded by the simulation profile. For example, if a cell pack is demanded a set amount of power from each cell, the balancing feedback can distribute the demanded power differently on each cell, allowing low SOC cells to provide less charge and high SOC cells to provide the rest of the charge. This way, the perceived power granted from the pack is the same, but each cell contributed in different amounts. Active balancing with this work’s approach only occurs during discharge. By modulating the power demand of each cell, the amount of charge each cell provides will be different, reflected by the discharge current iappj of each cell, as shown in Equation (3.13). In order to control the power of each cell, we introduce a control variable C, is obtained externally from the RL model, which will act as the balancing factor. Each cell will have a different control value C, and it directly affects the power usage P of each cell. Normally, when there is no balancing, each cell will be demanded the same power P , and the pack power demanded will be P × N , where N is the total number of cells. This must always hold true in order to provide the correct power demanded from the pack. As such, the control variable C must also respect this constraint, thus the following must always hold true: For cell j, j ∈ {1, 2, · · · , N} 32 3. Methods N∑ j=1 (P × Cj) = P × N (3.30) Where Cj is the control variable, obtained from the RL model, P is power demanded by a single cell, obtained from the drive cycle profile, and N is the total number of cells. Additionally, it must not be possible for the control variable C of any cell to be 0. This case is not realistic and could cause potential harm on a real cell pack. For this, the control variable C must also always be in a safe range, which we defined as follows For cell j, j ∈ {1, 2, · · · , N} 0.5 <= Cj <= 1.5 (3.31) where Cj is the control variable of the cell. With the control variable C defined, the discharge current iapp of any cell during active balancing is defined as follows iappj = (vOC,j − vRC,j) − √ ((vOC,j − vRC,j))2 − 4(P × Cj)R0,j 2R0,j (3.32) where iappj is the discharging current, P is the power demand for a single cell, obtained from the drive cycle profile, Cj is the modulating control variable, obtained from the RL model, vOC,j is the open circuit voltage, vRC,j is the RC circuit voltage, and R0,j is the resistance in series. The balancing feedback for each cell is not controlled or changed in the simulation itself, thus it must be provided correctly externally since it will be applied as-is. 3.3 Reinforcement Learning Model This section describes the reinforcement learning algorithms and methods devised to train on the battery pack simulation. 3.3.1 RL environment The simulation environment is interfaced with the RL agent via action space and obser- vation space. The action space represents the parameters of the environment which can be directly altered by the RL agent. We have two different action spaces for each balancing topology. For passive balancing, the action space is the balancing current, the additional current discharged for each cell by the BMS. The additional balancing discharge current values are between 0 and 1 Ampere, where 0 means no additional discharge current. For the reconfigurable active balancing during discharge, the action space represents the usage percentage for each 33 3. Methods cell, an abstraction of the duty-cycling of the power electronics that link each cell. The values for the usage percentages are between 50% and 150%, for each cell. The total usage values of the cells must always provide the same total power to the EV, according to the drive-cycle demand. Vehicle power demand does not get altered by the topology and controller, they can only alter the way power is drawn from the cells by requesting more power from some cells, and less from others to compensate. Figure 3.6: State transition within a training episode Both topologies utilize the same observation space, which consists of the SOC values and the energy throughput of each cell in the pack. Each state represents an instance of the observation space, seen in Figure 3.6. The agent observes a given state, chooses an action according to its latest policy, and broadcasts the action which is then applied to the environment, leading to the following state. This state transition is then assigned a reward according to the reward function. The process continues until the transition to the final state. State transitions in RL are designed to follow the Markov property, outlined in the Markov property below, Equation (3.33). P (Xn+1 = xn+1|Xn = xn, . . . , X1 = x1) = P (Xn+1 = xn+1|Xn = xn), ∀n ∈ N. (3.33) The Markov property asserts that the current state of the environment encapsulates all relevant information needed to determine the future state, rendering the history of previous states and actions unnecessary. Formally, in a Markov Decision Process (MDP), the probability distribution of transitioning to the next state from the current state, conditioned on both the current state and action, remains independent of past states and actions. This property simplifies the learning process by allowing RL agents to make decisions based solely on the current state, facilitating more efficient and scalable algorithms [29]. 34 3. Methods The reward function must take this property into consideration, and only assign rewards for the transition between one state to the next, according to an intermediate action. Individual state-transition rewards must not take into account past or future rewards. The elements for such behaviour are handled at higher levels within the RL algorithm, and not within the state-transitions themselves. 3.3.2 Action sampling rate As the sample generation is bound by computational power in online RL, we make efforts to optimize the sample generation speed, at the cost of granularity in actions. Instead of computing a new action for each subsequent timestep of the simulation, we apply an action for multiple timesteps before interrupting and receiving the resulting state and assigning the reward for the transition, according to the reward function. The frequency at which a new state is produced depends on the sampling rate of the simulation. All states are linked together by actions, except the initial state of an episode. An episode is a full run of a simulation, for a set amount of cycles. For our balancing objective, we simulate many charge-discharge cycles across the lifetime of a pack within a vehicle. In order to reduce the computational load, by limiting the interrupts to the simulation for additional action commands from the RL agent, we sample the environment and return an action once every 30 in-simulation seconds, which constitutes a sampling rate of 30. Thus we reduce the time needed to complete a simulation by reducing the amount of context switching. This sampling choice also affects how the reward function interacts with the environment and shapes our learning objective. As actions last for several timesteps, aggressive BMS actions such as large discharge currents can make the cells unbalanced, which the RL agent learns to account for. Indeed, as a tangential benefit, this acts as a limit to high discharge rates in the case of the passive topology. As high C-rates are undesirable in battery systems because they significantly lower lifetime of a pack. The RL agent is assigned 30 timesteps in which to gradually discharge the battery to a balanced state, rather than attempting to do so on a second-by-second basis, which experimentation has shown will consistently lead to the usage of maximum C-rates between seconds if not included as part of the reward function. The C-rate limiting offered by this approach reduces the complexity of the reward func- tion, which allows for much more reliable discovery of optimal policies. As the complexity of the reward function increases, the difficulty for the algorithm to discover relationships between state parameters and actions also increases. When rewards are delayed, however, it can become challenging for the agent to under- stand which actions contributed to the eventual reward. This is known as the temporal credit assignment problem. Methods like eligibility traces in TD learning help mitigate this by attributing rewards to actions taken earlier. 35 3. Methods 3.3.3 Reward Design There is no guarantee of convergence to an optimal policy for RL algorithms, and each new training cycle can lead to different results. An improperly defined reward function can easily lead to poor results. Achieving a truly optimal policy is better defined as an aspirational goal rather than a guaranteed outcome. ’Frequency’ and ’scale’ matter greatly for actions and rewards, as they directly influence the learning process and the efficiency of the policy. If some rewards are scarce, the agent will never discover how to get that reward and develop the optimal policy. The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states [29]. Rewards that appear in states with low frequency are called sparse rewards. In environments that only provide sparse rewards, the agent struggles to learn effectively because it receives little information about the states at all times, and cannot infer what what constitutes a good or bad action. An example of a traditional and simplistic RL reward function is a reward of +1 when all cells are balanced (i.e. when all cells are almost equal in SOC). The simulation starts, and the cells start off in a fresh, balanced state. Because of this, the agent gets rewarded in the beginning, and then stops getting rewarded the moment they unbalance. After unbalancing, the agent gets no feedback from the reward as to any improvements that are pushing the cells closer to a balanced state. None of it’s actions can be quantified to an ’improvement’. It only receives +1 in the state of absolute balance, and 0 otherwise. Once the rewards become 0 from the unbalancing, the agent has no indicator that tells them if they are getting close to balancing again or not, and the exploration process fails. Whatever actions the agent takes after unbalance, they rely entirely on the random occurrence of cells reaching the ’balanced’ state again, by chance. These events are so infrequent, that the cumulative changes in policy are insignificant. No policy can be learned from such a reward in this environment. Policies are developed through improvement and iteration, the agent starts off with a bad policy and, through trial and error, gradually converges to the best policy. If we have no logical string of improving rewards to develop a policy, that offer a significant reward change relative to the rewards in the space, the agent cannot discover an optimum. If supplied only scarce ’maximum’ rewards, the agent cannot infer the path to get to them. Experimentation has shown that while using scarce rewards, during the first few steps of the simulation, any actions taken on the balanced state of the battery can predominantly lead to faster unbalancing. The only ’correct’ action is inaction from the agent, and no further policy can be discerned as no rewards can point the agent towards restoring a balanced state. As a reward cannot simply happen only during a balanced state, it must happen in stages and eventually lead the policy to the maximum reward. This technique is called ’reward shaping’. Through reward shaping, we have experimented with reward functions which give the 36 3. Methods agent ’hints’ about how close it is to balancing the cells, through partial rewards. Maxi- mum rewards are given once the objectives are fully achieved. A common mistake made in reward shaping is to yet again disregard the frequency and scale of rewards. If we design a partial reward which is large, and does not lead to the maximum reward, the agent will remain focused on the partial reward and never evolve to the maximum reward. In an experimental example, a reward which sums instances of +0.1 for each balanced cell can very easily get stuck on balancing only half the cells, or only 9 cells out of a total of 10. The reward significance from the states where all 10 cells are balanced is so small (0.9 vs. 1.0) that the policy fails to update. It does not acknowledge a variation of just 0.1 that happens with very low frequency to be significant enough to update the whole policy, because the discovered rewards are already strong and deviating from them will lose the perceived progress (effectively giving up exploitation), and thus exploration fails and the agent never learns to balance all 10 cells. By introducing a second or third reward factor, such as energy throughput, new con- siderations must be taken. Challenges arise from the increased complexity in balancing multiple objectives and ensuring that the agent learns an optimal policy that appropri- ately considers all reward components. Different reward factors might represent conflicting objectives. For example, cell balance and complete energy throughput uniformity are objectives that partially go against each other, depending on the characteristics of the cells and their behaviour during function- ing. Balancing these conflicting objectives requires careful tuning to avoid preferential behavior where the agent consistently ignores one objective in favor of another. Intro- ducing new reward factors can lead to unforeseen consequences. An agent might exploit loopholes in the reward structure to achieve high rewards without actually performing the intended task. An example of such a loophole was found experimenting with purely negative reward structures, where the maximum possible reward per state is 0 and all other rewards are negative. Control theory has many overlapping concepts with RL, and is the preferred control method utilized in traditional BMS systems. In control theory, the equivalent for the ’reward’ component of RL is cost. While RL typically aims to maximize cumulative rewards, control theory generally focuses on minimizing cumulative costs. A setpoint is established as the reference of the control system, for example in a PID (Proportional– integral–derivative) or MPC (Model Predictive Control) system. The cost gets higher as the measurements diverge from the reference setpoint, the desired values. 37 3. Methods Figure 3.7: Control feedback loop By testing the principles of cost in RL training and relying on purely negative rewards in experiments, several observations were made. The objective shifts to minimizing a cost, rather than maximizing a reward. As the agent gets ’punished’ with more negative rewards the longer it remains in an undesirable state, the agent prioritizes quickly exiting the state. This introduces ’speed’ as a primary factor of the reward function. However, as the agent quickly tries to escape the future ’punishments’ given for each step spent in a disadvantageous position, it quickly discovers that it can forcibly end the simulation early and stop any future costs, by crashing the battery pack through over-discharge. Thus the agent has ’minimized’ the accumulated costs by stopping the episode from gen- erating any additional cost-incurring states, by destroying the battery pack and finishing the simulation. Cost-based behavior has been shown to be difficult to work around in an RL environment, and dangerous in a safety-critical vehicle application. The target environment of the agent must employ strict and thorough constraints in order to function on a cost-based behavior. Working with such a method requires hard constraints that the agent cannot violate, such as safety limits on state and action variables. These restraints would, as a result, heavily limit the possible actions of the BMS controller and the ability of the agent to achieve it’s task using the full range of possible actions. Applying a larger penalty at the end of a forced pack over-discharge has also not proven to be an effective method of avoiding such behavior. As stated previously, sparse rewards are not an effective tool in policy optimization. As the over-discharge state is the last state in the training cycle, the large lump reward will skew the policy into attributing the penalty as the result of that immediate action. Because the over-discharge crashing is discovered with very high frequency, as part of the learning process, the agent will consistently default to this behavior in a cost-based reward function, no matter the adjustments in cost weights or additional costs for crashes. 3.3.4 Action/Observation space normalization For RL algorithms, effective exploration and learning depend on the appropriate scaling and normalizing of the action and observation spaces, as well as the issued rewards. Such normalization techniques ensure stable and efficient learning processes for RL agents. 38 3. Methods The action and observation spaces in RL environments can vary widely in scale and mag- nitude, posing challenges for RL algorithms in learning policies and behaviors effectively. Normalization techniques aim to address these challenges by ensuring consistency and stability to these spaces. Proper normalization facilitates smoother convergence, im- proves exploration-exploitation trade-offs, and enhances the generalization capabilities of RL agents across diverse environments. The functions which serve as the basic building