Reinforcement Learning-Based Cell
Balancing for Electric Vehicles

Master’s thesis in Computer science and engineering

GIOVANNI MAZZOLO
MATEI SCHIOPU

Department of Computer Science and Engineering
Chalmers University of Technology
University of Gothenburg
Gothenburg, Sweden 2024


Master’s thesis 2024

Reinforcement Learning-Based Cell
Balancing for Electric Vehicles

GIOVANNI MAZZOLO

MATEI SCHIOPU

Department of Computer Science and Engineering
Chalmers University of Technology

Gothenburg, Sweden 2024


Reinforcement Learning-Based Cell Balancing for Electric Vehicles

GIOVANNI MAZZOLO
MATEI SCHIOPU

© GIOVANNI MAZZOLO, MATEI SCHIOPU, 2024.

Supervisor: Dr. Yang Xu, Volvo Group
Advisor and Examiner: Pedro Petersen Moura Trancoso, Department of Computer Sci-
ence and Engineering

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Reinforcement Learning-Based Cell Balancing for Electric Vehicles

GIOVANNI MAZZOLO
MATEI SCHIOPU

Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Lithium-ion battery packs are comprised of hundreds to thousands of individual cells
which, even though manufactured uniformly, exhibit small variations in their character-
istics that impact their behavior during operation. These differences cause cells’ State
of Charge (SOC) to become unbalanced, which can, in turn, reduce the capacity uti-
lization efficiency of the pack [1]. Additionally, battery cells age differently over time,
and fast-aged cells can cause packs with healthy cells to be retired early, without fully
taking advantage of each cell. When a battery has deteriorated to around 80% of its
total capacity, it is retired from electric vehicle usage [2].
To maintain batteries functioning correctly, cell SOC balancing must be done on battery
packs. However, balancing the SOC of cells provides a window of opportunity to also
include cells’ health into the balancing equation, aiming for the homogenization of cell
aging, allowing to thoroughly utilize a battery’s resources. In this way, it is possible to
both keep batteries in operating condition and potentially increase their lifespan.
In this work, we develop and research a multi-cell simulation framework and Reinforce-
ment Learning (RL) methodologies to explore the potential of cell SOC and health
balancing. We propose an active balancing strategy for re-configurable cell topology
with RL, in which instead of transferring energy between high SOC cells to low SOC
cells, cell utilization is modulated so that the power consumption is optimally distributed
based on each cell’s SOC. This strategy is applied to SOC balancing, as well as SOC and
State of Health (SOH) balancing simultaneously, to potentially allow for an exhaustive
utilization of the battery’s potential.

Keywords: Battery, cell balancing, reinforcement learning, lithium-ion batteries, automo-
tive, computer science, engineering, deep learning.

v


Acknowledgements
We would like to express our deepest gratitude to our supervisor Dr. Yang Xu during
our thesis project at Volvo Group, for always showing kindness, patience, willingness to
help, and prioritizing assisting us on the project even during busy days. He helped us
broaden our views, offered advice, and motivated us to do our best and to continue
learning. Without his great experience, knowledge, and commitment to help, this work
would not have been possible nor have been as enjoyable as it was. Even when at times
we made mistakes or when we were not at our best, he made sure to always stay positive
and supported us to keep moving forward. We sincerely thank you.
Many thanks to the Volvo BMS team for the feedback, and guidance and for welcoming
us to the team during our thesis work. Thank you for the amazing insights into the
world of lithium batteries.
We would also like to thank our Chalmers supervisor, Pedro Petersen Moura Trancoso
for supporting us during the thesis work.
This Master’s thesis was conducted at Volvo GTT in conjunction with Chalmers Univer-
sity of Technology, Department of Computer Science and Engineering.

Giovanni Mazzolo and Matei Schiopu, Gothenburg, 2024-06-26

vii


Contents

Acronyms xi

Nomenclature xiii

List of Figures xv

List of Tables xvii

1 Introduction 1
1.1 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7
2.1 Battery Management Systems . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 State of Charge (SOC) . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 State of Health (SOH) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Causes of imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Cell Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Passive Cell Balancing . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Active Cell Balancing . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.1 Equivalent-circuit Model (ECM) . . . . . . . . . . . . . . . . . 13

2.7 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Methods 21
3.1 Balancing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Battery Simulation Environment . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Cell Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Drive Cycle Profiles . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Cell Simulation States . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 Cell Balancing Simulation . . . . . . . . . . . . . . . . . . . . . 31

3.3 Reinforcement Learning Model . . . . . . . . . . . . . . . . . . . . . . 33

ix


Contents

3.3.1 RL environment . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Action sampling rate . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Reward Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Action/Observation space normalization . . . . . . . . . . . . . 38
3.3.5 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.6 PPO vs. TD3 vs. SAC . . . . . . . . . . . . . . . . . . . . . . 44
3.3.7 Training techniques . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 RL Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 MATLAB-Python Interfacing . . . . . . . . . . . . . . . . . . . . . . . 55

3.6.1 Memory-map Interface Implementation . . . . . . . . . . . . . . 56
3.7 Running the Cell Simulation Environment . . . . . . . . . . . . . . . . . 59

3.7.1 System specifications . . . . . . . . . . . . . . . . . . . . . . . 60

4 Results 63
4.1 Control Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 SOC only Active Balancing . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Active SOC and SOH Balancing . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 SOC and SOH Balancing with 1% Threshold . . . . . . . . . . . 71
4.3.2 SOC and SOH Balancing with 2.5% Threshold . . . . . . . . . . 74
4.3.3 SOC and SOH Balancing with 3.5% Threshold . . . . . . . . . . 77
4.3.4 SOC and SOH Balancing with 5% Threshold . . . . . . . . . . . 80

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Conclusion 85
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Bibliography 87

A Appendix I
A.1 Python Cell Simulation Execution Example . . . . . . . . . . . . . . . . I
A.2 Additional Simulation Runs . . . . . . . . . . . . . . . . . . . . . . . . IV

x


Acronyms

Below is the list of acronyms that have been used throughout this thesis:

BMS Battery management system

RL Reinforcement learning

ML Machine learning

AI Artificial intelligence

EV(s) Electric vehicle(s)

SOC State of charge

SOH State of health

SOQ State of capacity

SOR State of resistance

SOX State of charge, health, capacity, resistance

DOD Depth of discharge

ET Energy throughput

PPO Proximal policy optimization

SAC Soft Actor-Critic

DPG Deep Policy Gradient

DDPG Deep Deterministic Policy Gradient

TD3 Twin Delayed DDPG

OCV Open circuit voltage

DQN Deep Q-Network

RC Resistor–capacitor

PID Proportional–integral–derivative

MPC Model Predictive Control

ADAM Adaptive Moment Estimation

xi


Contents

gSDE Generalized State-Dependent Exploration
EM Electro-chemical Models
ECM Equivalent-circuit Models
DDM Data-driven Models
UCB Upper Confidence Bound
RUL Remaining-useful-life
BPNN Back Propagation Neural Network
RBNN Radial Basis Neural Network
LSTM Long Short Term Memory
ANN Artificial Neural Network

xii


Nomenclature

Reinforcement learning variables

γ Discount factor

π Policy

σ Logical sigmoid function

ET (s, i) Energy throughput of a cell in a certain state

R(t) Return. Total sum of rewards starting from timestep t onwards, mod-
ified by discount factor

rt Reward assigned at timestep t

Other symbols

Cj Balancing feedback for power modulation

Physics notations

η Coulombic Efficiency

Ah Ampere hour (Amp × hour)

C − rate Charge rate

i(t) Current

iapp(t) Discharge or charging current

ibalance(t) Balancing current

ileakage(t) Leakage current

inet(t) Total current

iself−discharge(t) Self discharge current

P Power

Q Capacity

qi Cell capacity

R Resistance

xiii


Nomenclature

R0 Cell R0 resistance
v(t) Voltage
vOC(t) Open Circuit Voltage
vRC(t) RC circuit Voltage
vt(t) Terminal Voltage
z(t) State of charge
iRC(t) RC circuit current

xiv


List of Figures

2.1 The composition of EV battery packs . . . . . . . . . . . . . . . . . . . 7
2.2 The immediate effects of cell imbalance . . . . . . . . . . . . . . . . . . 11
2.3 First order Thevenin model of a cell. . . . . . . . . . . . . . . . . . . . 14
2.4 RL training principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 RL agent state-action-reward unit . . . . . . . . . . . . . . . . . . . . . 15
2.6 Taxonomy of RL algorithms . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Balancing focused re-configurable cell pack topology. . . . . . . . . . . . 22
3.2 ”P14” cell model of OCV and SOC relationship, T=25 C°. . . . . . . . . 24
3.3 Highway (us06) and urban (udds) drive cycle profiles. . . . . . . . . . . 28
3.4 State flow chart of the cell pack simulation. . . . . . . . . . . . . . . . 29
3.5 Cell pack SOCs and SOC difference during discharging, charging, and

resting phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 State transition within a training episode . . . . . . . . . . . . . . . . . 34
3.7 Control feedback loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Logistic sigmoid curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Probability distribution - normal (Gaussian) distribution . . . . . . . . . 45
3.10 SAC mean episode reward - 2.5 million training steps to convergence . . 47
3.11 TD3 training instability - evaluation episodes . . . . . . . . . . . . . . . 48
3.12 Curriculum learning environment phases . . . . . . . . . . . . . . . . . 49
3.13 TD3 - Diverging after convergence . . . . . . . . . . . . . . . . . . . . 50
3.14 Illustration of two different processes sharing memory through a memory-

mapped file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.15 Illustration of the memory-mapped files used for passing data between

processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.16 Organization of the address spaces of the memory-mapped files. . . . . . 58

4.1 Simulation of 10 cells on the passive topology with no balancing. . . . . 63
4.2 Simulation of 10 cells on the active topology with no balancing. . . . . . 64
4.3 Passive balancing simulation of 10 cells - utilization = [10, 2]. . . . . . . 66
4.4 Close-up of 4.3 of steps 2025 to 2150. . . . . . . . . . . . . . . . . . . 67
4.5 Active balancing simulation of 10 cells - utilization = [10,2]. . . . . . . . 68
4.6 Active balancing simulation of 10 cells - close-up. . . . . . . . . . . . . . 69
4.7 Active balancing for SOC only (0% threshold) . . . . . . . . . . . . . . 71
4.7 Active balancing for SOC and SOH - 1% threshold - case 1. . . . . . . . 72
4.8 Active balancing for SOC and SOH - 1% threshold - case 2. . . . . . . . 73

xv


List of Figures

4.9 Active balancing for SOC and SOH - 1% threshold - case 3. . . . . . . . 74
4.10 Active balancing for SOC and SOH - 2.5% threshold - case 1. . . . . . . 75
4.11 Active balancing for SOC and SOH - 2.5% threshold - case 2. . . . . . . 76
4.12 Active balancing for SOC and SOH - 2.5% threshold - case 3. . . . . . . 77
4.13 Active balancing for SOC and SOH - 3.5% threshold - case 1. . . . . . . 78
4.14 Active balancing for SOC and SOH - 3.5% threshold - case 2. . . . . . . 78
4.15 Active balancing for SOC and SOH - 3.5% threshold - case 3. . . . . . . 79
4.16 Active balancing for SOC and SOH - 5% threshold - case 1. . . . . . . . 80
4.17 SOC and SOC 5% threshold - case 1 - First 4000 steps. . . . . . . . . . 81

A.1 Active SOC balancing test - 1000 cycles - utilization [10, 2]. . . . . . . . IV
A.2 Active SOC balancing test - 1000 cycles - utilization [20, 2]. . . . . . . . V

xvi


List of Tables

3.1 ”P14” cell model parameters. . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Cell simulation initial parameters. . . . . . . . . . . . . . . . . . . . . . 25
3.3 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 EvalCallback function parameters . . . . . . . . . . . . . . . . . . . . . 51
3.5 SAC training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Hyperparameters used within the experiments - SAC . . . . . . . . . . . 54
3.7 Cell simulation parameters description . . . . . . . . . . . . . . . . . . . 59
3.8 System specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1 Simulation training parameters for SOC only balancing. . . . . . . . . . 65
4.2 Simulation parameters of the trained SOC and SOH balancing models. . 70
4.3 Simulation parameters for testing SOC and SOH balancing models. . . . 70

xvii


List of Tables

xviii


1
Introduction

Electric vehicles (EVs) have seen a massive surge in popularity over the past years, with
reports from the International Energy Agency showing over 26 million EVs on the road
in 2022, with more on the rise [3], giving way to a sustainable transportation method
for people and industries. The increase in demand for electric vehicles goes hand in
hand with battery demand, which comes with its own unique set of challenges. The
battery pack comprises the most expensive component in EVs, with prices of $200/kWh
[4]. Once the capacity of a battery degrades to a certain level, around 80%, the pack
is deemed expired and must be replaced. The preservation of battery lifetime is thus
integral and highly desirable in the field. Maximizing the amount of charge within a
battery pack is yet another important research subject within battery technologies, to
grant vehicles as much range as possible.

Cell balancing is the process of bringing all cells in a pack to similar charge levels. Both
battery lifetime and performance depend on the balancing mechanism of the battery. A
lack of balancing will quickly render a battery unusable [5]. The widely-used industry
balancing applications are using relatively simple topologies using control algorithms that
are not battery health-aware and cannot effectively make use of all of the sensor data at
their disposal. Traditional control mechanisms that efficiently follow multiple balancing
objectives are difficult to develop, as they rely on precise modeling of battery pack
relationships [6] [7]. Due to the non-linear and estimative nature of battery dynamics,
when multiple objectives are pursued, the complexity of developing such control systems
grows exponentially.

Traditional balancing topologies rely on passive balancing, which wastes the charge from
the cells, or active balancing, which attempts to transfer energy between cells until they
are balanced. Both of these balancing methods happen during resting periods when the
battery is not in use [8].

Within this thesis, we aim to introduce a cell balancing method that functions during
vehicle operation by modulating cell utilization via reconfigurable cells according to the
charge and capabilities of each cell. We develop a reinforcement learning (RL) balancing
controller for this topology, training it to handle the complexity of such a topology, and
attempt to balance the State of Charge (SOC) in a health-aware manner, preserving the
State of Health (SOH).

1


1. Introduction

1.1 Research objectives
We propose a new reconfigurable cell balancing topology, which aims to balance cells dur-
ing operation. Our proposed topology utilizes a reconfigurable battery solution alongside
power electronics that allows balancing to take place during cell operation, introducing
the possibility of balancing outside rest phases - which is the norm for current battery
management systems. Such a topology requires a complex controller that can interface
with it and follow the balancing objective.

RL agents can infer the required control actions without the requirement of relationship
modeling used in methods such as MPC (Model predictive control) [7], while also being
able to provide strong predictive behavior. This topology makes use of a high-dimensional
continuous action space and observation space, in a highly non-linear environment, a
challenging RL task, and we aim to provide an RL balancing controller that can effectively
train to handle such an environment, as well as attempt to follow multiple training
objectives.

The aim of this thesis is to introduce a reinforcement learning approach to cell balancing.
The intended outcome is to create a cell balancing controller built from a reinforcement
learning agent, as well as provide the simulation environment for the training and testing
of the agent. Designing a balancing controller that can take many factors into account
using traditional control theory methods, for a non-linear environment, gets exponen-
tially harder with each objective as there are difficulties in modeling these non-linear
relationships between each new factor. We aim to train an RL controller agent to learn
these relationships and make use of them in battery balancing, and simultaneously pursue
multiple objectives such as SOC balancing in a health-aware manner.

1.2 Related work
With the rise of electromobility, research in the area of battery systems has expanded
rapidly. The first component of battery management systems (BMS) to be interfaced
with AI/ML technology has been parameter and state estimation and prediction. Sev-
eral papers discuss the applications of supervised and unsupervised learning in order to
predict the values of SOC, SOH, capacity, battery aging, and other functioning param-
eters. There have been extensive studies targeting the estimation of SOC, SOH, and
Remaining-useful-life (RUL) for battery systems using machine learning. However, the
cell balancing aspect of BMS remains an open question in the field of machine learn-
ing, specifically when it comes to exploring the capabilities of reinforcement learning for
battery controllers.

Harwardt et al. [9] consider a reinforcement learning method that targets passive and
active cell balancing using a cell simulation that contains a dynamic thermal model.
However, this method is limited to balancing one cell per timestep and making powerful
assumptions, such as constant voltage.

Y. Yang et al. [10] explore a fast-charge RL battery control method using DQN for
minimizing charging time in a balancing-aware manner. By using the generalization of

2


1. Introduction

a neural network, the RL agent handles balancing control during charging.

Duraisamy et al. [11] propose cell balancing methods that utilize back propagation neural
network (BPNN), radial basis neural network (RBNN), and Long Short-Term Memory
(LSTM) models to select an optimal resistor for passive battery balancing. Each model’s
parameters are based on SOC, temperature rise, balancing time, and C-rate and they are
compared and evaluated on a 3-cell and 3-resistor simulation environment. The study
treats the case of the switched shunt resistor passive balancing topology and proposes a
set of additional resistors that the model can swap between at will, comparing it to the
limited capabilities of a weaker topology.

Z. Xia et al. [12] utilize an artificial neural network (ANN) to estimate the battery
capacity, which then feeds into a controller to manage the balancing of the battery
cells. Both of the presented studies rely on machine learning techniques only for battery
measurements, estimating their states.

B. Jiang et al. [13] study a Deep Neural Network RL method of controlling a reconfig-
urable cell topology based around switches, using a discrete action space. This balancing
method considers the only SOC of cells for the RL agent’s balancing objective.

The technical basis for battery pack modeling and simulation methods used for developing
the new topology is adapted from G. Plett’s publications [14] [8]. These works provide
exhaustive descriptions of physics-based models of lithium-ion cells and state-of-the-art
applications of equivalent-circuit models used for battery management and control. The
battery pack simulations of the project were adapted from these works.

The reinforcement learning algorithms studied within this thesis and used to develop the
RL controller are based on the following works. T. Haarnoja, et al. [15] describe Soft
Actor-Critic (SAC), an off-policy actor-critic deep RL algorithm based on the maximum
entropy reinforcement learning, which is utilized in our work to train RL controllers.
S. Fujimoto et al. [16] propose a novel mechanism that builds on Double Q-learning,
by taking the minimum value between a pair of critics to limit overestimation, and
results in the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) which
we experiment with in our work. T. P. Lillicrap et al. [17] define the Deep Deterministic
Policy Gradient, the actor-critic, model-free algorithm based on the deterministic policy
gradient that can operate over continuous action spaces, and the principles that would
later serve as the basis for TD3 and SAC algorithms. J. Schulman et al. [18] propose
Proximal Policy Optimization (PPO), policy gradient methods for reinforcement learning
which alternate between sampling data through interaction with the environment, and
optimizing an objective function using stochastic gradient ascent.

Existing studies treat cases of 3 cells [9], [11] or 4 cells [10], while others utilize machine
learning only for estimation rather than direct control [12]. We seek to provide a scalable
RL methodology that introduces state-of-the-art RL algorithms to the field and provides
insight into RL development methods in the context of EV battery controllers. The
existing methods handle balancing during resting [9], [11], or charging [10] phases. A
study [13] proposes a reconfigurable topology based around switches, a discrete action
space, coupled to an RL controller which only takes SOC into consideration in the
reward function. We introduce a power-electronics battery balancing topology, with a

3


1. Introduction

continuous action space and high-dimensional continuous observation space, which can
be active during battery discharge during normal operation of a vehicle, coupled with an
RL controller capable of handling this continuous action space to balance SOCs as well
as maintain SOH through health-aware balancing.

1.3 Methodology

Within our work, we propose and analyze applications of reinforcement learning agents
for battery pack balancing control, as a method to follow multiple objectives within
balancing. This is done within a dynamic battery pack simulation based on experimental
cell datasets. Within the project, we adapt the simulation formulas to our proposed
topology and link it up to simulated road-usage powder demands profiles, synthesized
from a drive-cycle dataset, and altered by road conditions. This simulation is then
attached and synchronized with reinforcement learning algorithms within our training
framework, in order to train and produce RL agents which can fulfill the role of controllers
for the BMS of the EV battery in an online RL way. We undergo a thorough analysis and
experimentation of reinforcement learning algorithms, training techniques, optimizations,
and reward design in order to find the most suitable algorithm for training RL agents for
battery pack environments and provide insight into development methods for the complex
high-dimensional and non-linear learning task of health-aware battery balancing.

1.4 Organization

The thesis is formatted into several chapters. The theoretical background and foun-
dational information are presented in the ’Background’ chapter, which is split into the
battery pack, balancing, and RL theory. The implementation and experiments are dis-
cussed in the ’Methodology’ chapter which presents the work done on the cell simulation,
RL algorithms, and the interfacing of the two. The findings are showcased in the ’Re-
sults’ chapter. Discussion based on the findings and final thoughts for applications are
presented in the ’Conclusion’ chapter, as well as proposals for future improvements and
subsequent work.

1.5 Limitations

The electrical and electronics design aspect of the topology makes use of abstractions,
especially in the case of duty-cycling power electronics. In-depth design and simulation
of these elements are not within the scope of the work. Training of the RL model for
this project was done using cell simulations that are based on the characterization of
real cells, which are modeled using a first-order RC circuit, which we consider sufficient
for the needs of this project.

4


1. Introduction

1.6 Ethical Considerations
All datasets used within this work for cell modeling and drive cycles are public and contain
no identifiable vehicle data. Reinforcement learning models, especially Deep Learning
variants, are highly complex and raise difficulties in interpreting their decisions. This
lack of transparency can lead to issues for safety considerations. Any EV applications
that attempt to incorporate RL/ML technologies within their workflow must undergo
extremely thorough testing and verification, especially in the case of battery technologies
which can be subject to leaks of hazardous material or smoke, fire, and explosion as a
result of thermal runaway.

5


1. Introduction

6


2
Background

2.1 Battery Management Systems
Battery management systems (BMS) are software applications that oversee the safe and
efficient operation of battery cells and the battery packs as a whole [19].

Figure 2.1: The composition of EV battery packs

A battery pack can contain hundreds of cells, and an electric vehicle (EV) can contain
multiple battery packs, pictured in Figure 2.1. Traditionally, EV manufacturers use the
same cell type in vehicle packs, as mismatches add unnecessary complexity to functioning
and manufacturing. However, cells of the same type, from the same manufacturer
and manufacturing batches are not identical in characteristics. Current manufacturing
limitations and industry demand do not allow precise uniformity among the cells [14] [8].
A BMS is responsible for keeping track of data from sensors across the battery pack,
such as voltages, current, and temperatures, and uses it to maintain uniform usage in a
non-uniform environment. Some sensors are present at cell-level, on each cell, such as
the ones for current and voltage, while some can be placed across several sections of the
battery pack in the case of thermal sensors. This data is used to manage temperature,
charge C-rate, discharge profile, and other functions of the pack [19].
The raw data from the sensors is then used to estimate and predict ’states’ for the cells.
State of Charge (SOC), State of Health (SOH) (including State of Capacity (SOQ)
and State of Resistance (SOR)), and many other states are estimated by the BMS to
determine the status of the battery pack. SOC, for example, can be used to directly

7


2. Background

inform the power management unit of the remaining charge and, coupled with additional
vehicle usage data, determine the remaining range of the vehicle and offer an estimation
of how much longer the vehicle can run during assumed average usage [14].

2.2 State of Charge (SOC)
Arguably the most important estimation performed in the battery back is the State of
Charge. It is crucial to the basic functionalities of a vehicle, however, SOC has many
more important uses that are not directly visible to the end-user of an EV. As we cannot
directly measure SOC, we must rely on estimation methods [14].

Rechargeable lithium-ion cells are small packets of different shapes and sizes that store
electricity through chemical reactions. Cell usage is represented by charge/discharge
rate, known as cycling rate (C-rate), and voltage, must be within manufacturer-defined
limits [14].

State of Charge is bound to these limitations. If a cell were to over-charge beyond its
factory limit, chemical reactions within the cell would lead to irreversible damage and a
permanent loss of capabilities in the cell. The same goes for over-discharging. A cell
is never ’fully discharged’ of its electric charge, but rather partially discharged up to a
lower voltage limit which it must never pass [19]. When discharging, this usage bound
is called Depth of Discharge (DOD), and it measures how much of a battery’s capacity
has been used relative to its total capacity. DOD has a direct impact on the degradation
of a cell, as higher DOD generally results in a larger volume change of active particles
during cycling, increasing stress and leading to cracking and cell degradation [20].

The definition of SOC varies depending on the method used for the estimation. The
most widely used methods are Open circuit voltage, Coulomb counting, Impedance
Spectroscopy, Model-based or ANN-based.

In the Open Circuit Voltage (OCV) method, SOC is defined as a one-to-one relationship
to OCV. The battery is disconnected from the load and left to settle its chemical reac-
tions, and then OCV is measured. It is then compared to a SOC-OCV lookup table and
the SOC is determined [21]. This method however can require long relaxation times until
a lithium-ion battery is fully settled. The lookup table depends on the battery chemistry,
and the relationship between SOC-OCV can diverge from the one-to-one estimation with
temperature changes and aging.

In the Coulomb counting method, SOC is defined as the integration of current [21]. Given
an accurate initial SOC estimation, coulomb counting keeps track of the current that
flows in and out of the battery by accumulating the charge that is being transferred [22].
However, this method relies on the accuracy of the counting sensors, and inaccuracies
from defects or electronics wear can accumulate into large estimation errors [21].

Electrochemical impedance spectroscopy functions by injecting small amplitude AC sig-
nals to a battery at different frequencies. Parameters measured from these signals are
then used as indicators of SOC. This method is not suitable for online measurements,
while the battery is being used, [21] and usage is mostly restricted to lab environments.

8


2. Background

The model-based methods are the most suitable for online measurements and involve
modeling of the electrical, chemical, or combinations of both properties of a specific
battery. The most widely used algorithms for this method are variations of the Kalman
filter [23].

For our project, we utilize coulomb counting to track SOCs. We aim for a simulation of
a battery within a vehicle in usage, rather than within a lab environment, and thus we
make use of an accurate online, real-time, measurement method.

2.3 State of Health (SOH)
The State of Health (SOH) is another critical parameter for BMSs that aims to estimate
the loss of charge storage capabilities of a battery, by use and aging [24]. Fundamentally,
it is the monitoring of the battery’s parameters whose degradation process is, usually,
slow yet influences the battery’s performance.

Over time, a battery’s total capacity gets reduced - this is known as capacity fade.
Capacity fading occurs due to the cell’s structural deterioration as well as chemical
side effects over time. Furthermore, battery cell aging causes its internal resistances
to increase, for similar reasons. It is essential to have an accurate measurement of the
capacity and internal resistances, given that they are major contributing parameters for
estimating SOC and calculating energy [14].

Similar to SOC, SOH is also a parameter obtained through estimation and several meth-
ods enable it, such as [24]: Characteristic Quantity Method, Model-based Method (Equiv-
alent Circuit Model, Electro-chemical Model, Electro-chemical Impedance Spectroscopy),
Fusion Method and Data-driven Methods (Neural-networks, Fuzzy Logic, Support Vector
Machine, Parameter Identification).

An influential aspect of a battery’s health is the C-rate, which defines how quickly the
cell acquires or releases charge. C-rate has many implications on a cell’s lifetime and
capacity, the most important one being that high C-rates lead to drastic loss of capacity
and loss of power for the cell [8] [20]. Essentially, batteries that are under extreme load
demands or that are charged at a very high rate can exhibit faster degradation, leading
to reduced lifetime.

Many elements are considered to influence the degradation of SOH in a battery during
operation. The most drastic effects arise from overcharging and over-discharging, as well
as operation at inadequate temperatures [8].

Within our work, we treat these effects via charge/discharge safety mechanisms and
temperature abstractions within the battery simulation, and focus on the effects of cell
balancing and cell usage. We use energy throughput as an indicator for preventing SOH
degradation. By attempting uniform energy throughput for each cell while balancing, we
aim to decrease the overall degradation of cells and limit the non-uniform degradation
of cells in a pack to increase the lifetime of battery packs.

Some state-of-the-art work [12], [25] that include SOH in the balancing strategy model
SOH as the capacity fade of the cells. However, capacity fading can be caused by

9


2. Background

several factors, including internal chemical reactions and cell operation. Cell chemical
deterioration is not considered for this work as it is unavailable in our simulation capacity
or dataset. For this, we consider SOH as the energy throughput of the cells. Cell
throughput when considered as SOH allows for the RL model to be trained on data that
updates consistently and in a fast manner, in addition to energy throughput being linked
to cell aging [20].

2.4 Causes of imbalance
Imbalance is introduced by any combination of factors that make the SOCs of the cells
diverge from one another. A common factor is the different Coulombic efficiency in cells.
Two cells may start in the same SOC state, and quickly diverge during charging due to
the different efficiency η [8].

The following formula describes how the SOC z of a cell is estimated while the cell is
being charged

z(t) = z(0) − 1
Q

∫ t

0
η(τ)iapp(τ)dτ (2.1)

where Q is the capacity, η is the Coulombic efficiency and iapp is the charging current
of the cell. Notice that η directly affects the charging current of the cell, and each cell’s
η parameter is unique, even a slight difference can cause a cascading unbalancing effect
over time.

Imbalance can also happen due to different total current loads applied to the cells. The
current that passes through cells is related to multiple different sources. Besides the
main application current, the load requested from the battery pack for the operation of
a vehicle, we also have self-discharge current as well as leakage current [8].

inet(t) = iapp(t) + iself -discharge(t) + ileakage(t) (2.2)

Self-discharge rate differs from cell to cell and refers to the internal current flow within
a battery cell when it is not connected to any external circuit. Self-discharge is a natural
phenomenon that occurs in all types of batteries over time due to ongoing chemical
reactions within the battery, even when not in use, as well as impurities and imperfections
in manufacturing which create small pathways for current to flow internally. These
chemical reactions accelerate with temperature increases, as well as high SOC values
within the cell over long periods [8].

Leakage current refers to the small amount of current that powers attached BMS elec-
tronic circuitry. Manufacturers of battery packs and BMS components specify the power
consumption of their BMS circuitry in datasheets [8].

The effects of these additional currents are seen through every state of the battery,
in charging, discharging, and resting. They are permanently active and vary with the
parameters of the pack and cell itself [8].

10


2. Background

2.5 Cell Balancing
The definitions of a balanced pack differ from application to application. A balanced pack
is commonly defined as a pack where all of the SOCs of its cells are the same. However,
this is a very harsh requirement that depends strongly on very accurate measurements and
cell manipulation techniques which are not usually available in an EV system. The SOCs
in an EV battery pack vary greatly during usage and the estimations from measurements
can be inconsistent, as the cells are not measured in an isolated lab environment [21].

Thus we introduce a balance threshold rather than an exact equality relationship. In a
cell pack with SOC values between 0 and 100%, we compute the difference between the
smallest SOC value and other cells in the pack. We assign a threshold for the maximum
SOC difference between the cells, which within experiments can take values between
1% to 3.5%. Cells within the pack with a difference smaller than the set threshold are
considered balanced.

Due to the non-uniform characteristics of cells in a pack, which diverge even further
through usage and aging, cells in a pack begin to have very different SOCs. As the
cells are charged and discharged, these SOC variations get larger and larger to the point
where, for example, a cell can be at below 50% SOC when the rest are almost at 100%.
The short-term effects are a significant reduction of battery capacity and a waste of
energy, reducing the range of an EV to a fraction of the intended baseline.

Figure 2.2: The immediate effects of cell imbalance

When such differences happen, a BMS limits the battery capacity to that of the weakest
cell in the pack, pictured in Figure 2.2. Continuous usage of the vehicle with these
imbalances can lead to rapid deterioration of the cells, rendering the whole pack unusable.
An improper BMS implementation on an EV suffering from cell imbalance can lead to
a full stop in the middle of a highway and similar safety risks [8]. These imbalances, if
left untended over a longer period, lead to rapid degradation of the battery cells through
accelerated aging and abuse. Erroneous charging/discharging of imbalanced batteries,
usually due to measurement inaccuracies of batteries during stress, leads to overcharge

11


2. Background

and over-discharge. This produces irreversible chemical reactions in the batteries, which
reduce their capabilities and generate heat [5].

Imbalance in cells influences internal resistance. Cells with higher internal resistance tend
to dissipate more heat during charging and discharging. This increased heat generation
can exacerbate temperature differences within the battery pack, potentially leading to
localized hotspots where thermal runaway may be initiated, in a heavily degraded cell
[8]. Imbalanced cells may experience voltage spikes or current surges during charging
or discharging cycles, especially when transitioning between different operational states.
These irregularities in voltage and current can induce stress on the cells and lead to
thermal runaway if not properly managed by the battery management system [5].

The sum of these factors can start, or contribute to, an incident of thermal runaway. As
the lithium-ion cells’ temperature increases, the chemical reactions in the cell get faster
and faster, leading to a self-feeding reaction. This catastrophic effect of self-heating,
known as thermal runaway, causes the expansion of the cells, production of smoke, fire,
and finally, leads to the explosion of the cell pack. If BMS detects the development of
this effect too late and a battery enters a state of thermal runaway, it usually cannot
be stopped. Outside thermal influence from cooling mechanisms, and an attempt to
prevent this only before the thermal runaway has been initiated [5].

2.5.1 Passive Cell Balancing

Because of these issues and the critical flaw of lithium-ion packs, a BMS must regularly
balance the battery cells and maintain a certain degree of uniformity for the SOCs.
There are many methods for ensuring this balance, each requiring different circuitry and
electronics within the pack to enable it, and they are categorized into two categories:
passive and active cell balancing [8] [26]. In terms of energy, passive balancing usually
is classified as dissipative, whereas active balancing is non-dissipative [27].

Passive cell balancing is the simplest form of balancing, where cells are discharged until
they reach similar SOCs. The excess charge is transferred into heat through a consumer,
a resistor. The main benefits of these methods are simple implementations and relatively
cheap components. There is no need for advanced algorithms. In some implementations,
the BMS has no direct influence on the passive balancing mechanism, which functions
purely at the circuit level [8].

The downside of these methods is that battery charge is wasted destructively. Energy
may not be simply removed from the cells and is instead transformed directly into heat.
The resistors are placed close to the cells and this leads to the heating of the pack. Due
to this effect, as well as the inaccuracy of SOC estimations during charge/discharge, it
can be dangerous to allow this balancing to happen outside resting periods, where there
is no outside current applied to the cells. Applying this method during charge/discharge
comes with the danger of overheating the battery which, even if thermal runaway is
avoided, still contributes to the deterioration of the cells [8] [26].

12


2. Background

2.5.2 Active Cell Balancing

Active balancing, as a non-dissipative type does not waste cells’ energy to achieve balance.
Most active cell balancing implementations rely on transferring energy from high SOC
cells to low SOC cells. There is some waste of energy on the circuitry needed to enable
the energy transferring between cells, but lower in comparison to the passive balancing
strategies [8].

Naturally, transferring energy between different cells requires more complex mechanisms,
some of which are not effective enough to be considered as good alternatives to already
implemented passive balancing designs. Not only in complexity, but voltages between
imbalanced cells need to be relatively large so that cells can balance quickly. This means
that low differences in cell voltages, even though they can be unbalanced, can make
the energy transfer too slow, unable to achieve balance in the necessary amount of time
without the need for additional circuitry that can bypass this limitation.

2.6 Cell Model

There are a variety of different models utilized to simulate battery cell behaviors, some of
which are Electro-chemical Models (EM), Equivalent-circuit Models (ECM), and Data-
driven Models (DDM) [28], each with different advantages and disadvantages.

In this project, we utilize the ECM model for cell and cell pack simulation.

2.6.1 Equivalent-circuit Model (ECM)

ECM models the electrical behavior of battery cells through circuit theory, utilizing
electrical elements such as resistors, capacitors, inductors, and voltage/current sources.
The ECM model used is based on the Thevenin circuit model, one of the most widely used
ECM models for cell battery simulation, given that it can more accurately represent the
dynamic behaviors of the cells [28]. Due to the Thevenin ECM model being a linear and
time-invariant circuit model, it does not capture accurately the nonlinear and time-variant
physical behaviors of the cells. Additionally, for the model to be accurate, real cells must
be measured and parameterized to tune the model parameters for precise simulation.
This can be challenging given the non-linear behaviors that cells show depending on
different parameters such as temperature, SOC, and rates of charge and discharge [8],
[28].

13


2. Background

Figure 2.3: First order Thevenin model of a cell.

In this model, OCV is defined as the Open Circuit Voltage, which is the voltage of the
cell measured without load. The OCV is also a necessary parameter to determine the
SOC of a given cell. The cell model voltage is modeled as shown in equation (2.3).

v(t) = OCV (z(t)) − R1iR1(t) − R0i(t) (2.3)

2.7 Reinforcement Learning

Reinforcement learning is a branch of machine learning. Machine learning, as a concept,
is a computational approach to learning. Reinforcement learning uses ’interaction’ as its
object of learning. The ’learner’ in an RL setting is frequently called the ’agent’. The
objective of the agent is to solve an infeasible problem to approach through traditional
algorithmic means [29]. The usual applications of RL algorithms are problems with a,
often large, set of variables in a non-uniform environment. That is to say, they influence
each other non-linearly [29].

Reinforcement learning is formalized as control optimization in an incompletely-known
Markov decision process. The learning agent must sense the state of an environment and
take actions to affect that state. RL is often referred to as a form of unsupervised learn-
ing, however, the objectives of these two methods are different. Unsupervised learning
tries to find a structure hidden in collections of unlabeled data, whereas reinforcement
learning tries to maximize a reward sequence rather than trying to find a hidden struc-
ture. Although there is some overlap in these objectives, as finding a structure would be
beneficial for RL in most situations, however, it does not guarantee a maximization of a
reward [29].

14


2. Background

Figure 2.4: RL training principles

Figure 2.4 illustrates the the dataflow of RL training. An environment (i.e. the battery
back) sends out states to the RL agent. The agent provides the best course of action or
that state that it has learned through experience so far. The action alters the environ-
ment, leading to a new state. This state is assigned a reward, according to the reward
function, and the feedback is given to the RL agent. After enough state-transition re-
ward pairs are gathered, the RL agent goes through a training step in which it updates
its course of action.
An RL agent’s target for interaction is the environment. An environment can be defined
as the setting in which our agent can act and influence elements within. It can be several
integer variables, a set of continuous arrays, or a complex three-dimensional simulation.
The definition of an environment depends on the problem which we seek to solve. The
RL agent is offered an interface through which it can interact with the environment. This
is the Action Space of our environment, and it represents variables that are manipulable
by our agent. The variables that cannot be directly influenced by our agent are termed
the Observation Space. A ’snapshot’ of the Observation Space at a certain timestep is
called a ’state’. States are linked together by the actions which the agent takes between
them. We start interaction with an environment in an initial state, and each action the
agent takes ends in another state. Actions are sequential and begin from the resulting
state of the previous action [29].

Figure 2.5: RL agent state-action-reward unit

Each action taken upon an initial state and leading to a final state is assigned a reward,
formalized in Equation (2.4). The reward is defined by a reward function that represents

15


2. Background

the target of our RL agent, the objective which it must achieve through its actions,
bounded by a discount factor γ which takes values between 0 and 0.99 Equations (2.5,
2.6). The value discount factor determines how much the RL agent ’looks ahead’ as it
tries to maximize the rewards. The largest instance of the sum is the immediate reward
for a state-action pair, however, with a large discount factor, the future rewards for the
following actions gain more importance. An agent may thus take an action that has a
low immediate reward but allows for much higher rewards for future states. The agent
then develops efficient long-term strategies that can secure high total rewards by taking
into consideration the development of states. An RL agent with a discount factor of 0
is known as ’myopic’, and ignores future considerations in its decision-making. It has a
maximum possible value of 0.99, which is a mathematical technique to ensure rewards in
the distant future eventually approach 0. This ensures the immediate action reward has
more importance than individual future actions. The predictive focus of an RL algorithm
must be tuned to the specific task it aims to solve, and the discount factor is critical in
defining this focus [29].

rt = R(st, at) (2.4)

The RL objective is to select a policy that maximizes the expected reward sum when
the agent acts according to that policy. A policy is a stochastic rule by which the agent
selects actions as a function of states. A policy’s value functions (vπ and qπ) assign to
each state, or state–action pair, the expected return from that state–action pair, given
that the agent uses the policy (2.5). The value function v of a state s under a policy
π, denoted vπ(s), is the expected return when starting in s and following π thereafter.
Similarly, we define the value of taking action an in state s under a policy π, denoted
qπ(s, a), as the expected return starting from s, taking the action a, and thereafter
following policy π (2.6) [29] [16].

vπ(s) = Eπ[Gt|St = s] = Eπ

[ ∞∑
k=0

γkRt+k+1|St = s

]
(2.5)

qπ(s, a) = Eπ[Gt|St = s, At = a] = Eπ

[ ∞∑
k=0

γkRt+k+1|St = s, At = a

]
(2.6)

These value functions are estimated through experience, an agent follows the policy π
and computes an average for the states encountered of the total reward returns that
have followed that state, which converges to the state’s value vπ(s) as the number
of times that state is encountered nears infinity. Keeping a separate average for each
action in each state, we converge to the action values qπ(s, a). The value functions
define ordering over policies. A policy π is defined as better than π′ if the expected
total reward is greater than that of π′ in all states. Formally, π > π′ if and only if
vπ(s) > vπ′(s). The RL task is to find the optimal policy π∗ which is greater than all
other policies, formalized in the following equations (2.7), (2.8), (2.9) [29].

v∗(s) = max
π

qπ(s) (2.7)

16


2. Background

q∗(s, a) = max
π

qπ(s, a) (2.8)

q∗(s, a) = E[Rt+1 + γv∗(St+1)|St = s, At = a] (2.9)

Traditional greedy algorithms in non-linear environments show severely limited perfor-
mance. A greedy algorithm begins by sampling an action, and will then continuously
take that action as it is the ’most rewarding’ action at its disposal. It makes no effort
to explore the other actions it can take, as doing so is not according to its imperative of
maximum immediate reward, which it can only compute from the actions it has taken.
Thus, the same course of action is repeated ad infinitum with no improvements, never
discovering the better alternatives offered by the other actions. RL is defined by its
capacity for effective exploration of a large environment [29].
A big challenge in designing RL systems is the balance between exploration and exploita-
tion. To accrue a large reward, the agent must take actions that it has tried in the past
and knows will produce an effective reward. However, to discover such actions the agent
must try actions that it did not attempt before. Our agent must exploit what it has
already tried, and explore to discover new methods for achieving rewards. By choosing
one, we fail the other, and a balance must be struck. This trade-off dilemma is specific
to reinforcement learning and has no equivalent in supervised or unsupervised learning
[29] [30] [31].
RL algorithms utilize techniques such as e-greedy, softmax, or Upper Confidence Bound
(UCB) to encourage the agent to explore a wide range of actions and states in the
early stages of learning. This exploration helps the agent discover effective strategies
and solutions that might not be immediately apparent. By systematically balancing
exploration with exploitation, RL ensures that the agent can both discover new strategies
and refine them to maximize long-term rewards, eventually discovering a final optimal
policy [29].
The biggest bottleneck in RL is samples. RL algorithms need many state-action-reward
samples to train efficiently. The number of samples varies by task/environment but usu-
ally reaches several million state transitions for convergence. An RL algorithm converges
when further exploitation yields no further improvements, and the exploration mechanism
tapers off. For example, some algorithms utilize an entropy parameter that gradually
decreases according to loss functions based on the accuracy of predicted rewards, while
others introduce noise to the action of the agent to ensure exploration [15] [16].
There are two approaches to RL when it comes to sample-gathering: Online RL and
Offline RL.
In offline RL, samples are provided from interactions with other actors (i.e. other BMS
control algorithms). These interactions are recorded from logged data and tagged with
rewards. In the training process, the RL agent cannot directly act upon the environment
and instead learns from the actions of other controllers.
In online RL, an environment is simulated and the RL agent interacts directly with the
simulation to generate state-action-reward samples. This approach works best when

17


2. Background

there is no abundant amount of usable logged data to train the RL agent with. It
converts the sample scarcity problem into a computation-power problem. However, the
simulation must be relatively accurate to the real-world environment we seek to learn
[29].
Examples of RL infrastructures in the industry initially start with an online approach when
branching into the area, and gradually move to Offline once they have the possibility of
generating enough logged data from products, or through digital twins. A digital twin is
a simulation that we can say is certifiably identical to its real-world counterpart. Online
RL may also offer more possibilities for exploration than Offline learning from traditional
controllers. However, the accuracy of the simulations remains a key issue.
As we are experimenting with a novel topology for our approach, there is no widely avail-
able large bank of logged BMS interaction data. Within our work, we rely on simulations
to generate the necessary samples for the training process, via direct interaction from
the RL agent in an online RL setup.

Figure 2.6: Taxonomy of RL algorithms

Reinforcement learning algorithms are classified into two large categories: model-based
and model-free.
Model-based algorithms learn an explicit model of the environment, in the form of tran-
sition probabilities or a predictive model. They utilize this model to plan and make
decisions about actions to take. That is to say, a model-based approach implies the
possibility of predicting rewards and states, during training, before an action is taken.
Rather than relying on trial and error, the model-based approach uses approximations of
the environment to view the possible outcomes before taking a step [32].

18


2. Background

Model-based RL approaches require certainty that the approximation fully captures the
essence of the environment, one that is identical to the ground truth. It is also crucial for
this approximation to be very efficient in generating predictions, otherwise the time-to-
train becomes infeasible. These approaches work best with easily modeled environments
with few or no non-linear factors, such as games of Chess and Go [32].
Model-free approaches are applied directly to an environment, and learn through trial
and error. A model-free agent must take an action, and see the results, to be able to
learn from it. This grants much more flexibility and eliminates the risks of inaccurate
models training unusable agents. It comes with the cost of less sample efficiency, as we
need to run more simulations for the agent to find the best course of action [32].
Within the model-free methods, we have several classifications: Policy Optimization, and
Q-learning methods.
Q-learning algorithms are also known as ’off-policy’. They learn a value function, which is
then used to derive a policy. Specifically, it learns the action-value function (Q-function),
which predicts the expected utility of taking a given action in a given state and following
a certain policy thereafter. It does so by estimating the Bellman equation. A big
advantage of off-policy methods is that they can use data collected at any point during
training, even from past actions from previous, less-trained, versions of the agent. The
most popular algorithm representative of this class, which spawned this branch of RL, is
Deep Q-Network (DQN) [33].
Policy optimization, or ’on-policy’ methods can only use data collected with the lat-
est version of their policy. They learn a policy function that directly maps states to
actions. The approach of policy-based methods makes them very stable and able to
handle continuous action spaces, and smoothly converge to an optimal policy [18]. On-
policy algorithms are ideal for scenarios with a large and continuous action space. They
do however suffer from inefficiency in sample usage.

19


2. Background

20


3
Methods

3.1 Balancing Strategy

As discussed previously, balancing can be categorized as dissipative and non-dissipative.
For this project, we propose a non-dissipative balancing strategy that does not rely on
transferring charge between cells. We explore the idea of each cell being discharged at
different rates, by the normal operational power-draw of the vehicle, in order to achieve
balance, for this to occur cells must be balanced during discharge phases, namely, during
operation.

Balancing during operation can allow the cell pack to be utilized fully, by allowing higher
SOC or stronger cells to instead take part of the work of the low SOC or weak cells. With
this, low SOC or weak cells get used less, balancing the overall power demand between
all cells equitably so that cell aging averages out in a similar manner on all cells, rather
that having cells aging more quickly that other. Ideally, when cell packs reach the end
of their lifetime, all cells should have been fully exploited.

To allow for balancing during operation, we utilize a cell pack topology which is a re-
configurable battery topology, illustrated in Figure 3.1. In this battery pack, each cell is
connected to a power electronics circuit unit whose operation is to modulate the amount
of charge that each individual cell will provide. Each circuit unit, for the purposes of
this project, is connected in series with other units, forming a pack. By being able to
modulate the amount of charge each cell provides, it is possible to remove the needed
charge of each cell during operation with the aim of achieving balance.

In Figure 3.1, the buck-boost converter represents the power electronics section of the
electronics while the controlling switch represents the circuit that will handle the charge
flow accordingly. This is a representation of what could be expected of the controller
power electronics to be, however the electrical design and considerations of this are out
of scope for this work. Any similar circuit or system can be used instead of the presented
one, as long as the handling circuit is capable of executing the demanded tasks implied
by this work.

21


3. Methods

Figure 3.1: Balancing focused re-configurable cell pack topology.

3.2 Battery Simulation Environment

This section provides details about the cell simulation, the data used for the simulations,
the interfacing between the simulation environment and the machine learning training
environment and how balancing is handled through the simulation.

3.2.1 Cell Simulation

The cell simulation code was leveraged from previous works [8] and in order for these to
fit the objectives of this project, the code was modified and new code was added onto
it.

The cell simulation reproduces a cell’s electrical parameters over time based on the
demanded power profile, utilizing the electrical cell model described in section 2.6 Cell
Model. Every cell simulation uses the same cell model and parameters, as shown in Table
3.1, which correspond to cell model ”P14” when running the cell simulation environment.
More cell models are available for this work, however only cell model ”P14” was used
for the final experimental results.

22


3. Methods

Table 3.1: ”P14” cell model parameters.

Parameter Value Unit Description
T 25 Degrees Celcius Temperature of operation of the cell
q 14.53 Amp × hour (Ah) Cell capacity
eta 0.999 Ratio, unitless Coulombic efficiency
r0 0.00178096 Ohms Series resistor parameter
r 6.477208e-04 Ohms Resistor-Capacitor resistance
rc 0.823683 Ohms R-C resistance
rt 2.5e-4 Ohms Cell tabs resistance

The temperature parameter for this work has been set to 25 °C. Most of the parameters
of the cells change depending on the temperature in varying degrees but mostly stay
relatively similar. Given that this would greatly increase the complexity and time of the
simulation, the temperature will remain constant for the purposes of this work.

Each cell is simulated in a pack of 10 cells which have slight differences in their pa-
rameters, in order to simulate factory irregularities, aging differences, and capture the
unbalancing characteristics of the cells. If the cells in the simulation have the exact
same parameters, then unbalancing will never occur since they will all act exactly the
same way. To avoid this, some cell parameters have small variations. This is meant
to emulate real cells, given that the manufacturing process of the cells cannot possibly
yield virtually exact cells, thus this parameter variance is reasonable to have. To have
some control over the randomness of the parameter values, each random parameter is
seeded via a seed parameter that can be freely chosen on each simulation - this allows
simulations to be accurately reproduced as well as being able to have reproducible yet
random cell configuration.

The parameters q, eta, r0 for cell i are randomized as follows

Qi = q − 0.25 + 0.5αi (3.1)

Etai = eta − 0.002βi (3.2)

R0i = r0 − 0.0005 + 0.0015γi (3.3)

where αi, βi, γi ∼ Unif (0, 1). Here αi, βi, γi and αj, βj, γj are independent when i ̸= j.

During simulation, different voltages and currents are calculated depending on the power
that is being demanded on each cell, and ultimately the SOC of each cell is computed.
SOC is one of the primary values that is used for balancing, it also is used for the
simulation itself, given that cells start with a set initial SOC value from which the rest
of the parameters are obtained.

The open circuit voltage (OCV) vOC is also part of the cell model parameters. The cell
open circuit voltage vOC is a function that depends on the cell SOC and temperature,
defined as follows

vOC = OCV fromSOC(SOC, T, model) (3.4)

23


3. Methods

where SOC is the current state of charge of the cell, T is temperature and model are
the ”P14” cell model parameters and OCV fromSOC() is a lookup table function that
finds the cell’s vOC depending on the input parameters. The cell model contains all the
parameters shown in Table 3.1 as well as the voltage-SOC relationship of the simulated
cell. This data belongs to the cell parameters we utilize and was obtained through cell
characterization tests and measurements, which we leverage in this work.

Figure 3.2 illustrates the relationship between vOC and SOC, where SOC is the free
variable.

Figure 3.2: ”P14” cell model of OCV and SOC relationship, T=25 C°.

The initial value of SOC for all cells is the same, this is the free variable of the simulation
which needs to be set at initialization. It is set to the upper SOC limit at the start of
the simulation:

SOC(0) = maxSOC (3.5)

Later, once the simulation has calculated the voltages and currents going through the
cells, the SOC is calculated again and updated, from which the simulation cycle begins
again.

By defining the maximum and minimum limits for SOC, maxSOC and minSOC respec-
tively, we can use Equation (3.4) to calculate the respective maximum and minimum
voltage limits that the cell can have, maxV lim and minV lim respectively - shown in
Table 3.2.

The simulation parameters that are set at initialization are shown in Table 3.2.

24


3. Methods

Table 3.2: Cell simulation initial parameters.

Parameter Value Unit Description
maxSOC 0.95 Percentage Upper SOC limit for cell
minSOC 0.10 Percentage Lower SOC limit for the cell
maxVlim 4.095 Volts Upper voltage limit for the cell
minVlim 3.5185 Volts Lower voltage limit for the cell
leak_c 0.01 Ampere Leakage current of the cell
Tsd 20 Celsius Self-discharge cell temperature

The values for maxSOC, maxSOC, leak_c and Tsd are simulation defined, while
maxV lim and minV lim are values obtained from the OCV-SOC relationship from the
cell model, via Equation (3.4).

Leakage current, which is parameter leak_c, aims to simulate cells slowly losing charge
over time due to an external factor. This is to mimic cells becoming discharged due to
circuitry utilizing cell charge for computation or control circuitry, such as powering the
BMS control circuit.

Additionally, cells can exhibit a self-discharge behavior, the simulation treats the self-
discharge of the cell as an additional current. This is a different current than the
leakage current, and the parameter that controls this current is Tsd - self-discharge cell
temperature.

The parameters leak_c and Tsd for cell i are randomized as follows

Leak_ci = leak_c + 0.002αi (3.6)

Tsdi = Tsd + 10βi (3.7)

where αi, βi ∼ Unif (0, 1). Here αi, βi and αj, βj are independent when i ̸= j.

Once the initial parameters are defined, the simulation can start calculating the sev-
eral variables that are dependent on the discharge power profile used. Cells for passive
balancing are considered connected in series and simulated as such, while cells for recon-
figurable active balancing are simulated individually, with no direct connection to other
cells.

For passive and active balancing, the current inet represents the current flowing through
any cell and is defined as follows

inet(t) = iapp(t) + iself -discharge(t) + ileakage(t) (3.8)

Where iapp(t) is the discharge current, dependent on the power profile, iself -discharge(t)
is the self-discharge current and ileakage(t) is the leakage current.

The discharge current iapp is initialized as follows

25


3. Methods

iapp(0) = 0 (3.9)

Cell pack simulation for passive balancing topology:

N∑
j=1

(vOC,j − vRC,j) i −
N∑

j=1
R0,ji

2 = P × N (3.10)

where N is the number of cells in a battery pack, iapp is the current through the pack,
and P is the power usage of a single cell, read from the drive cycle profile. For cell j,
vOC,j is the open circuit voltage, vRC,j is the RC circuit voltage, and R0,j is the resistance
in series.
To calculate iapp with all voltages positive, we use

iapp =

N∑
j=1

(vOC,j − vRC,j) −

√√√√√ N∑
j=1

(vOC,j − vRC,j)

2

− 4(P × N)
N∑

j=1
R0,j

2
N∑

j=1
R0,j

. (3.11)

Cell pack simulation for reconfigurable active balancing topology:
For cell j, j ∈ [1, 2, · · · , N ]

(vOC,j − vRC,j) ij − R0,ji
2
j = P (3.12)

where iappj is the discharging current, P is the power usage of a single cell, obtained
from the drive cycle profile, vOC,j is the open circuit voltage, vRC,j is the RC circuit
voltage, and R0,j is the resistance in series. To calculate iappj with all voltages positive,
we use

iappj =
(vOC,j − vRC,j) −

√
((vOC,j − vRC,j))2 − 4PR0,j

2R0,j

(3.13)

The current calculation for iself -discharge(t) for cell j is defined as

iself -discharge(t) = (vOC,j − vRC,j)
((−20 + 0.4Tsd) × SOCj + (35 − 0.5Tsd)) × 1000

(3.14)

As for leakage current ileakage for cell j the value remains constant as shown in Formula
(3.6) and is defined as

ileakagej = Leak_cj (3.15)

The RC voltage vRC(t) of any cell i is the voltage of the RC components of the ECM
model, which is necessary for calculating the terminal voltage vt(t) of the cell. The RC
voltage vRC(t) the can be calculated as follows

26


3. Methods

vRCi(t) = ri × iRC,i(t) (3.16)

Where ri is the RC resistance, from Table 3.1, and iRC,i(t) is the current of the RC
components of the cell model. The RC current iRC(t) of cell i is defined by the following

iRC,i(0) = 0 (3.17)

iRC,i(t + 1) = rci × iRC,i(t) + (1 − rci) × ineti(t) (3.18)

Where rci is the RC resistance value (cell parameter from Table 3.1) and ineti(t) is the
current flowing through the cell.

With the cell OCV voltage vOC and the RC voltage vRC, the terminal voltage vt of cell
i can be calculated as follows

vti(t) = vOCi(t) − vRCi(t) − ineti(t) × r0i (3.19)

Where vti(t) is the terminal voltage, vOCi(t) is the open circuit voltage, vRCi(t) is the
RC voltage, (ineti(t) is the current flowing through the cell and r0i is the r0 resistance
parameter of the cell.

Finally, the SOC of the cell can be calculated, which will be the new SOC value for the
next simulation iteration. The SOC for cell i is calculated as follows

SOCi(t + 1) = SOCi(t) − 1
3600

× ineti(t)
qi

(3.20)

Where ineti(t) is the current flowing through the cell, and qi is the cell’s capacity.

Additionally, in this work we utilized state of health (SOH) as the secondary parameter
for balancing in addition to SOC. For SOH, the power expended by every cell is captured
and accumulated. The calculation for SOH of cell i is defined as follows

SOH i(0) = |(vOCi(0) − vRCi(0)) × ineti(0)| (3.21)

SOH i(t) = SOH i(t − 1) + |(vOCi(t) − vRCi(t)) × ineti(t)| (3.22)

3.2.2 Drive Cycle Profiles
The demanded power for each cell refers to the expected power output each cell must
be providing at any given moment. This power value is converted by the simulation
to the current that must go through the cells in order to supply the demanded power,
this is calculated using the dynamic variables such as voltage as well as the internal cell
parameters. With the power demand, the current that is expected to go through each
cell is also calculated. The current going through each cell will either charge or discharge
the cell, depending on which state of operation cells are in.

27


3. Methods

Since the cell simulation requires for a power demand profile that will allow cells to
provide energy and become discharged over time, the simulation utilizes drive cycle
profiles based on real-world data, leveraged from previous works. This allows for cells to
be discharged at a rate which is the closest to a real-world scenario as possible, so that
we can capture more realistic cell discharge behaviors.

The drive cycle profiles are a function of expected power demand versus time. The Power
unit is in Watts and time is measured in seconds in the simulation. There are multiple
drive cycle profiles of an EV’s power demand over time in a given driving environment,
such as in a city, highway or urban area. Each of these scenarios have highly different
characteristics in terms of power demand, for instance, on a highway the power demand
is the highest given the need to maintain a high speed is energy-intensive, contrary to
driving in a slow area with multiple stopping points in an urban area.

Figure 3.3: Highway (us06) and urban (udds) drive cycle profiles.

Figure 3.3 is a graph that shows the power demand profiles for one cell. The drive cycle
of a highway profile is ”us06” and the urban drive cycle profile is ”udds”, leveraged from
[34]. These are the two main used profiles in the simulation experimentation. The ”us06”
profile discharges cells at a much higher rate and has a short duration, this means that
cells under constant discharge with this profile will require to be charged more often. On
the other hand, the profile ”udds” has a lower power demand and longer duration, this
translates to cells becoming discharged much slower and requiring charging less often.

3.2.3 Cell Simulation States

The cell simulation accounts for three different states where cells could be in: charging,
discharging, and resting. These three different states aim to imitate the normal operation
of cells on a real system. Cells will become discharged during usage, charged when they
reach a certain lower threshold, and resting when the system is not under operation.
Figure 3.4 shows the state-flow of the cell simulation.

28


3. Methods

Figure 3.4: State flow chart of the cell pack simulation.

The simulation time scale depends on the power profiles, and each simulation step
simulates one second of operation. Each step, SOC along with other variables are
calculated.

During ”discharging” phase, cells become discharged based on the demanded power pro-
file as well as leakage and self-discharge currents. At every time step, SOC is calculated
for each cell, and later cell SOCs and cell voltages are checked, if they are below a certain
threshold, then the simulation state changes to ”Charging”.

Any cell is i considered discharged when

vt,i(t) <= minV lim OR SOCi <= minSOC (3.23)

Where vt,i(t) is the terminal voltage of the cell and SOCi is the SOC of the cell. If any
of these conditions are true, the simulation state changes to ”charging”.

In order to control how long cells are under operation and rest, two counter variables are
used. The internal parameter usageCounter controls the time cells will be charging and
discharging, and the internal parameter restingCounter controls how long cells will rest.
After usageCounter finishes, restingCounter starts. Once restingCounter ends, this
is considered a cycle. The amount of cycles the simulation will run for is provided by the
user during the set-up and configuration of the simulation environment (Section 3.7),
illustrated in Figure 3.4 as N . Once the simulation has executed N cycles, it ends.

For the ”charging” state, cells are charged at a rate of 6.6KW . Once any cell in the pack
has reached its maximum voltage or SOC threshold, charging stops and the simulation
returns to its discharging state.

Cells during charge are not capable of fully receiving all the charge due to inefficiencies
with energy transfer. The parameter that defines the efficiency transfer is Coulombic
Efficiency, or eta (η) as defined in Table 3.1. This parameter affects the charging current
directly.

29


3. Methods

When charging, the current inet(t) flowing through the cells is defined as

inet(t) = iapp(t) × eta + iself -discharge(t) + ileakage (3.24)

where iapp(t) is the charging current, iself -discharge(t) is the self discharge current and
ileakage(t) is the leakage current.

Any cell is i considered charged when

v(t)i => maxV lim (3.25)

where v(t)i is the terminal voltage of the cell.

The simulation will reach the ”resting” state after usageCounter ends, and will remain
in this until restingCounter ends. Essentially, after a certain amount of time that the
simulation is ”discharging”, it will change to ”resting” to emulate cells under no load.
During this state, cells will only be subject to leakage and self-discharge currents. If cells
are left resting infinitely, they will eventually fully discharge.

During the resting state, cells are not being charged nor discharged, this means that the
only current flowing through the cells is leakage and self-discharge currents.

As such, any cell is considered in rest when the current iapp(t) is zero. Where inet(t) is
the current of the cell, resting current is defined as follows from Equation (3.8)

iapp(t) = 0. (3.26)

Thus

inet(t) = 0 + iself -discharge(t) + ileakage (3.27)

inet(t) = iself -discharge(t) + ileakage (3.28)

Where inet(t) is the total current flowing through any cell during rest.

Figure 3.5 shows how the SOCs of different cells look like over time during several
simulation cycles and phases, as well as a plot of the difference of maximum SOC
against minimum SOC - this is meant to illustrate how cells tend to unbalance over time.
Note that the SOC differences increasingly grow over time, when there is no balancing
policy controlling the simulation.

30


3. Methods

Figure 3.5: Cell pack SOCs and SOC difference during discharging, charging, and resting
phases.

3.2.4 Cell Balancing Simulation
Balancing in the simulation is handled for both passive and reconfigurable active bal-
ancing. For this to work along with the RL model training environment, the simulation
has a feedback loop from where it obtains the actions that the RL model wants to take
and applies them to the simulation, directly affecting the simulation calculation when
balancing. This means that the simulation cannot continue unless it has received actions
from the RL model, given that this is an online training methodology.

Due to resource limitations when executing the simulation environment and passing infor-
mation to the RL training environment, data passing had to be down-sampled between
processes. Sending each simulation step to the RL training environment is a very slow
process given that it involves transferring information to another process and waiting for
that other process to compute the next action and return the action to the simulation.
As a result, the simulation has a down-sampling factor of 30, which drastically reduced
simulation and training time. As a consequence of this, the actions applied when bal-
ancing persist for 30 simulation steps, until the next value is received and applied for
balancing.

Given that active and passive balancing occur at different stages of cell operation, the
balancing strategy and feedback will be different for each. For passive balancing, cells
need to be discharged over time while they are in the resting phase only. For the active
balancing approach presented in this project, balancing occurs during cell operation only.
Any phases where cells cannot be balanced, they will become naturally unbalanced if no
balancing strategy is applied.

31


3. Methods

Passive balancing feedback for the simulation is treated as a discharging current that
flows through each cell individually. Each cell will become discharged based on the
balancing strategy of the RL model. The RL balancing model has full control on the
cell current discharge and the feedback is applied as-is to the cells. Since this balancing
can only occur during resting windows, the current feedback is only applied during the
resting phase and ignored on the other phases. When the resting phase ends, which the
RL agent can’t control the timings of, balancing no longer occurs.

For passive balancing each cell is balanced independently only during rest, at which point
the current going through the cells at rest is defined by Equation (3.28). To include
balancing, an additional balancing current ibalance is added to the equation. The total
current inet(t) flowing through cell i during passive balancing is as follows

inet,i(t) = ibalance,i(t) + iself -discharge,i(t) + ileakage,i (3.29)

Where inet,i(t) is the total current flowing through the cell, iself -discharge,i(t) is the self-
discharge current flowing through the cell, ileakage,i is the leakage current of the cell
and ibalance,i is the balancing current of the cell. The balancing current is controlled
externally to the simulation.

Active balancing is approached differently in this work. Because active balancing occurs
during the discharge phase, cells almost constantly have current flow through them. With
the balancing strategy proposed in this project, each different cell will have a different
amount of current flowing through them, controlled by the power electronics. The
power electronics will enable the balancing during operation strategy by allowing varying
amounts of charge to be drawn from individual cells at any given time. For this, the
balancing feedback is treated as a percentage that represents the percentage of power
that each cell will provide based on the power demanded by the simulation profile. For
example, if a cell pack is demanded a set amount of power from each cell, the balancing
feedback can distribute the demanded power differently on each cell, allowing low SOC
cells to provide less charge and high SOC cells to provide the rest of the charge. This
way, the perceived power granted from the pack is the same, but each cell contributed
in different amounts.

Active balancing with this work’s approach only occurs during discharge. By modulating
the power demand of each cell, the amount of charge each cell provides will be different,
reflected by the discharge current iappj of each cell, as shown in Equation (3.13). In
order to control the power of each cell, we introduce a control variable C, is obtained
externally from the RL model, which will act as the balancing factor. Each cell will have
a different control value C, and it directly affects the power usage P of each cell.

Normally, when there is no balancing, each cell will be demanded the same power P , and
the pack power demanded will be P × N , where N is the total number of cells. This
must always hold true in order to provide the correct power demanded from the pack. As
such, the control variable C must also respect this constraint, thus the following must
always hold true:

For cell j, j ∈ {1, 2, · · · , N}

32


3. Methods

N∑
j=1

(P × Cj) = P × N (3.30)

Where Cj is the control variable, obtained from the RL model, P is power demanded by
a single cell, obtained from the drive cycle profile, and N is the total number of cells.

Additionally, it must not be possible for the control variable C of any cell to be 0. This
case is not realistic and could cause potential harm on a real cell pack. For this, the
control variable C must also always be in a safe range, which we defined as follows

For cell j, j ∈ {1, 2, · · · , N}

0.5 <= Cj <= 1.5 (3.31)

where Cj is the control variable of the cell.

With the control variable C defined, the discharge current iapp of any cell during active
balancing is defined as follows

iappj =
(vOC,j − vRC,j) −

√
((vOC,j − vRC,j))2 − 4(P × Cj)R0,j

2R0,j

(3.32)

where iappj is the discharging current, P is the power demand for a single cell, obtained
from the drive cycle profile, Cj is the modulating control variable, obtained from the RL
model, vOC,j is the open circuit voltage, vRC,j is the RC circuit voltage, and R0,j is the
resistance in series.

The balancing feedback for each cell is not controlled or changed in the simulation itself,
thus it must be provided correctly externally since it will be applied as-is.

3.3 Reinforcement Learning Model
This section describes the reinforcement learning algorithms and methods devised to
train on the battery pack simulation.

3.3.1 RL environment
The simulation environment is interfaced with the RL agent via action space and obser-
vation space. The action space represents the parameters of the environment which can
be directly altered by the RL agent.

We have two different action spaces for each balancing topology. For passive balancing,
the action space is the balancing current, the additional current discharged for each cell
by the BMS. The additional balancing discharge current values are between 0 and 1
Ampere, where 0 means no additional discharge current. For the reconfigurable active
balancing during discharge, the action space represents the usage percentage for each

33


3. Methods

cell, an abstraction of the duty-cycling of the power electronics that link each cell. The
values for the usage percentages are between 50% and 150%, for each cell. The total
usage values of the cells must always provide the same total power to the EV, according
to the drive-cycle demand. Vehicle power demand does not get altered by the topology
and controller, they can only alter the way power is drawn from the cells by requesting
more power from some cells, and less from others to compensate.

Figure 3.6: State transition within a training episode

Both topologies utilize the same observation space, which consists of the SOC values
and the energy throughput of each cell in the pack. Each state represents an instance
of the observation space, seen in Figure 3.6. The agent observes a given state, chooses
an action according to its latest policy, and broadcasts the action which is then applied
to the environment, leading to the following state. This state transition is then assigned
a reward according to the reward function. The process continues until the transition to
the final state.

State transitions in RL are designed to follow the Markov property, outlined in the Markov
property below, Equation (3.33).

P (Xn+1 = xn+1|Xn = xn, . . . , X1 = x1) = P (Xn+1 = xn+1|Xn = xn), ∀n ∈ N.
(3.33)

The Markov property asserts that the current state of the environment encapsulates
all relevant information needed to determine the future state, rendering the history of
previous states and actions unnecessary. Formally, in a Markov Decision Process (MDP),
the probability distribution of transitioning to the next state from the current state,
conditioned on both the current state and action, remains independent of past states
and actions. This property simplifies the learning process by allowing RL agents to
make decisions based solely on the current state, facilitating more efficient and scalable
algorithms [29].

34


3. Methods

The reward function must take this property into consideration, and only assign rewards
for the transition between one state to the next, according to an intermediate action.
Individual state-transition rewards must not take into account past or future rewards.
The elements for such behaviour are handled at higher levels within the RL algorithm,
and not within the state-transitions themselves.

3.3.2 Action sampling rate

As the sample generation is bound by computational power in online RL, we make efforts
to optimize the sample generation speed, at the cost of granularity in actions. Instead
of computing a new action for each subsequent timestep of the simulation, we apply an
action for multiple timesteps before interrupting and receiving the resulting state and
assigning the reward for the transition, according to the reward function.

The frequency at which a new state is produced depends on the sampling rate of the
simulation. All states are linked together by actions, except the initial state of an episode.
An episode is a full run of a simulation, for a set amount of cycles.

For our balancing objective, we simulate many charge-discharge cycles across the lifetime
of a pack within a vehicle. In order to reduce the computational load, by limiting
the interrupts to the simulation for additional action commands from the RL agent,
we sample the environment and return an action once every 30 in-simulation seconds,
which constitutes a sampling rate of 30. Thus we reduce the time needed to complete
a simulation by reducing the amount of context switching.

This sampling choice also affects how the reward function interacts with the environment
and shapes our learning objective. As actions last for several timesteps, aggressive BMS
actions such as large discharge currents can make the cells unbalanced, which the RL
agent learns to account for. Indeed, as a tangential benefit, this acts as a limit to high
discharge rates in the case of the passive topology. As high C-rates are undesirable in
battery systems because they significantly lower lifetime of a pack. The RL agent is
assigned 30 timesteps in which to gradually discharge the battery to a balanced state,
rather than attempting to do so on a second-by-second basis, which experimentation
has shown will consistently lead to the usage of maximum C-rates between seconds if
not included as part of the reward function.

The C-rate limiting offered by this approach reduces the complexity of the reward func-
tion, which allows for much more reliable discovery of optimal policies. As the complexity
of the reward function increases, the difficulty for the algorithm to discover relationships
between state parameters and actions also increases.

When rewards are delayed, however, it can become challenging for the agent to under-
stand which actions contributed to the eventual reward. This is known as the temporal
credit assignment problem. Methods like eligibility traces in TD learning help mitigate
this by attributing rewards to actions taken earlier.

35


3. Methods

3.3.3 Reward Design
There is no guarantee of convergence to an optimal policy for RL algorithms, and each
new training cycle can lead to different results. An improperly defined reward function
can easily lead to poor results. Achieving a truly optimal policy is better defined as an
aspirational goal rather than a guaranteed outcome.

’Frequency’ and ’scale’ matter greatly for actions and rewards, as they directly influence
the learning process and the efficiency of the policy. If some rewards are scarce, the
agent will never discover how to get that reward and develop the optimal policy.

The online nature of reinforcement learning makes it possible to approximate optimal
policies in ways that put more effort into learning to make good decisions for frequently
encountered states, at the expense of less effort for infrequently encountered states
[29]. Rewards that appear in states with low frequency are called sparse rewards. In
environments that only provide sparse rewards, the agent struggles to learn effectively
because it receives little information about the states at all times, and cannot infer what
what constitutes a good or bad action.

An example of a traditional and simplistic RL reward function is a reward of +1 when
all cells are balanced (i.e. when all cells are almost equal in SOC). The simulation starts,
and the cells start off in a fresh, balanced state. Because of this, the agent gets rewarded
in the beginning, and then stops getting rewarded the moment they unbalance. After
unbalancing, the agent gets no feedback from the reward as to any improvements that
are pushing the cells closer to a balanced state. None of it’s actions can be quantified to
an ’improvement’. It only receives +1 in the state of absolute balance, and 0 otherwise.
Once the rewards become 0 from the unbalancing, the agent has no indicator that tells
them if they are getting close to balancing again or not, and the exploration process fails.
Whatever actions the agent takes after unbalance, they rely entirely on the random
occurrence of cells reaching the ’balanced’ state again, by chance. These events are so
infrequent, that the cumulative changes in policy are insignificant. No policy can be
learned from such a reward in this environment.

Policies are developed through improvement and iteration, the agent starts off with a
bad policy and, through trial and error, gradually converges to the best policy. If we have
no logical string of improving rewards to develop a policy, that offer a significant reward
change relative to the rewards in the space, the agent cannot discover an optimum. If
supplied only scarce ’maximum’ rewards, the agent cannot infer the path to get to them.

Experimentation has shown that while using scarce rewards, during the first few steps of
the simulation, any actions taken on the balanced state of the battery can predominantly
lead to faster unbalancing. The only ’correct’ action is inaction from the agent, and no
further policy can be discerned as no rewards can point the agent towards restoring a
balanced state.

As a reward cannot simply happen only during a balanced state, it must happen in stages
and eventually lead the policy to the maximum reward. This technique is called ’reward
shaping’.

Through reward shaping, we have experimented with reward functions which give the

36


3. Methods

agent ’hints’ about how close it is to balancing the cells, through partial rewards. Maxi-
mum rewards are given once the objectives are fully achieved.

A common mistake made in reward shaping is to yet again disregard the frequency and
scale of rewards. If we design a partial reward which is large, and does not lead to the
maximum reward, the agent will remain focused on the partial reward and never evolve
to the maximum reward.

In an experimental example, a reward which sums instances of +0.1 for each balanced
cell can very easily get stuck on balancing only half the cells, or only 9 cells out of a
total of 10. The reward significance from the states where all 10 cells are balanced is so
small (0.9 vs. 1.0) that the policy fails to update. It does not acknowledge a variation
of just 0.1 that happens with very low frequency to be significant enough to update the
whole policy, because the discovered rewards are already strong and deviating from them
will lose the perceived progress (effectively giving up exploitation), and thus exploration
fails and the agent never learns to balance all 10 cells.

By introducing a second or third reward factor, such as energy throughput, new con-
siderations must be taken. Challenges arise from the increased complexity in balancing
multiple objectives and ensuring that the agent learns an optimal policy that appropri-
ately considers all reward components.

Different reward factors might represent conflicting objectives. For example, cell balance
and complete energy throughput uniformity are objectives that partially go against each
other, depending on the characteristics of the cells and their behaviour during function-
ing. Balancing these conflicting objectives requires careful tuning to avoid preferential
behavior where the agent consistently ignores one objective in favor of another. Intro-
ducing new reward factors can lead to unforeseen consequences. An agent might exploit
loopholes in the reward structure to achieve high rewards without actually performing
the intended task.

An example of such a loophole was found experimenting with purely negative reward
structures, where the maximum possible reward per state is 0 and all other rewards are
negative. Control theory has many overlapping concepts with RL, and is the preferred
control method utilized in traditional BMS systems. In control theory, the equivalent for
the ’reward’ component of RL is cost. While RL typically aims to maximize cumulative
rewards, control theory generally focuses on minimizing cumulative costs. A setpoint is
established as the reference of the control system, for example in a PID (Proportional–
integral–derivative) or MPC (Model Predictive Control) system. The cost gets higher
as the measurements diverge from the reference setpoint, the desired values.

37


3. Methods

Figure 3.7: Control feedback loop

By testing the principles of cost in RL training and relying on purely negative rewards
in experiments, several observations were made. The objective shifts to minimizing a
cost, rather than maximizing a reward. As the agent gets ’punished’ with more negative
rewards the longer it remains in an undesirable state, the agent prioritizes quickly exiting
the state. This introduces ’speed’ as a primary factor of the reward function. However,
as the agent quickly tries to escape the future ’punishments’ given for each step spent
in a disadvantageous position, it quickly discovers that it can forcibly end the simulation
early and stop any future costs, by crashing the battery pack through over-discharge.

Thus the agent has ’minimized’ the accumulated costs by stopping the episode from gen-
erating any additional cost-incurring states, by destroying the battery pack and finishing
the simulation.

Cost-based behavior has been shown to be difficult to work around in an RL environment,
and dangerous in a safety-critical vehicle application. The target environment of the
agent must employ strict and thorough constraints in order to function on a cost-based
behavior. Working with such a method requires hard constraints that the agent cannot
violate, such as safety limits on state and action variables. These restraints would, as
a result, heavily limit the possible actions of the BMS controller and the ability of the
agent to achieve it’s task using the full range of possible actions.

Applying a larger penalty at the end of a forced pack over-discharge has also not proven
to be an effective method of avoiding such behavior. As stated previously, sparse rewards
are not an effective tool in policy optimization. As the over-discharge state is the last
state in the training cycle, the large lump reward will skew the policy into attributing
the penalty as the result of that immediate action. Because the over-discharge crashing
is discovered with very high frequency, as part of the learning process, the agent will
consistently default to this behavior in a cost-based reward function, no matter the
adjustments in cost weights or additional costs for crashes.

3.3.4 Action/Observation space normalization
For RL algorithms, effective exploration and learning depend on the appropriate scaling
and normalizing of the action and observation spaces, as well as the issued rewards. Such
normalization techniques ensure stable and efficient learning processes for RL agents.

38


3. Methods

The action and observation spaces in RL environments can vary widely in scale and mag-
nitude, posing challenges for RL algorithms in learning policies and behaviors effectively.
Normalization techniques aim to address these challenges by ensuring consistency and
stability to these spaces. Proper normalization facilitates smoother convergence, im-
proves exploration-exploitation trade-offs, and enhances the generalization capabilities
of RL agents across diverse environments.

The functions which serve as the basic building