Adaptive Radar Illuminations
with Deep Reinforcement Learning
Master’s thesis

Albin Ekelund Karlsson
Samuel Sandelius

DEPARTMENT OF MATHEMATICAL SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2023
www.chalmers.se

www.chalmers.se


Master’s thesis 2023

Adaptive Radar Illuminations with
Deep Reinforcement Learning

Illumination Scheduling for Long Range Surveillance Radar with the
use of Proximal Policy Optimization

Albin Ekelund Karlsson
Samuel Sandelius

Department of Mathematical Science
Chalmers University of Technology

Gothenburg, Sweden 2023


Adaptive Radar Illuminations with Deep Reinforcement Learning
Albin Ekelund Karlsson
Samuel Sandelius

© Albin Ekelund Karlsson, 2023.
© Samuel Sandelius, 2023.

Supervisor: Adam Andersson, Saab AB
Examiner: Peter Helgesson, Department of Mathematical Science

Master’s Thesis 2023
Department of Mathematical Sciences
Division of Applied Mathematics and Statistics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2023

iv


Abstract
A modern radar antenna can direct its energy electronically without inertia or the
need for mechanically steering. This opens up several degrees of freedom such as
transmission direction and illumination time, and thus also the potential to optimise
operation in real-time. Long range surveillance radars solve the trade-off between
searching for new targets and tracking known targets. This optimisation is often
rule-based.

In recent years, Reinforcement Learning (RL) Algorithms have been able to effi-
ciently solve increasingly difficult tasks, such as mastering game strategies or solv-
ing complex control tasks. In this thesis we show that reinforcement learning can
outperform such rule-based approaches for a simulated radar.

Keywords: Reinforcement Learning RL, Radar Target Tracking, Partially Observed
Markov Decision Process POMDP, Active Electronically Scanned Array Antenna,
Airborne Surveillance Radar

v


Acknowledgements
We would like to express our sincere gratitude to the following individuals for the
invaluable support and guidance throughout this project.

First and foremost, we would like to extend our deepest appreciation to our su-
pervisor Adam Andersson for his exceptional guidance and mentorship through the
project. His guidance has been an essential part in shaping the direction and quality
of this thesis, and we are truly grateful for his commitment, patience and dedication
to our academic and professional development.

We would also like to thank our examiner Peter Helgesson for his commitment and
meticulous evaluation and feedback during this project. His constructive criticism
have greatly contributed to the refinement of this thesis. We are grateful for his
thorough assessment and valuable suggestions, which have enhanced the quality of
our thesis.

Lastly we would also like to extend our gratitude to Saab for this incredible opportu-
nity to conduct a project within such and interesting subject, but also for providing
us with the necessary resources, facilities, expertise.

Thank you.
Albin Ekelund Karlsson, Samuel Sandelius, Gothenburg, June 2023

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 5
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Reinforcement Learning with Neural Networks . . . . . . . . . . . . . 7

2.2.1 Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Policy Gradient methods . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Actor - Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Trust Region Policy Optimization . . . . . . . . . . . . . . . . 11
2.2.5 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . 11

2.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Hungarian method . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2.1 Mahalanobis distance . . . . . . . . . . . . . . . . . 15

3 Radar Simulator 17
3.1 Radar basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Pulse Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Waveform and Integration Time . . . . . . . . . . . . . . . . . 18
3.1.3 Pulse Repetition Frequency and Resolving Measurements . . . 18

3.2 Simulator Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Initializing the Search Area . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.1 Spawning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Spawn Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 The Lobe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.3 Control Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.4 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

ix


Contents

3.5.5 False Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6.1 Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.7 Simulation Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 Baseline Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Selecting Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.9.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.2 Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Methods 31
4.1 MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Input Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Implementation Specific Details . . . . . . . . . . . . . . . . . . . . . 35

4.4.1 Varying Time Step Size . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Performance Score and Reward . . . . . . . . . . . . . . . . . 35

4.4.2.1 Performance Score . . . . . . . . . . . . . . . . . . . 35
4.4.2.2 Reward and Discount . . . . . . . . . . . . . . . . . 36

4.4.3 Discounted Reward Fast Algorithm . . . . . . . . . . . . . . . 36
4.4.4 Advantage Estimation . . . . . . . . . . . . . . . . . . . . . . 38
4.4.5 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.6 Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Selecting Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Feature Stack Architecture . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Fully Connected Architecture . . . . . . . . . . . . . . . . . . 42
4.5.3 SNR-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . 44

4.6 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Input State Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.7.1 No Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7.2 Input features . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7.3 Radial Distance, RCS, Radial Velocity . . . . . . . . . . . . . 45
4.7.4 Search Angle and Track Angle . . . . . . . . . . . . . . . . . . 45
4.7.5 Observation History . . . . . . . . . . . . . . . . . . . . . . . 46
4.7.6 Confidence Level . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7.7 Zero Doppler Detectability . . . . . . . . . . . . . . . . . . . . 46
4.7.8 Track Volume Derivative . . . . . . . . . . . . . . . . . . . . . 47
4.7.9 Selected Features . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8 Network Structure Optimization . . . . . . . . . . . . . . . . . . . . . 47
4.8.1 Feature stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8.1.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 48
4.8.2 Track Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.8.2.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 48

5 Results 49
5.1 Agent Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

x


Contents

5.2 Agent Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Action Distribution . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Reilluminations . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.3 Time Before Confidence Upgrade . . . . . . . . . . . . . . . . 53
5.2.4 Lost Tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Added Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Resolving Measurements . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Tracker Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.3 False Detections . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.4 Resolving measurements, tracker delay and false detections . . 57
5.3.5 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Conclusion 59
6.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography 61

A POMDP & ρPOMDP I

B Simulator parameters used III

xi


Contents

xii


List of Figures

1.1 Simple illustration of the principle behind radar sensing. The small
circle segments originate from the antenna to the left and propagate
to the right. The waves are reflected on the object in all directions
(dotted circles), eventually returning to the sender. . . . . . . . . . . 1

2.1 The MDP process and interaction between agent and environment. . 6
2.2 Example structure of a neural network. Neurons are labeled nl

i for
neuron i in layer l and forward connected to the next layer as shown
by the blue arrows. This network has four input features and produces
three outputs through three hidden layers of five neurons each. . . . . 8

3.1 The shape of the surveillance area. The smaller circle arc colored
red on is the only part of the border where new targets may not
spawn. The angle between the black middle and illumination angle
(red line) is the azimuth. The dotted line shows the lobe width when
illuminating in the direction of the red line. . . . . . . . . . . . . . . 22

3.2 Visualization of how the C value function looks when illuminating in
azimuth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 A snapshot of the tracker state with true targets drawn as well. Trans-
parent circles represent the location of a true target. Filled circles
represent location estimates in tracks with the fill colors yellow, blue,
and black representing tracks of confidence levels candidate, tenta-
tive, and confirmed, respectively. Additionally, velocity vectors are
drawn from the track locations, orange arrows for the true velocity
vector, and filled arrows with the corresponding color of confidence
level for the track estimates. The red line represents the radar lobe
center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Diagram describing the flow of the training loop used for this project. 31
4.2 The simulation loop describing the policy’s interaction with the en-

vironment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 State transitions and observations and belief state updates are all

considered in the belief state transition probabilities P̃ . . . . . . . . . 32
4.4 Illustration of the idea behind the network architecture. Weights

labeled wij are shared across nodes connected to different track features. 34
4.5 Entropy contribution as a function of the output probability of a

single output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

xiii


List of Figures

4.6 Expected integration time necessary until detection in terms of SNR.
Since SNR depends on the waveform selected, the SNR on the x-axis
shows normalized SNR equivalent to using 128 pulses. 256 and 512
pulses are assumed to increase the SNR by 3 and 6, respectively. . . . 42

4.7 Number of pulses used when tracking targets with SNR shown on the
x-axis. The SNR values are estimated SNR for 128 pulses of integration. 42

4.8 Illustration of the idea behind the fully connected network architecture. 43
4.9 Number of pulses used when tracking targets with SNR shown on the

x-axis. The SNR values are estimated SNR for 128 pulses of integration. 43
4.10 Number of pulses used when tracking targets with SNR shown on the

x-axis. The SNR values are estimated SNR for 128 pulses of integration. 44

5.1 Trajectory scores over a training session. Blue line shows the average
over three trajectories in one episode. Orange line shows the aver-
age score over the last 50 episodes (average score over the last 150
trajectories). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Number of tracks over the course of the trajectory for the RL Agent
(a) and the baseline implementation (b) . . . . . . . . . . . . . . . . 50

5.3 500 trajectories each with their median confirmed track volume as
well as 25th to 75th percentile ranges. . . . . . . . . . . . . . . . . . . 51

5.4 Actions taken over the course of the trajectory for the RL Agent (a)
and the baseline implementation (b). Bins on the x-axis are labeled
"action #pulses", so that "search 128" represents search actions using
128 pulses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.5 Square root of the number of reilluminations over the course of the
trajectory for the RL Agent (a) and the baseline implementation (b) . 53

5.6 Histogram of the time taken to upgrade tracks for the agent and
baseline respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.7 Removed tracks over the course of the trajectory for the RL Agent
(a) and the baseline implementation (b) . . . . . . . . . . . . . . . . 55

5.8 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 56
5.9 Tracker history over 20 trajectories. . . . . . . . . . . . . . . . . . . . 56
5.10 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 56
5.11 Tracker history over 20 trajectories. . . . . . . . . . . . . . . . . . . . 56
5.12 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 57
5.13 Tracker history over 20 trajectories. . . . . . . . . . . . . . . . . . . . 57
5.14 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 58
5.15 Tracker hitosry over 20 trajectories. . . . . . . . . . . . . . . . . . . . 58

xiv


List of Tables

3.1 Integration times for different number of pulses used. . . . . . . . . . 18

4.1 Encoding of input features . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Hyperparameters used for training. Learning rates were found by grid

search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Table of all features used in the forward selection . . . . . . . . . . . 46
4.4 The 4 different network architectures used for optimizing the feature

stack of the neural network . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 3 different network architectures used for network structure optimiza-

tion of the track stack . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Performance Scores, average score over 10 trajectories of 1500 seconds
each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

B.1 Parameters used when running the simplest version of our environment III

xv


List of Tables

xvi


1
Introduction

The surveillance task of a radar concerns the search of unknown targets and the
tracking of known targets within the radar search volume. Sensing with a pulse
Doppler radar works by sending pulses of electromagnetic radiation which reflects
at distant objects. The reflections are recorded by the radar and through data
processing, targets are detected and tracked. A simple illustration in Figure 1.1
shows the propagation and reflection.

Figure 1.1: Simple illustration of the principle behind radar sensing. The small
circle segments originate from the antenna to the left and propagate to the right.
The waves are reflected on the object in all directions (dotted circles), eventually
returning to the sender.

We consider an Active Electronically Scanned Array (AESA) antenna which consists
of multiple smaller antennas that can take on different phase delays, allowing the
direction of the pulse to be centered in a specified direction. Older radar systems are
typically constructed with a static radiation pattern produced by the geometry of
the aperture. Here, the energy is directed by mechanically rotating the entire aper-
ture which is a slow process compared to changing the phase delays in an AESA
antenna [3].

Tracking a target means correctly associating multiple detections with known tracks
(series of detections originating from a single target) and, in this way, building tra-
jectories for the discovered targets. When performing an illumination, two control
inputs mainly affect the illumination outcome; the direction (angle of the illumi-
nation) and integration time (duration of illumination). The trade-off between
searching for new targets and tracking known targets needs to be considered in order
to maximize the number of tracks held. The primary cost of each action is the time
spent as tracks should be frequently updated.

1


1. Introduction

Traditionally, AESA based radar illuminations were scheduled using rule based pri-
ority lists with different local optimization of illumination features such as integra-
tion time. While such a scheme can result in very good tracking performance, there
is good reason to believe that an algorithm that applies a policy based on global
optimization may improve the performance. This thesis proposes a method of ap-
proaching this problem by Reinforcement Learning (RL) to find a policy (probability
distribution over actions) that performs better than the baseline (priority lists) in a
simulated environment.

1.1 Problem formulation
In our RL approach, an agent (decision maker derived from the policy) is set to
perform the radar scheduling in a Python based simulation environment. A train-
ing method that is typically used in RL problems is Proximal Policy Optimization
(PPO) [14] which has proven especially effective in control tasks. For this project,
PPO was mainly used due to its simplicity and generally good performance with-
out requiring much prior parameter tuning. This allowed us to easily adapt the
algorithm for solving the illumination scheduling problem.

1.1.1 Evaluation
The agent is trained to schedule radar illuminations to maximize the total number
of tracks held while also minimizing the time that new targets remain undiscov-
ered. The performance is compared against the baseline as well as a radar which
only searches for new targets by perpetually scanning the search volume, never
re-illuminating existing tracks.

1.2 Context
In the present project, the target tracking process is optimized at a high decision-
making level, making use of traditional tracking methods in order to discretize the
state and action spaces. A first attempt at realizing this approach was made by
Nathanson in his master’s thesis [8] using a simulated environment and an agent
trained using the PPO algorithm. While the simulation environment, tracking al-
gorithm, and reinforcement learning agent were implemented and trained, the deep
network agent did not outperform the baseline (priority lists), and the study was
not conclusive in whether or not the training method could be successful after a
more careful hyperparameter tunning. A strong hypothesis for this was that the
simulator had too many layers of realism to it and the task was too hard to start
with.

Since Nathanson carried out his work in his thesis [8], Saab’s simulator environment
has been updated with more options and tunable parameters, allowing, for example,
the radar model to be simplified and apply a variable level of realism. As of this,

2


1. Introduction

the problem has been simplified and the focus was initially shifted to optimising the
training algorithm on a simpler problem, only later adding some of those realistic
limitations and complexities to the environment. In the end we have added all the
realism of Nathansson and obtained promising results.

3


1. Introduction

4


2
Theory

In this chapter we introduce the theoretical foundations of concepts used throughout
this project, specifically regarding Markov Decision Processes (MDP), Reinforce-
ment Learning (RL) and an elementary tracking methodology.

Markov decision processes constitute a mathematical formalism used to study se-
quential desicion-making where state transitions are Markovian (independent of pre-
vious history) and controlled by discrete actions. They underpin both dynamic
programming and reinforcement learning. In Section 2.1 the concept of MDPs is
discussed generally and in the context of this project.

Reinforcement learning is a type of Machine Learning technique which optimizes an
agent’s decision making through trial and error in an environment. The agent re-
ceives feedback based on state transitions, known as a rewards which determines how
desirable said state transition is. In Section 2.2 we introduce different approaches to
solve RL problems such as deep Q-learning, Policy Gradient (PG) methods, Actor-
critics, Trust Region Policy Optimization (TRPO), and Proximal Policy Optimiza-
tion (PPO). In Chapter 4, the implementation used in this project is explained in
more detail.

Lastly, we discuss target tracking which is the process of estimating target trajec-
tories (tracks) out of discrete measurements. Traditional tracking algorithms are
used in this project in order to create simple, discrete state and action spaces and to
evaluate the performance. Section 2.3 covers Bayesian filtering and the Hungarian
method which are techniques used for this purpose.

2.1 Markov Decision Processes
In reinforcement learning, problems are typically modelled as Markov decision pro-
cesses. An MDP is defined as a tuple (S, A, P, R) where S is the set of available
environment states s, A is the set of available actions a, R(s, s′) is the reward for the
state transition from s to s′, P (st+1 = s′|st = s, at = a) is the transition probability
from s to s′ given an action a between timesteps t and t + 1. Lastly, we have the
policy π(a|s), i.e., the probability distribution of actions a given a state s. While
(S, A, P, R) are given by the problem, π is to be optimized. A simple illustration
shows the process in Figure 2.1 below. Note how the MDP only depends on the
current and previous time step; previous history is not necessary to determine the

5


2. Theory

future in an MDP.

Agent
𝝅(st, at)

Reward
r(st-1, st)

New state st

Old state st-1

Environment
st-1 →  st

Action at

Figure 2.1: The MDP process and interaction between agent and environment.

The policy π is a probability distribution over actions, so we can rewrite the transi-
tion probabilities given π as Pπ = Pπ(st+1 = s′|st = s). The objective is to find the
policy π that maximizes some cumulative function of rewards, usually the expected
discounted return R̂, Q-value Qπ(st, at)

Qπ(st, at) = Eπ

[
R̂(st|st, at)

]
,

where

R̂(st0) =
∞∑

t=t0

γtR(st, st+1)

and γ ∈ (0, 1) is the discount factor. It is used to determine the importance of earlier
rewards as opposed to rewards far into the future, as well as improving numerical
stability by guaranteeing that the sum converges. A policy that maximizes Q(s, a) is
called an optimal policy π∗. Policies can be stochastic or deterministic (greedy). A
stochastic policy is a probability distribution over actions and is optimal if it, for all
s ∈ S maximizes Ea∼π[Qπ(s, a)] = ∑

a∈A π(a|s)Qπ(s, a). An optimal deterministic
policy can be written as a function of the state input

π∗(s) = argmax
a

Qπ(s, a). (2.1)

To aid in finding a solution to an MDP or to be able to evaluate the performance of
the MDP policy, a value function Vπ(s) = Eπ

[
R̂(s)|s

]
can be used. The value func-

tion predicts the estimated reward from a state s following the policy π. Basically,
it disregards the action taken and only considers the expected return from the state
itself. The value implicitly depends on π since it affects the transition probabilities
Pπ(s, s′). The value can be defined recursively as

Vπ(s) :=
∑
s′

Pπ(s′|s) (R(s, s′) + γV (s′)) , (2.2)

which allows us to rewrite the Q-value in terms of value as

6


2. Theory

Qπ(s, a) =
∑
s′

Pπ(s′|a) (R(s, s′) + γV (s′)) . (2.3)

With this formulation, we assume that the state S is known. If this assumption
cannot be made, it is typical to instead treat the problem as a Partially Observable
Markov Decision Process (POMDP) or the extension of the POMDP, the ρPOMDP,
explained in Appendix A [1] [22]. The POMDP is an extension of the MDP frame-
work which allows the agent to take actions based on partial and noisy information
while still maximizing the expected return from true states s. It does this by utiliz-
ing a belief state b, a distribution over s. The POMDP can be extended further into
a ρPOMDP where the agent can be rewarded on the belief state b directly rather
than s.

2.2 Reinforcement Learning with Neural Networks

Similar to most modern machine learning, RL commonly uses neural networks. In
recent years, deep Neural Networks have been used to solve increasingly difficult
control tasks [19] [20] [16]. A neural network is composed of a set of functions re-
ferred to as neurons each with their own set of tunable parameters. Multiple neurons
are typically arranged in forward-connected layers where a multidimensional input
is propagated forward through the network, producing an output, see [5] for further
information about neural networks. In short, the input is propagated through the
network’s hidden layers and the output can be parsed off the last layer, although
many different, more complex architectures do exist. The hidden layers consist of
neurons which are neither used as input nor output, and merely contribute with
extra parameters (weights). All neurons otherwise operate in a similar manner and
typically process inputs by first weighting the it using the neuron’s weight parame-
ter, then summing over all weighted inputs. A bias term is added according to the
neuron’s bias parameter and is finally passed through a nonlinear activation func-
tion, predetermined with no parameter dependence. The result is sent to the next
neuron layer or read off output from the network, see Figure 2.2 for an example of
a small neural network and its neuron structure.

With this arrangement, the network parameters can be optimised by minimizing a
loss function between the input and the output, typically with the help of stochastic
gradient descent. With such a large number of parameters and through the use of
linear as well as nonlinear functions, even intricate nonlinear dependencies between
the input and output can be learned. For Reinforcement Learning, the network
outputs are directly fed to the input of some control task, and the loss is determined
through some reward after stepping through the environment. The loss function is
set up so that minimizing the loss corresponds to maximizing some total reward.
The neural network can take the role of the policy πθ(s, a) with parameters θ and
can be used to solve an MDP.

7


2. Theory

n
(0)
1

n
(0)
2

n
(0)
3

n
(0)
4

n
(1)
1

n
(1)
2

n
(1)
3

n
(1)
4

n
(1)
5

n
(2)
1

n
(2)
2

n
(2)
3

n
(2)
4

n
(2)
5

n
(3)
1

n
(3)
2

n
(3)
3

n
(3)
4

n
(3)
5

n
(4)
1

n
(4)
2

n
(4)
3

input
layer

hidden layers

output
layer

Figure 2.2: Example structure of a neural network. Neurons are labeled nl
i for

neuron i in layer l and forward connected to the next layer as shown by the blue
arrows. This network has four input features and produces three outputs through
three hidden layers of five neurons each.

2.2.1 Deep Q Learning
Deep Q learning is one of the more straightforward RL algorithms used to train
a neural network and there are many references on the topic, for instance, the
textbook [18]. "Deep" in this case refers to the use of neural networks as opposed
to lookup tables, allowing the agent to interpolate in a large action space where
sampling Q values for the entire state space may not be feasible. In deep Q learning,
the idea is to have the network approximate the Q-value Q(s, a) and find the optimal
(deterministic) policy π∗(s) = arg maxa Q(s, a). This is done by minimizing the loss

Lt =
(
rt + γQ̂maxa(st+1, a)− Q̂(st, at)

)2
.

Here, rt = R(st, st+1) and Q̂(s, a) is the network output for action a, given input
s. In order to determine the optimal parameters θ, rewards are sampled over many
trajectories by exploration, typically via the ϵ-greedy policy: at every timestep, with
a probability ϵ, pick a random action. Otherwise pick the action with the highest
Q-value as in the optimal policy π∗(s), referred to as exploitation. In order to solve
an MDP using Deep Q learning, exploration is necessary in order to sample rewards
from states which may still be unknown to the agent. Exploitation is used not only
to fine tune the parameter optimization itself, but to also sample the state transition
probability Pπ(st+1 = s′|st = s, at = a) where π may be close to the optimal policy.

There are a few downsides to Deep Q Learning which makes it less practical com-
pared to other methods in certain situations, some of them are listed below.

• Relatively low sample efficiency as only a single weight update is performed
for each trajectory sample. In short, there is no way to know the optimal step
size for the gradient descent algorithm, making convergence slow.

• Q-value initialization may have a completely different mean compared to dis-
counted rewards obtained from the environment. While true for any algorithm,
Deep Q Learning makes no attempt to normalize the sampled rewards.

8


2. Theory

• Trade off between exploration and exploitation includes an additional hyperpa-
rameter ϵ that requires tuning. On top of that, exploration occurs at random,
even if the agent could potentially act with more certainty in some cases com-
pared to others [11].

• Off-policy learning. As we shall get into later, Deep Q Learning is an off pol-
icy RL method, meaning the agent does not act according to what it thinks
is optimal during training due to the exploration parameter forcing explo-
ration at random. As a result, it is more difficult estimate the state transition
probability Pπ(s′|s) and it may take longer to train.

The problems listed impact most RL algorithms, but there are strategies to mitigate
these problems. We get into these types of algorithms in the following sections.

2.2.2 Policy Gradient methods
Policy Gradient (PG) [18], [10] is a term often used in RL. The principle of policy
gradient is to learn a parameterized policy for selecting actions directly instead of
indirectly by a Q-value estimate. However, value functions can still be used to aid
the training procedure. When learning the policy parameters, the gradient of an
objective function J(θ) is used. Here the vector θ denotes the parameter vector
of the policy πθ(a|s). The method wants to maximize the objective function J(θ),
where

J(θ) = Eπθ
[r(τ)] ,

Where τ = (s0, a0, r0, s1, a1, r1, . . . , sT , aT , rT ) is the trajectory of length T , st is the
state at time t, at is the chosen action at time t and rt is the reward at time t and
r(τ) is the total return over a trajectory τ .

The goal of the policy gradient methods is to use the method of gradient descent
(or ascent) to directly maximize J(θ) in the direction of ∇J(θ), using the update:

θt+1 = θt + α∇J(θt),

where ∇J(θt) is the gradient of J(θ) and α the learning rate parameter. Before
we introduce how to derive the gradient with the gradient theorem we will first
introduce a common trick used in Deep learning refereed to as the "log trick". The
log trick is given as:

f(x)∇θ log f(x) = f(x)∇θf(x)
f(x) = ∇θf(x)

using this together with the policy gradient Theorem, the gradient of J(θ) can be
written as:

∇θJ(θ) = Eπθ
[r(τ)∇ log πθ(τ)] (2.4)

= Eπθ
[r(τ)

T∑
t=1
∇ log πθ(st, at)]. (2.5)

9


2. Theory

Here ∇ log πθ(st, at) is the score function or log-derivative of the policy using the
"log trick" introduced above, which measures the sensitivity of the policy to changes
in θ. The full proof of (2.4) can be found here [18, page: 325].

Modifying the policy directly in this way has its advantages. The most obvious is
that there is no longer a need for the ϵ-greedy strategy. The entropy (describing the
spread of the action distribution) of π(s, a) is typically high for an untrained network
with randomly initialized parameters which promotes exploration. The entropy
decreases over time if the agent manages to differentiate between good and bad
actions through r(τ), so that we can directly sample Pπ(s′|s) while simultaneously
converging to the optimal policy [11].

2.2.3 Actor - Critic
A popular approach of modern RL is to use an actor-critic architecture (see for
instance [21]). This combines different features from value-based methods such as
Q-learning and policy-based methods. As we saw in the previous section, the gra-
dient of the objective function ∇θJ(θ) flips its sign together with the total return
r(τ). Since neural networks are typically designed in a way that keeps the policy’s
output probabilities normalized (using for example a softmax output layer [4]), one
could potentially consider for instance positive-only rewards, and still converge to
an optimal policy. It is however both intuitive and sample efficient to consider pos-
itive rewards for "good" actions and negative rewards for "bad" actions, since this is
directly reflected in the objective gradient. The goal of the actor - critic architecture
is to achieve this effect.

The approach involves the use of both a policy function πθ referred to as the actor,
as well as a value function estimate V ϕ referred to as the critic. Both of these can be
separate neural networks with parameters θ and ϕ respectively. We use essentially
the same objective function as for the policy gradient in (2.4), but replace the
total return with the advantage A(s, a) = Qπ(s, a)− Vπ(s). The resulting objective
function is

∇θJ(θ) = Eπθ
[A(st, at)

T∑
t=1
∇ log πθ(st, at)]. (2.6)

The advantage is not usually known, so it is typically estimated using V ϕ(s) and
the trajectory return r(τ), for example as Â(s, a) = r(τ) − V ϕ(s). This changes
the dynamics where previously the sign of the reward directly influenced the sign of
the objective gradient. Instead, the sign depends on whether the obtained return is
greater or smaller than the estimated value V ϕ. Remember that the value function is
implicitly dependent on the policy π and the state transition probabilities Pπ(st+1 =
s′|st = s), so positive advantage comes from actions which were better than what the
critic expected based on previous trajectories. The value estimator can be trained
directly on the total return r(τ) by, for example minimizing squared error.

10


2. Theory

2.2.4 Trust Region Policy Optimization
Trust Region Policy Optimization (TRPO) is a policy gradient method which seeks
to further improve the stability of convergence of the policy update and does this by
using a parameter to constraint the maximum step length during each update [15].
The change in the policy will not deviate to far from its earlier policy when updat-
ing, which allows for decreased risk of instability or convergence to local optima.

The main idea behind TRPO is to optimize the surrogate function which is used
to approximate the true performance objective while also constraining the policy
update to a trust region around the current policy. The surrogate function is the so-
called clipped surrogate which is a modification of the previously discussed objective
function:

LCPI(θ) = Es,a∼πθold

[
πθ(a|s)

πθold(a|s)Aπθold (s, a)
]

,

where θold are the parameters that were used when sampling the trajectory and
A

πθold
t (s, a) is the advantage function constructed by those policy parameters. The

update is then optimised within the trust region defined by a maximally allowed
Kullback-Leibler divergence which is stated as a constraint. The idea of keeping
a trust region is to make sure that the update does not differ too much from the
current policy. However TRPO comes with some disadvantages. Calculating the
KL divergence can be computationally expensive and time consuming based on the
dimensions of the action space. Moreover, from a practical point of view, the hands-
on implementation is also quite complicated.

2.2.5 Proximal Policy Optimization
Proximal policy optimization (PPO) is a Policy gradient method that seeks to cap-
ture the spirit of TRPO, but in a way which is both more efficient and simpler.
Instead of solving the problem with a complex second-order method, it uses a first-
order method and applies some extra tricks not to move too far away from the
earlier policy. There are two variants of PPO, but in this thesis, we will focus on
the variant known as PPO-clip. Instead of using the KL divergence as a constraint,
PPO uses specialized clipping for the objective function to limit the distance one
can move from the earlier policy, see [14] for further information. The updating
scheme follows

θk+1 = arg max
θ

Es,a∼πθk
[L(s, a, θk, θ)]

where L is given by

L(s, a, θk, θ) = min
(

πθ(a|s)
πθk(a|s)Aπθk (s, a), g(ϵ, Aπθk (s, a))

)

and

g(ϵ, A) =

(1 + ϵ)A A ≥ 0
(1− ϵ)A A < 0.

11


2. Theory

Here, ϵ is the clipping parameter. With this clipping method, the objective function
is updated using the regular policy gradient objective until the policy diverges too far
from the original policy. At that point, the objective is clipped, causing the gradient
to become zero, effectively halting the learning process until a new trajectory is
sampled. This makes the learning process much simpler compared to optimizing the
policy completely within a trust region.

2.3 Tracking
A real-world radar observation is a signal composed of informative target data and
random noise from the environment. In order to get reasonable estimates of the
target states, the noisy parts must be eliminated or at least reduced. One way to
achieve noise reduction is by utilizing so-called Bayesian filtering. When estimating
target states, this estimation is called a track. However, Bayesian filtering only
solves the problem of target estimation, not the problem of multiple target tracking.
Solving this problem requires some association to distinguish between tracks, which
is essential in radar surveillance. One way to solve this assignment problem is the
so-called Hungarian method. This method was chosen because of its simplicity and
was already implemented in the simulator. However, in modern implementations,
methods like multi-hypothesis tracking [2] or Multi-Object Tracking [6] are instead
used.

2.3.1 Bayesian Filtering
Filtering theory is a branch of statistics dealing with the problem of estimating
hidden states of a Markov process via noisy measurements. Bayesian filtering theory
is simply the Bayesian formulation of this theory [13]. It calculates the marginal
posterior distribution (or filtering distribution) P (xk|y1:k) of state xk at timestep k
based on the measurement history up to the current timestep y1:k. The posterior
distribution P (xk|y1:k) and the predicted distribution P (xk|y1:k−1) at timestep k are
updated in a recursive manner. The recursion starts from the prior distribution x0,
thereafter the prediction distribution of state xk is calculated at time k using the
Chapman equation:

p(xk|y1:k−1) =
∫

p(xk|xk−1)p(xk−1|y1:k−1)dxk−1 (2.7)

Then, using Bayes’ rule and given the measurement yk at timestep k, the posterior
distribution of state xk can be calculated according to

p(xk|y1:k) = p(yk|xk)p(xk|y1:k−1)
p(yk|y1:k−1)

. (2.8)

Here p(yk|y1:k−1) is the normalization constant given as

p(yk|y1:k−1) =
∫

p(yk|xk)p(xk|y1:k−1)dxk. (2.9)

12


2. Theory

The recursive functions from the Bayesian equation work for linear Gaussian and
nonlinear Gaussian state space models. In the case of a linear model with Gaus-
sian noise, the filter distribution is also Gaussian. In the linear case the state and
measurement models read:

xk = Ak−1xk−1 + qk−1

yk = Hkxk + rk

(2.10)

Here xk is the state, yk is the measurements (observations), qk−1 ∼ N(0, Qk) is the
process noise and rk ∼ N(0, Rk) is the measurement noise, Ak−1 is the transition
matrix of the model, and Hk the measurement model matrix. The closed-form
solution to this system is called the Kalman filter (KF) [13] given from (2.7), (2.8),
(2.9) and reads

p(xk|y1:k−1) = N(xk|m−
k , P −

K ) (2.11)
p(xk|y1:k) = N(xk|mk, Pk) (2.12)
p(yk|y1:k−1) = N(yk|HKm−

k , Sk) (2.13)

where p(xk|y1:k−1) is the predicted distribution, p(xk|y1:k) is the posterior distribu-
tion and p(yk|y1:k−1) the predictive posterior distribution. The parameters above
can be computed in two steps; prediction and update. In the prediction step we
predict the state estimate m−

k and the error covariance P −
k according to

m−
k = Ak−1mk−1

P −
k = Ak−1Pk−1A

T
k−1 + Qk−1.

In the update step, we compute the measurement residual vk the covaraince residual
Sk, the Kalman gain Kk, update the state estimate mk and error covariance Pk as
follows:

vk = yk −Hkm−
k ,

Sk = HkP −
k HT

k + Rk,

Kk = P −
k HT

k S−1
k

mk = m−
k + Kkvk

Pk = P −
k −KkSkKT

k

Starting from the prior mean and covariance m0 and P0.

While the Kalman filter gives an exact solution to the filtering problem in the linear
case, in many applications, models are nonlinear, which makes the Kalman filter
not applicable. Fortunately, there are approximate versions of the Kalman filter
theory for nonlinear models. One is called the extended Kalman filter (EKF), which
approximates the non-gaussian filter distribution using linearisation. The main idea
of the EKF is to utilize the first-order Taylor approximation to assume a Gaussian
approximation of the filter densities around the last prediction [13]. Using the same

13


2. Theory

assumptions for the EKF as for the KF, with the differences that f and h are
differentiable, the filter density can be approximated as

p(xk|z1:k−1) ≃ N(xk|m−
k , P −

K ) (2.14)
p(xk|z1:k) ≃ N(xk|mk, Pk) (2.15)
p(zk|z1:k−1) ≃ N(zk|HKm−

k , Sk). (2.16)

From that, the prediction of the state estimate m−
k and the error covariance P −

k

instead becomes

m−
k = f(mk−1)

P −
k = Fx(mk−1)Pk−1F

T
x (mk−1) + Qk−1.

And the updates of the measurement residual vk, the covariance residual Sk and the
Kalman gain Kk instead becomes

vk = yk − h(m−
k ),

Sk = Hx(m−
k )P −

k HT
x (m−

k ) + Rk,

Kk = P −
k HT

x (m−
k )S−1

k ,

mk = m−
k + Kkvk,

Pk = P −
k −KkSkKT

k .

Here Hx(m−
k ) and Fx(mk−1) are the jacobian of f and h of state x evaluated at m−

k

2.3.2 Hungarian method
The Hungarian method, or the Hungarian algorithm, is an optimization algorithm
used to solve assignment problems. The optimization problem involves finding the
most optimal assignment, in this case, for a set of detection to a set of tracks, given
a cost in the way of a distance between the detections and tracks. The Hungarian
method is an algorithm that was developed by Harold Kuhn in 1955. Kuhn named
the algorithm Hungarian because of two Hungarian mathematicians, Dénes Kőnig
and Jenő Egerváry, since the algorithm was based on their earlier works [7].

Imagine you obtain a set of detections and have a group of already existing tracks,
and you want to assign each detection to a track in the most efficient way possible.
Since each of the detections has a given distance to each track, the goal is to find
the assignment which minimizes the overall distance.

The Hungarian method solves this problem by utilizing a matrix representation of
distance. The matrix consists of columns representing the tracks and rows repre-
senting the detections. Each cell in the matrix contains the distance associated with
assigning a particular detection to a particular track.

The method begins by identifying paths called "augmenting paths" in the matrix
representation. Here, an augmenting path is a path that begins with an unassigned

14


2. Theory

detection and moves to an unassigned track. The goal is to find a set of paths cov-
ering all the unassigned cells in the matrix. Once the augmenting paths are found,
the next step in the method is to proceed to modify the assignments based on these
paths. It starts by selecting the path with the lowest distance. It thereafter updates
the assignment along this path, switching the currently assigned detections with the
unassigned ones until all the cells in the matrix are assigned. By iteratively finding
paths and updating assignments, the Hungarian method eventually finds an optimal
solution where the overall distance is minimized.

2.3.2.1 Mahalanobis distance

The algorithm used to generate the cost matrix in the Hungarian algorithm is the
Mahalanobis distance. Mahalanobis distance measures the distance between a point
x, in our case a detection, and a distribution D ∼ N(µ, P ), with the tracks position
µ and the tracks covariance matrix P . The distance is then measured in how many
standard deviations the detection x is from the track µ given the uncertainty of the
track as P :

dM(x, µ; P ) =
√

(x− µ)T P −1(x− µ) (2.17)

15


2. Theory

16


3
Radar Simulator

The simulator used for this project is based on the simulator written and used
by Nathanson [8]. However, for this project, the codebase has been updated with
more variable parameters, allowing us to change the complexity of the simulation
environment. This chapter presents how the simulator works, the features relevant to
this project, and some fundamental theories necessary to understand the reasoning
behind the implementation.

3.1 Radar basics
RAdio (Aim) Detection And Ranging (Radar) is a technique used to detect and
gain information about distant targets with the help of radio waves. A radar sends
electromagnetic radiation in a direction that reflects off a target. A receiver picks up
the reflection to extract information about the target. In particular, its position can
be determined by the response time, and a Doppler shift reveals its radial velocity
towards the radar. The Doppler effect is manifested through a shift in frequency as
a wave bounces off a moving target. The constant velocity of electromagnetic waves
makes it possible to accurately measure the Doppler effect with signal processing [9].

Radars can be separated into two categories, those that send waves continuously
while simultaneously listening and those that transmit a fraction of the time and
listens the rest of the time, repeatedly. Both techniques have advantages and dis-
advantages, but in the context of airborne radars, the most common radar is the
second one, the pulse Doppler radar. The primary advantage of the pulse Doppler
radar as opposed to continuous wave (CW) radar is the increased range and accurate
localization of targets due to their concentrated pulses of large gain [9].

3.1.1 Pulse Radar
Airborne radars usually have one antenna used for both transmitting and receiving.
By the time the reflection of the pulse returns to the antenna, the power is greatly
diminished and must be noticeably higher than the background noise to detect an
object reliably. The signal energy received from a target is given by:

Se = PavgA2
etot · RCS

4πR4λ2L
, (3.1)

where Se is the signal energy, Pavg is the average transmitter power, R is the range,
RCS is the radar cross section of the target, tot is the time-on-target (integration

17


3. Radar Simulator

time), λ is the wavelength and L represents the general losses obtained, e.g., from
processing and filtering. We also have the function for noise, which is given by:

Ne = kTs

where k is Boltzmann’s constant, Ts system noise temperature. Given these two
equations, we can calculate the SNR. For further reading, see [9]

SNR = Se/Ne = PavgA2
etot · RCS

4πR4λ2LkTs
.

Since most of these variables are constant, we can write the expression for SNR
using an equivalent constant K. We also include the area Ae with its dependence
on the azimuth angle from the radar θ extracted.

SNRi = K · RCS · A2
etot

R4 = K ′ · RCS · cos(θ)2tot

R4 . (3.2)

3.1.2 Waveform and Integration Time
In radar surveillance, the transmitted signal is often referred to as waveform. The
waveform has four characteristics, carrier frequency, pulse width, modulation within
each pulse or pulse to pulse, and rate at which the pulses are transmitted (pulse
repetition frequency, see Section 3.1.3). However, when referring to the waveform,
we mean the pulse width regarding the number of pulses sent. Since both the carrier
frequency and Modulation are assumed to be constant. In the simulation, the avail-
able waveforms contain wi ∈ {128, 256, 512} pulses and differ only by the number
of pulses used.

Integration time is the time it takes for one illumination, i.e., the time for all wi

pulses to be sent, plus a burn-in time taken for the first pulse to return from the
distant radar horizon. Recorded data during the burn-in time is challenging to
process because the data are non-stationary, contaminated with pulses from the
illumination before, and not used when processing the signal. As for the total
integration time, we denote it I128, I256, I512 for the different waveforms, respectively
which can be found in Table 3.1.

Number of pulses 128 256 512
Time (s) 0.01197 0.02093 0.03885

Table 3.1: Integration times for different number of pulses
used.

3.1.3 Pulse Repetition Frequency and Resolving Measure-
ments

The radar sends multiple pulses at a given frequency to obtain a measurement
from a target. The pulse transmission rate is called the Pulse Repetition Frequency

18


3. Radar Simulator

(PRF). However, even if one sends multiple pulses using the same frequency, knowing
which response corresponds to what pulse is impossible. As a result, using the
response time to determine the distance to the target results in ambiguity under
the assumption that the response could be coming from any of the transmitted
pulses. This issue can be solved by resolving the measurement by varying the PRF
in consecutive measurements [9, Chapter 15]. This can resolve the ambiguity by
shifting the response time of all pulses but the one corresponding to the actual
response, allowing us to single out the accurate distance. After that, by measuring
the differences in frequency from the sent signal to the obtained, the target’s velocity
can be estimated by the size of the frequency shift. The shift is described as follows:

fD = −2vrad

λ
,

where vrad is the radial velocity and λ is the wavelength of the radiowave. The
Doppler frequency resolution depends on the number of pulses sent. By properties
of the discrete Fourier transform, Doppler frequency, just as range, is measured with
an ambiguity that needs to be resolved. The simulator emulates the need to resolve
measurements by requiring multiple detections before a non-associated measurement
can be sampled. Alternatively, a detection can be matched with an existing track
without the need to resolve since it can be correlated with the track’s position and
velocity instead.

In order to fulfill this requirement, the azimuth angle increments in the search action
are reduced to produce overlapping lobes. A parameter m specifies the number of
lobes that should all cover a mutual azimuth angle, and the angle increment size
is determined to fulfill that requirement (see details in Section 3.5.3). With this,
obtaining up to m detections from the same target is possible, all while adjusting
the PRF between each detection. A parameter n specifies how many detections of
different PRFs are necessary out of the m overlapping pulses to consider a measure-
ment resolved. If less than n matching detections are obtained, no unambiguous
measurement is sampled and any unresolved detections are discarded.

3.2 Simulator Overview
The simulator is built to simulate a phased array radar (see e.g [9]) using an AESA
antenna. The antenna is computer-controlled, allowing the radar to steer the di-
rection of the radio waves with significant gain without moving the antenna itself,
resulting in minimal delay. The radar system can be controlled and used in a
simulated environment, providing targets that can be detected based on the radar
equation 3.1.

3.3 Initializing the Search Area
All activity happens inside the defined search area, a circle sector of radius r ∈
[rmin, rmax] = [20, 450] km and angle θ ∈ [−75, 75]°. Nothing is simulated outside of

19


3. Radar Simulator

this area. If a target moves outside the boundary, it is immediately terminated. A
constant n = 30 specifies roughly the desired number of targets within the search
area (once an equilibrium state has been reached). Initially, ⌊0.6 · n⌋ targets are
spawned on the border. The simulation then runs for 500 seconds (simulation time)
before the radar is initialized in order for a steady state of evenly distributed targets
to be present before tracking begins.

3.4 Targets
The state of a target is described by the position and velocity state vector vk =
[xk, ẋk, yk, ẏk]T , turn rate θ̇, turn initiation rate expected value ρ and radar cross
section RCS. After each time step k of size ∆tk, the state vector vk is multiplied by
the transition matrix X

vk+1 = X · vk (3.3)

with

X =


1 sin(θ̇∆t)

θ̇
0 −(1−cos(θ̇∆t))

θ̇

0 cos(θ̇∆t) 0 − sin(θ̇∆t)
0 1−cos(θ̇∆t)

θ̇
1 sin(θ̇∆t)

θ̇

0 sin(θ̇∆t) 0 cos(θ̇∆t)

 . (3.4)

In case θ̇ = 0, instead use the matrix limθ̇→0 X :=
(
limθ̇→0 X(θ̇)ij

)
.

At the time of spawning, each target is given a permanent turn rate expected value
ρ ∼ U(10−4, 10−3) turn initiations per second, and the time of the most recent
turn initiation is recorded as tprevious, set to 0 initially. A turn is then initiated at
time tinitiate ∼ tprevious + Exp(ρ) where Exp(ρ) is the exponential distribution with
expected value 1/ρ. A turn duration is sampled at each turn initiation tduration ∼
Exp (U (0.05, 0.1)), so the turn ends at tend = tinitiate + tduration. A new turn rate
is sampled at each turn initiation θ̇ ∼ U(0.001, 0.02) rad/s, and the target keeps
a constant θ̇ turn rate during the entire maneuver in positive or negative angle
direction with equal odds. After the turn is completed, a new initiation time is
sampled, and the turn rate is set back to zero.

3.4.1 Spawning
Any new target receives the following permanent properties until it is terminated:

• Velocity u ∼ U(400, 800) m/s
• RCS ∼ N (U(−10, 10), 2)
• ρ ∼ U(10−4, 10−3)

where N (µ, σ) is the normal distribution with mean µ and standard deviation σ.
RCS is the radar cross section used in (3.1).

The following properties are sampled at the time of spawning but may change during
the simulation:

20


3. Radar Simulator

• Position (x, y)
• Heading θ
• Turn rate θ̇ = 0

The position is sampled uniformly on the border of the surveillance area, excluding
the smaller circle arc, see Figure 3.1 where it is the red part of the border. The
heading is sampled as the angle θ from the target position to a uniformly sampled
position z = (x, y) within the surveillance area.

3.4.2 Spawn Rate
During the simulation, a spawn rate ω(t) < 1 sets the expected frequency of new
targets being spawned per second. Initially, the spawn rate is set to ω(0) = n 600

450000 =
0.04 with n = 30. The idea is to spawn targets at roughly the replacement rate for
n targets with an average lifetime at around 450000

600 = 750 s.

The spawn rate changes at an interval of 750 s with the intended effect of changing
the spawn rate after most targets since the last spawn rate update has been termi-
nated. The idea is not to let the target intensity reach an equilibrium for too long
before the spawn rate is changed, which is then done according to

ω(t) ∼ clip
(
N
(

ω(0), ω(0)
10

)
, 0, 1

)
, (3.5)

where

clip(x, xmin, xmax) =


xmin if x ≤ xmin

x if xmin < x < xmax

xmax if x > xmax

.

After each whole second of simulation time, a new target is spawned with probability
ω(t).

3.5 Radar
In this section, we describe how the different parts of the radar are integrated and
used in the simulation.

3.5.1 The Lobe
The lobe width in boresight (zero degrees azimuth) is given by Φ = 2°, and the
lobe size changes according to ωϕ = Φ/ cos(ϕ) where ϕ is the lobe’s angle to the or-
thogonal direction from the antenna surface (see Figure 3.1). This effect originates
from the fact that the lobe width depends on the antenna width, and since phased
array antennas do not rotate, the effective width when sending a pulse at an angle
is smaller than when sending it straight forward. Detection probability decays at
the edges of the lobe, see below.

21


3. Radar Simulator

Azimuth

lobe width

Figure 3.1: The shape of the surveillance area. The smaller circle arc colored red
on is the only part of the border where new targets may not spawn. The angle
between the black middle and illumination angle (red line) is the azimuth. The
dotted line shows the lobe width when illuminating in the direction of the red line.

3.5.2 Detection

When the lobe overlaps with a target, a detection is sampled via a probability
distribution. If the detection is successful, a unique target id is returned to the
radar used for resolving a measurement. The probability of detection pD is calculated
based on the signal to noise ratio (SNR) using the following formula with ξ as the
detection threshold and σ as the duty factor, where the duty factor is the ratio
between the pulse width and the Pulse Repetition Interval (PRI, inverse PRF):

pD =

(1− σ) · C · exp
(

−ξ
1+SNR

)
for SNR > 1

0 otherwise,
(3.6)

where C depends on the angle difference between the center of the lobe and the
target ϕtr = |ϕt−ϕr|, as the detection probability diminishes at the edge of the lobe
depending on the lobe decay parameter τ = 0.5:

C =

1 for τ · ωϕ/2− ϕtr > 0
1 + (τ ·ωϕ/2−ϕtr)

(ωϕ(1−τ)/2) otherwise
(3.7)

The lobe decay τ specifies the proportion of the lobe angle in the center of the lobe,
where C = 1. For the remaining part of the lobe width, C decays linearly to 0, see
Figure 3.2

22


3. Radar Simulator

1.5 1.0 0.5 0.0 0.5 1.0 1.5
Angle difference in degrees

1.0

0.5

0.0

0.5

1.0

1.5

2.0

C 
va

lu
e

C value change towards edge of lobe with width of 2 degrees
pD

Figure 3.2: Visualization of how the C value function looks when illuminating in
azimuth

The signal-to-noise ratio, which were derived from the radar equation (Equation 3.1)
is convert into decibels, where decibel is given by valuedb = 10 log(value), and we
instead get:

SNRdB = KdB + 10 · log(PRI · ρ)− 40 · log(r)− r · κ
+ RCSdB + 20 · log(cos(ϕt)γ) + 10 · log(B),

(3.8)

where B is the blindness factor between 0 and 1 arising from the blind zone in
radial velocity; targets that move too slowly can not be distinguished from the
earth’s surface and becomes harder to detect if moving slower than rv2 . If the target
moves slower than rv1 , it is undetectable, so B = 0. Another contribution to the B
factor (related to the mod operator) comes from the fact that the same antenna is
used for sending and receiving, causing a risk of some pulses being received at the
same time as new ones are sent, making them undetectable. The expression for the
B factor is

B = min
1, max

ϵ,
mod (|rv|, λ

4·PRIR
)− rv1

rv2 − rv1

 (3.9)

KdB = 225− 10 · log(PRI · 128). (3.10)
Here, rv is the target radial velocity, λ is the wavelength, PRIR is the uniformly
randomly sampled pulse repetition interval, PRI is the mean PRI, ϵ = 10−6, rv1 = 15
m/s is the minimum detectable radial velocity, rv2 = 40 m/s is the minimum radial
velocity with full detectability (B = 1), ρ is the number of pulses, γ = 1.5 is the
decay exponent and κ = 4 · 10−6 is the atmospheric dampening.

3.5.3 Control Inputs
The control inputs for the radar are to search or to track (often referred to as re-
illuminate, not to be confused with tracks in the tracker).

23


3. Radar Simulator

If the search action is selected, the radar innitially sets the lobe angle φ to equal
the minimum azimuth angle in the search area (φ = −75°). It keeps the radar angle
used in the most recent search in an internal memory φold so that the next time the
radar is commanded to search, it sets the radar angle to

φ = φold + ϕ

cos(φold) ·m,

since ϕ
cos(φ) is the effective lobe width, the increment in azimuth angle between illu-

minations is varied accordingly in order to produce m lobes which overlap in azimuth
angle. If m is set to one, lobes barely touch but do not overlap with each other.

If the track action is selected, an existing track must be specified. The lobe angle is
then set to equal the estimated azimuth angle towards the track. With every action,
the waveform wi ∈ {128, 256, 512} must also be selected.

3.5.4 Measurements
After receiving n detections with the same target id in the last m searches, or after
obtaining a detection through re-illumination, a measurement containing informa-
tion about the target is sampled. From a target with coordinates (x, y) and velocity
(ẋ, ẏ) (assuming the radar is positioned in the origin), we measure the radial posi-
tion, the azimuth angle, and velocities in cartesian coordinates (r̂, φ̂, ˆ̇x, ˆ̇y) with noise
according to

r̂ = N
(√

x2 + y2, 100
)

, φ̂ = N
(

φ,
0.12

cos(φ)

)
ˆ̇x = N (ẋ, 5), ˆ̇y = N (ẏ, 5).

Additionally, the RCS and SNR are measured.

R̂CSdB = N (RCSdB, 2), ŜNRdB = N (SNRdB, 2),

The RCSdB for a target is its average RCSdB given at the spawning time. SNRdB
for the target is given by (3.2)

Notice that we allow measurement of the full 2-dimensional velocity, while a radar
only measures radial velocity. We motivate this as follows. The target tracker
is basic and inferior to a real tracker. Helping it with extra velocity information
compensates for this to some extent.

3.5.5 False Alarms
In the real world where noise is unavoidable, there is a chance that random noise
spikes cause a candidate track to be created despite not correlating to an actual
target. In this project, noise is sampled for measurements after enough detections
from a target. However, background noise from sources other than targets is not
simulated. Instead, false measurements are sampled at random, with a probability

24


3. Radar Simulator

pfalse of occurring after each action. When a false measurement occurs, the azimuth
angle to the track is sampled uniformly around the lobe center, within the lobe
width. We want the false measurement to be uniformly distributed within the area
covered by the lobe, so we sample the radial distance r according to

r ∼
√

U(r2
min, r2

max),

since the area of a circle sector covered by the lobe increases in proportion to the
radius. The heading angle θ is sampled

θ ∼ U(0, 2π).

The measured velocity, SNR and RCS are sampled with identical distributions as
for newly spawned targets. The measurement is then sent to the tracker in the same
way as true measurements.

3.6 Target Tracking
Whenever the radar receives a measurement, it is sent to the tracker. The tracker
can do one of two things with the measurement; associate it with an existing track
and update its state (prediction of target properties) or create a new track with
its own prediction. Each track contains estimates of all measured properties of the
target and their covariances. The tracker is also able to predict how the tracks
change over time. Some of the properties held in a track can be visualized and
matched with true targets, shown in Figure 3.3.

Figure 3.3: A snapshot of the tracker state with true targets drawn as well. Trans-
parent circles represent the location of a true target. Filled circles represent location
estimates in tracks with the fill colors yellow, blue, and black representing tracks
of confidence levels candidate, tentative, and confirmed, respectively. Additionally,
velocity vectors are drawn from the track locations, orange arrows for the true ve-
locity vector, and filled arrows with the corresponding color of confidence level for
the track estimates. The red line represents the radar lobe center.

A new track is always initialized with the confidence level "candidate". The confi-
dence level is upgraded to tentative once two additional measurements have been
associated with this particular track. Finally, it is upgraded to confirmed if another
two measurements have been associated. The confidence level does not affect the
track in any way; it is only used as information for the policy as well as the reward
function in an MDP.

25


3. Radar Simulator

The tracking of individual targets is done with an EKF described in Section 2.3.1,
where each track gets its own EKF and where the process and measurement noise
is defined as

Q = σx
2


t3/4 t2/2 0 0
t2/2 t 0 0

0 0 t3/4 t2/2
0 0 t2/2 t

 R =


σr 0 0 0
0 σsinφ 0 0
0 0 σẋ 0
0 0 0 σẋ

 ,

with

Parameters Value
σr 100 m
σẋ 5 m/s
σsin φ 0.12
σx 50 m

The jacobian of the measurement model according to the state used for the EKF is:

Jh ((x, y, ẋ, ẏ)) =


x
r

0 y
r

0
−y·x

r3 0 1
r
− y2

r3 0
0 1 0 0
0 0 0 1.



3.6.1 Association
The association of measurements to tracks is done in two different ways. The Hun-
garian algorithm is only used for search mode. For tracking/reillumination mode,
we instead use Association boxes. The association boxes are generated around the
targets with dimensions three times the standard deviations in the diagonal covari-
ance matrix in radial positioning, radial velocity, and radar angle. If a target is
found inside the detection box (detection within all three deviations), that yields
a detection. Re-illuminations do not create new instances of tracks, only updating
existing ones.

3.6.2 Termination
A track can be terminated in two different ways. Firstly, a track can be removed
if the radar misses it a given amount of times based on its confidence level. For
candidates, a maximum of 9 subsequent misses since the last detection are allowed.
The 10th miss terminates the track. For tentative and confirmed, the numbers are
12 and 15, respectively. A miss is counted for each action taken where the radar
lobe is within one lobe width of the target but where it does not grant a detection.
Secondly, a track is terminated if too much time has been spent since the last
detection. We call this time coast time, and the maximum allowed is 30 seconds.

26


3. Radar Simulator

3.7 Simulation Loop
The simulation uses an adaptive time step size based on the integration time of the
radar pulse. This allows the simulation to run efficiently with the lowest resolution
possible, with hardly any visible artifacts from the radar’s perspective. Changes
in target trajectories are updated at a much lower frequency than what the radar
operates, so any difference in target behavior due to varying time step sizes is mi-
nuscule. However, small differences in simulation trajectories build up over time,
creating large differences in the long run. Thankfully, since targets are eliminated
once they leave the search area, individual target trajectories are replaced by new
ones before large differences are given a chance to arise.

Algorithm 1 below describes the steps in which order they are taken in the simula-
tion.

Algorithm 1 Simulation Loop
1. Initialize search area and spawn ⌊0.6n⌋ targets
2. Simulate for 500 seconds while integrating movement and spawning new targets
3. Initialize radar
while t <time horizon do

a. Select action (sample policy)
b. Integrate target movement and spawn new targets, t← t + ∆t
c. Sample detections and save in memory
if n detections from the same target in the last m searches then

Sample measurement and associate with existing tracks using the Hungar-
ian algorithm or detection boxes. If no association, create new candidate.

end if
e. Predict track movement.
f. Calculate reward
if time since spawn rate update > 750 seconds then

Update Spawn Rate
end if

end while
Calculate average reward over simulation time to obtain a score

3.8 Baseline Policy
Results are compared against a baseline policy, which is a hard-coded heuristic to
take logical and intuitive actions providing descent tracking performance. At each
time step, the algorithm puts all existing tracks with coast time larger than 15
seconds in a priority list sorted by descending coast times. The list is iterated from
the top to check for tracks fulfilling:

• The track is not the most recently reilluminated track
• The angle from the radar to track is not the same as the previous illumination

angle

27


3. Radar Simulator

• The estimated radial velocity is considered outside of the zero-doppler blind-
zone

If all of the above conditions are true, the track is selected for reillumination and
iteration through the priority list is stopped. If no track is selected, a search is
performed instead.

3.9 Selecting Waveform
Selecting the waveform is essential to successful tracking. However, to simplify
the problem initially, it is not considered part of the decision making but analyzed
separately in Section 4.5. Instead, an algorithm selects the waveform "automatically"
instead of being considered a control input. Instead, it can be considered part of
the state transition probabilities while also affecting the size of the time step. The
algorithm below describes how the waveform is selected automatically, and is used for
most of the project, except explicitly stated otherwise. The process differs whether
the selected action is search or track.

3.9.1 Search
When the search action is performed, the number of new tracks that were found is
stored in a history. A history of the last 30 seconds of search actions is kept and the
average number of tracks found α is calculated over the history. This value is used
to determine the number of pulses j to use for the next search according to

j =


512 if α < 1/30
256 if 1/30 < α < 2/30
512 if 2/30 < α

.

The idea behind this method is to maximize the rate of finding new tracks regardless
of their detectability. If previous searches provided many new tracks, a higher
integration time is considered unnecessary. On the other hand, if previous searches
did not yield many new tracks, it is assumed that the integration time is too low to
find any remaining targets with low detectability.

3.9.2 Track
When reilluminating an already existing track, it is easier to argue for a specific
waveform choice to be optimal. The goal here is to minimize the expected integration
time necessary to get a detection. Since the SNR directly influences the detection
probability (ignoring Doppler and PRI blindness), we use the previously measured
SNR to determine the waveform choice of the following illumination. A collection
of properties that primarily affect the SNR of detections are

• Radar cross section
• Distance to target
• Azimuth Angle (affects power gain / intensity of the lobe)
• Waveform

28


3. Radar Simulator

All of the above except the selected waveform is assumed to be near constant during
the time span from one detection to the next. We then assume that a subsequent
illumination yields a similar SNR if the same waveform is used. Additionally, we
can use the fact that doubling the integration time leads to a 3 dB increase in
SNRdB (doubling the linear SNR value). With this information, we normalized the
estimated SNR to be the equivalent of using a waveform of 128 pulses in decibels
and denote it as SNR128

dB . Based on the estimated signal to noise ratio SNRdB and
number of pulses used j for that detection, we calculate the 128 pulse-normalized
SNR

SNR128
dB = SNRdB − 3 log2

(
j

128

)
(3.11)

With the estimated SNR128
dB , we can calculate which waveform should give us the

lowest expected integration time necessary for a detection. Choosing the waveform
which minimizes the necessary integration time should be optimal most of the time,
at least from a local optimization standpoint. We look at (3.6) and assume C = 1
to give us a detection probability. Given a detection probability

Prob
(

SNR = SNR128
dB + 3 · log2

(
j

128

))
, (3.12)

for j ∈ {128, 256, 512} the estimated number of attempts needed for one detection is
1/p. Multiplying this by the integration time I(j) gives us the expected integration
time necessary for a detection

Idetect(j) = I(j)/p(j) (3.13)

so we can select the optimal number of pulses arg minj Idetect(j).

29


3. Radar Simulator

30


4
Methods

This chapter describes all the necessary details regarding the training and perfor-
mance evaluation of different approaches. The overall flow of the training algorithm
is illustrated in Figure 4.1 below. Further details regarding each step are described
throughout this chapter.

Run simulation
500 seconds

Initialization

Select Action:
sample from
ℼ𝛳(s, a)

Perform Action & 
integrate 
simulation, 
update targets

Process 
detections: let 
tracker update & 
and predict tracks

Receive new 
tracker state and 
reward

Shuffle & encode 
new state

Push reward, 
(old) state and 
action to 
experience buffer

Train network by 
minimizing loss 
on buffered 
experiences

While buffer size < time horizon:

Reset buffer

For number of training epochs

Figure 4.1: Diagram describing the flow of the training loop used for this project.

4.1 MDP
Section 2.1 described the MDP framework where a policy π can be trained in a
Markovian environment in order to maximize the expected return. Figure 4.1 de-
scribes the training procedure used in this thesis, however it may not be immediately
clear how the MDP framework is used here. We look closer at the simulation loop

31


4. Methods

regarding how the policy interacts with the environment in Figure 4.2. Here, the
state st+1 (representing the true target distribution) is sampled without dependence
on the policy from a probability distribution P (st). The selected action at+1 as well
as st+1 influences the obtained observation Ot+1. Using all previous actions and ob-
servations a1:t+1 and O1:t+1, we can predict the environment state with a belief state
Bt+1. Here, the belief state is a distribution over S describing the probability of
being in the true state s. However, for predicting the belief state, we use the tracker
(Section 2.3) built into the simulator denoted TT which uses the implemented track-
ing methods to output a good belief state B̃t+1 which corresponds to a single likely
state s.

Agent
at+1 ~ 𝝅(B̃t)

Observation
Ot+1 ~ O(st+1, at+1)  

st+1 ~ P(st)

Bt+1 =  B(a1:t+1, O1:t+1)

B̃t+1 = TT(B̃t, at+1, Ot+1)

Rt+1(B̃t,B̃t+1)

Figure 4.2: The simulation loop describing the policy’s interaction with the envi-
ronment.

From the perspective of the agent, the belief state (tracker state) B̃t is the only
relevant input to the policy, so the sampling of observations and true state transitions
can be considered part of the transition probability when "sampling" the belief state
B̃ after using an action. In doing this, we obtain something like in Figure 4.3 where
we consider the belief state to be sampled from a probability distribution P̃ . We
can compare this to the typical MDP in Figure 2.1 and see that these are clearly
equivalent, so we can describe the environment transitions using the standard MDP
framework.

Agent
at+1 ~ 𝝅(B̃t)

B̃t+1 ~ P̃(B̃t, at+1)

Rt+1(B̃t,B̃t+1) Action at+1

Reward

Figure 4.3: State transitions and observations and belief state updates are all
considered in the belief state transition probabilities P̃ .

32


4. Methods

4.2 Neural Network
The implementation is based on the PPO-clip training procedure with an actor-
critic architecture. We use two completely separate neural networks for the actor
and critic, respectively. The network architectures are the same between the actor
and critic except for the number of output nodes. They are separated into a feature
stack and a track stack of neuron layers. The feature stack is initially applied to all
features of each track individually, in parallel, giving a set of outputs each for every
track and works like an encoder for the track features, decreasing the dimensionality
of the input before it is passed on to the fully connected track stack. The outputs
from the feature stacks are concatenated before they are used as the input to the
track stack. Since we want to use a finite number of neurons, we limit the total
number of allowed tracks in the tracker. All the features used and how they are
represented can be found in Section 4.3

For n = 31 features per track, m = 50 maximum number of allowed tracks, o out-
puts (equal to the number of actions (51) for the actor, single output for the critic),
we describe the network architecture using Linear(i, j) layers with i inputs and j
outputs fully connected. The ReLU layer is the rectified linear unit, a nonlinear
activation function. For more details regarding the neuron layers, see the official
Pytorch documentation [12].

Feature stack:
• Linear(n, 64)
• ReLU
• Linear(64, 64)
• ReLU
• Linear(64, 3)

Track stack:
• Linear(3 ·m, 1024)
• ReLU
• Linear(1024, 1024)
• ReLU
• Linear(1024, 512)
• ReLU
• Linear(512, o)

Figure 4.4 illustrates the idea behind the network architecture. The important detail
regarding the feature stack is that the weights (labeled wij in Figure 4.4) are shared
across all tracks, since there is only a single feature stack which is applied to all
tracks in sequence.

4.3 State Encoding
The agent receives information about the environment from the tracker directly, so
in the MDP formulation, the tracker is part of the environment and its transition

33


4. Methods

[Track 1 features] [Track 2 features] [Track n features]

. . . .

. . . .

:
:
:

:
:
:

:
:
:

wij
wij wij

… … …

Figure 4.4: Illustration of the idea behind the network architecture. Weights
labeled wij are shared across nodes connected to different track features.

probabilities. Each track contains an array of information about the corresponding
target, encoded into primarily binary encoding with +1 or −1 to make the infor-
mation easy for the agent to interpret. All features are listed below, where some
features are encoded using multiple bits (one bit for each neuron), meaning most
features are represented by multiple neurons. In total, 31 neurons/bits encode all
features for a single track. The first feature Track existence is set to 1 if the feature
vector corresponds to a track. Otherwise, it is set to −1. When the feature vector
does not correspond to a track (has not yet been found), the input is zero-padded
(all remaining features set to 0). We list the features of each track as well as details
regarding the number of bits and range used for each feature in Table 4.1. Suppose
a feature’s value is outside the encoded range. In that case, it is clipped to the
maximum or minimum value in the range, whichever is closest. For the features
where the provided number of bits is insufficient to represent all whole numbers in
the range or if the values are continuous, the range is rescaled and rounded to fit
the range of the bit representation. Where each bit represent an interval of integer
values.

Feature No. Bits Range
Track existence 1 [−1, 1]

Certainty (candidate, tentative, confirmed) 3 One hot
Coast time (time since last measurement) 5 [0, 30] [s]

Number of subsequent misses 4 [0, 15]
Estimated SNRdB 4 [−3, 12] (dB)

Angle to track 3 [−75, 75] (deg)
Radial distance 4 [0, 450000] [m]

Zero doppler detectability 2 [None, Reduced, Full]
Detections since last upgrade 2 [0, 3]

Search angle 3 [−75, 75] (deg)

Table 4.1: Encoding of input features

The number of outputs corresponds to the number of available actions for the actor,
in this case, 51, one action for the reillumination of each track and a single output

34


4. Methods

for searching. Ouptuts corresponding to tracks which do not exist are set to −∞,
afterwards a softmax layer normalizes all outputs with total sum equal to one, so
that they can be interpreted as probabilities. The value network has a single output
corresponding to the value estimate.

4.3.1 Input Shuffle
In order to prevent the agent from learning (nonexistent) patterns in track permuta-
tion, the ordering of the input order is shuffled randomly between the feature stack
and the track stack. Effectively this means that the concatenation of outputs from
the feature stack is done in random order. We then make sure to match the output
with the ordering of the input.

4.4 Implementation Specific Details
In this section, we go over some details regarding the implementation which are not
standard in the PPO-clip algorithm and are specific to this project or should be clar-
ified in more detail. This includes details regarding the specific scoring and reward
function used, managing varying time step sizes and corresponding γ discount, and
other optimizations to enhance the training procedure.

4.4.1 Varying Time Step Size
The environment is sampled with multiple independent trajectories τi simultaneously
before they are used for training. All trajectories are stored in a single buffer. We
now look at the PPO training algorithm for a single trajectory τi, and call it τ .
The trajectory is divided into a partition of time intervals with t1, ..., tn for n fixed,
denoting the start of each interval. ∆tk = tk+1 − tk is not constant and may vary
between different k. An action has a specific integration time w ∈ [w1, w2, w3] based
on the selected waveform which determines the size of the current time step and
gives us ∆t ∈ [∆t1, ∆t2, ∆t3] where ∆t1 < ∆t2 < ∆t3. The trajectory consists of n
actions, each having a corresponding waveform that determines the size of the time
step. The total simulation time, therefore, depends on the actions chosen.

4.4.2 Performance Score and Reward
Since we have varying time steps we have to pay some extra care to the reward and
discount. We do this by introducing a performance score and use it as a basis for
the reward. Discount needs to be applied with care to avoid undesirable biases in
the step length choice.

4.4.2.1 Performance Score

The performance score Ψk over time step k is calculated as

Ψk =
∫ tk+∆tk

tk

Ψ̇(t)dt (4.1)

35


4. Methods

where Ψ̇(t) is the amount of reward given per second based on the number of tracks in
the tracker and their confidence level. It is determined according by Ψ̇(t) = ∑

i Ci(t)
for all tracks i where

Ci(t) =


0 if track i does not exist at time t

10/3 if track i has confidence level candidate at time t

20/3 if track i has confidence level tentative at time t

30/3 if track i has confidence level confirmed at time t.

(4.2)

Since tracks can terminate during the time ∆tk it is important to not only consider
the tracker state at the beginning of the time step, since that might cause unfairness
in the choice of integration time. Ψ̇(t) is therefore considered a continuous variable,
and is unlike Ψk not limited by the resolution of the simulation time steps. The
total score over the trajectory τ is

Ψτ =
∑
k∈τ

Ψk. (4.3)

We see that the score received over a time step k is proportional to the number of
tracks (and their confidence level) held during the time step and the duration of the
time step. In order to obtain a large reward, the agent should confirm tracks quickly
and keep the confirmed tracks for the rest of the trajectory.

4.4.2.2 Reward and Discount

The non-discounted immediate reward rate of change ṙk at time step k is

ṙk = Ψk

∆tk

− Ψk−1

∆tk−1
:= ¯̇Ψk − ¯̇Ψk−1, (4.4)

which can be interpreted as the increase in the average score rate of change from
the previous time step to the next. Taking the average score rate of change Ψk

∆tk
is

done in order to get rid of the time dependence in ṙ so that it can be identified by
a single index k.

As for the total reward, one could integrate over ṙk over the trajectory, but we are
more interested in the discounted reward (starting from a specific time step k). The
discounted reward r̂k is

r̂k =
∞∑

j=k

ṙj ·
∫ tj+∆tj

tj

e−(t−tk)γdt, (4.5)

with discount factor γ > 0.

4.4.3 Discounted Reward Fast Algorithm
For the implementation of (4.5), the entire integral over [tk,∞] is not calculated
individually for each time step k. The reason is that although the integral over ex-
ponential decay can be precalculated for three different integration times, each time

36


4. Methods

step would require calculating a sum over all future time steps for the rest of the
simulation. Doing this would require an algorithm of complexity O(n2) for n time
steps, but it can be improved to linear time using Algorithm 2 below. Note also
that the discounted reward depends on the reward over an infinite time horizon. It
is not possible to sample an infinite amount of time steps, and therefore we assume
that any reward past the time horizon ṙk>n stays equal to zero for the rest of time
so that

∫∞
tk

ṙk>n · e−(t−tk)γdt = 0.

Algorithm 2 Discounted Reward
r̂n = ṙn ·

∫ tn+∆tn
tn

e−(t−tn)γdt
for k = n− 1, n− 2, ..., 2, 1 do

r̂k = ṙk

∫ tk+∆tk
tk

e−(t−tk)γdt + r̂k+1e
−(tk+1−tk)γ.

end for

This algorithm yields the same result as (4.5) for all rk since

r̂k =
∞∑

j=k

ṙj

∫ tj+∆tj

tj

e−(t−tk)γdt =

= ṙk

∫ tk+∆tk

tk

e−(t−tk)γdt +
∞∑

j=k+1
ṙj

∫ tj+∆tj

tj

e−(t−tk)γdt.

Looking closer at the last summation, we see that
∞∑

j=k+1
ṙj

∫ tj+∆tj

tj

e−(t−tk)γdt =
∞∑

j=k+1
ṙj

∫ tj+∆tj

tj

e−(t−tk+1)γe−(tk+1−tk)γdt.

However, the factor e−(tk+1−tk)γ does not depend on t and can therefore be moved
outside the integral. It also does not depend on j, so we can move it outside the
sum. Doing this gives

∞∑
j=k+1

ṙj

∫ tj+∆tj

tj

e−(t−tk)γdt = e−(tk+1−tk)γ
∞∑

j=k+1
ṙj

∫ tj+∆tj

tj

e−(t−tk+1)γ

= e−(tk+1−tk)γ r̂k+1,

so we get

r̂k = ṙk

∫ tk+∆tk

tk

e−(t−tk)γdt + e−(tk+1−tk)γ r̂k+1.

Note that (tk+1 − tk) = ∆tk, so the algorithm can be performed quickly by precal-
culating the exponential decay for all ∆tk.

The reward discount algorithm was verified not to contain any bias by running
a training session where a constant reward per second was given regardless of ac-
tion. Together with a small waveform entropy loss, the waveform selection remained
evenly distributed, showing that the discounted reward is not biased towards any
action length using this reward discount.

37


4. Methods

4.4.4 Advantage Estimation
The advantage is A(sk, ak) = Q(sk, ak) − V (sk) where Q(sk, ak) is the expected
discounted reward E(R̂k) for taking action ak in state sk and V (sk) is the expected
discounted reward based on the current policy πθ(s, a). In PPO, we want to use the
advantage to train the agent. We use a neural network to predict V (sk) by training
on discounted reward, but predicting Q(sk, ak) is not so straight forward. One way
to estimate the advantage at time step k is

Ak = r̂k − V (sk), (4.6)

r̂k is the discounted reward over the rest of the trajectory, which is not a great ap-
proximation of Q(sk, ak). This advantage estimator may work well if a deterministic
policy is used and if the environment is deterministic in its transition probabilities as
well as reward. In addition to randomness in the policy πθ, the discounted reward r̂k

does not consider the variation in state transition probabilities but instead samples
a particular outcome.

In our implementation using a radar simulation, neither the policy nor the envi-
ronment (belief state) is deterministic, and estimating the advantage this way is
expected to suffer from significant variance due to the randomness in reward. Al-
ternatively, the advantage can be estimated with less variance using

Ak = ṙk

∫ tk+∆tk

tk

e−(t−tk)γdt + V (sk+1)e−∆tkγ − V (sk). (4.7)

This way, the value estimator estimates the future reward, which considers the envi-
ronment transition probabilities. One downside of this advantage estimate regards
delayed rewards since such rewards would only be considered if they strongly cor-
relate to a specific state transition. The discounted future reward term V (sk+1)
depends only on the state itself and previously obtained rewards from similar states
according to the policy and state transition probabilities. If a, for the current policy,
atypical action was taken which resulted in a large but delayed reward, it would not
be reflected whatsoever in this advantage calculation since it relies on immediate
rewards and accurate value estimations to function. But the value estimator will al-
ways lag behind the policy since it requires training from several trajectories before
it can accurately predict the value.

Another option is to use the generalized advantage estimate. What we present here
is based on the work of Schulman et al [14] but is modified using the time continuous
discount factor

Âk = δk + e−(tk+1−tk)γλδk+1 + ... + e−(tn−tk)γλn−kδn,

where λ ∈ [0, 1] and

δk =
∫ tk+∆tk

tk

ṙke−(t−tk)γdt + V (sk+1)e−∆tkγ − V (sk).

38


4. Methods

δk is essentially the low variance advantage estimate in (4.7). Since we modified
the advantage expression to utilize reward discount with varying time step sizes,
the expression looks much more complicated. But functionally, it works the same
as the original expression in Schulman et al [14], so we suggest taking a look at
their original expression to understand the idea behind the generalized advantage
estimate. By recursion, we calculate advantages in linear time using

Âk = δk + e−(δtk)γλÂk+1. (4.8)

The parameter λ serves as a tunable trade-off between near-future rewards from the
present trajectory as in (4.6) versus the more general, low variance value estimate
as in (4.7).

4.4.5 Training Algorithm
In PPO, we use the advantage estimate to calculate the loss and update the network
weights to a policy πθl+1 with network parameters θl+1 that maximizes the expected
discounted reward. After sampling a trajectory τ and calculating discounted rewards
r̂k, value estimates Vk and advantage estimates Ak for k ∈ τ , we update the network
weights according to

θl+1 = argmaxθ

1
n

n∑
k=1

min
(

πθ(ak|sk)
πθl(ak|sk)A(sk, ak), g(ϵ, A(sk, ak)

)
, (4.9)

where g(ϵ, f) clips f so that

g(ϵ, f) =


f if 1− ϵ < f < 1 + ϵ

1− ϵ if f < 1− ϵ

1 + ϵ if 1 + ϵ < f.

Here, the clipping parameter ϵ is set to equal 0.2. πθl(ak|sk) is the probability of
choosing action ak given state sk for the policy with weights θl. This training algo-
rithm strives to increase the probability of selecting actions with a large advantage
and decrease the probability of actions with a negative advantage within the clipping
ratio ϵ.

The value estimator network is trained by minimizing the mean squared error be-
tween the value estimate and discounted reward according to

ϕl+1 = argminϕ

1
n

n∑
k=1

(Vϕl
(sk)− R̂k)2.

Here Vϕl
(sk) is the value estimate of state sk for the value network with weights ϕl,

and R̂K is the discounted reward.

39


4. Methods

4.4.6 Entropy Loss
In order to prolong exploration, entropy loss is used in the training loop and sub-
tracted from the loss function. The general expression of entropy loss for a policy
with outputs xi where i = 1, ..., n is

H(x) = −
n∑

i=1
P (xi) loge(P (xi)). (4.10)

The entropy H(x) for a single output x1 (n = 1) is shown in Figure 4.5. We see
that the entropy contribution from a single output is low (near 0) if its probability
is near 0 or 1, high if the probability is near half. What is more important, though,
is that the derivative of the expression within the sum is

d
dx

P (x) loge(P (x)) = − loge(x)− 1, (4.11)

so that when an output’s probability approaches 0, the derivative of the entropy
contribution from that output approaches positive infinity. Since the agent learns
based on the gradient of the loss function, including the entropy in the loss function
incentivizes the agent to refrain from completely ignoring specific outputs without
randomly forcing it to explore as opposed to the ϵ-greedy strategy typically used in
Q-learning.

0.0 0.2 0.4 0.6 0.8 1.0
Probability of single output

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

En
tro

py
 c

on
tri

bu
tio

n

Figure 4.5: Entropy contribution as a function of the output probability of a single
output

We have also used a waveform-specific entropy loss that allows the agent to select a
waveform. For the waveform-specific entropy loss, all actions with the same wave-
form are merged into the same category by summing their probabilities. The three
resulting categories are then re-normalized, and the entropy loss is calculated.

Unfortunately, in neither case did the entropy loss improve performance, so it was
removed in the final implementation. However, it could still be helpful in junction
with other hyperparameter settings or network architectures. And as mentioned,
this entropy loss was used to verify that the reward discount implementation did
not induce biases in waveform selection.

40


4. Methods

4.5 Selecting Waveform
An initial goal of the thesis project was to allow the agent to select the waveform;
the number of pulses to use for each illumination. This was done by multiplying
the output size by 3 (one for each available waveform) but keeping everything else
otherwise the same. Thus, the neural network is given 153 output neurons where
the first 150 outputs correspond to the reillumination of a track with one out of
three possible waveforms. The last three outputs correspond to searching with one
of three waveforms.

The best performing network architecture that was found did not allow for effective
waveform selection despite managing to perform well in other regards. Ultimately,
the best performing agent uses the feature stack network architecture described in
Section 4.2 and is not responsible for waveform selection. Instead, we use a coded
algorithm to optimize the waveform selection locally after the illumination direction
has been selected by the agent. If waveform selection is enabled by increasing the
output size on this architecture, then the agent converges to using only a single
waveform, despite the use of waveform specific entropy, while still performing rea-
sonably well in other regards.

The fully connected network architecture was also implemented to bypass the fea-
ture stack. It allows the agent to learn to select waveform in a more convincing
manner, although performing much worse in other regards. This prompts interest
in further investigation in the future, as there is good reason to believe that other
network architectures may allow both, but for now we decided to continue optimiz-
ing the training algorithm without waveform selection being decided by the agent.
The impact of the different architectures is discussed in Section 4.8.

Regardless, in this section we go over what we did find in regards to training an
agent with waveform selection, before putting the matter aside for the rest of the
project. However, first, in order to gain an intuition of the kind of situations where
a particular waveform is desirable (although locally optimized), let us think about
the integration time needed with a specific waveform on a target in order to receive
a detection. As discussed in Section 3.9.2, we can calculate the expected integration
time necessary for a detection using Equation (3.13) with the integration times
implemented in the simulation I(j) = {0.13, 0.22, 0.38}. See Figure 4.6 where the
expected integration times are plotted for different waveforms against a normalized
SNR128

dB (as described in Equation (3.11)) on the x-axis. Here we see that it is
optimal to use a waveform of 512 pulses against targets where SNR128

dB < 1.6, 256
pulses where 1.6 <SNR128

dB < 5.7, and 128 pulses where 5.7 <SNR128
dB as it results in

the lowest expected integration time necessary for a detection.

4.5.1 Feature Stack Architecture
Despite performing well otherwise, the agent does not efficiently select waveform
using the feature stack architecture seen in Figure 4.4. Many variations of the ar-

41


4. Methods

0 1 2 3 4 5 6 7 8 9 10
SNR [dB]

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Ex
pe

ct
ed

 in
te

gr
at

io
n 

tim
e 

[s
]

Expected integration times for SNR and waveform
128 pulses
256 pulses
512 pulses
512 pulses Optimal from 0 SNR
256 pulses Optimal from 1.6 SNR
128 pulses Optimal from 5.7 SNR

Figure 4.6: Expected integration time necessary until detection in terms of SNR.
Since SNR depends on the waveform selected, the SNR on the x-axis shows normal-
ized SNR equivalent to using 128 pulses. 256 and 512 pulses are assumed to increase
the SNR by 3 and 6, respectively.

chitecture was tested, such as different numbers of bottleneck nodes (feature stack
outputs) as well as the general depth and size of the network, however many of
these variations caused the agent to perform considerably worse in general while
not helping in the slightest when it comes to waveform selection. Only a weak
waveform dependence on measured SNR can be seen with this architecture. See
Figure 4.7 where the proportion of selected waveforms are shown out of all reillumi-
nation actions taken. The estimated SNR is adjusted to be equivalent to 128 pulses
of integration.

10 5 0 5 10 15 20
True snr [dB]

0.0

0.2

0.4

0.6

0.8

1.0

Pr
op

or
tio

n 
of

 w
av

ef
or

m
 u

sa
ge

_RL_Seed-1Wid2_Score-147

128 pulses
256 pulses
512 pulses

Figure 4.7: Number of pulses used when tracking targets with SNR shown on the
x-axis. The SNR values are estimated SNR for 128 pulses of integration.

4.5.2 Fully Connected Architecture
We also evaluate an alternate network architecture that proved to perform better at
selecting waveforms. The fully connected network architecture works by effectively

42


4. Methods

skipping the feature stacks and instead connecting the input features of each track
directly to the fully connected network. See the illustration in Figure 4.8 below.

[Track 1 features]    …. [Track n features]

.

.

.

.

Figure 4.8: Illustration of the idea behind the fully connected network architecture.

With the fully connected network architecture, the agent does not perform as well
in general but manages to select waveforms more similar to the theoretical SNR
thresholds, as shown in Figure 4.6. An example of a trained agent’s waveform
selection is shown in Figure 4.9 below.

10 5 0 5 10 15 20
True snr [dB]

0.0

0.2

0.4

0.6

0.8

Pr
op

or
tio

n 
of

 w
av

ef
or

m
 u

sa
ge

_RL_Seed-1Wid0_Score-123

128 pulses
256 pulses
512 pulses

Figure 4.9: Number of pulses used when tracking targets with SNR shown on the
x-axis. The SNR values are estimated SNR for 128 pulses of integration.

While the agent seems to use longer integration times against targets with lower
measured SNR, the result is quite different from the theoretical algorithm whose
results are shown in Figure