Adaptive Radar Illuminations with Deep Reinforcement Learning Master’s thesis Albin Ekelund Karlsson Samuel Sandelius DEPARTMENT OF MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Adaptive Radar Illuminations with Deep Reinforcement Learning Illumination Scheduling for Long Range Surveillance Radar with the use of Proximal Policy Optimization Albin Ekelund Karlsson Samuel Sandelius Department of Mathematical Science Chalmers University of Technology Gothenburg, Sweden 2023 Adaptive Radar Illuminations with Deep Reinforcement Learning Albin Ekelund Karlsson Samuel Sandelius © Albin Ekelund Karlsson, 2023. © Samuel Sandelius, 2023. Supervisor: Adam Andersson, Saab AB Examiner: Peter Helgesson, Department of Mathematical Science Master’s Thesis 2023 Department of Mathematical Sciences Division of Applied Mathematics and Statistics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Abstract A modern radar antenna can direct its energy electronically without inertia or the need for mechanically steering. This opens up several degrees of freedom such as transmission direction and illumination time, and thus also the potential to optimise operation in real-time. Long range surveillance radars solve the trade-off between searching for new targets and tracking known targets. This optimisation is often rule-based. In recent years, Reinforcement Learning (RL) Algorithms have been able to effi- ciently solve increasingly difficult tasks, such as mastering game strategies or solv- ing complex control tasks. In this thesis we show that reinforcement learning can outperform such rule-based approaches for a simulated radar. Keywords: Reinforcement Learning RL, Radar Target Tracking, Partially Observed Markov Decision Process POMDP, Active Electronically Scanned Array Antenna, Airborne Surveillance Radar v Acknowledgements We would like to express our sincere gratitude to the following individuals for the invaluable support and guidance throughout this project. First and foremost, we would like to extend our deepest appreciation to our su- pervisor Adam Andersson for his exceptional guidance and mentorship through the project. His guidance has been an essential part in shaping the direction and quality of this thesis, and we are truly grateful for his commitment, patience and dedication to our academic and professional development. We would also like to thank our examiner Peter Helgesson for his commitment and meticulous evaluation and feedback during this project. His constructive criticism have greatly contributed to the refinement of this thesis. We are grateful for his thorough assessment and valuable suggestions, which have enhanced the quality of our thesis. Lastly we would also like to extend our gratitude to Saab for this incredible opportu- nity to conduct a project within such and interesting subject, but also for providing us with the necessary resources, facilities, expertise. Thank you. Albin Ekelund Karlsson, Samuel Sandelius, Gothenburg, June 2023 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 5 2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Reinforcement Learning with Neural Networks . . . . . . . . . . . . . 7 2.2.1 Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Policy Gradient methods . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Actor - Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.4 Trust Region Policy Optimization . . . . . . . . . . . . . . . . 11 2.2.5 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . 11 2.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Hungarian method . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2.1 Mahalanobis distance . . . . . . . . . . . . . . . . . 15 3 Radar Simulator 17 3.1 Radar basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Pulse Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Waveform and Integration Time . . . . . . . . . . . . . . . . . 18 3.1.3 Pulse Repetition Frequency and Resolving Measurements . . . 18 3.2 Simulator Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Initializing the Search Area . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.1 Spawning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2 Spawn Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.1 The Lobe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.3 Control Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.4 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 ix Contents 3.5.5 False Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6.1 Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.2 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Simulation Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.8 Baseline Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Selecting Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9.2 Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Methods 31 4.1 MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Input Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Implementation Specific Details . . . . . . . . . . . . . . . . . . . . . 35 4.4.1 Varying Time Step Size . . . . . . . . . . . . . . . . . . . . . 35 4.4.2 Performance Score and Reward . . . . . . . . . . . . . . . . . 35 4.4.2.1 Performance Score . . . . . . . . . . . . . . . . . . . 35 4.4.2.2 Reward and Discount . . . . . . . . . . . . . . . . . 36 4.4.3 Discounted Reward Fast Algorithm . . . . . . . . . . . . . . . 36 4.4.4 Advantage Estimation . . . . . . . . . . . . . . . . . . . . . . 38 4.4.5 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.6 Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Selecting Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.1 Feature Stack Architecture . . . . . . . . . . . . . . . . . . . . 41 4.5.2 Fully Connected Architecture . . . . . . . . . . . . . . . . . . 42 4.5.3 SNR-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . 44 4.6 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.7 Input State Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.7.1 No Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.7.2 Input features . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.7.3 Radial Distance, RCS, Radial Velocity . . . . . . . . . . . . . 45 4.7.4 Search Angle and Track Angle . . . . . . . . . . . . . . . . . . 45 4.7.5 Observation History . . . . . . . . . . . . . . . . . . . . . . . 46 4.7.6 Confidence Level . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.7.7 Zero Doppler Detectability . . . . . . . . . . . . . . . . . . . . 46 4.7.8 Track Volume Derivative . . . . . . . . . . . . . . . . . . . . . 47 4.7.9 Selected Features . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.8 Network Structure Optimization . . . . . . . . . . . . . . . . . . . . . 47 4.8.1 Feature stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.8.1.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 48 4.8.2 Track Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.8.2.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 48 5 Results 49 5.1 Agent Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 x Contents 5.2 Agent Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Action Distribution . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.2 Reilluminations . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 Time Before Confidence Upgrade . . . . . . . . . . . . . . . . 53 5.2.4 Lost Tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Added Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.1 Resolving Measurements . . . . . . . . . . . . . . . . . . . . . 55 5.3.2 Tracker Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.3 False Detections . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.4 Resolving measurements, tracker delay and false detections . . 57 5.3.5 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 58 6 Conclusion 59 6.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bibliography 61 A POMDP & ρPOMDP I B Simulator parameters used III xi Contents xii List of Figures 1.1 Simple illustration of the principle behind radar sensing. The small circle segments originate from the antenna to the left and propagate to the right. The waves are reflected on the object in all directions (dotted circles), eventually returning to the sender. . . . . . . . . . . 1 2.1 The MDP process and interaction between agent and environment. . 6 2.2 Example structure of a neural network. Neurons are labeled nl i for neuron i in layer l and forward connected to the next layer as shown by the blue arrows. This network has four input features and produces three outputs through three hidden layers of five neurons each. . . . . 8 3.1 The shape of the surveillance area. The smaller circle arc colored red on is the only part of the border where new targets may not spawn. The angle between the black middle and illumination angle (red line) is the azimuth. The dotted line shows the lobe width when illuminating in the direction of the red line. . . . . . . . . . . . . . . 22 3.2 Visualization of how the C value function looks when illuminating in azimuth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 A snapshot of the tracker state with true targets drawn as well. Trans- parent circles represent the location of a true target. Filled circles represent location estimates in tracks with the fill colors yellow, blue, and black representing tracks of confidence levels candidate, tenta- tive, and confirmed, respectively. Additionally, velocity vectors are drawn from the track locations, orange arrows for the true velocity vector, and filled arrows with the corresponding color of confidence level for the track estimates. The red line represents the radar lobe center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Diagram describing the flow of the training loop used for this project. 31 4.2 The simulation loop describing the policy’s interaction with the en- vironment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 State transitions and observations and belief state updates are all considered in the belief state transition probabilities P̃ . . . . . . . . . 32 4.4 Illustration of the idea behind the network architecture. Weights labeled wij are shared across nodes connected to different track features. 34 4.5 Entropy contribution as a function of the output probability of a single output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 xiii List of Figures 4.6 Expected integration time necessary until detection in terms of SNR. Since SNR depends on the waveform selected, the SNR on the x-axis shows normalized SNR equivalent to using 128 pulses. 256 and 512 pulses are assumed to increase the SNR by 3 and 6, respectively. . . . 42 4.7 Number of pulses used when tracking targets with SNR shown on the x-axis. The SNR values are estimated SNR for 128 pulses of integration. 42 4.8 Illustration of the idea behind the fully connected network architecture. 43 4.9 Number of pulses used when tracking targets with SNR shown on the x-axis. The SNR values are estimated SNR for 128 pulses of integration. 43 4.10 Number of pulses used when tracking targets with SNR shown on the x-axis. The SNR values are estimated SNR for 128 pulses of integration. 44 5.1 Trajectory scores over a training session. Blue line shows the average over three trajectories in one episode. Orange line shows the aver- age score over the last 50 episodes (average score over the last 150 trajectories). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Number of tracks over the course of the trajectory for the RL Agent (a) and the baseline implementation (b) . . . . . . . . . . . . . . . . 50 5.3 500 trajectories each with their median confirmed track volume as well as 25th to 75th percentile ranges. . . . . . . . . . . . . . . . . . . 51 5.4 Actions taken over the course of the trajectory for the RL Agent (a) and the baseline implementation (b). Bins on the x-axis are labeled "action #pulses", so that "search 128" represents search actions using 128 pulses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5 Square root of the number of reilluminations over the course of the trajectory for the RL Agent (a) and the baseline implementation (b) . 53 5.6 Histogram of the time taken to upgrade tracks for the agent and baseline respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.7 Removed tracks over the course of the trajectory for the RL Agent (a) and the baseline implementation (b) . . . . . . . . . . . . . . . . 55 5.8 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 56 5.9 Tracker history over 20 trajectories. . . . . . . . . . . . . . . . . . . . 56 5.10 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 56 5.11 Tracker history over 20 trajectories. . . . . . . . . . . . . . . . . . . . 56 5.12 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 57 5.13 Tracker history over 20 trajectories. . . . . . . . . . . . . . . . . . . . 57 5.14 Training scores for the agent. . . . . . . . . . . . . . . . . . . . . . . 58 5.15 Tracker hitosry over 20 trajectories. . . . . . . . . . . . . . . . . . . . 58 xiv List of Tables 3.1 Integration times for different number of pulses used. . . . . . . . . . 18 4.1 Encoding of input features . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Hyperparameters used for training. Learning rates were found by grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Table of all features used in the forward selection . . . . . . . . . . . 46 4.4 The 4 different network architectures used for optimizing the feature stack of the neural network . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 3 different network architectures used for network structure optimiza- tion of the track stack . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Performance Scores, average score over 10 trajectories of 1500 seconds each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 B.1 Parameters used when running the simplest version of our environment III xv List of Tables xvi 1 Introduction The surveillance task of a radar concerns the search of unknown targets and the tracking of known targets within the radar search volume. Sensing with a pulse Doppler radar works by sending pulses of electromagnetic radiation which reflects at distant objects. The reflections are recorded by the radar and through data processing, targets are detected and tracked. A simple illustration in Figure 1.1 shows the propagation and reflection. Figure 1.1: Simple illustration of the principle behind radar sensing. The small circle segments originate from the antenna to the left and propagate to the right. The waves are reflected on the object in all directions (dotted circles), eventually returning to the sender. We consider an Active Electronically Scanned Array (AESA) antenna which consists of multiple smaller antennas that can take on different phase delays, allowing the direction of the pulse to be centered in a specified direction. Older radar systems are typically constructed with a static radiation pattern produced by the geometry of the aperture. Here, the energy is directed by mechanically rotating the entire aper- ture which is a slow process compared to changing the phase delays in an AESA antenna [3]. Tracking a target means correctly associating multiple detections with known tracks (series of detections originating from a single target) and, in this way, building tra- jectories for the discovered targets. When performing an illumination, two control inputs mainly affect the illumination outcome; the direction (angle of the illumi- nation) and integration time (duration of illumination). The trade-off between searching for new targets and tracking known targets needs to be considered in order to maximize the number of tracks held. The primary cost of each action is the time spent as tracks should be frequently updated. 1 1. Introduction Traditionally, AESA based radar illuminations were scheduled using rule based pri- ority lists with different local optimization of illumination features such as integra- tion time. While such a scheme can result in very good tracking performance, there is good reason to believe that an algorithm that applies a policy based on global optimization may improve the performance. This thesis proposes a method of ap- proaching this problem by Reinforcement Learning (RL) to find a policy (probability distribution over actions) that performs better than the baseline (priority lists) in a simulated environment. 1.1 Problem formulation In our RL approach, an agent (decision maker derived from the policy) is set to perform the radar scheduling in a Python based simulation environment. A train- ing method that is typically used in RL problems is Proximal Policy Optimization (PPO) [14] which has proven especially effective in control tasks. For this project, PPO was mainly used due to its simplicity and generally good performance with- out requiring much prior parameter tuning. This allowed us to easily adapt the algorithm for solving the illumination scheduling problem. 1.1.1 Evaluation The agent is trained to schedule radar illuminations to maximize the total number of tracks held while also minimizing the time that new targets remain undiscov- ered. The performance is compared against the baseline as well as a radar which only searches for new targets by perpetually scanning the search volume, never re-illuminating existing tracks. 1.2 Context In the present project, the target tracking process is optimized at a high decision- making level, making use of traditional tracking methods in order to discretize the state and action spaces. A first attempt at realizing this approach was made by Nathanson in his master’s thesis [8] using a simulated environment and an agent trained using the PPO algorithm. While the simulation environment, tracking al- gorithm, and reinforcement learning agent were implemented and trained, the deep network agent did not outperform the baseline (priority lists), and the study was not conclusive in whether or not the training method could be successful after a more careful hyperparameter tunning. A strong hypothesis for this was that the simulator had too many layers of realism to it and the task was too hard to start with. Since Nathanson carried out his work in his thesis [8], Saab’s simulator environment has been updated with more options and tunable parameters, allowing, for example, the radar model to be simplified and apply a variable level of realism. As of this, 2 1. Introduction the problem has been simplified and the focus was initially shifted to optimising the training algorithm on a simpler problem, only later adding some of those realistic limitations and complexities to the environment. In the end we have added all the realism of Nathansson and obtained promising results. 3 1. Introduction 4 2 Theory In this chapter we introduce the theoretical foundations of concepts used throughout this project, specifically regarding Markov Decision Processes (MDP), Reinforce- ment Learning (RL) and an elementary tracking methodology. Markov decision processes constitute a mathematical formalism used to study se- quential desicion-making where state transitions are Markovian (independent of pre- vious history) and controlled by discrete actions. They underpin both dynamic programming and reinforcement learning. In Section 2.1 the concept of MDPs is discussed generally and in the context of this project. Reinforcement learning is a type of Machine Learning technique which optimizes an agent’s decision making through trial and error in an environment. The agent re- ceives feedback based on state transitions, known as a rewards which determines how desirable said state transition is. In Section 2.2 we introduce different approaches to solve RL problems such as deep Q-learning, Policy Gradient (PG) methods, Actor- critics, Trust Region Policy Optimization (TRPO), and Proximal Policy Optimiza- tion (PPO). In Chapter 4, the implementation used in this project is explained in more detail. Lastly, we discuss target tracking which is the process of estimating target trajec- tories (tracks) out of discrete measurements. Traditional tracking algorithms are used in this project in order to create simple, discrete state and action spaces and to evaluate the performance. Section 2.3 covers Bayesian filtering and the Hungarian method which are techniques used for this purpose. 2.1 Markov Decision Processes In reinforcement learning, problems are typically modelled as Markov decision pro- cesses. An MDP is defined as a tuple (S, A, P, R) where S is the set of available environment states s, A is the set of available actions a, R(s, s′) is the reward for the state transition from s to s′, P (st+1 = s′|st = s, at = a) is the transition probability from s to s′ given an action a between timesteps t and t + 1. Lastly, we have the policy π(a|s), i.e., the probability distribution of actions a given a state s. While (S, A, P, R) are given by the problem, π is to be optimized. A simple illustration shows the process in Figure 2.1 below. Note how the MDP only depends on the current and previous time step; previous history is not necessary to determine the 5 2. Theory future in an MDP. Agent 𝝅(st, at) Reward r(st-1, st) New state st Old state st-1 Environment st-1 → st Action at Figure 2.1: The MDP process and interaction between agent and environment. The policy π is a probability distribution over actions, so we can rewrite the transi- tion probabilities given π as Pπ = Pπ(st+1 = s′|st = s). The objective is to find the policy π that maximizes some cumulative function of rewards, usually the expected discounted return R̂, Q-value Qπ(st, at) Qπ(st, at) = Eπ [ R̂(st|st, at) ] , where R̂(st0) = ∞∑ t=t0 γtR(st, st+1) and γ ∈ (0, 1) is the discount factor. It is used to determine the importance of earlier rewards as opposed to rewards far into the future, as well as improving numerical stability by guaranteeing that the sum converges. A policy that maximizes Q(s, a) is called an optimal policy π∗. Policies can be stochastic or deterministic (greedy). A stochastic policy is a probability distribution over actions and is optimal if it, for all s ∈ S maximizes Ea∼π[Qπ(s, a)] = ∑ a∈A π(a|s)Qπ(s, a). An optimal deterministic policy can be written as a function of the state input π∗(s) = argmax a Qπ(s, a). (2.1) To aid in finding a solution to an MDP or to be able to evaluate the performance of the MDP policy, a value function Vπ(s) = Eπ [ R̂(s)|s ] can be used. The value func- tion predicts the estimated reward from a state s following the policy π. Basically, it disregards the action taken and only considers the expected return from the state itself. The value implicitly depends on π since it affects the transition probabilities Pπ(s, s′). The value can be defined recursively as Vπ(s) := ∑ s′ Pπ(s′|s) (R(s, s′) + γV (s′)) , (2.2) which allows us to rewrite the Q-value in terms of value as 6 2. Theory Qπ(s, a) = ∑ s′ Pπ(s′|a) (R(s, s′) + γV (s′)) . (2.3) With this formulation, we assume that the state S is known. If this assumption cannot be made, it is typical to instead treat the problem as a Partially Observable Markov Decision Process (POMDP) or the extension of the POMDP, the ρPOMDP, explained in Appendix A [1] [22]. The POMDP is an extension of the MDP frame- work which allows the agent to take actions based on partial and noisy information while still maximizing the expected return from true states s. It does this by utiliz- ing a belief state b, a distribution over s. The POMDP can be extended further into a ρPOMDP where the agent can be rewarded on the belief state b directly rather than s. 2.2 Reinforcement Learning with Neural Networks Similar to most modern machine learning, RL commonly uses neural networks. In recent years, deep Neural Networks have been used to solve increasingly difficult control tasks [19] [20] [16]. A neural network is composed of a set of functions re- ferred to as neurons each with their own set of tunable parameters. Multiple neurons are typically arranged in forward-connected layers where a multidimensional input is propagated forward through the network, producing an output, see [5] for further information about neural networks. In short, the input is propagated through the network’s hidden layers and the output can be parsed off the last layer, although many different, more complex architectures do exist. The hidden layers consist of neurons which are neither used as input nor output, and merely contribute with extra parameters (weights). All neurons otherwise operate in a similar manner and typically process inputs by first weighting the it using the neuron’s weight parame- ter, then summing over all weighted inputs. A bias term is added according to the neuron’s bias parameter and is finally passed through a nonlinear activation func- tion, predetermined with no parameter dependence. The result is sent to the next neuron layer or read off output from the network, see Figure 2.2 for an example of a small neural network and its neuron structure. With this arrangement, the network parameters can be optimised by minimizing a loss function between the input and the output, typically with the help of stochastic gradient descent. With such a large number of parameters and through the use of linear as well as nonlinear functions, even intricate nonlinear dependencies between the input and output can be learned. For Reinforcement Learning, the network outputs are directly fed to the input of some control task, and the loss is determined through some reward after stepping through the environment. The loss function is set up so that minimizing the loss corresponds to maximizing some total reward. The neural network can take the role of the policy πθ(s, a) with parameters θ and can be used to solve an MDP. 7 2. Theory n (0) 1 n (0) 2 n (0) 3 n (0) 4 n (1) 1 n (1) 2 n (1) 3 n (1) 4 n (1) 5 n (2) 1 n (2) 2 n (2) 3 n (2) 4 n (2) 5 n (3) 1 n (3) 2 n (3) 3 n (3) 4 n (3) 5 n (4) 1 n (4) 2 n (4) 3 input layer hidden layers output layer Figure 2.2: Example structure of a neural network. Neurons are labeled nl i for neuron i in layer l and forward connected to the next layer as shown by the blue arrows. This network has four input features and produces three outputs through three hidden layers of five neurons each. 2.2.1 Deep Q Learning Deep Q learning is one of the more straightforward RL algorithms used to train a neural network and there are many references on the topic, for instance, the textbook [18]. "Deep" in this case refers to the use of neural networks as opposed to lookup tables, allowing the agent to interpolate in a large action space where sampling Q values for the entire state space may not be feasible. In deep Q learning, the idea is to have the network approximate the Q-value Q(s, a) and find the optimal (deterministic) policy π∗(s) = arg maxa Q(s, a). This is done by minimizing the loss Lt = ( rt + γQ̂maxa(st+1, a)− Q̂(st, at) )2 . Here, rt = R(st, st+1) and Q̂(s, a) is the network output for action a, given input s. In order to determine the optimal parameters θ, rewards are sampled over many trajectories by exploration, typically via the ϵ-greedy policy: at every timestep, with a probability ϵ, pick a random action. Otherwise pick the action with the highest Q-value as in the optimal policy π∗(s), referred to as exploitation. In order to solve an MDP using Deep Q learning, exploration is necessary in order to sample rewards from states which may still be unknown to the agent. Exploitation is used not only to fine tune the parameter optimization itself, but to also sample the state transition probability Pπ(st+1 = s′|st = s, at = a) where π may be close to the optimal policy. There are a few downsides to Deep Q Learning which makes it less practical com- pared to other methods in certain situations, some of them are listed below. • Relatively low sample efficiency as only a single weight update is performed for each trajectory sample. In short, there is no way to know the optimal step size for the gradient descent algorithm, making convergence slow. • Q-value initialization may have a completely different mean compared to dis- counted rewards obtained from the environment. While true for any algorithm, Deep Q Learning makes no attempt to normalize the sampled rewards. 8 2. Theory • Trade off between exploration and exploitation includes an additional hyperpa- rameter ϵ that requires tuning. On top of that, exploration occurs at random, even if the agent could potentially act with more certainty in some cases com- pared to others [11]. • Off-policy learning. As we shall get into later, Deep Q Learning is an off pol- icy RL method, meaning the agent does not act according to what it thinks is optimal during training due to the exploration parameter forcing explo- ration at random. As a result, it is more difficult estimate the state transition probability Pπ(s′|s) and it may take longer to train. The problems listed impact most RL algorithms, but there are strategies to mitigate these problems. We get into these types of algorithms in the following sections. 2.2.2 Policy Gradient methods Policy Gradient (PG) [18], [10] is a term often used in RL. The principle of policy gradient is to learn a parameterized policy for selecting actions directly instead of indirectly by a Q-value estimate. However, value functions can still be used to aid the training procedure. When learning the policy parameters, the gradient of an objective function J(θ) is used. Here the vector θ denotes the parameter vector of the policy πθ(a|s). The method wants to maximize the objective function J(θ), where J(θ) = Eπθ [r(τ)] , Where τ = (s0, a0, r0, s1, a1, r1, . . . , sT , aT , rT ) is the trajectory of length T , st is the state at time t, at is the chosen action at time t and rt is the reward at time t and r(τ) is the total return over a trajectory τ . The goal of the policy gradient methods is to use the method of gradient descent (or ascent) to directly maximize J(θ) in the direction of ∇J(θ), using the update: θt+1 = θt + α∇J(θt), where ∇J(θt) is the gradient of J(θ) and α the learning rate parameter. Before we introduce how to derive the gradient with the gradient theorem we will first introduce a common trick used in Deep learning refereed to as the "log trick". The log trick is given as: f(x)∇θ log f(x) = f(x)∇θf(x) f(x) = ∇θf(x) using this together with the policy gradient Theorem, the gradient of J(θ) can be written as: ∇θJ(θ) = Eπθ [r(τ)∇ log πθ(τ)] (2.4) = Eπθ [r(τ) T∑ t=1 ∇ log πθ(st, at)]. (2.5) 9 2. Theory Here ∇ log πθ(st, at) is the score function or log-derivative of the policy using the "log trick" introduced above, which measures the sensitivity of the policy to changes in θ. The full proof of (2.4) can be found here [18, page: 325]. Modifying the policy directly in this way has its advantages. The most obvious is that there is no longer a need for the ϵ-greedy strategy. The entropy (describing the spread of the action distribution) of π(s, a) is typically high for an untrained network with randomly initialized parameters which promotes exploration. The entropy decreases over time if the agent manages to differentiate between good and bad actions through r(τ), so that we can directly sample Pπ(s′|s) while simultaneously converging to the optimal policy [11]. 2.2.3 Actor - Critic A popular approach of modern RL is to use an actor-critic architecture (see for instance [21]). This combines different features from value-based methods such as Q-learning and policy-based methods. As we saw in the previous section, the gra- dient of the objective function ∇θJ(θ) flips its sign together with the total return r(τ). Since neural networks are typically designed in a way that keeps the policy’s output probabilities normalized (using for example a softmax output layer [4]), one could potentially consider for instance positive-only rewards, and still converge to an optimal policy. It is however both intuitive and sample efficient to consider pos- itive rewards for "good" actions and negative rewards for "bad" actions, since this is directly reflected in the objective gradient. The goal of the actor - critic architecture is to achieve this effect. The approach involves the use of both a policy function πθ referred to as the actor, as well as a value function estimate V ϕ referred to as the critic. Both of these can be separate neural networks with parameters θ and ϕ respectively. We use essentially the same objective function as for the policy gradient in (2.4), but replace the total return with the advantage A(s, a) = Qπ(s, a)− Vπ(s). The resulting objective function is ∇θJ(θ) = Eπθ [A(st, at) T∑ t=1 ∇ log πθ(st, at)]. (2.6) The advantage is not usually known, so it is typically estimated using V ϕ(s) and the trajectory return r(τ), for example as Â(s, a) = r(τ) − V ϕ(s). This changes the dynamics where previously the sign of the reward directly influenced the sign of the objective gradient. Instead, the sign depends on whether the obtained return is greater or smaller than the estimated value V ϕ. Remember that the value function is implicitly dependent on the policy π and the state transition probabilities Pπ(st+1 = s′|st = s), so positive advantage comes from actions which were better than what the critic expected based on previous trajectories. The value estimator can be trained directly on the total return r(τ) by, for example minimizing squared error. 10 2. Theory 2.2.4 Trust Region Policy Optimization Trust Region Policy Optimization (TRPO) is a policy gradient method which seeks to further improve the stability of convergence of the policy update and does this by using a parameter to constraint the maximum step length during each update [15]. The change in the policy will not deviate to far from its earlier policy when updat- ing, which allows for decreased risk of instability or convergence to local optima. The main idea behind TRPO is to optimize the surrogate function which is used to approximate the true performance objective while also constraining the policy update to a trust region around the current policy. The surrogate function is the so- called clipped surrogate which is a modification of the previously discussed objective function: LCPI(θ) = Es,a∼πθold [ πθ(a|s) πθold(a|s)Aπθold (s, a) ] , where θold are the parameters that were used when sampling the trajectory and A πθold t (s, a) is the advantage function constructed by those policy parameters. The update is then optimised within the trust region defined by a maximally allowed Kullback-Leibler divergence which is stated as a constraint. The idea of keeping a trust region is to make sure that the update does not differ too much from the current policy. However TRPO comes with some disadvantages. Calculating the KL divergence can be computationally expensive and time consuming based on the dimensions of the action space. Moreover, from a practical point of view, the hands- on implementation is also quite complicated. 2.2.5 Proximal Policy Optimization Proximal policy optimization (PPO) is a Policy gradient method that seeks to cap- ture the spirit of TRPO, but in a way which is both more efficient and simpler. Instead of solving the problem with a complex second-order method, it uses a first- order method and applies some extra tricks not to move too far away from the earlier policy. There are two variants of PPO, but in this thesis, we will focus on the variant known as PPO-clip. Instead of using the KL divergence as a constraint, PPO uses specialized clipping for the objective function to limit the distance one can move from the earlier policy, see [14] for further information. The updating scheme follows θk+1 = arg max θ Es,a∼πθk [L(s, a, θk, θ)] where L is given by L(s, a, θk, θ) = min ( πθ(a|s) πθk(a|s)Aπθk (s, a), g(ϵ, Aπθk (s, a)) ) and g(ϵ, A) = (1 + ϵ)A A ≥ 0 (1− ϵ)A A < 0. 11 2. Theory Here, ϵ is the clipping parameter. With this clipping method, the objective function is updated using the regular policy gradient objective until the policy diverges too far from the original policy. At that point, the objective is clipped, causing the gradient to become zero, effectively halting the learning process until a new trajectory is sampled. This makes the learning process much simpler compared to optimizing the policy completely within a trust region. 2.3 Tracking A real-world radar observation is a signal composed of informative target data and random noise from the environment. In order to get reasonable estimates of the target states, the noisy parts must be eliminated or at least reduced. One way to achieve noise reduction is by utilizing so-called Bayesian filtering. When estimating target states, this estimation is called a track. However, Bayesian filtering only solves the problem of target estimation, not the problem of multiple target tracking. Solving this problem requires some association to distinguish between tracks, which is essential in radar surveillance. One way to solve this assignment problem is the so-called Hungarian method. This method was chosen because of its simplicity and was already implemented in the simulator. However, in modern implementations, methods like multi-hypothesis tracking [2] or Multi-Object Tracking [6] are instead used. 2.3.1 Bayesian Filtering Filtering theory is a branch of statistics dealing with the problem of estimating hidden states of a Markov process via noisy measurements. Bayesian filtering theory is simply the Bayesian formulation of this theory [13]. It calculates the marginal posterior distribution (or filtering distribution) P (xk|y1:k) of state xk at timestep k based on the measurement history up to the current timestep y1:k. The posterior distribution P (xk|y1:k) and the predicted distribution P (xk|y1:k−1) at timestep k are updated in a recursive manner. The recursion starts from the prior distribution x0, thereafter the prediction distribution of state xk is calculated at time k using the Chapman equation: p(xk|y1:k−1) = ∫ p(xk|xk−1)p(xk−1|y1:k−1)dxk−1 (2.7) Then, using Bayes’ rule and given the measurement yk at timestep k, the posterior distribution of state xk can be calculated according to p(xk|y1:k) = p(yk|xk)p(xk|y1:k−1) p(yk|y1:k−1) . (2.8) Here p(yk|y1:k−1) is the normalization constant given as p(yk|y1:k−1) = ∫ p(yk|xk)p(xk|y1:k−1)dxk. (2.9) 12 2. Theory The recursive functions from the Bayesian equation work for linear Gaussian and nonlinear Gaussian state space models. In the case of a linear model with Gaus- sian noise, the filter distribution is also Gaussian. In the linear case the state and measurement models read: xk = Ak−1xk−1 + qk−1 yk = Hkxk + rk (2.10) Here xk is the state, yk is the measurements (observations), qk−1 ∼ N(0, Qk) is the process noise and rk ∼ N(0, Rk) is the measurement noise, Ak−1 is the transition matrix of the model, and Hk the measurement model matrix. The closed-form solution to this system is called the Kalman filter (KF) [13] given from (2.7), (2.8), (2.9) and reads p(xk|y1:k−1) = N(xk|m− k , P − K ) (2.11) p(xk|y1:k) = N(xk|mk, Pk) (2.12) p(yk|y1:k−1) = N(yk|HKm− k , Sk) (2.13) where p(xk|y1:k−1) is the predicted distribution, p(xk|y1:k) is the posterior distribu- tion and p(yk|y1:k−1) the predictive posterior distribution. The parameters above can be computed in two steps; prediction and update. In the prediction step we predict the state estimate m− k and the error covariance P − k according to m− k = Ak−1mk−1 P − k = Ak−1Pk−1A T k−1 + Qk−1. In the update step, we compute the measurement residual vk the covaraince residual Sk, the Kalman gain Kk, update the state estimate mk and error covariance Pk as follows: vk = yk −Hkm− k , Sk = HkP − k HT k + Rk, Kk = P − k HT k S−1 k mk = m− k + Kkvk Pk = P − k −KkSkKT k Starting from the prior mean and covariance m0 and P0. While the Kalman filter gives an exact solution to the filtering problem in the linear case, in many applications, models are nonlinear, which makes the Kalman filter not applicable. Fortunately, there are approximate versions of the Kalman filter theory for nonlinear models. One is called the extended Kalman filter (EKF), which approximates the non-gaussian filter distribution using linearisation. The main idea of the EKF is to utilize the first-order Taylor approximation to assume a Gaussian approximation of the filter densities around the last prediction [13]. Using the same 13 2. Theory assumptions for the EKF as for the KF, with the differences that f and h are differentiable, the filter density can be approximated as p(xk|z1:k−1) ≃ N(xk|m− k , P − K ) (2.14) p(xk|z1:k) ≃ N(xk|mk, Pk) (2.15) p(zk|z1:k−1) ≃ N(zk|HKm− k , Sk). (2.16) From that, the prediction of the state estimate m− k and the error covariance P − k instead becomes m− k = f(mk−1) P − k = Fx(mk−1)Pk−1F T x (mk−1) + Qk−1. And the updates of the measurement residual vk, the covariance residual Sk and the Kalman gain Kk instead becomes vk = yk − h(m− k ), Sk = Hx(m− k )P − k HT x (m− k ) + Rk, Kk = P − k HT x (m− k )S−1 k , mk = m− k + Kkvk, Pk = P − k −KkSkKT k . Here Hx(m− k ) and Fx(mk−1) are the jacobian of f and h of state x evaluated at m− k 2.3.2 Hungarian method The Hungarian method, or the Hungarian algorithm, is an optimization algorithm used to solve assignment problems. The optimization problem involves finding the most optimal assignment, in this case, for a set of detection to a set of tracks, given a cost in the way of a distance between the detections and tracks. The Hungarian method is an algorithm that was developed by Harold Kuhn in 1955. Kuhn named the algorithm Hungarian because of two Hungarian mathematicians, Dénes Kőnig and Jenő Egerváry, since the algorithm was based on their earlier works [7]. Imagine you obtain a set of detections and have a group of already existing tracks, and you want to assign each detection to a track in the most efficient way possible. Since each of the detections has a given distance to each track, the goal is to find the assignment which minimizes the overall distance. The Hungarian method solves this problem by utilizing a matrix representation of distance. The matrix consists of columns representing the tracks and rows repre- senting the detections. Each cell in the matrix contains the distance associated with assigning a particular detection to a particular track. The method begins by identifying paths called "augmenting paths" in the matrix representation. Here, an augmenting path is a path that begins with an unassigned 14 2. Theory detection and moves to an unassigned track. The goal is to find a set of paths cov- ering all the unassigned cells in the matrix. Once the augmenting paths are found, the next step in the method is to proceed to modify the assignments based on these paths. It starts by selecting the path with the lowest distance. It thereafter updates the assignment along this path, switching the currently assigned detections with the unassigned ones until all the cells in the matrix are assigned. By iteratively finding paths and updating assignments, the Hungarian method eventually finds an optimal solution where the overall distance is minimized. 2.3.2.1 Mahalanobis distance The algorithm used to generate the cost matrix in the Hungarian algorithm is the Mahalanobis distance. Mahalanobis distance measures the distance between a point x, in our case a detection, and a distribution D ∼ N(µ, P ), with the tracks position µ and the tracks covariance matrix P . The distance is then measured in how many standard deviations the detection x is from the track µ given the uncertainty of the track as P : dM(x, µ; P ) = √ (x− µ)T P −1(x− µ) (2.17) 15 2. Theory 16 3 Radar Simulator The simulator used for this project is based on the simulator written and used by Nathanson [8]. However, for this project, the codebase has been updated with more variable parameters, allowing us to change the complexity of the simulation environment. This chapter presents how the simulator works, the features relevant to this project, and some fundamental theories necessary to understand the reasoning behind the implementation. 3.1 Radar basics RAdio (Aim) Detection And Ranging (Radar) is a technique used to detect and gain information about distant targets with the help of radio waves. A radar sends electromagnetic radiation in a direction that reflects off a target. A receiver picks up the reflection to extract information about the target. In particular, its position can be determined by the response time, and a Doppler shift reveals its radial velocity towards the radar. The Doppler effect is manifested through a shift in frequency as a wave bounces off a moving target. The constant velocity of electromagnetic waves makes it possible to accurately measure the Doppler effect with signal processing [9]. Radars can be separated into two categories, those that send waves continuously while simultaneously listening and those that transmit a fraction of the time and listens the rest of the time, repeatedly. Both techniques have advantages and dis- advantages, but in the context of airborne radars, the most common radar is the second one, the pulse Doppler radar. The primary advantage of the pulse Doppler radar as opposed to continuous wave (CW) radar is the increased range and accurate localization of targets due to their concentrated pulses of large gain [9]. 3.1.1 Pulse Radar Airborne radars usually have one antenna used for both transmitting and receiving. By the time the reflection of the pulse returns to the antenna, the power is greatly diminished and must be noticeably higher than the background noise to detect an object reliably. The signal energy received from a target is given by: Se = PavgA2 etot · RCS 4πR4λ2L , (3.1) where Se is the signal energy, Pavg is the average transmitter power, R is the range, RCS is the radar cross section of the target, tot is the time-on-target (integration 17 3. Radar Simulator time), λ is the wavelength and L represents the general losses obtained, e.g., from processing and filtering. We also have the function for noise, which is given by: Ne = kTs where k is Boltzmann’s constant, Ts system noise temperature. Given these two equations, we can calculate the SNR. For further reading, see [9] SNR = Se/Ne = PavgA2 etot · RCS 4πR4λ2LkTs . Since most of these variables are constant, we can write the expression for SNR using an equivalent constant K. We also include the area Ae with its dependence on the azimuth angle from the radar θ extracted. SNRi = K · RCS · A2 etot R4 = K ′ · RCS · cos(θ)2tot R4 . (3.2) 3.1.2 Waveform and Integration Time In radar surveillance, the transmitted signal is often referred to as waveform. The waveform has four characteristics, carrier frequency, pulse width, modulation within each pulse or pulse to pulse, and rate at which the pulses are transmitted (pulse repetition frequency, see Section 3.1.3). However, when referring to the waveform, we mean the pulse width regarding the number of pulses sent. Since both the carrier frequency and Modulation are assumed to be constant. In the simulation, the avail- able waveforms contain wi ∈ {128, 256, 512} pulses and differ only by the number of pulses used. Integration time is the time it takes for one illumination, i.e., the time for all wi pulses to be sent, plus a burn-in time taken for the first pulse to return from the distant radar horizon. Recorded data during the burn-in time is challenging to process because the data are non-stationary, contaminated with pulses from the illumination before, and not used when processing the signal. As for the total integration time, we denote it I128, I256, I512 for the different waveforms, respectively which can be found in Table 3.1. Number of pulses 128 256 512 Time (s) 0.01197 0.02093 0.03885 Table 3.1: Integration times for different number of pulses used. 3.1.3 Pulse Repetition Frequency and Resolving Measure- ments The radar sends multiple pulses at a given frequency to obtain a measurement from a target. The pulse transmission rate is called the Pulse Repetition Frequency 18 3. Radar Simulator (PRF). However, even if one sends multiple pulses using the same frequency, knowing which response corresponds to what pulse is impossible. As a result, using the response time to determine the distance to the target results in ambiguity under the assumption that the response could be coming from any of the transmitted pulses. This issue can be solved by resolving the measurement by varying the PRF in consecutive measurements [9, Chapter 15]. This can resolve the ambiguity by shifting the response time of all pulses but the one corresponding to the actual response, allowing us to single out the accurate distance. After that, by measuring the differences in frequency from the sent signal to the obtained, the target’s velocity can be estimated by the size of the frequency shift. The shift is described as follows: fD = −2vrad λ , where vrad is the radial velocity and λ is the wavelength of the radiowave. The Doppler frequency resolution depends on the number of pulses sent. By properties of the discrete Fourier transform, Doppler frequency, just as range, is measured with an ambiguity that needs to be resolved. The simulator emulates the need to resolve measurements by requiring multiple detections before a non-associated measurement can be sampled. Alternatively, a detection can be matched with an existing track without the need to resolve since it can be correlated with the track’s position and velocity instead. In order to fulfill this requirement, the azimuth angle increments in the search action are reduced to produce overlapping lobes. A parameter m specifies the number of lobes that should all cover a mutual azimuth angle, and the angle increment size is determined to fulfill that requirement (see details in Section 3.5.3). With this, obtaining up to m detections from the same target is possible, all while adjusting the PRF between each detection. A parameter n specifies how many detections of different PRFs are necessary out of the m overlapping pulses to consider a measure- ment resolved. If less than n matching detections are obtained, no unambiguous measurement is sampled and any unresolved detections are discarded. 3.2 Simulator Overview The simulator is built to simulate a phased array radar (see e.g [9]) using an AESA antenna. The antenna is computer-controlled, allowing the radar to steer the di- rection of the radio waves with significant gain without moving the antenna itself, resulting in minimal delay. The radar system can be controlled and used in a simulated environment, providing targets that can be detected based on the radar equation 3.1. 3.3 Initializing the Search Area All activity happens inside the defined search area, a circle sector of radius r ∈ [rmin, rmax] = [20, 450] km and angle θ ∈ [−75, 75]°. Nothing is simulated outside of 19 3. Radar Simulator this area. If a target moves outside the boundary, it is immediately terminated. A constant n = 30 specifies roughly the desired number of targets within the search area (once an equilibrium state has been reached). Initially, ⌊0.6 · n⌋ targets are spawned on the border. The simulation then runs for 500 seconds (simulation time) before the radar is initialized in order for a steady state of evenly distributed targets to be present before tracking begins. 3.4 Targets The state of a target is described by the position and velocity state vector vk = [xk, ẋk, yk, ẏk]T , turn rate θ̇, turn initiation rate expected value ρ and radar cross section RCS. After each time step k of size ∆tk, the state vector vk is multiplied by the transition matrix X vk+1 = X · vk (3.3) with X =  1 sin(θ̇∆t) θ̇ 0 −(1−cos(θ̇∆t)) θ̇ 0 cos(θ̇∆t) 0 − sin(θ̇∆t) 0 1−cos(θ̇∆t) θ̇ 1 sin(θ̇∆t) θ̇ 0 sin(θ̇∆t) 0 cos(θ̇∆t)  . (3.4) In case θ̇ = 0, instead use the matrix limθ̇→0 X := ( limθ̇→0 X(θ̇)ij ) . At the time of spawning, each target is given a permanent turn rate expected value ρ ∼ U(10−4, 10−3) turn initiations per second, and the time of the most recent turn initiation is recorded as tprevious, set to 0 initially. A turn is then initiated at time tinitiate ∼ tprevious + Exp(ρ) where Exp(ρ) is the exponential distribution with expected value 1/ρ. A turn duration is sampled at each turn initiation tduration ∼ Exp (U (0.05, 0.1)), so the turn ends at tend = tinitiate + tduration. A new turn rate is sampled at each turn initiation θ̇ ∼ U(0.001, 0.02) rad/s, and the target keeps a constant θ̇ turn rate during the entire maneuver in positive or negative angle direction with equal odds. After the turn is completed, a new initiation time is sampled, and the turn rate is set back to zero. 3.4.1 Spawning Any new target receives the following permanent properties until it is terminated: • Velocity u ∼ U(400, 800) m/s • RCS ∼ N (U(−10, 10), 2) • ρ ∼ U(10−4, 10−3) where N (µ, σ) is the normal distribution with mean µ and standard deviation σ. RCS is the radar cross section used in (3.1). The following properties are sampled at the time of spawning but may change during the simulation: 20 3. Radar Simulator • Position (x, y) • Heading θ • Turn rate θ̇ = 0 The position is sampled uniformly on the border of the surveillance area, excluding the smaller circle arc, see Figure 3.1 where it is the red part of the border. The heading is sampled as the angle θ from the target position to a uniformly sampled position z = (x, y) within the surveillance area. 3.4.2 Spawn Rate During the simulation, a spawn rate ω(t) < 1 sets the expected frequency of new targets being spawned per second. Initially, the spawn rate is set to ω(0) = n 600 450000 = 0.04 with n = 30. The idea is to spawn targets at roughly the replacement rate for n targets with an average lifetime at around 450000 600 = 750 s. The spawn rate changes at an interval of 750 s with the intended effect of changing the spawn rate after most targets since the last spawn rate update has been termi- nated. The idea is not to let the target intensity reach an equilibrium for too long before the spawn rate is changed, which is then done according to ω(t) ∼ clip ( N ( ω(0), ω(0) 10 ) , 0, 1 ) , (3.5) where clip(x, xmin, xmax) =  xmin if x ≤ xmin x if xmin < x < xmax xmax if x > xmax . After each whole second of simulation time, a new target is spawned with probability ω(t). 3.5 Radar In this section, we describe how the different parts of the radar are integrated and used in the simulation. 3.5.1 The Lobe The lobe width in boresight (zero degrees azimuth) is given by Φ = 2°, and the lobe size changes according to ωϕ = Φ/ cos(ϕ) where ϕ is the lobe’s angle to the or- thogonal direction from the antenna surface (see Figure 3.1). This effect originates from the fact that the lobe width depends on the antenna width, and since phased array antennas do not rotate, the effective width when sending a pulse at an angle is smaller than when sending it straight forward. Detection probability decays at the edges of the lobe, see below. 21 3. Radar Simulator Azimuth lobe width Figure 3.1: The shape of the surveillance area. The smaller circle arc colored red on is the only part of the border where new targets may not spawn. The angle between the black middle and illumination angle (red line) is the azimuth. The dotted line shows the lobe width when illuminating in the direction of the red line. 3.5.2 Detection When the lobe overlaps with a target, a detection is sampled via a probability distribution. If the detection is successful, a unique target id is returned to the radar used for resolving a measurement. The probability of detection pD is calculated based on the signal to noise ratio (SNR) using the following formula with ξ as the detection threshold and σ as the duty factor, where the duty factor is the ratio between the pulse width and the Pulse Repetition Interval (PRI, inverse PRF): pD = (1− σ) · C · exp ( −ξ 1+SNR ) for SNR > 1 0 otherwise, (3.6) where C depends on the angle difference between the center of the lobe and the target ϕtr = |ϕt−ϕr|, as the detection probability diminishes at the edge of the lobe depending on the lobe decay parameter τ = 0.5: C = 1 for τ · ωϕ/2− ϕtr > 0 1 + (τ ·ωϕ/2−ϕtr) (ωϕ(1−τ)/2) otherwise (3.7) The lobe decay τ specifies the proportion of the lobe angle in the center of the lobe, where C = 1. For the remaining part of the lobe width, C decays linearly to 0, see Figure 3.2 22 3. Radar Simulator 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Angle difference in degrees 1.0 0.5 0.0 0.5 1.0 1.5 2.0 C va lu e C value change towards edge of lobe with width of 2 degrees pD Figure 3.2: Visualization of how the C value function looks when illuminating in azimuth The signal-to-noise ratio, which were derived from the radar equation (Equation 3.1) is convert into decibels, where decibel is given by valuedb = 10 log(value), and we instead get: SNRdB = KdB + 10 · log(PRI · ρ)− 40 · log(r)− r · κ + RCSdB + 20 · log(cos(ϕt)γ) + 10 · log(B), (3.8) where B is the blindness factor between 0 and 1 arising from the blind zone in radial velocity; targets that move too slowly can not be distinguished from the earth’s surface and becomes harder to detect if moving slower than rv2 . If the target moves slower than rv1 , it is undetectable, so B = 0. Another contribution to the B factor (related to the mod operator) comes from the fact that the same antenna is used for sending and receiving, causing a risk of some pulses being received at the same time as new ones are sent, making them undetectable. The expression for the B factor is B = min 1, max ϵ, mod (|rv|, λ 4·PRIR )− rv1 rv2 − rv1  (3.9) KdB = 225− 10 · log(PRI · 128). (3.10) Here, rv is the target radial velocity, λ is the wavelength, PRIR is the uniformly randomly sampled pulse repetition interval, PRI is the mean PRI, ϵ = 10−6, rv1 = 15 m/s is the minimum detectable radial velocity, rv2 = 40 m/s is the minimum radial velocity with full detectability (B = 1), ρ is the number of pulses, γ = 1.5 is the decay exponent and κ = 4 · 10−6 is the atmospheric dampening. 3.5.3 Control Inputs The control inputs for the radar are to search or to track (often referred to as re- illuminate, not to be confused with tracks in the tracker). 23 3. Radar Simulator If the search action is selected, the radar innitially sets the lobe angle φ to equal the minimum azimuth angle in the search area (φ = −75°). It keeps the radar angle used in the most recent search in an internal memory φold so that the next time the radar is commanded to search, it sets the radar angle to φ = φold + ϕ cos(φold) ·m, since ϕ cos(φ) is the effective lobe width, the increment in azimuth angle between illu- minations is varied accordingly in order to produce m lobes which overlap in azimuth angle. If m is set to one, lobes barely touch but do not overlap with each other. If the track action is selected, an existing track must be specified. The lobe angle is then set to equal the estimated azimuth angle towards the track. With every action, the waveform wi ∈ {128, 256, 512} must also be selected. 3.5.4 Measurements After receiving n detections with the same target id in the last m searches, or after obtaining a detection through re-illumination, a measurement containing informa- tion about the target is sampled. From a target with coordinates (x, y) and velocity (ẋ, ẏ) (assuming the radar is positioned in the origin), we measure the radial posi- tion, the azimuth angle, and velocities in cartesian coordinates (r̂, φ̂, ˆ̇x, ˆ̇y) with noise according to r̂ = N (√ x2 + y2, 100 ) , φ̂ = N ( φ, 0.12 cos(φ) ) ˆ̇x = N (ẋ, 5), ˆ̇y = N (ẏ, 5). Additionally, the RCS and SNR are measured. R̂CSdB = N (RCSdB, 2), ŜNRdB = N (SNRdB, 2), The RCSdB for a target is its average RCSdB given at the spawning time. SNRdB for the target is given by (3.2) Notice that we allow measurement of the full 2-dimensional velocity, while a radar only measures radial velocity. We motivate this as follows. The target tracker is basic and inferior to a real tracker. Helping it with extra velocity information compensates for this to some extent. 3.5.5 False Alarms In the real world where noise is unavoidable, there is a chance that random noise spikes cause a candidate track to be created despite not correlating to an actual target. In this project, noise is sampled for measurements after enough detections from a target. However, background noise from sources other than targets is not simulated. Instead, false measurements are sampled at random, with a probability 24 3. Radar Simulator pfalse of occurring after each action. When a false measurement occurs, the azimuth angle to the track is sampled uniformly around the lobe center, within the lobe width. We want the false measurement to be uniformly distributed within the area covered by the lobe, so we sample the radial distance r according to r ∼ √ U(r2 min, r2 max), since the area of a circle sector covered by the lobe increases in proportion to the radius. The heading angle θ is sampled θ ∼ U(0, 2π). The measured velocity, SNR and RCS are sampled with identical distributions as for newly spawned targets. The measurement is then sent to the tracker in the same way as true measurements. 3.6 Target Tracking Whenever the radar receives a measurement, it is sent to the tracker. The tracker can do one of two things with the measurement; associate it with an existing track and update its state (prediction of target properties) or create a new track with its own prediction. Each track contains estimates of all measured properties of the target and their covariances. The tracker is also able to predict how the tracks change over time. Some of the properties held in a track can be visualized and matched with true targets, shown in Figure 3.3. Figure 3.3: A snapshot of the tracker state with true targets drawn as well. Trans- parent circles represent the location of a true target. Filled circles represent location estimates in tracks with the fill colors yellow, blue, and black representing tracks of confidence levels candidate, tentative, and confirmed, respectively. Additionally, velocity vectors are drawn from the track locations, orange arrows for the true ve- locity vector, and filled arrows with the corresponding color of confidence level for the track estimates. The red line represents the radar lobe center. A new track is always initialized with the confidence level "candidate". The confi- dence level is upgraded to tentative once two additional measurements have been associated with this particular track. Finally, it is upgraded to confirmed if another two measurements have been associated. The confidence level does not affect the track in any way; it is only used as information for the policy as well as the reward function in an MDP. 25 3. Radar Simulator The tracking of individual targets is done with an EKF described in Section 2.3.1, where each track gets its own EKF and where the process and measurement noise is defined as Q = σx 2  t3/4 t2/2 0 0 t2/2 t 0 0 0 0 t3/4 t2/2 0 0 t2/2 t  R =  σr 0 0 0 0 σsinφ 0 0 0 0 σẋ 0 0 0 0 σẋ  , with Parameters Value σr 100 m σẋ 5 m/s σsin φ 0.12 σx 50 m The jacobian of the measurement model according to the state used for the EKF is: Jh ((x, y, ẋ, ẏ)) =  x r 0 y r 0 −y·x r3 0 1 r − y2 r3 0 0 1 0 0 0 0 0 1.  3.6.1 Association The association of measurements to tracks is done in two different ways. The Hun- garian algorithm is only used for search mode. For tracking/reillumination mode, we instead use Association boxes. The association boxes are generated around the targets with dimensions three times the standard deviations in the diagonal covari- ance matrix in radial positioning, radial velocity, and radar angle. If a target is found inside the detection box (detection within all three deviations), that yields a detection. Re-illuminations do not create new instances of tracks, only updating existing ones. 3.6.2 Termination A track can be terminated in two different ways. Firstly, a track can be removed if the radar misses it a given amount of times based on its confidence level. For candidates, a maximum of 9 subsequent misses since the last detection are allowed. The 10th miss terminates the track. For tentative and confirmed, the numbers are 12 and 15, respectively. A miss is counted for each action taken where the radar lobe is within one lobe width of the target but where it does not grant a detection. Secondly, a track is terminated if too much time has been spent since the last detection. We call this time coast time, and the maximum allowed is 30 seconds. 26 3. Radar Simulator 3.7 Simulation Loop The simulation uses an adaptive time step size based on the integration time of the radar pulse. This allows the simulation to run efficiently with the lowest resolution possible, with hardly any visible artifacts from the radar’s perspective. Changes in target trajectories are updated at a much lower frequency than what the radar operates, so any difference in target behavior due to varying time step sizes is mi- nuscule. However, small differences in simulation trajectories build up over time, creating large differences in the long run. Thankfully, since targets are eliminated once they leave the search area, individual target trajectories are replaced by new ones before large differences are given a chance to arise. Algorithm 1 below describes the steps in which order they are taken in the simula- tion. Algorithm 1 Simulation Loop 1. Initialize search area and spawn ⌊0.6n⌋ targets 2. Simulate for 500 seconds while integrating movement and spawning new targets 3. Initialize radar while t