Detecting and Tracking Regions of Interest for Remote Measurement of Vital Parameters Estimation and Tracking of Keypoints Using Object Detection in Visual and Thermal Footage Master’s thesis in Physics MADELEINE MÜLLER DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2022 www.chalmers.se i www.chalmers.se Master’s thesis 2022 Detecting and Tracking Regions of Interest for Remote Measurement of Vital Parameters Estimation and Tracking of Keypoints Using Object Detection in Visual and Thermal Footage MADELEINE MÜLLER Department of Physics Chalmers University of Technology Gothenburg, Sweden 2022 Detecting and Tracking Regions of Interest for Remote Measurement of Vital Pa- rameters Estimation and Tracking of Keypoints Using Object Detection in Visual and Ther- mal Footage. MADELEINE MÜLLER © MADELEINE MÜLLER, 2022. Supervisor: Farzad Kamrani and Marianela Garcia, Swedish Defence Research Agency Examiner: Christian Forssén, Department of Physics Master’s Thesis 2022 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Illustration of the facial grid implemented to model the facial keypoints spatial relation in the thermal domain. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2022 iv Detecting and Tracking Regions of Interest for Remote Measurement of Vital Pa- rameters Using Deep Learning Estimation and Tracking of Keypoints Using Object Detection in Visual and Ther- mal Footage Madeleine Müller Department of Physics Chalmers University of Technology Abstract The initial assessment of a mass casualty incident is essential to e�ectively conduct a rescue operation. The survival rate is a�ected by the complexity of the incident, and it is therefore imperative to enhance the operational capacities of emergency medical services and civil protection agencies in mass casualty incidents. This the- sis investigates the possibilities for an unmanned aerial vehicle (UAV) to detect and track regions of interest for remote measurement of vital parameters in visual and thermal footage for first response triage purposes. The regions of interest are the nose, mouth, and chest, and the UAV characteristic taken under consideration in this thesis is image blur due to random camera motion. In this thesis, we take an ob- ject detection approach and implement the keypoint estimation framework KAPAO and the tracking algorithm SORT in several di�erent experimental setups. Using KAPAO and SORT, we achieve a good result. For the detection in the thermal domain, the model created by transferring knowledge from the visual to the ther- mal domain achieves the highest performance. We also consider adversarial training on random motion blur, however the result shows a minimal impact on the model performance in the presence of characteristic low-altitude UAV motion blur. Re- garding the tracking of the regions of interest, the result concludes that the SORT algorithm improves the performance compared to assigning tracking identification numbers based on frame-to-frame di�erences. The result shows that the distance to the subjects and the image quality impacts the performance. Compared with previous work on remote measurement of vital parameters, the algorithms of this thesis achieve a nearly perfect score on corresponding distances. If the distances are realizable in a UAV triage application is however unknown and has to be in- vestigated further. Moreover, the work of this thesis problematizes the low-altitude UAV motion blur which poses a potential limitation in a potential UAV triage ap- plication. An alternative could hence be to use optical stabilization measurement for blur reduction. v Acknowledgements I would like to thank my supervisors at FOI, Farzad Kamrani and Marianela Garcia, for their participation, encouragement, and guidance. Furthermore, I would also like to thank my examiner at Chalmers Prof. Christian Forssén for his advice and support during the project. Finally, I would also like to thank the ones helping me review my work. Madeleine Müller, Stockholm, October 2022 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis project listed in alphabetical order: ANN Artificial Neural Networks CNN Convolutional Neural Network COCO Microsoft Common Objects in Context FOI Swedish Defence Research Agency IOU Intersection of Union KAPAO Keypoints And Poses As Objects MOTA Multiple Object Tracking Accuracy NMS Non-maximum Suppression OKS Object Keypoint Similarity PPG Photoplethysmography ROI Region of Interest rPPG Remote Photoplethysmography SORT Simple Online Realtime Tracking TFW Thermal Faces in the Wild UAV Unmanned Aerial Vehicle YOLO You Only Look Once ix Contents List of Acronyms ix 1 Introduction 1 1.1 Problem Definition and Purpose . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Swedish Defence Research Agency . . . . . . . . . . . . . . . . . . . . 2 1.5 Report Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Prehospital Triage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Remote Measurement of Vital Parameters . . . . . . . . . . . . . . . 5 2.2.1 Remote Measurement of Heart Rate . . . . . . . . . . . . . . 6 2.2.2 Remote Measurement of Body Temperature . . . . . . . . . . 6 2.2.3 Remote Measurement of Respiratory Rate . . . . . . . . . . . 6 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Theory 9 3.1 A Brief Introduction to Deep Learning . . . . . . . . . . . . . . . . . 9 3.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 9 3.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 11 3.1.2.1 Convolutional Layers . . . . . . . . . . . . . . . . . . 11 3.1.2.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . 11 3.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Keypoint Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Multi-Person Keypoint Estimation . . . . . . . . . . . . . . . 15 3.3.2 Modeling Keypoints and Poses as Objects . . . . . . . . . . . 15 3.4 Thermal Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6 Adversarial Perturbations . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.1 Motion Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.7 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.7.1 Simple Online Realtime Tracking . . . . . . . . . . . . . . . . 19 3.8 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.8.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 19 xi Contents 3.8.2 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.8.3 Multiple Object Tracking Accuracy . . . . . . . . . . . . . . . 20 4 Method 23 4.1 Keypoint Estimation Datasets . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 The Common Objects in Context Dataset . . . . . . . . . . . 23 4.1.2 Thermal Faces in the Wild . . . . . . . . . . . . . . . . . . . . 25 4.2 Video Datasets for Tracking . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.1 300 Videos in the Wild . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 RGBT234 Dataset . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Motion Blur Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4 Triangulation of the Forehead and Chest . . . . . . . . . . . . . . . . 29 4.5 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Results 35 5.1 Blur Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Detection in the Thermal Domain . . . . . . . . . . . . . . . . . . . . 35 5.3 Tracking of ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6 Discussion 39 6.1 Detection in the Visual Domain . . . . . . . . . . . . . . . . . . . . . 39 6.2 Detection in the Thermal Domain . . . . . . . . . . . . . . . . . . . . 40 6.3 Dataset Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.4 Tracking of ROIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.4.1 Evaluation on the 300-VW Dataset . . . . . . . . . . . . . . . 41 6.4.2 Evaluation on the RGBT234 Dataset . . . . . . . . . . . . . . 42 6.5 Adaption to UAV Applications . . . . . . . . . . . . . . . . . . . . . 42 7 Conclusions 43 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Bibliography 45 xii 1 Introduction A mass casualty incident is referred to as an event where the number of casualties exceeds the available medical resources. Mass injuries can, for example, be caused by transportation accidents, terrorism, fires, or natural disasters. The initial assessment of a mass casualty incident is essential to e�ectively conduct a rescue operation. As a first response, triage is performed by medical sta� at the scene. Triage is a medical procedure of evaluating and classifying injured based on their vital signs to prioritize patient care. This becomes a rather demanding task as the magnitude and complexity of the mass casualty incident increases, which can decrease the survival rate [1]. To address the di�culties faced in mass casualty incident operations the Swedish Defence Research Agency (FOI) is involved in a research project led by the Swedish Transport Administration. The project aims to develop an unmanned aerial vehicle (UAV) system integrated with artificial intelligence for search and rescue purposes, to simplify operations and increase the survival rate. The idea is to use a UAV equipped with sensors such as a standard color (RGB) camera and a thermographic camera to collect data from the scene of the incident. As a part of the assessment, the UAV needs to be able to measure vital parameters, and in order to measure a vital parameter, the UAV has to be able to detect and track the corresponding region of interest (ROI) where the measurement can be performed. 1.1 Problem Definition and Purpose This thesis project aims to investigate the possibilities for a UAV to detect and track ROIs for remote measurement of vital parameters. Specific ROIs are the forehead, nose, mouth, and chest, where body temperature, respiration rate, and pulse can be measured for triage [2]. There are state-of-the-art deep learning frameworks for the detection of human body features. However, they are not able to detect all the ROIs mentioned, nor are they robust enough to be implemented in a UAV application [3]. Therefore, this thesis aims to develop and evaluate a robust method for detecting and tracking the mentioned ROIs in visual and thermal footage in multi-person scenarios. Robust is here defined as the ability to handle perturbation caused by characteristic low-altitude UAV motion. As there to this point are no publicly available low- altitude visible or thermal UAV data for detection and tracking of mentioned ROIs, the UAV motion characteristics are to be simulated on still images. The task at hand is to be accomplished using deep learning combined with adversarial training. 1 1. Introduction 1.2 Research Question This thesis project aims to answer the research question: • To what extent is it possible to detect and track regions of interest for remote measurement of vital parameters in RGB and thermal footage and in the presence of characteristic low-altitude UAV motion blur? 1.3 Scope and Limitations • This project aims to detect and track regions of interest for the measurement of vital parameters in RGB and thermal footage. • The ROIs for this project are the forehead, mouth, nose, and chest. Due to a lack of annotations, the chest will only be evaluated for detection in the visual domain. • The project focuses on implementing deep learning models for keypoint esti- mation for detection and tracking of mentioned ROIs. • The UAV characteristic taken under consideration in this project is UAV mo- tion blur. For example, considering di�erent UAV pitch angles are outside the scope of this project. • The camera motion blur is to be simulated from still images and evaluation of the e�ectiveness with respect to real UAV motion blur is outside the scope of this project. • This project does not employ real data collected in a mass casualty incident. Real data collected in a mass casualty incident is considered sensitive as it con- tains medical information, and using such data would by Swedish law require an ethical analysis preformed by the Ethics Review Authority1. By excluding sensitive data, this thesis can focus on the technical aspects rather than the ethical aspects of data usage. • Field testing and performing measurements of vital parameters are outside the scope of this project. Such activities would be required before deploying the final system. Moreover, it would also require the approval of the Ethics Review Authority due to the ethical aspects concerning human test subjects. 1.4 Swedish Defence Research Agency This thesis is carried out in collaboration with FOI. FOI is a defence research insti- tute and a government authority operating under the Swedish Ministry of Defence. The agency conducts research within the area of safety and security of the soci- ety, including strategic decision-making and crisis management [5]. On behalf of 1The mission of the Swedish Ethics Review Authority is to protect the individuals and the human values within research, and any research carried out on human requires their approval. The Swedish Ethics Review Authority considers the ethical aspects concerning the project and only approves a project and whether the possible benefits of the results exceed the endangerment of the persons subjected for investigation [4]. 2 1. Introduction the Swedish Transport Administration and the EU project Nightingale, FOI is de- veloping a UAV triage application for mass casualty incidents with the ultimate goal of increasing the survival rate. This thesis project aims to contribute to FOI’s UAV triage application research. Furthermore, one should note that this is a civil project. 1.5 Report Overview This report is organized as follows: Chapter 2 provides background knowledge of triage and remote vital parameter measurement, as well as related works; Chapter 3 provides the theoretical foundation of the project; Chapter 4 presents the method- ology and the experiments conducted; Chapter 5 presents the results; Chapter 6 provides a discussion of the results and the research question along with some eth- ical aspects in Section 6.3, and lastly Chapter 7 which presents the conclusion and future works. 3 1. Introduction 4 2 Background This chapter is to provide a background for remote triage measurement. Section 2.1 describes the fundamentals of prehospital triage performed at the scene of the in- cident. Section 2.2 presents existing methods for remote measurement of the vital parameters presented in Section 2.1 along with corresponding regions of interest in this thesis. Finally, in Section 2.3 related work concerning the detection and tracking of ROIs for vital parameter measurement is presented. 2.1 Prehospital Triage Triage is the medical procedure of evaluating and classifying injured to prioritize patient care implemented when the demand exceeds the available resource. There are di�erent types of triage, and prehospital triage is referred to as the first response triage performed on the scene of the incident. Prehospital triage aims to assess the situation in order to e�ectively conduct a rescue operation. Indicators of which prehospital triage assessments are built upon are: • ambulatory, • clear airways, • respiratory rate, • radial/peripheral pulse, • and level of consciousness [6]. Ambulatory, the ability to walk, is a primary divider. Walking requires a su�cient central nervous system and blood pressure, and walking persons will thereby be down-prioritized. The second point of assessment is clear airways. If the person cannot breathe, it will get down-prioritized due to a low survival rate. If clear airways are the assessment of if the person is breathing, the respiratory rate is the assessment of how the person is breathing. The respiratory rate is an indicator of trauma to the airways and lungs, and an abnormal breathing pattern will get prioritized. The pulse parameter is used to estimate blood pressure to detect life-threatening internal and external bleeding. The last point is the level of consciousness which is assessed by whether the person can follow commands or not. Verbal and motor responses are indicators of neurological function [6]. 2.2 Remote Measurement of Vital Parameters This section presents di�erent methods for remote measurement of vital parameters of interest for triage assessment. The ROI in this thesis is limited to the ROI for 5 2. Background remote measurement of heart rate, body temperature, and respiration rate, which are presented below. 2.2.1 Remote Measurement of Heart Rate The commonly implemented method for remote measurement of heart rate and oxy- gen saturation is remote photoplethysmography (rPPG) [2]. Remote PPG operates on the same basis as traditional PPG, which illuminates the skin and measures the variation of light absorption. The light is absorbed by the blood in the capillaries, and the absorption is correlated to the dilation and constriction of the capillaries from which the heart rate can be measured [7]. The forehead is a region of interest for rPPG measurement due to its large vascularization and thin skin [8]. Therefore, the forehead constitutes an ROI in this project. 2.2.2 Remote Measurement of Body Temperature The forehead is also an ROI for remote measurement of body temperature as the forehead temperature is highly correlated to the internal body temperature due to the large vascularization and thin skin. The temperature at the forehead can therefore be used to detect hyperthermia and hypothermia. This correlation cannot be seen in other parts of the body; the temperature in limbs can be highly di�erent from the core body temperature [8]. 2.2.3 Remote Measurement of Respiratory Rate Di�erent types of sensor data can be used to measure the respiratory rate remotely. The respiratory rate can be estimated remotely by acoustic analysis, analysis of temperature fluctuation due to respiration, chest motion analysis, and rPPG. The acoustic-based methods utilize the breathing sound to determine the respira- tory rate. Such methods have proven e�ective but also sensitive to background noise [2]. The sensitivity poses a di�culty in unconstrained environments, which makes acoustic analysis unsuitable for UAV triage applications. The respiration rate can be estimated remotely by analyzing the temperature di�er- ence in inhaling and exhaling air in thermal footage [2]. Detecting and tracking of mouth and nose in thermal images are hence of interest in this thesis. Temperature fluctuation-based methods have proven to be sensitive to motion [2] which make the tracking aspect particularly relevant. Estimation of respiration rate can also be performed through respiration motion analysis. Chest motion analysis has proven to be a robust method for respiration rate estimation in multi-person scenarios [2]. Moreover, it can be performed on di�erent types of sensor data such as RGB, depth, and thermal footage. Detection of the chest is hence within the scope of this project. The chest area is a large ROI which poses an advantage. However, clothing can potentially obstruct the analysis of chest motion, which can become problematic in outdoor applications. Respiration rates can be observed in rPPG signals as respiratory arrhythmia af- fects the heart rate and blood volume signal. Hence, the respiration rate can be 6 2. Background modulated from the rPPG signal [9]. The fundamentals of rPPG are described in Section 2.2.1. 2.3 Related Work In the aftermath of the COVID-19 pandemic, the detection of ROIs for remote mea- surement of infection indicators has gained more attention. Previous research by Rodriguez-Lozano et al. [8] and Muller et al. [10] propose two methods for segmen- tation of the forehead and the nose region in thermal images. Rodriguez-Lozano et al. implement a trigonometric segmentation method where the forehead is seg- mented by fitting an ellipse onto the face and extracting the upper part of the ellipse [8]. In di�erence from Rodriguez-Lozano et al., Muller et al. implement a deep learning- based method for segmenting the forehead, nose, mouth, lower face, and eyebrows. Muller et al. implement a conditional generative adversarial network (cGAN), trained on a custom annotated database they have created. The dataset is, however, small and limited to controlled lab environments. In addition to segmentation, Muller et al. successfully extract the temperature levels at the di�erent ROIs [10]. Another approach for detecting and tracking ROIs is triangulation from existing keypoints. Djeldjli [11] presents a method for remote assessment of blood pressure and arterial sti�ness in RGB footage. The measurements are performed remotely at the forehead, triangulated from the eyebrows and the facial bounding box, using the facial keypoint estimator Dlib [12] and the OpenCV [13] face detector. Yang et al. [14] propose a method for remote measurement of blood pressure, heart rate, and respiration rate in RGB and thermal footage, including detection of the forehead and nostrils. Yang et al. implement RetrinaFace [15], a deep learning framework for detecting facial keypoints, from which the ROIs are triangulated based on di�erent facial feature distances. Although vital parameters are measured in visual and thermal domains, the ROIs are only detected in the RGB videos. The defections are transferred to corresponding thermal videos via an image alignment process made possible as the videos are recorded simultaneously. Compared with previous work, this thesis is not limited to controlled settings. In this thesis, we implement data collected under uncontrolled settings in outdoor environments that represent a potential UAV triage application. Moreover, we are considering more considerable distances to the subject than in previous work. 7 2. Background 8 3 Theory This chapter contains the theory for this thesis focusing on the computer vision ele- ments. Section 3.1 provides a brief introduction to deep learning and convolutional neural network. Section 3.3 the fundamentals of keypoint estimation are described along with a keypoint estimation framework of interest for this thesis project. As motion blur is to be considered in this thesis, Section 3.6 provides the theory re- garding adversarial perturbation and motion blur kernels. Furthermore, Section 3.7 is dedicated to object tracking and the tracking algorithm implemented in this the- sis. 3.1 A Brief Introduction to Deep Learning Deep learning algorithms are based upon artificial neural network architectures, and deep learning models are consequently often referred to as deep neural net- works. Deep neural networks are extensive models characterized by their complex- ity. Training a deep learning model requires a considerable amount of data and computational power [16]. This section describes some fundamental deep learning concepts of interest for this thesis. 3.1.1 Artificial Neural Networks Artificial neural networks (ANN), or simply neural networks, are computational models inspired by biological neural networks. An ANN consists of artificial neurons, or nodes, which are connected and arranged in layers. An illustration of a basic ANN is illustrated in Figure 3.1. The first and last layers of an ANN correspond to the input and output layers, while the intermediate layers are referred to as the hidden layers. Moreover, the depth of a neural network is given by the number of hidden layers [16]. Mathematically, a node is a weighted summation passed through an activation func- tion. Given the input values x1, ...xm, the corresponding weights w1, ...wm, and an activation function „(·) the output of the node is given by y = „ A mÿ i=1 xiwi + b B . The weights and the bias therm b added to the weighted sum are learnable param- eters determined by training. The weights control the strength of the connection of the nodes, i.e., how the inputs influence the output, and the bias terms regulate 9 3. Theory Hidden Layers Output LayerInput Layer Figure 3.1: An example of a feed-forward ANN. The circles correspond to nodes which are arranged in layers. the flexibility of the nodes by shifting the activation functions. Di�erent functions can be employed for activation. A commonly implemented activation function is the sigmoid function „(z) = 1 1 + e≠z , which returns a value between 0 and 1. The functionality of a node is further illustrated in Figure 3.2 where z corresponds to the weighed summation. Input Values Weights Node Activation Function Output Bias Therm Figure 3.2: Illustration of an artificial neuron. An ANN can be described as a non-linear transformation F which maps an input X œ Rm◊N on to an output Y œ Rn◊N ; F : Rm◊N æ Rn◊N where N denotes the number of samples while m and n correspond to the input and output dimensions respectively. The transformation is governed by the weights and the biases, ◊ = {Wl, bl}L+1 l=1 , of the trainable layers. There are L + 1 trainable layer corresponding to the the hidden layers L along with the output layer. The output is generated by 10 3. Theory propagating the input through the ANN and the output can therefore be described as Y = F (X, ◊) = „L+1(„L(...(„1(X)))) [17]. Training a neural network is the process of tuning the parameters ◊ to improve model performance. The objective is to minimize the di�erence between the actual and the predicted output described by the loss function L(F (X, ◊), Ŷ ). Accordingly, the training process is a minimization problem where the optimal parameter is given by ◊̂ = argmin ◊ L(F (X, ◊), Ŷ ). The training of a neural network is an iterative process where the parameters are learned through stochastic gradient descent optimization [17]. 3.1.2 Convolutional Neural Networks Convolutional neural network (CNN) is a frequently implemented architecture within computer vision. CNN neurons are arranged in three dimensions; the height and width corresponding to the spatial dimension and the depth equivalent to the depth of the input. The three-dimensional representation makes CNNs suitable for im- age processing. CNN architecture comprises three types of layers; the convolutional layers and pooling layer for feature learning, and the fully connected layers for clas- sification [18]. The fully connected layers are conventional ANN layers described above, while convolutional layers and pooling layers are described below. 3.1.2.1 Convolutional Layers Convolutional layers pose as filters for feature extraction, which consist of convolu- tional kernels corresponding to matrices of weights learned by training. The kernels are passed sequentially over the input for which the kernel is multiplied with each subregion, followed by a summation to obtain a feature map for activation. Hence, the size of the acquired feature map is determined by the size of the kernel, as well as the division of subregions. A convolutional operation is illustrated in Figure 3.3. A neuron in a convolutional layer is only connected to neurons in the preceding layer within reach of the convolutional kernel. Due to the local connectivity of the conditional layers, the initial layers of a CNN will capture local features while the latter ones will capture global features [17, 18]. 3.1.2.2 Pooling Layers A pooling layer can follow a convolutional layer to reduce the spatial dimension further. This is known as downsampling, which reduces the complexity of the model. There are two pooling operators: the max pooling operator, which extracts the maximum value from a specific region, and the average pooling operator, which extracts the average value. The dimension is reduced by dividing a feature map into subregions and by letting a pooling operation act on the regions. Hence, the reduction is determined by the size of the regions [17, 18]. The downsampling process is illustrated in Figure 3.4. Note that the pooling layer acts on each feature map independently to maintain the depth. 11 3. Theory Figure 3.3: Illustration of a convolutional operation. The filter acts on the input layer and the size of the output is determined by the dimension of the filter and how the filter is traversed. 3.2 Object Detection Object detection is the computer vision task of detecting and localizing objects within images and videos. Deep learning-based object detectors are commonly di- vided into one-stage and two-stage models. The two-stage detectors consist of a region proposal step for the prediction of object bounding boxes and a classification step for the classification of the predicted bounding boxes. In contrast to the two- stage detectors, the one-stage detectors perform region proposal and classification simultaneously [19]. The one-stage detectors are known for their inference speed, making them suitable for UAV applications, compared with the two-stage detectors, which demand greater computational resources [20]. The following section is ded- icated to the one-stage object detector You Only Look Once implemented in this thesis. 3.2.1 You Only Look Once You Only Look Once [21, 22, 23, 24, 25] (YOLO) is a state-of-the-art one-stage object detector known for its inference speed, making it suitable for real-time appli- cations. As a one-stage detector, YOLO performs object proposal and classification simultaneously by dividing the input images into an S ◊S grid. Within each cell, B object bounding boxes are predicted and assigned a confidence score reflecting upon the accuracy of the corresponding bounding box. This process yields a surplus of bounding boxes which are passed through a non-maximum suppression (NMS) filter to determine the final detections. The NMS removes bounding boxes with confi- dence scores below a given threshold and then suppresses the bounding boxes with a high intersection over union (IOU) with high confidence class-specified bounding boxes. The YOLO pipeline is illustrated in Figure 3.5. 12 3. Theory 0 4 4 7 3 9 3 2 8 5 6 9 4 3 1 8 9 7 8 9 4 4 5 6 Average Pooling Max Pooling Figure 3.4: Max pooling and average pooling operations. The max pooling oper- ator extracts the maximum value of the region, and the average pooling operation yields the average value of the region. The YOLO model was initially proposed by Redmon et al. in 2015, after which four subsequent versions have been released. The original version, YOLOv1 [21], consists of a CNN of 24 convolutional layers for feature extraction followed by two fully connected layers for the prediction of bounding boxes. The updated second version, YOLOv2 [23], released the year after, uses a Darknet-19 backbone for fea- ture extraction. Darknet-19 is a CNN architecture of 19 convolutional layers and five max pooling layers. Moreover, in the second version, the fully connected layers are removed and replaced with anchor boxes for the prediction of bounding boxes. The architectural updates show an increment in inference speed. In the third version, YOLOv3 [22], also by Redmon et al., the backbone is further up- dated to a Darknet-53 architecture (to be described in more detail in Section 4.5). Moreover, YOLOv3 employs a feature pyramid network (FPN) for feature fusion which concatenates the feature maps from di�erent layers of the backbone. Fur- thermore, in the third version, residual block, skip connections, and up-sampling are introduced. These changes show a significant improvement in accuracy and performance on small objects. The fourth version, YOLOv4 [24], was released by Bochkovskiy et al. in 2020. Com- pared to its successor, YOLOv4 relies on a CSPDarknet53 backbone, and the FPN is replaced with a spatial pyramid pooling (SPP) layer and path aggregation network (PANet) for feature aggregation at multiple scales. Bochkovskiy et al. introduce a “Bag of Freebies” and a “Bag of Specials” to further improve the performance. The bag of freebies is data augmentation techniques to expand the dataset arti- ficially. Data augmentation techniques employed are photometric, such as bright- ness, contrast, hue, and saturation distortion, and geometric distortions such as random scaling, cropping, and flipping. The most recent version as of March 2022, is the fifth version YOLOv5 which is an open-source project maintained by Ul- tralytics1. YOLOv5 is a PyTorch implementation of YOLOv4, making it more user-friendly [25]. 1https://ultralytics.com/yolov5 13 https://ultralytics.com/yolov5 3. Theory SxS grid on input image Bounding boxes+confidence Probability map Final detection Figure 3.5: Illustration over the YOLO pipeline. The input image is divided into an S ◊ S grid. For each cell, B object bounding boxes are predicted and assigned a confidence score along with the class probabilities. The predictions are then fused with NMS to obtain the final predictions. 3.3 Keypoint Estimation Vision-based keypoint estimation aims to detect and localize points of interest in visual input data such as images and videos. In particular, person keypoint detection aims to estimate human body landmarks. Conventional human body landmarks annotations are: eyes, ears, nose, shoulders, elbows, hands, hips, knees, and feet, which are visualized in Figure 3.6. To this point, CNN-based heatmap regression is the prevalent approach for keypoint estimation. In a keypoint heatmap, the grid constitutes the input image, and the pixels encode the probability of the corresponding pixel being a keypoint. Given that the model aims to determine N keypoints, there will be N keypoint heatmaps containing the spatial confidence distribution of the corresponding keypoint. CNN- based heatmap regression leverage CNN to regress the heatmaps onto the input images, and the final keypoint predictions are obtained by extracting the maximum indices of the heatmaps [26, 27]. The heatmap-based methods have become the dominating keypoint estimation ap- proach due to their performance. However, generating and processing the heatmaps are computationally costly. High-resolution heatmaps are required to achieve accu- rate results, and the accuracy/speed trade-o� is one of the major drawbacks of the heatmap-based approaches [26]. 14 3. Theory Figure 3.6: Human body keypoint annotations. 3.3.1 Multi-Person Keypoint Estimation Multi-person keypoint estimation can be categorized into top-down and bottom- up methods. The challenge in a multi-person setting is associating the detected keypoints with the corresponding persons, which becomes especially challenging when people occlude each other. Top-down approaches employ a person detector and a single-person keypoint estima- tor. The detector generates human detection bounding boxes, and the single-person keypoint estimator is then applied to each bounding box separately to create a multi-person keypoint estimation. Top-down approaches are thereby classified as two-stage processes. The bottom-up methods simultaneously detect all keypoints and assign them to the corresponding person [27]. Hence, bottom-up methods are considered to be one-stage approaches. The bottom-up approaches are generally faster; however, they are less accurate than their top-down counterparts [26, 27]. On the other hand, the top-down approaches are more sensitive to person occlusion as occlusion makes the person detection more likely to fail [27]. 3.3.2 Modeling Keypoints and Poses as Objects To address the heatmap drawbacks, McNally et al. introduced the heatmap-free multi-person keypoint detector KAPAO [26] (Keypoints And Poses As Objects). KAPAO models keypoint and their spatial relations as objects within a YOLOv5 framework. This heatmap-free development has significantly improved the accuracy and inference speed compared to heatmap-based state-of-the-art keypoint estimation frameworks [26]. KAPAO is an object detection-based keypoint estimator which treats keypoints and poses as objects. The keypoint objects are defined by a detection bounding box of equal width and height and its center coordinates. In contrast, the pose objects comprise of a traditional object bounding box along with its associated set 15 3. Theory of keypoint objects. The keypoint objects are dedicated to keypoints with strong local features such as nose and eyes. Unlike the keypoint objects, the pose objects contain spatial relation information. The global understanding of the pose object makes it suitable for estimation of keypoint lacking in local features such as hips and shoulders [26]. KAPAO employs a YOLO-based network for the detection of keypoint and pose objects. The network takes an input image and maps it onto output grids containing the predicted keypoints and pose objects. Rather than using a bottom-up approach, the detection of keypoints and pose objects occur simultaneously. Subsequently, the output grids are passed to NMS to eliminate redundant proposals by suppressing candidates with low confidence scores by comparing the overlaps. The NMS is applied on the keypoints and pose objects separately before being fused with a matching algorithm to obtain the final body pose prediction [26]. The pipeline is illustrated in Figure 3.7. NMS( ) NMS( ) Figure 3.7: An illustration of the KAPAO framework. The detection network N maps the input image I onto a set of output grids Ĝ containing both the pre- dicted keypoint Ô k and pose objects Ô p. The predictions are then separately filtered through an NMS before being fused by the matching algorithm –, which produces the final prediction P̂. 3.4 Thermal Imaging The visual and thermal domains have been taken under consideration in this project, and this section presents the basics of thermal imaging. The electromagnetic spec- trum is divided into subspectrums based on energy levels. The visible range includes the wavelengths visible to the naked eye, and the infrared (IR) region includes wave- lengths of 0.7–1000 µm [28]. Furthermore, the IR spectrum is divided into subcat- egories; near, short, mid, long, and far wavelength IR. According to Wien’s law, the human body emits a wavelength of 9.4 µm, which falls within the range of long wavelength infrared radiation (LWIR) of 8-15 µm [28]. Hence, LWIR sensors can be used for human detection. The electromagnetic field can be used to create a visual representation of an object. RGB, or true color imaging, refers to imaging within the visible spectra. Likewise, thermal imaging can be applied to visualize thermal energy, i.e., IR radiation. A thermal heat signature consists of a spectral and spatial intensity distribution from apparent temperature di�erences to the background [29]. In contrast with visual imaging, thermal imaging is less sensitive to illumination degradation and visual obstruction, which poses an advantage in poor weather condi- tions and at night. On the other hand, thermal imaging depends on the surrounding 16 3. Theory environment and background temperatures. Lighting conditions, geographic loca- tions, weather conditions, time of capture, and materials reflective properties are examples of factors that impact the surrounding environment transmission. More- over, one should be aware that atmospheric transmission caused by, e.g., humidity and aerosols also a�ects the thermal image resolution [29]. 3.5 Transfer Learning Transfer learning within machine learning aims to transfer knowledge gained from solving one problem to solving another, di�erent but related problem. Transfer learning can be used for generalization across domains and transferring knowledge from a data-rich to a data-poor domain. In the context of this thesis, transfer learning can be employed for transferring knowledge from the visual to the thermal domain [30]. Rather than initializing a model with random weights, knowledge can be transferred by implementing the weights of a pre-trained model. Knowledge can be transferred by freezing layers and by fine-tuning. Freezing methods employs the pre-trained model without modifications. The weights of the pre-trained model are said to be frozen, meaning that they will not be adjusted during training. The model is adapted by adding additional layers to be modified by further training on the target data. Fine-tuning, rather than freezing the weights, adjusts the pre-trained models by further training the model on the target data [30]. 3.6 Adversarial Perturbations Computer vision models are known to be sensitive to adversarial attacks, i.e., mis- classification due to perturbations in input data. This is a well-documented issue that poses a threat in real-world implementations [3, 31, 32, 33]. An adversarial perturbation is a noise that causes misclassification when added to an image. It can be quasi-imperceptible, i.e., invisible to a human eye, and it can be universal and cause misclassification when added to any arbitrary image [32]. Motion blur is an example of an adversarial that can cause misclassification. There are two common defense strategies for dealing with adversarial perturbations: 1) model alternation and 2) image prepossessing [32, 34, 35]. Adversarial training is a commonly implemented defense strategy for increasing model robustness by including adversarial samples in the training dataset. Image prepossessing, on the other hand, aims to remove the adversarial perturbation rather than alternating the model [35]. 3.6.1 Motion Blur State-of-the-art networks are typically trained and evaluated on large high-quality artifact-free datasets. Using perfect data for training causes a decrease in perfor- mance in the presence of quality distortion such as motion blur [33, 36]. Possible 17 3. Theory solutions for handling motion blur are adversarial training on motion blur and de- blurring, which both require knowledge about the blur kernel. An image subjected to motion blur can mathematically be defined as y = k ú x + n, (3.1) where x corresponds to the equivalent sharp image, k the blur kernel, n some additive noise, and ú is the convolution operator [37, 38]. As the equation implies, if the kernel is known, it is possible to extract the sharp images from their blurry counterpart. However, determining the true kernel is not a trivial task. By the definition in Equation 3.1, motion blur can be simulated by implementing synthetic motion blur kernels. A selection of di�erent synthetic blur kernels is pre- sented in Figure 3.8. The disk kernel in Figure 3.8 (a) can be used to simulate defocus blur, and the oriented box kernel in Figure 3.8 (b) can be applied to sim- ulate linear motion blur [33]. The kernel in Figure 3.8 (c) is a random kernel that can be used for simulating random motion [33, 38, 39, 40]. In order to capture the blur kernel for a UAV, one has to take the complex motion pattern of a UAV into account, including irregular motion caused by hovering, wind, or turbulence. (a) Defocus. (b) Linear motion. (c) Random motion. Figure 3.8: Di�erent synthetic blur kernels. Vasiljevic et al. [33] investigate the image blur impact on state-of-the-art detection networks. They implement the types of synthetic blur kernels presented in Figure 3.8 and observe a decrease in performance. Moreover, they are able to improve the performance by fine-tuning and adversarial training. Vasiljevic et al. observe that similar static distribution tends to generalize among di�erent kernels, i.e., fine-tuning on defocus blur increases the model performance on camera shake blur and vice versa. It is a significant finding, however, in a more recent study by Sayed et al. [31] conclude the opposite. 3.7 Object Tracking Object tracking within computer vision refers to the task of tracking objects across consecutive video frames. Tracking-by-detection is a common approach to multiple object tracking relying on an object detector. A joint tracking and detection model 18 3. Theory employs an object detection on the video sequence frames for detection and a track- ing algorithm for data association across the frames to obtain the trajectories of the objects. Keypoint tracking and object tracking share the same objectives, and as object trackers, keypoint trackers typically take a two-stage approach [41]. To the best of this author’s knowledge, there is no de-facto standard approach for keypoint tracking. Since we are taking an object-based approach to keypoint detection in this thesis, we will do the same for keypoint tracking. 3.7.1 Simple Online Realtime Tracking Simple Online Realtime Tracking (SORT) [42] is a multiple object tracking frame- work introduced by Bewley et al. in 2016. SORT relies on an object detector, a state estimation model, and a data association algorithm. The estimation model is a Kalman filter that estimates the next position of an object by extrapolating the motion of the object. SORT implements the Hungarian optimization algorithm [43] to associate objects across frames. The Hungarian algorithm operates on an assign- ment cost matrix of the IOU distances among the detected objects in frame t and previously detected objects in frame t ≠ 1. 3.8 Evaluation Metrics This section presents di�erent metrics for evaluation used in this thesis. The evalua- tion metrics are based upon the binary prediction results presented in the confusion matrix in Figure 3.9. True Positive (TP) False Positive (FP) False Negative (FN) True Negative (TN) Po si tiv e N eg at iv e Positive Negative Ground True Pr ed ic te d Figure 3.9: The confusion matrix for the definition of the classification metrics. The columns correspond to the true labels and the rows the predicted labels. TP corresponds to a hit (a correct classified positive), TN to a correct rejection (a correct classified negative), while FP is a false alarm (incorrectly positive-classified), and FN a miss (incorrectly negative-classified). 3.8.1 Precision and Recall Precision and Recall are key metrics for the evaluation of binary classification mod- els. The Precision score (P) is defined as the fraction of true positives and actual 19 3. Theory positives: P = TP TP + FP . (3.2) The Recall score (R) corresponds to the true positive rate: R = TP TP + FN . (3.3) According to Equations 3.2 and 3.3, a high Precision score corresponds to a low FP rate, and a high Recall score corresponds to a low FN rate. A model should preferably achieve both. The trade-o� between the Precision and Recall is usually visualized with a precision-recall curve from which an optimal threshold can be determined. 3.8.2 Average Precision The Average Precision (AP) is an accuracy measurement commonly used for bench- marking keypoint estimators. The AP corresponds to the area under the precision- recall curve, which can be calculated by summarizing the weighted mean of the Precision and Recall scores at each threshold n, AP = ÿ n (Rn ≠ Rn≠1)Pn. A high-performance model should ideally achieve a high AP score. There are di�er- ent adaptations of the AP score. The AP score is calculated separately for each class, i.e., the classes are averaged independently. Another variant is the mean Average Precision (mAP) which corresponds to the average AP over all classes [27]. 3.8.3 Multiple Object Tracking Accuracy Multiple Object Tracking Accuracy (MOTA) is an evaluation metric for multiple object tracking algorithms. The MOTA metric incorporates three error rates; the ratio of misses, the ratio of false alarms, and the ratio of ID mismatches [44]. An ID mismatch is equivalent to an ID switch and is denoted as IDSW. The MOTA is defined as MOTA = 1 ≠ q t FNt + FPt + IDSWtq t GTt (3.4) where error rates and the number of true objects GT are calculated per frame t [45]. A high-performance tracker should obtain a high MOTA score. 20 3. Theory t t+2t+1 t+3 t+4 t+5 t+6 Mismatch False Positive Miss Figure 3.10: Illustration of the di�erent MOTA labels. The di�erent shapes rep- resent objects captured over seven frames where ‡i are the corresponding ID. The False Positive corresponds to a detection where there is no object, the mismatch oc- curs when the ID switches object, and the miss is when an object goes undetected. 21 3. Theory 22 4 Method This chapter presents the methodology behind experiments conducted to answer the research question “To what extent is it possible to detect and track regions of interest for remote measurement of vital parameters in RGB and thermal footage and in the presence of characteristic low-altitude UAV motion blur?”. The experiments consist of six use cases described in Section 4.6. Due to the lack of public available keypoint annotated data collected by UAVs, synthetic motion blur have been applied to still images to simulate UAV motion characteristics focusing on the irregular motion pattern of a UAV. The datasets implemented for keypoint estimation are presented in Section 4.1 while the blur augmentation is described in Section 4.3. 4.1 Keypoint Estimation Datasets This section presents the datasets used for the keypoint detection task and the alter- ations made for customizing the mouth keypoint. The datasets are summarized in Table 4.1. The forehead and chest are not included in the standard keypoint anno- tations and have been triangulated from the existing annotations. The triangulation process is further described in Section 4.4. Table 4.1: Technical information about the datasets used for the detection task. Keypoints refers to the number of annotated keypoint labels. Dataset Modality Keypoints (Total/Facial) Number of Images Train Val Test COCO [46] RGB 17/5 118 287 5000 40 671 COCO-WholeBody [47] RGB 113/68 118 287 5000 40 671 TFW outdoor [48] Thermal 5/5 5916 664 1600 TFW indoor [48] RGB+Thermal 5/5 7200 864 2160 4.1.1 The Common Objects in Context Dataset The original KAPAO model [26] is trained on the Microsoft Common Objects in Context (COCO) dataset [46]; an established large-scale object recognition dataset of everyday objects in the wild. COCO is an RGB dataset annotated with 17 human keypoints. A selection of images from the COCO dataset is presented in Figure 4.1, while the COCO keypoints are shown in Figure 3.6. COCO is annotated with five 23 4. Method facial keypoints: eyes, ears, and nose. There is an extension of the COCO dataset, which is the COCO-WholeBody dataset [47]. The data remains the same, but COCO-WholeBody is annotated with an additional 133 human keypoints, of which 68 are facial keypoints. The additional facial landmarks of COCO-WholeBody are presented in Figure 4.2. Figure 4.1: Sample images from the COCO dataset. In this thesis, a custom dataset has been created by adding a mouth keypoint to the original COCO dataset. This mouth keypoint has been derived from the COCO- WholeBody dataset. The COCO keypoints are annotated in standard COCO for- mat, where each keypoint is given by a pixel location, x- and y-coordinate, along with a visibility criterion. The COCO keypoint visibility flags are; v = 0: not labeled (in which case x = y = 0), v = 1: labeled but occluded, v = 2: labeled and visible [49]. COCO-WholeBody incorporates other visibility flags representing a reliability crite- rion that can either be True or False. Compared to COCO, COCO-WholeBody does not di�er between occluded and non-labeled scenarios, and keypoints with visibility v > 0 are considered reliable [47]. In this thesis, the additional mouth keypoint has been derived by averaging the COCO-WholeBody mouth keypoints with a non-zero visibility criterion. The extracted mouth keypoints have been given visibility v = 2 to match the standard COCO format. In addition to the keypoint labels, COCO includes a skeleton label to relate/connect the keypoint and to create a spatial understanding. The original COCO keypoint skeleton is visualized in Figure 4.3. One should note that this skeleton does not directly translate into the KAPAO pose object. In this thesis, the additional mouth keypoint has been connected directly to the nose to create the custom skeleton. 24 4. Method Figure 4.2: The additional 68 COCO-WholeBody facial keypoints. Figure 4.3: The standard COCO skeleton. 4.1.2 Thermal Faces in the Wild Thermal Faces in the Wild (TFW) [48] published in 2022 is a thermal dataset an- notated with facial keypoint. TFW contains data collected in controlled indoor, semi-controlled indoor, and uncontrolled outdoor settings. The outdoor data are multi-person scenarios collected in di�erent environments under unconstrained set- tings and are hence the most representative of a potential UAV triage application. Two images from the TFW dataset collected under uncontrolled outdoor settings are presented in Figure 4.4 for exemplification. The TFW dataset has been manu- ally annotated with a facial bounding box and five facial keypoints; eyes, nose, and the outer corners of the mouth. Ultimately, these facial keypoints have created the underlying constraints for the triangulation of the forehead in this project. The TFW dataset is annotated with two mouth keypoints. In this thesis, these have been averaged to obtain a single mouth keypoint annotation. Moreover, all 25 4. Method Figure 4.4: Samples from the TFW dataset. the annotations have been considered visible as nothing else is stated and have been assigned the visibility flag v = 2 according to the standard COCO keypoint formation. 4.2 Video Datasets for Tracking This section presents the datasets used to evaluate the tracking algorithm: the 300 Videos in the Wild [50, 51, 52] dataset and the RGBT234 [53] dataset. The datasets are summarized in Table 4.2. To this point, there is no public available in-the-wild RGB video dataset annotated with all keypoint of interest in this thesis. Therefore, tracking of the chest has been left for future work. There is also a lack of publicly available thermal keypoint annotated video dataset; hence, a multimodal dataset has been considered. Table 4.2: Summary of the datasets used for evaluation of the tracking algorithms. Dataset Modality Number of Sequences K ey po in t A nn ot at ed M ov in g C am er a O cc ul sio n 300-VW [50, 51, 52] RGB 114 � � RGBT234 [53] RGB+Thermal 234 � � 4.2.1 300 Videos in the Wild The iBug 300 Videos in the Wild (300-VW) dataset [50, 51, 52] has been used to evaluate the tracking of ROIs in the visual domain. 300-VW is an RGB video dataset for facial landmark tracking annotated with the 68 facial keypoint presented in Figure 4.2. The dataset consists of video sequences acquired under uncontrolled 26 4. Method settings and includes various poses, facial expressions, illumination settings, and occlusion. Moreover, the data is divided into three categories based on di�culty. The categories are well-lit, mild occlusion, and challenging and the videos of category challenging have been used for evaluation in this thesis. Figure 4.5: A sample frame from the iBug 300-VW dataset. 4.2.2 RGBT234 Dataset The RGBT234 dataset [53] has been used to evaluate the tracking of ROIs in the thermal domain. RGBT234 is a bimodal multi-person video dataset of aligned RGB and thermal video pairs collected under uncontrolled conditions. A frame pair from the dataset can be seen in Figure 4.6. The dataset consists of a wide range of videos in terms of settings, and the most representative sequence have been selected for the task at hand. However, the RGBT234 dataset is not keypoint annotated and has hence been annotated using the bimodality in this project. The thermal video has been annotated by applying the RGB keypoint estimator to the corresponding RGB video. This approach is inspired by Chen et al. [54]. Figure 4.6: An RGB and thermal frame pair from the RGBT234 dataset. 27 4. Method 4.3 Motion Blur Augmentation As a pre-processing step, the data have been augmented with synthetic blur to sim- ulate random motion blur. This section presents the process of generating random motion blur kernels implemented in this project. The approach follows previous work by Boracchi and Foi [40]. The blur kernels have been created using a Markov process followed by sub-pixel linear interpolation. A Markov process is a stochastic process where the next state is determined based on only the current state of the system. The motion trajecto- ries have been continuously sampled on a 2-dimensional grid, where the following position is determined based on the velocity and previous position. The algorithm is described in further detail in Algorithm 1. Three perturbations govern the process: a Gaussian, an impulsive, and an inertial perturbation. The Gaussian perturbation corresponds to a smaller Gaussian deviation, while the inertial term is a larger de- viation. In a UAV application, the Gaussian term could be caused by hovering and the inertial perturbation due to wind gusts. The impulsive perturbation is a counter term that counteracts the other perturbations [38, 40]. If the perturbations equal zero, then the motion will be linear. Algorithm 1 Random trajectory generator Parameters: M -number of iterations, Lmax-max length of trajectory, I-inertia, ps-probability of an impulsive perturbation, pb-probability of an inertial perturbation, pg-probability of a Gaussian perturbation, „-initial angle, x-the trajectory vector. 1: v0 Ω cos „ + i sin „ 2: v Ω v0 · Lmax/(M ≠ 1) 3: for t=1 to M-1 do 4: if randn < pb · ps then Û randn≥ N (µ, ‡ 2) 5: nextDirection Ω 2v · e i(fi(randn≠0.5)) 6: else 7: nextDirection Ω 0 8: end if 9: dv Ω nextDirection+ps(pg(randn+irandn) · I · x[t] · Lmax/(M ≠ 1)) 10: v Ω v + dv 11: v Ω (v/|v|) · Lmax/(M ≠ 1) 12: x[t + 1] Ω x[t] + v 13: end for The obtained motion trajectories have been converted into point spread functions (PSFs) by sub-pixel linear interpolation. Sub-pixel linear interpolation is a sampling method for transforming from sub-pixel resolution to pixel resolution by linear inter- 28 4. Method polation. The PSFs have been generated by sampling the trajectories on a pixel grid and performing linear interpolation along each axis. A selection of obtained PSFs are presented in Figure 4.7 for exemplification. The data was blurred by convolving the PSFs onto the images using openCV [13] and the blurred output created with the PSFs of Figure 4.7 are presented in Figure 4.8. (a) Kernel A. (b) Kernel B. (d) Kernel D.(c) Kernel C. Figure 4.7: Di�erent random motion blur kernels with di�erent standard devia- tions. 4.4 Triangulation of the Forehead and Chest The additional ROIs of this thesis that are not annotated have been triangulated from the existing keypoint annotations as a post-processing step. The forehead has been triangulated from the eyes and nose for all use cases. Given an x-axis along eye level and a y-axis intersecting the nose at (0, -dnose), the forehead keypoint has been located at coordinate (0, 0.5dnose). The chest keypoint has been triangulated from the hips and the shoulders. If the shoulders and the hips constitute a square, then the chest has been defined in the center of the upper half of the square. Triangulation has the drawback of breaking in the absence of a keypoints the triangulation relies upon. Hence, one should be aware that triangulated ROIs are more sensitive to occlusion than their independent counterparts. Moreover, one should note that the distances for triangulation are not chosen on specific scientific grounds for vital parameter measurement in this project. 4.5 Network Architecture This section presents the architecture and training procedures of the implemented keypoint estimator. The small KAPAO version has been implemented in this the- sis project which relies on a CSPDarkNet53 backbone for feature extraction. The DarkNet53 is a CNN architecture of 53 layers comprising 3 ◊ 3 and 1 ◊ 1 kernels. To enhance the learning capabilities of the CNN, CSPDarkNet53 employs a Cross Stage Partial Network (CSPNet) strategy in which the gradient flow is divided and propagated through the networks in di�erent paths. It is achieved by dividing the feature map of the base layer into two and by fusing them through a cross-stage hierarchy [55]. The architecture is further described in Table 4.3. 29 4. Method (a) Convolved with kernel A. (b) Convolved with kernel B. (c) Convolved with kernel C. (d) Convolved with kernel D. Figure 4.8: The blurred result obtained when applying the blur kernels in Figure 4.7. Table 4.3: CSPDarknet53 architecture, the backbone to YOLOv5s version 6.0. Type of Layer Filter Size Repetitions Conv 64 3 ◊ 3 1◊ Conv 128 3 ◊ 3 1◊ C3 128 1 ◊ 1 3◊ Conv 256 3 ◊ 3 1◊ C3 256 1 ◊ 1 6◊ Conv 512 3 ◊ 3 1◊ C3 512 1 ◊ 1 9◊ Conv 1024 3 ◊ 3 1◊ C3 1024 1 ◊ 1 3◊ SPPF 1024 5 ◊ 5 1◊ Following the approach of McNally et al. [26], the use cases not relying on fine-tuning have been initialized with the weights of YOLOv5s to decrease the training time. According to the default setting of YOLOv5s, also employed by McNally et al., a stochastic gradient descent (SGD) optimizer and an initial learning rate of 0.01 have been used for training. For the use cases not relying on fine-tuning, the training has been performed during 250 epochs using batch size 32. The model fine-tuned on thermal data has been initialized with the weights of the RGB model, and the 30 4. Method learning rate decreased to 0.001. Moreover, the number of epochs has been set to 501. The data has been augmented during the training process to expand the dataset artificially. Once more, we have followed YOLOv5s and McNally et al. [26] and used the augmentation techniques and corresponding probabilities in Table 4.4. The HSV techniques alter the color and mosaic creates new images by combining multiple images. Table 4.4: Augmentation techniques and corresponding probabilities were imple- mented for the expansion of the datasets. Method Probability HSV-Hue 0.015 HSV-Saturation 0.7 HSV-Value 0.4 Translate 0.1 Scale 0.9 Flip left-right 0.5 Mosaic 1.0 4.6 Experimental Setup This section describes the experiments conducted to answer the research question. The experiments conducted for the detection and tracking tasks are described sep- arately below. 4.6.1 Detection The di�erent scenarios for keypoint detection that have been taken under consider- ation are presented in Table 4.5. An RGB and thermal baseline have been created for reference purposes, trained on the modified COCO dataset and TFW outdoor data respectively. As the impact of characteristic UVA motion blur is of interest in this project, an RGB model adapted to motion blur have been created. This has been achieved by adversarial training on a blurred version of the COCO dataset, where 1/3 of the training data is perturbed with synthetic motion blur as described in Section 4.3. The impact of the blur has been evaluated by comparing the RGB baseline and the adopted model’s performance in the presence of synthetic blur. Three additional use cases have been considered to further investigate the properties of the thermal domain and KAPAO. As the TFW outdoor data is a relatively small dataset, an expanded model has been introduced trained on the combined TFW outdoor and indoor data. The temperature ranges of an outdoor versus indoor 1The readers unfamiliar with the fundamental concepts of training neural networks are referred to [56] for further reading. 31 4. Method Table 4.5: The use cases for which the experiments have been conducted. Use Case Training Data Evaluation Data RGB baseline COCO COCO COCO blurred TFW outdoor RGB blurred COCO blurred COCO blurred Thermal baseline TFW outdoor TFW outdoor RGB fine-tuned on thermal COCO+TFW outdoor TFW outdoor Thermal expanded TFW outdoor+TFW indoor TFW outdoor Thermal with modified pose object TFW outdoor TFW outdoor environment can largely di�erentiate, and it is not trivial that combining them will increase performance. Another approach taken under consideration is fine-tuning from the visual to the thermal domain to compensate for the limited amount of available thermal keypoint annotated data. The last model adopts an alternative pose object. Rather than implementing the modified COCO skeleton, the face grid illustrated in Figure 4.9 have been adopted. The aim of redefining the pose object is to better understand the modeling of local and global features in the thermal domain. Like the thermal baseline, all the additional thermal models have been evaluated on the TFW outdoor dataset for comparison. Figure 4.9: Illustration of the facial grid used as a pose object. The image is taken from the TFW indoor dataset. 4.6.2 Tracking For the tracking of ROIs in the visual and thermal domains, tracking algorithm SORT has been implemented according to the theory of Section 3.7. The best- performing keypoint estimators of the respective domain have been used for detec- tion of ROIs. In addition to the SORT algorithm, a naive tracking algorithm where the tracking IDs are assigned based on frame-to-frame di�erences has been employed for reference purpose. The experiments conducted referring to tracking are listed in Table 4.6. 32 4. Method Table 4.6: The use cases for the tracking task. Use Case Keypoint Estimator Evaluation Data Naive algorithm RGB 300-VW challenging RGB RGBT234 video sequence Thermal RGBT234 video sequence SORT RGB 300-VW challenging RGB RGBT234 video sequence Thermal RGBT234 video sequence As previously mentioned, the thermal video dataset is not keypoint annotated. Hence, the chosen RGBT-234 video has been annotated with the RGB baseline. The RGB keypoint estimator has been applied to the RGB modality and trans- ferred onto the thermal modality to evaluate the performance. 4.7 Evaluation The metrics described in Section 3.8 have been implemented according to the stan- dard COCO metrics for keypoint detection. The COCO evaluation metric for key- point detection relies on an object keypoint similarity (OKS) measurement defined as, OKS = ÿ i exp C ≠d 2 i 2s2Ÿ2 i D ” (vi > 0) M ÿ i ” (vi > 0) . (4.1) The OKS represents the average keypoint similarity across the keypoint labels where di corresponds to the Euclidean distance between a detected keypoint and corre- sponding ground-true annotation, and vi the annotated keypoints visibility. Note that the visibility of the detected keypoint is not taken under consideration. The sŸi corresponds to the keypoint standard deviation where s denotes the object scale, and Ÿi is a keypoint constant. The object scale is defined as the square root of the segmented object area and the keypoint constant Ÿi = 2‡i where ‡i corresponds to the keypoint standard deviation for the object scale. By the COCO standard ‡i for human keypoint detection are given by [0.026, 0.025, 0.035, 0.079, 0.072, 0.062, 0.107, 0.087, 0.089] for the nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles, respectively. These values have been derived from the COCO evaluation dataset [49]. In this project, the additional mouth keypoint has been assigned the same ‡i as the nose. The primary evaluation metric for keypoint detection is mAP at OKS = 0.50 : 0.05 : 0.95 and OKS = 0.50. An OKS threshold determines the AP for keypoint evaluation. Keypoints with an OKS exceeding the threshold are considered TP, and vice versa for FP, from which the AP can be determined. According to the standard COCO metrics for keypoint detection the mAP at OKS=0.50:0.05:0.95 corresponds the mean mAP over OKS=0.50, 0.55, 0.60, ..., 0.95 [49]. From here on, mAP at OKS=0.50:0.05:0.95 will be denoted as mAP and mAP at OKS=0.50 as mAP.50. 33 4. Method In addition to the COCO standard metric for keypoint detection, the PoseTrack evaluation metric has been considered. PoseTrack is a benchmark for human pose estimation and tracking comprising of three tasks, 1) single-frame pose estimation, 2) pose estimation in videos, and 3) pose tracking in the wild. A geodesic point similarity (GSP) has been adopted for the estimation in the thermal domain. The GSP mimics the COCO OKS in Equation 4.1, but rather than relying on COCO- specific normalization instances, a mean geodesic distance is used as a normalization factor [57]. The ground truth eye-eye distance is commonly used as a normalization constant in facial landmark detectors [58]. However, the ground truth eye-nose distance has been used for normalization in this project as it is less sensitive to occlusion and di�erent poses. For tracking, PoseTrack employs the MOT metric described in Section 3.8. The PoseTrack evaluation server has been implemented to evaluate the tracking of key- point. The server reports the MOTA score for each keypoint label and as an average over all keypoint labels [57]. 34 5 Results This chapter presents the results of the conducted experiments presented in Chap- ter 4. The results referring to detection are presented in Sections 5.1 and 5.2, while the tracking results are shown in Section 5.3. 5.1 Blur Impact This section presents the impact of the synthetic motion blur and the adversarial training in the visual domain. The performance of the RGB models, previously de- scribed in Section 4.6, on di�erent blurred versions of the COCO evaluation dataset is presented in Table 5.1. Note that the blur percentage reports the percentage of evaluation images augmented with synthetic blur. The baseline model achieved a mAP=0.619 and mAP.50=0.8 on the original COCO evaluation dataset. Table 5.1: The performances of the RGB baseline and the model trained on syn- thetic blur on the di�erent blurred versions of the COCO evaluation dataset. The Blur Percentage reports the percentage of images augmented with artificial blur. The scores have been calculated according to the COCO standard metric for key- point evaluation, where the additional mouth keypoint has been assigned the same keypoint standard deviation as the nose. Use Case Blur Percentage mAP@[OKS=0.50:0.05:0.95] mAP@[OKS=0.50] RGB baseline 0 0.62 0.80 33 0.53 0.77 100 0.34 0.62 RGB blurred 33 0.53 0.79 100 0.36 0.67 5.2 Detection in the Thermal Domain This section is dedicated to the results regarding detection of the ROIs in the ther- mal domain. The models’ performance on the thermal TFW outdoor dataset is presented in Table 5.2. As shown in Table 5.2, the model fine-tuned from visual to thermal domain yields the highest scores while the RGB baseline achieved the lowest mAP. However, in terms of mAP.50 , the RGB baseline obtained a significantly high 35 5. Results score compared with the acquired mAP. One should note that the reported scores in Table 5.2 have been determined using the geodesic normalization described in Section 4.7. Table 5.2: The detection scores for the di�erent models were determined on the TFW outdoor evaluation dataset. The scores have been calculated with the eye-nose distance as a normalization factor. User Case mAP@[OKS=0.50:0.05:0.95] mAP@[OKS=0.50] RGB baseline 0.21 0.73 Thermal baseline 0.52 0.54 RGB fine tuned on thermal 0.86 0.89 Thermal expanded 0.64 0.66 Thermal with modified pose object 0.73 0.77 5.3 Tracking of ROIs The results obtained by the naive algorithm employed for reference and the SORT algorithm are presented in Tables 5.3 and 5.4. The tracking algorithms have been evaluated according to the PoseTrack standard on the subset of the 300-VW dataset labeled challenging, and a selected sequence from the RGBT234. In addition to the ROIs for vital parameter measurement, the MOTA scores for the eyes used for trian- gulation of the forehead have been reported. According to equation 3.4, the MOTA score ranges from negative infinity to one. A high performing tracking algorithm should achieve a MOTA score greater than zero implying that the number of true observations exceeds the total number of misses, false alarms, and ID mismatches. One can observe that the algorithms achieve a nearly perfect score on the 300-VW dataset, compared to the scores on the RGBT234 dataset, which are significantly lower. Moreover, the result shows that the implementation of the SORT algorithm increased the performance. Table 5.3: The obtained MOTA scores for the naive tracking algorithm. The MOTA is reported per keypoint, and the Total corresponds to the average score over the keypoint labels. Use Case Dataset Nose Left Eye Right Eye Mouth Forehead Total RGB baseline 300-VW 0.981 0.985 0.990 0.976 0.972 0.976 RGB baseline RGBT234 -0.956 -0.955 -0.955 -0.957 -1.000 -0.972 Fine-tuned on TFW RGBT234 -0.674 -0.674 -0.674 -0.694 -0.998 -0.743 36 5. Results Table 5.4: The obtained MOTA scores for the SORT algorithm. The MOTA is reported per keypoint, and the Total corresponds to the average score over the keypoint labels. Use Case Dataset Nose Left Eye Right Eye Mouth Forehead Total RGB baseline 300-VW 0.989 0.991 0.991 0.988 0.989 0.989 RGB baseline RGBT234 -0.837 -0.838 -0.839 -0.885 -0.893 -0.858 Fine-tuned on TFW RGBT234 -0.441 -0.434 -0.444 -0.501 -0.550 -0.474 37 5. Results 38 6 Discussion In this chapter, the results in Chapter 5 are discussed in relation to the research question. In addition, the adaption of this thesis for a potential UAV triage appli- cation is discussed as well as the ethical concern focusing on dataset biases. 6.1 Detection in the Visual Domain The RGB baseline can be compared with the state-of-the-art KAPAO model to gain some understanding of the performance. The RGB baseline obtains a mAP=0.619 and mAP.50=0.800 which is similar accuracy to the original small KAPAO model which achieves a mAP=0.638 and mAP.50=0.884. Training of RGB baseline is computationally demanding and requires a considerable amount of time. Hence the number of training epochs has not been further investigated in this project, and one could probably achieve higher AP scores by prolonging the training. Moreover, only the small KAPAO version has been the object of investigation in this project, as inference speed has been prioritized. One should note that implementing the larger version would likely increase the accuracy as the original large KAPAO model achieves a mAP=0.703. As expected, the results in Table 5.1 demonstrate a decrease in performance in the presence of motion blur. The result shows that adversarial training can improve performance, although the di�erence is marginal compared to the result by Vasil- jevic et al. [33]. Vasiljevic et al. regain moste of the lost accuracy by fine-tuning their pre-trained model on synthetic blur. One should note that there are two sig- nificant di�erences compared to the work of Vasiljevic et al.: 1) they are considering conventional objects rather than keypoint objects, and 2) the implementation of blur kernels. Firstly, keypoint objects are smaller than conventional objects, and small objects are more sensitive to blur [59]. Hence the object size could explain the results. Secondly, Vasiljevic et al. generate 100 di�erent blur kernels, which are applied randomly rather than applying truly randomized blur kernels as in this project. This can explain why generalization among kernels cannot be observed to the same extent. As adversarial training appears to not be very e�ective in this case, it might be interesting to investigate possible image prepossessing alternatives for deblurring. In a UAV application, that could be optical stabilizing measures. Another possible option could be to remove blur artifacts by deconvolution of the kernels with a deblurring generative adversarial network (GAN). 39 6. Discussion 6.2 Detection in the Thermal Domain The results in Table 5.2 conclude that the RGB model fine-tuned on thermal data achieves the highest scores of mAP=0.856 and mAP.50=0.890 on the TFW outdoor dataset. The performance of the fine-tuned model provides evidence for that it is possible to generalize from the visual to the thermal domain. However, the perfor- mance of the RGB baseline is inconclusive. The RGB baseline achieves the lowest mAP=0.205, but the mAP.50=0.728 which is considerably higher. Deviation in an- notation standards between the datasets could explain the di�ering AP score of the RGB baseline. Observing the two additional thermal models, the result in Table 5.2 shows that the performance increased by including the indoor data for training and implementing the face grid. The relatively poor performance of the thermal baseline could reflect upon the size of the dataset but also the thermal feature space. Firstly, the increase in performance due to the expansion of training data is not trivial as the outdoor and indoor settings di�er considerably in the thermal domain. It would be interesting to investigate if this is a model-specific behavior, possibly enabled by the pose object, or not. Secondly, McNally et al. [26] state that the keypoint objects are intended for keypoint with local features and pose objects for keypoints with global features. As facial keypoints are categorized as keypoints with local features, it is interesting that the modification of the pose objects has such a big impact on the performance. It would be interesting to investigate this behavior further as it speaks for the innovative height of the pose object. The TFW dataset is a relatively newly published dataset; hence, there is limited published work regarding the dataset. The authors of the dataset [48] provide two baselines, a YOLOv5 and a YOLOv5Face model, both trained on their dataset. Kuzdeuov et al. use a normalized mean error for evaluation. However, it is not stated how they have handled false detections, making it impossible to compare the results. 6.3 Dataset Biases To understand the limitations of a deep learning model, one has to be aware of dataset biases. Amplification of biases within the training dataset is a well-documented problem within computer vision [60]. For example, a deep-learning model can be biased in terms of gender, ethnicity, and age, which are ethical concerns. Studies have been conducted to investigate biases within the COCO dataset. Zhao et al. [61] conclude that the COCO dataset is biased toward skin color and gender at image and instance levels. According to their study, the COCO dataset contains 7.5 times more light-skinned subjects than dark-skinned subjects and two times more males than females. Moreover, Zhao et al. conclude that there are visual di�erences and that darker-skinned subjects more frequently appear in outdoor settings whiles light-skinned subjects more regularly appear in indoor environments. Although COCO is a large-scale dataset, it possesses biases that could cause ethical concerns in a real-world application. 40 6. Discussion Biases within the thermal domain are more complex compared to biases in the visual domain. For example, one should be aware that the heat signatures of a red house made of wood and a red brick house may di�er due to the di�erent thermal properties of the materials. The houses can appear the same on a cloudy day, but their heat signatures will di�er in direct sunlight. A thermal dataset can, for example, be biased in terms of lighting conditions, geographic locations, weather conditions, time of day or year of capture, and due to material properties. Hence, the size and diversity of the thermal data are essential. Biases within the TFW dataset have not been investigated, but one can assume it possesses biases due to its small size. Analyzing bias propagation becomes especially important in a potential health care application to avoid discrimination. In addition, biases regarding health status have to be taken under consideration in a medical application. The data used in this project are limited to healthy subjects, and the models are consequently biased towards healthy subjects. Scenarios likely to occur in a mass casualty incident such as bruising, blood, and loss of limbs have not been investigated. The same applies in the thermal domain, as skin temperature is correlated to health status. In a potential mass casualty incident application, domain-specific features such as those mentioned must be investigated to ensure unbiased models. 6.4 Tracking of ROIs The tracking of ROIs has been evaluated on two di�erent datasets in terms of image quality and distance to the observed subjects. The di�erence in distance likely explains the di�erence in MOTA scores as the distance directly relates to the pixel area of the ROIs. Due to data shortage, further quantifying the distance dependency has been left to future work. 6.4.1 Evaluation on the 300-VW Dataset The results in Tables 5.3 and 5.4 show that both the naive and the SORT algorithm achieve high accuracy on the 300-VW dataset, although only evaluated on the videos of category challenging. The forehead and mouth achieve the lowest MOTA scores, likely due to triangulation and lip movements. Hence, one can conclude that it is favorable to use the nose for triangulation rather than triangulate from the mouth. The result shows that implementing the SORT algorithm improves the tracking score, although the scores were nearly perfect to start with. An aspect to consider is the relative size and motions of the ROIs. The 300-VW dataset mainly consists of videos where the ROIs cover larger pixel areas, which is favorable from a tracking perspective. Nevertheless, the remaining question is what is a suitable distance for remote vital parameter measurement. Related research on remote vital parameter measurement by Yang et al. [14] report a method that performs well on a 0.6 ≠ 1.2m distance in indoor settings, which according to Yang et al. is a considerably large distance compared to previous works. If the distance is realizable in a UAV application is debatable; however, the range falls within distances featured in the 300-VW dataset. 41 6. Discussion 6.4.2 Evaluation on the RGBT234 Dataset The evaluation on the RGBT345 dataset does not achieve as high MOTA scores as the results obtained on the 300-VW dataset (see Table 5.3 and 5.4). Implement- ing the SORT algorithm improves the scores for all use cases. However, the large distances and the small relative motions of the ROIs are presumably limiting the performance of the algorithm. Based on visual observations, detection failures and identification failings likely cause the low MOTA scores. This behavior does not seem specific to the thermal domain, as the same could be observed in the RGB modality. The performance of a tracking algorithm is ultimately determined by the perfor- mance of the detector. Small objects with weak appearance and features decrease the performance of object detectors [59] which is likely what we observed here. As the ROIs cover small pixel areas, it is presumable that the detector is sensitive to larger distances and that the combination of larger distances and poor image qual- ity becomes especially problematic. One could assumably increase the performance by shortening the detection distances, but there are other approaches to improve the performance on small objects. A prominent solution is YOLO-Z, introduced by Benjumea et al. [62]. YOLO-Z is a YOLOv5 model modified on an architectural level, improving the detection abilities of small objects without any significant speed trade-o�. 6.5 Adaption to UAV Applications This project focuses on characteristic UAV motion blur. However, for a potential UAV implementation, additional aspects concerning the UAV domain have to be considered. The bird’s-eye view and the pitch angle of the UAV are particular features to be taken into account. Another aspect of bringing attention to is hardware limitations. Challenges for deploying a computer vision application on a UAV platform are: 1) the energy consumption, the application has to consume minimal power to minimize the impact on the UAV flight time, 2) the memory consumption and computational power as the UAV payload capability are limited, and 3) the data processing, the input data has to be processed with low latency to be applicable for real-time applications [20]. The detection and tracking of ROIs will only be a part of a large UAV triage application, making these aspects even more relevant. 42 7 Conclusions The objective of this thesis has been to investigate the possibilities for a UAV to detect and track ROIs for remote measurement of vital parameters in the visual and thermal domains. The ROIs have been the forehead, nose, mouth, and chest, and the UAV characteristic taken under consideration is motion blur due to random camera motion. In this project, we have taken an object detection approach to keypoint estimation and tracking. The state-of-the-art keypoint detector KAPAO and the tracking algorithm SORT have been implemented and evaluated in several experimental setups. Various metrics have been used to assess the performance of the keypoint estimators and tracking algorithms. The keypoint estimators have been evaluated using mAP and di�erent keypoint object similarity measurements. For the evaluation of the tracking performance, MOTA has been used. For detecting ROIs in the thermal domain, the model created by transferring knowl- edge from the visual to the thermal domain by fine-tuning showed the highest perfor- mance. Furthermore, as the second best performing model in the thermal domain, the expansion of the pose object improved the performance significantly. This result demonstrates the innovative use of the spatial information. Adversarial training on motion blur had a minimal impact on the performance in the presence of blur. The result problematizes the sizes of the ROIs and the motion characteristics of low-altitude UAV flights. Since no generalization among random blur kernels could be observed, the results support the use of optical stabilization in a possible UAV triage application. Regarding tracking of ROIs, the result concludes that the SORT algorithm improved the performance for all use cases. In addition, one can observe that both the SORT algorithm and the naive approach achieved an almost perfect score on the 300-VW dataset. This is a significant result as the dataset represent the distances of previous research on remote measure of vital parameters. The result shows that the pixel area is relevant, and it can be concluded that the distance and image quality impact the performance. This poses a potential limitation in a UAV triage application. 7.1 Future Work In this project, we have been using public available keypoint annotated datasets to avoid manual annotation of data. The UAV motion characteristics have been simulated, and further adopting the models for UAV triage application has been left for future work. Challenges to be met have previously been discussed in Sections 6.3 and 6.5. With respect to the UAV triage characteristic, possible directions for future 43 7. Conclusions work could be to 1) collect and manually annotate UAV data from a mass casualty incident or 2) synthesize data and simulate corresponding scenarios. An idea could be to create synthetic data by segmenting and adding humans to UAV footage. The research question is related to limitations associated with remote measurement of vital parameters for triage. It would be beneficial to further investigate to what extent it is possible to remotely measure vital parameters to define constraints re- garding this project in terms of accuracy, distances, and angles. Such limitations would put this project in context and make it possible to tune the evaluation metrics accordingly. This project aims to detect and track the ROIs in the visual and thermal domains. It would be interesting to investigate the properties of the thermal domain further. A possible direction of future studies could be to investigate the thermal dataset biases and implications of using apparent temperature di�erences for imaging. As there is a limited public available keypoint annotated thermal data, another direction of future studies could be to explore the possibility of transforming RGB images into thermal images using, for example, GANs. The tracking algorithm SORT has been implemented in this thesis project. How- ever, other possible tracking algorithms might be of interest. DeepSORT [63] is a successor to SORT which additionally employs a re-identification network. The re-identification network is a CNN trained to identify object similarity to reduce identity switches. The original DeepSORT re-identification network is trained on human objects, and a possibility could be to create a re-identification dataset for the ROIs. However, that has been left for future work due to limitations. Before deploying the models in a real-world application, dataset biases and bias propagation has to be addressed. Aspects to be taken under consideration have been previously mentioned in Section 6.3, which have to be investigated to be ad- dressed accordingly. A possible idea could be to expand the dataset artificially using GANs for image synthesis to increase the diversity of the model and eliminate biased behavior. Several ethical aspects have to be investigated before deploying the model in a real-world triage application. In a medical application, privacy concerns become particularly important and should hence be analyzed further. Storage of the patents personal data, patient privacy and consent, as well as reconstruction of training data, for example, has to be addressed before a potential deployment. 44 Bibliography [1] J. Bazyar, M. Farrokhi, and H. Khankeh, “Triage systems in mass casualty incidents and disasters: A review study with a worldwide approach,” Open Ac- cess Macedonian Journal of Medical Sciences (OAMJMS), vol. 7, no. 3, p. 482, 2019. DOI: http://dx.doi.org/10.3889/oamjms.2019.119. [2] J. Rantakokko, M. G. Lozano, G. Tolt, L. Thors, and A. Bucht, “Evakuering av skadade med obemannade farkoster,” tech. rep., Totalförsvarets forskningsin- stitut (FOI), 2022. ISSN: 1650-1942. [3] K. Khabarlak and L. Koriashkina, “Fast facial landmark detection and appli- cations: A survey,” arXiv:2101.10808, 2021. [4] Etikprövningsmyndigheten, “Värnar människan i forskning.” URL:https:// etikprovningsmyndigheten.se, accessed: October 25 2022. [5] Swedish Defence Research Agency (FOI), “About FOI.” URL: https://www. foi.se/en/foi/about-foi.html, accessed: Mars 2 2022. [6] A. Khorram-Manesh, J. Nordling, E. Carlström, K. Goniewicz, R. Faccin- cani, and F. M. Burkle, “A translational triage research development tool: Standardizing prehospital triage decision-making systems in mass casualty incidents,” Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine (SJTREM), vol. 29, no. 1, pp. 1–13, 2021. DOI: http://dx.doi. org/10.1186/s13049-021-00932-z. [7] E. Lee, E. Chen, and C.-Y. Lee, “Meta-rPPG: Remote heart rate estimation using a transductive meta-learner,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 392–409, Springer, 2020. DOI: http://dx.doi. org/10.1007/978-3-030-58583-9_24. [8] F. J. Rodriguez-Lozano, F. León-García, M. Ruiz de Adana, J. M. Palomares, and J. Olivares, “Non-invasive forehead segmentation in thermographic imag- ing,” Sensors, vol. 19, no. 19, p. 4096, 2019. DOI: https://doi.org/10.3390/ s19194096. [9] V. Hartmann, H. Liu, F. Chen, W. Hong, S. Hughes, and D. Zheng, “Toward accurate extraction of respiratory frequency from the photoplethysmogram: Ef- fect of measurement site,” Frontiers in Physiology, vol. 10, p. 732, 2019. DOI: http://dx.doi.org/10.3389/fphys.2019.00732. 45 http://dx.doi.org/10.3889/oamjms.2019.119 https://etikprovningsmyndigheten.se https://etikprovningsmyndigheten.se https://www.foi.se/en/foi/about-foi.html https://www.foi.se/en/foi/about-foi.html http://dx.doi.org/10.1186/s13049-021-00932-z http://dx.doi.org/10.1186/s13049-021-00932-z http://dx.doi.org/10.1007/978-3-030-58583-9_24 http://dx.doi.org/10.1007/978-3-030-58583-9_24 https://doi.org/10.3390/s19194096 https://doi.org/10.3390/s19194096 http://dx.doi.org/10.3389/fphys.2019.00732 Bibliography [10] D. Müller, A. Ehlen, and B. Valeske, “Convolutional neural networks for se- mantic segmentation as a tool for multiclass face analysis in thermal infrared,” Journal of Nondestructive Evaluation, vol. 40, no. 1, pp. 1–10, 2021. DOI: http://dx.doi.org/10.1007/s10921-020-00740-y. [11] D. Djeldjli, F. Bousefsaf, C. Maaoui, F. Bereksi-Reguig, and A. Pruski, “Remote estimation of pulse wave features related to arterial sti�ness and blood pressure using a camera,” Biomedical Signal Processing and Control, vol. 64, p. 102242, 2021. DOI: http://dx.doi.org/10.1016/j.bspc.2020.102242. [12] D. E. King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009. [13] G. Bradski and A. Kaehler, “OpenCV,” Dr. Dobb’s Journal of Software Tools, vol. 3, p. 2, 2000. [14] F. Yang, S. He, S. Sadanand, A. Yusuf, and M. Bolic, “Contactless measurement of vital signs using thermal and RGB cameras: A study of COVID 19-related health monitoring,” Sensors, vol. 22, no. 2, p. 627, 2022. DOI: http://dx. doi.org/10.3390/s22020627. [15] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single- shot multi-level face localisation in the wild,” in Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR), pp. 5203–5212, 2020. DOI: http://dx.doi.org/10.1109/CVPR42600.2020.00525. [16] S. Skansi, “Feedforward neural networks,” in Introduction to Deep Learning: From Logical Calculus to Artificial Intelligence, pp. 79–105, Cham: Springer, 2018. DOI: https://doi.org/10.1007/978-3-319-73004-2_4. [17] H. H. Aghdam and E. J. Heravi, “Convolutional neural networks,” in Guide to Convolutional Neural Networks: A Practical Application to Tra�c-Sign De- tection and Classification, pp. 85–130, Cham: Springer, 2017. DOI: https: //doi.org/10.1007/978-3-319-57550-6_3. [18] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” arXiv:1511.08458, 2015. [19] Y. Pang and J. Cao, Deep Learning in Object Detection, pp. 19–57. Singapore: Springer Singapore, 2019. DOI: https://doi.org/10.1007/ 978-981-10-5152-4_2. [20] S. Vaddi, E�cient object detection model for real-time UAV applications. PhD thesis, Iowa State University, 2019. DOI: http://dx.doi.org/10.5539/cis. v14n1p45. [21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 779–788, 2016. DOI: http://dx.doi.org/10.1109/CVPR.2016.91. 46 http://dx.doi.org/10.1007/s10921-020-00740-y http://dx.doi.org/10.1016/j.bspc.2020.102242 http://dx.doi.org/10.3390/s22020627 http://dx.doi.org/10.3390/s22020627 http://dx.doi.org/10.1109/CVPR42600.2020.00525 https://doi.org/10.1007/978-3-319-73004-2_4 https://doi.org/10.1007/978-3-319-57550-6_3 https://doi.org/10.1007/978-3-319-57550-6_3 https://doi.org/10.1007/978-981-10-5152-4_2 https://doi.org/10.1007/978-981-10-5152-4_2 http://dx.doi.org/10.5539/cis.v14n1p45 http://dx.doi.org/10.5539/cis.v14n1p45 http://dx.doi.org/10.1109/CVPR.2016.91 Bibliography [22] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767, 2018. [23] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7263–7271, 2017. DOI: http://dx.doi.org/10.1109/CVPR. 2017.690. [24] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv:2004.10934, 2020. [25] U. Nepal and H. Eslamiat, “Comparing YOLOv3, YOLOv4 and YOLOv5 for autonomous landing spot detection in faulty uavs,” Sensors, vol. 22, no. 2, p. 464, 2022. DOI: http://dx.doi.org/10.3390/s22020464. [26] W. McNally, K. Vats, A. Wong, and J. McPhee, “Rethinking keypoint repre- sentations: Modeling keypoints and poses as objects for multi-person human pose estimation,” arXiv:2111.08557, 2021. [27] C. Zheng, W. Wu, T. Yang, S. Zhu, C. Chen, R. Liu, J. Shen, N. Kehtar- navaz, and M. Shah, “Deep learning-based human pose estimation: A eurvey,” arXiv:2012.13392, 2020. [28] National Aeronautics and Space Administration, Science Mission Direc- torate, “Infrared waves,” 2010. URL: http://science.nasa.gov/ems/07_ infraredwaves, accessed: May 12 2022. [29] M. Vollmer, “Infrared thermal imaging,” in Computer Vision: A Reference Guide, pp. 666–670, Springer, 2021. DOI: http://dx.doi.org/10.1007/ 978-3-030-63416-2_844. [30] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020. DOI: http://dx.doi.org/10.1109/JPROC.2020. 3004555. [31] M. Sayed and G. Brostow, “Improved handling of motion blur in online object detection,” in Proceedings of IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 1706–1716, 2021. DOI: http://dx.doi.org/ 10.1109/CVPR46437.2021.00175. [32] A. Chaubey, N. Agrawal, K. Barnwal, K. K. Guliani, and P. Mehta, “Universal adversarial perturbations: A survey,” arXiv:2005.08087, 2020. [33] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich, “Examining the impact of blur on recognition by convolutional networks,” arXiv:1611.05760, 2016. [34] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-GAN: Protecting clas- sifiers against adversarial attacks using generative models,” arXiv:1805.06605, 2018. 47 http://dx.doi.org/10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 http://dx.doi.org/10.3390/s22020464 http://science.nasa.gov/ems/07_infraredwaves http://science.nasa.gov/ems/07_infraredwaves http://dx.doi.org/10.1007/978-3-030-63416-2_844 http://dx.doi.org/10.1007/978-3-030-63416-2_844 http://dx.doi.org/10.1109/JPROC.2020.3004555 http://dx.doi.org/10.1109/JPROC.2020.3004555 http://dx.doi.org/10.1109/CVPR46437.2021.00175 http://dx.doi.org/10.1109/CVPR46437.2021.00175 Bibliography [35] N. Akhtar, A. Mian, N. Kardan, and M. Shah, “Advances in adversarial attacks and defenses in computer vision: A survey,” IEEE Access, vol. 9, pp. 155161– 155196, 2021. DOI: http://dx.doi.org/10.1109/ACCESS.2021.3127960. [36] S. Dodge and L. Karam, “Understanding how image quality a�ects deep neural networks,” in Proceedings of 2016 8th International Conference on Quality of Multimedia Experience (QoMEX 2016), pp. 1–6, 2016. DOI: http://dx.doi. org/10.1109/QoMEX.2016.7498955. [37] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding and eval- uating blind deconvolution algorithms,” in Proceedings of IEEE Conference on Computer Visi