person Vehicle Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveil- lance: A Comparative Study of YOLO11 and RF-DETR Master’s thesis in Electrical Engineering Jacob Justad Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Master’s thesis 2026 Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveillance: A Comparative Study of YOLO11 and RF-DETR Jacob Justad Department of Electrical Engineering Chalmers University of Technology 2026 Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveil- lance: A Comparative Study of YOLO11 and RF-DETR Jacob Justad © Jacob Justad, 2026. Supervisor: Ass Prof Jennifer Alvén, Electrical engineering Advisors: Martin Ljungqvist and Tiger Moberg, Axis Communications AB Examiner: Ass Prof Jennifer Alvén, Electrical engineering Master’s Thesis 2026 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 0 Gothenburg, Sweden 2026 iv Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveil- lance: A Comparative Study of YOLO11 and RF-DETR Jacob Justad Department of Electrical Engineering and Engineering Chalmers University of Technology Abstract This thesis investigates self-supervised fixed-scene adaptation for real-time object- detectors in an edge-computing surveillance context. While modern object-detectors achieve strong results on general-purpose benchmarks, deployment in static camera scenes introduces distinct challenges: domain shift to a specific viewpoint, limited availability of scene-specific labels, and stringent compute and memory budgets on- device. At the same time, the stationary background of surveillance footage provides exploitable structures, as do their temporal dependencies of video-frames. This study conducts a comparative analysis of two state-of-the-art object detection architectures: the Transformer-dominant RF-DETR and the convolutional neural network (CNN)- dominant YOLO11. The thesis employs the 100Scenes dataset to represent a broad range of surveillance environments. Experimental results demonstrate that RF-DETR consistently achieves higher accuracy, smoother convergence, and greater robustness than YOLO11, albeit with higher hardware demands. In contrast, YOLO11 variants (with a frozen backbone) leverage the larger trainable capacity of the neck and head to enable high scene-specific adaptability. While this yields significant gains under quality labeling, it tends to increase sensitivity to imperfect pseudo-labels and the risk of overfitting. Furthermore, by systematically varying model scales, adaptation strategies and environmental conditions the experimental design yields more than 3400 distinct runs. First the work examines the extent to which smaller, specialized models can match the approach of substantially larger models. The experimental results show that a small specialised model can compete with larger general models. Secondly, the study evaluated a proposed on-device self-supervised labeling strategy that integrates SAHI with a bidirectional implementation of ByteTrack. The proposed self-supervised labeling strategy provided reliable performance gains across all architectures and configurations, by recovering hard negatives, more specifically small, occluded and low confidence instances. Thirdly, the study investigated background-context fusion (BF). It proved to be consistently improving the performance in general for RF- DETR, while it proved inconsistent for YOLO11 and failed to increase robustness against seasonality, suggesting it induced background-dependent overfitting. Finally, the study shows that all models being trained on a summer scene exhibit a decrease in relative performance compared with the non-adapted models during a seasonal domain shift to a winter scene. Keywords: Computer Vision, YOLO11, DETR, RF-DETR, Object detection, LW- DETR, YOLO, Self-Supervised Learning v Acknowledgements I would like to express my sincere gratitude to Axis Communications for providing the opportunity to conduct this thesis and for giving me access to the resources to complete this work. A special thanks goes to my supervisors at Axis, Martin Ljungqvist and Tiger Moberg. I am grateful to my academic supervisor, Jennifer Alven, for your support, academic direction, and for ensuring the rigor of this research. Jacob Justad, Gothenburg, 2026-02-08 vii Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Preliminaries 7 2.1 Traditional computer vision-based object detection . . . . . . . . . . 7 2.2 Neural networks-based object detection . . . . . . . . . . . . . . . . . 7 2.3 Convolutional neural networks-based object detection . . . . . . . . . 8 2.3.1 Classification and two-stage object detection . . . . . . . . . . 8 2.3.2 YOLO - One-stage paradigm shift . . . . . . . . . . . . . . . . 9 2.4 Transformer-based object detection . . . . . . . . . . . . . . . . . . . 12 2.4.1 The attention mechanism and the Vision Transformer . . . . . 13 2.4.2 RoboFlow-DETR (RF-DETR) . . . . . . . . . . . . . . . . . . 13 2.5 Self-supervised learning and domain adaptation in object detection . 15 2.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7 Establisehd comparisons and peformances . . . . . . . . . . . . . . . 17 3 Methods 19 3.1 Dataset and models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Dataset Scenes100 . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Pseudo-Label Generation Strategies . . . . . . . . . . . . . . . . . . . 22 3.2.1 Baseline 1: Self-Supervised Learning Baseline (SSL-B) . . . . 22 3.2.2 Heavy Real-Time Ensemble (Ensemble) . . . . . . . . . . . . . 22 3.2.3 Proposed Method: SAHI + ByteTrack (ST) . . . . . . . . . . 23 3.2.4 Server-based (SAM3) . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Training details and adaptation strategies . . . . . . . . . . . . . . . 24 3.3.1 Standard Fine-Tuning (SF) . . . . . . . . . . . . . . . . . . . 26 3.3.2 Background-Context Fusion (BF) . . . . . . . . . . . . . . . . 26 3.3.3 Training setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 ix Contents 3.4 Seasonal Data creation . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Results 31 4.1 Architecture and model-size . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Pseudo-labeling strategy . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Adaptation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Seasonality changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Conclusion and Discussion 43 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Adaptation Capabilities . . . . . . . . . . . . . . . . . . . . . 43 5.1.2 Smaller models compared to larger . . . . . . . . . . . . . . . 45 5.1.3 SAHI + ByteTrack Yields performance increase at top level . 46 5.1.4 The Background Context Fusion impact on models . . . . . . 47 5.1.5 Seasonality changes in scenes . . . . . . . . . . . . . . . . . . 50 5.2 Recommendation for future works . . . . . . . . . . . . . . . . . . . . 50 5.3 Industry Recommendations . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Bibliography 55 A Appendix 1 I x 1 Introduction 1.1 Background The field of real-time object detection has evolved greatly in recent years, mainly due to rapid breakthroughs in deep learning and image classification, with the Convolutional Neural Network (CNN) [1] serving as the main keystone, making tra- ditional feature extraction methods, like Scale-Invariant Feature-Transform (SIFT) and Histogram of Oriented Gradients (HOG), obsolete as standalone techniques. Furthermore, traditional machine learning models for classification, such as Sup- port Vector Machines (SVMs), have been replaced by the fully-connected layers of perceptrons within the CNN architecture. While these new CNN’s architectures provide a powerful new foundation for classifi- cation, they do not inherently solve the full problem of object detection. This task is more complex, requiring the model to perform both localisation (finding where an object is) and classification (identifying what that object is) for potentially many objects at once. The challenge, especially for real-time applications, is to create architectures that could perform both of these tasks efficiently and accurately. The release of the YOLO model in 2015 [2] transformed object detection into a one-stage problem while also providing the capability to solve the latency bottleneck (processing 155 FPS), allowing real-time surveillance and the development of intelli- gence directly in the camera. The YOLO lineage has since matured[3], [4], [5], [6], [7], [8], [9], [10], [11], [12] and has become the industry standard for balancing speed and accuracy on constrained hardware. Today’s industry faces a paradigm shift originating from Natural Language Processing. Following the widespread adoption of transformers [13] in powering Large Language Models (LLMs), the architecture was also adapted for image analysis. Vision 1 1. Introduction Transformers (ViTs) challenged the CNN monopoly by treating image patches as sequences, demonstrating performance superior to state-of-the-art convolutional networks in classification tasks [14]. While the original Detection Transformer (DETR) introduced a revolutionary end-to-end pipeline for object-detection, its prohibitive computational cost precluded edge deployment [15]. Recent innovations have dismantled this barrier, evolving from Lightweight DETR [16] to the state-of-the- art Roboflow Detection Transformer (RF-DETR) [17]. Crucially, RF-DETR diverges from previous architectures by integrating a DINOv2 [18] backbone, effectively distilling the robust features of a self-supervised foundation model into a real-time framework. However, transformer architectures are still more hardware-demanding. Hardware limitations are the main bottleneck in edge surveillance. Because superior model performance generally correlates with higher hardware requirements, scaling advanced analytics can quickly become cost-prohibitive. The vital business case, therefore, lies in optimisation. By validating that efficient, lightweight models can rival the accuracy of computationally intensive ones, companies can deploy premium analytics on cost-effective hardware—cutting operational costs while maintaining high performance. The most recent progress in object detection is typically benchmarked on COCO [19], a general-purpose dataset that has driven impressive gains but does not fully reflect fixed-camera surveillance. In a static scene, the camera viewpoint is constant and the background is repeatedly observed. This makes deployment different in two important ways. First, detectors trained on broad datasets often degrade under domain shift to a specific scene [20]. Nevertheless, collecting and annotating scene- specific data at scale is costly and sometimes sensitive. This motivates self-supervised scene adaptation from unlabeled footage to specialize pre-trained detectors without manual labels [21]. This is a necessity for commercial use. Secondly, the static background is not just irrelevant context, it can be exploited, as shown by Zhang et al. [21] for an older model architecture called Faster R-CNN [22]. This motivates investigating whether background extraction and feature fusion can systematically increase performance in the currents state-of-art model architectures. 1.2 Aim The primary aim of this thesis is to evaluate the performance and trade-offs of self- supervised fixed scene adaptation for real-time object detectors in edge-computing environments. The thesis focuses on evaluating two state-of-the-art real-time object-detector archi- tectures currently dominating the field: The CNN-dominant architecture represented by YOLO11 and the transformer-dominant architecture represented by RF-DETR. I am examining how effectively the models adapt to a specific scene in order to better 2 1. Introduction understand the influence of architectural choices. In an edge-computing environment, hardware constraints are inevitable. Thus, the adaptability of the models across different sizes will be explored further by experi- menting using different model sizes. Further, manually labeling potentially thousands of specific scenes is infeasible, making self-supervised learning a cost-effective choice. This thesis investigates how far a potential on-device compatible pseudo-labeling strategy can perform in relation to existing labeling methods. Taking advantage of the stationary nature of surveillance cameras has been achieved through back- ground extraction and feature fusion in previous architectures, boosting performance [21].Thus, I further aim to examine whether an adaptation of Zhang et al.’s method for exploiting static backgrounds[21] can be used to enhance the performance of YOLO11 and RF-DEFR. In real-world deployments, environmental and operational conditions vary over time (e.g., across seasons and weather regimes), rendering robustness and generalisation properties critical. Consequently, this thesis additionally investigates the extent of performance degradation under such distributional shifts to more rigorously characterize the risks associated with model specialisation, such as overfitting. 3 1. Introduction 1.3 Research Questions Based on these objectives, the thesis addresses the following specific research ques- tions: 1. RQ1 : How do Yolo11 (CNN) and RF-DETR (Transformer) compare in adaptation capability within static scenes? 2. RQ2 : To what extent can self-supervised scene adaptation enable smaller models compared to larger ones? 3. RQ3 : Can a potential on-device strategy for self-supervised learning achieve performance in parity with more resource-heavy methods? 4. RQ4 : Does the integration of background extraction and feature fusion in a fixed-camera environment provide a performance improvement for the chosen models? 5. RQ5 : How does the accuracy of the adapted model under seasonal shifts compare to the performance of the non-adapted base model and its trained- domain performance? 4 1. Introduction 1.4 Limitations The scenes have only been evaluated once per model configuration. Thus, the same seed is being used. There is no statistical analysis per scene more than qualitative analysis and once seasonality is being investigated. Hardware and latency validation is a primary limitation of this study, as I rely on reported latency figures from existing literature rather than direct on-device benchmarking. Crucially, no independent latency measurements were conducted in this study. However, obtaining these metrics is essential to accurately evaluate the real-world computational trade-offs. The comparison between the models is further complicated by inconsistencies in prior works, such as the mixed use of FP16 for latency measurement versus FP32 for accuracy assessment. I am using FP32 for both, however for RF-DETR mixed-precision is being utilized in training. I conducted preliminary latency measurements for the models used during training without applying any optimizations. However, the observed latencies deviated from the theoretically reported values in the existing literature. To enable a rigorous and fair comparison, additional work is required in terms of model optimization and exporting the models to the appropriate formats. The experimental scope was restricted to a fixed input resolution of 640×640, however, generalisation of these findings to other resolutions remains unproven. Additionally, due to time and resource limitations, I did not include most recent state- of-the-art architectures such as D-Fine [23], but instead focused on more established architectures. 5 1. Introduction 6 2 Preliminaries 2.1 Traditional computer vision-based object de- tection Traditional computer vision (CV) approaches to object detection are defined by a multi-stage, "handcrafted" pipeline: Preprocessing → Feature Extraction → Classifi- cation. Unlike modern approaches that learn features directly from data, traditional methods rely on manual engineering to transform raw pixel data into compact representations robust to variations in lighting, scale, and rotation. A prominent example of this era is the Scale-Invariant Feature Transform (SIFT) [24] which made a huge impact. SIFT was designed to match objects by identifying stable interest points ("blobs") and computing descriptors based on gradient orientations in the keypoint’s neighborhood. Histogram of Gradients (HOG) [25] also showed promise in human detections. While effective for matching, using these descriptors for detection required a secondary step. Typically, descriptors were aggregated or analyzed using a "sliding window" approach, where a classifier scanned the image to identify object presence. Support Vector Machines (SVMs) became the state-of-the-art classifier for this task. Using the "kernel trick," SVMs find an optimal, maximum-margin hyperplane to separate object classes in the high-dimensional feature space created by descriptors like SIFT or HOG. However, the performance of these systems was fundamentally limited by the quality of the manually designed features. 2.2 Neural networks-based object detection Neural networks represents a paradigm shift from handcrafted features to end-to- end learning. In this framework, the network learns to extract features for object detection directly from the training data The base building block of the architectures of stacked layers known as deep neural 7 2. Preliminaries networks is the multi-layer perceptrons. Artificial neurons are arranged in layers, where each unit computes a weighted sum of its inputs. Crucially, this sum is passed through a non-linear activation function (e.g., ReLU or Sigmoid).Without these non-linearities, a stack of neural layers, no matter how deep, would mathematically collapse into a single linear transformation, rendering the network incapable of modeling complex decision boundaries. [26], [27], [28]. Training these networks is treated as an optimisation problem by minimizing a loss function that quantifies the error between the network’s predictions and the ground truth. How these networks learn efficiently finally became a significant breakthrough, achieved through backpropagation. [29], [30]. Backpropagation applies the chain rule from calculus in two passes. In the forward pass, the input data is fed through the network, layer by layer, to compute the activations and the final output. This output is then used to calculate the value of the loss function. In the backward pass, the algorithm propagates the gradient of the loss backward through the network, starting from the output layer. At each layer, it efficiently computes the gradient of the loss with respect to that layer’s parameters, crucially reusing the gradients computed for the layer "above" it. This dynamic programming approach avoids redundant calculations and makes training deep networks computationally feasible. The most widely adopted method for this is an optimisation technique called gradient descent. To manage the complexity of modern deep networks, adaptive optimizers are used. AdamW [31] is currently a standard choice. While its predecessor, Adam [32], adapted learning rates using moving averages of gradients, it handled regularisation suboptimal. AdamW decouples weight decay from the gradient update, applying it directly to the parameters. This modification significantly improves generalisation and training stability for deep neural networks. 2.3 Convolutional neural networks-based object detection This section provides a brief overview of the convolutional network used in object detection and concludes by describing one of the main architectures of the YOLO11 model. 2.3.1 Classification and two-stage object detection Hubel and Wiesel [33] research on a cat’s primary visual cortex established that specific neurons are distinguished and possess a local receptive field, responding only to stimuli within a restricted region of the visual field. Furthermore, they demonstrated that many of these neurons are functionally selective, responding optimally to simple geometric structures such as oriented edges or bars. They could also show that it was a hierarchical processing model, wherein "simple cells" respond to these local features and feed this information to other, more "complex cells," which pool from more than one cell. These biological findings laid the groundwork for the 8 2. Preliminaries Neocognitron [34] and eventually the first practical CNN in 1998 by LeCun LeNet-5 [35]. This architecture combined convolutional layers with subsampling (pooling) layers and was trainable end-to-end using backpropagation. It extract local features through the stacking of layers, increasing the receptive-field, being the area in the original resolution a feature can derive from, which in a traditional CNN increases the deeper in the network it is. The weight-sharing mechanism significantly reduced the model’s parameter count, enhancing parameter efficiency and generalisation, and its success in handwritten digit recognition demonstrated the viability of learned hierarchical features. The modern era of Large Scale deep convolutional models was catalyzed by the Ima- geNet Large Scale Visual Recognition Challenge (ILSVRC). Krizhevsky, Sutskever, and Hinton’s AlexNet [36] achieved a substantial reduction in top-5 classification error compared to traditional computer vision pipelines based on hand-designed descriptors such as SIFT and HOG. This breakthrough pivoted object detection from sliding-window techniques to deep convolutional architectures. The foundational R-CNN [37] applied CNNs to region proposals generated by selective search, significantly improving accuracy but suffering from high latency due to redundant feature computations, introducing a two-stage object-detection model. Subsequent models optimized this pipeline: SPP-Net [38] and Fast R-CNN [39] introduced shared feature maps and Region of Interest (RoI) pooling, enabling end-to-end training for classification and regression. Faster R-CNN [22] eventually eliminated the external proposal bottleneck by introducing the Region Proposal Network (RPN), achieving a fully unified, near real-time two-stage detector. 2.3.2 YOLO - One-stage paradigm shift In 2015, Redmon et al. proposed "You Only Look Once" (YOLO), a novel archi- tecture that reframed object detection as a single regression problem rather than a classification task applied to region proposals [2]. Unlike two-stage methods (e.g., Faster R-CNN) that first generate candidate regions, YOLO processes the entire image in a single forward pass, enabling real-time inference. The fundamental concept involves dividing the input image into an S × S grid. If an object’s center falls within a grid cell, that cell is responsible for detecting it. Each cell simultaneously predicts B bounding boxes (coordinates x, y, w, h), a confidence score reflecting the intersection-over-union (IoU) with the ground truth, and conditional class probabilities. This idea is visualized in Figure 2.1. Given that grid-cells may produce overlapping predictions for the same object, Non- Maximum Suppression (NMS) is applied during inference. NMS filters redundant 9 2. Preliminaries S × S grid on input Bounding boxes + confidence Class probability map Final detections Figure 2.1: Object detection as a regression problem. YOLO divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B · 5 + C) tensor, from the original paper [2] detections by discarding boxes with low confidence and suppressing those that have a high overlap of IoU with the highest-scoring box for a given class. Training is optimized via a multi-part Sum of Squared Errors (SSE) loss function that combines localisation and classification. This loss penalizes errors in the box center coordinates (x, y). A dimension loss is also added based on width and height (w, h); instead of squared errors, squared roots are used to align better with different object sizes, hoping to reflect that small deviations in large boxes matter less than in small boxes. Additionally, it regresses the objectiveness confidence toward the ground truth IoU, while a down-weighted "no-object" term suppresses false positives in background cells to prevent them from overwhelming the gradients. Finally, the optimisation includes a classification term that minimizes the error in class probabilities, but this is applied conditionally: it only penalizes classification errors if an object is present in the grid cell. This ensures the model effectively learns P (Class|Object), ignoring class predictions for background cells The YOLO-series has since then incrementally updated their model and improved both accuracy and latency. The YOLOv2 [40] most significant change was the adoption of anchor boxes, a concept borrowed from region-proposal-based networks like Faster R-CNN. The fully connected layers responsible for directly predicting bounding box coordinates in YOLOv1 were removed. Instead, the final convolutional layers were designed to predict offsets to a set of pre-defined prior boxes, or anchors. 10 2. Preliminaries Instead of hand-picking the anchor box priors, which can be suboptimal, YOLOv2 employed k-means clustering on the bounding box dimensions from the training dataset to automatically find a good set of priors. YOLOv3 introduced multi-scale detection using a Feature Pyramid Network (FPN)-like structure, extracting features at three different strides to detect objects of varying sizes [4], [41]. Subsequent versions like YOLOv4 and YOLOv5 introduced Cross-Stage-Partial (CSP) backbones and Path Aggregation Networks (PANet) for better feature extraction. They also formalized "Bag of Freebies" training techniques, replacing standard loss functions with CIoU and introducing Mosaic data augmentation [6]. Models continued evolving, and YOLOX in 2021 [42] introduced anchor-free predictions. Instead of predicting offsets to pre-defined anchor boxes, YOLOX treated detection as a per-pixel prediction problem. They also introduced a decoupled head, separating the classification and regression tasks into two parallel branches. To address the issue of assigning ground truth objects to the correct predictions for training, YOLOX incorporated an advanced dynamic label assignment strategy called SimOTA. Instead of relying on fixed IoU-based rules, SimOTA formulates label assignment as an optimal transport problem. YOLOv8 [9] adopted this and further made incremental improvements within the blocks of how the features are being passed. The latest iteration true to the convolutional architecture is the YOLO11 family of models, which represents the latest generation of YOLO-based CNN detectors as of 2024–2025, encompassing the accumulated advancements. Although both Yolo12,Yolo13, and Yolo26(soon to be released) are newer models, they differ from the more traditional YOLO and convolutional network architecture and are not as established as YOLO11. YOLO11’s architecture, like most detection systems, can be divided into 3 main modules: the backbone (feature extractor), the neck (feature fusion), and the head (predictors). YOLO11 remains true to its core by being a convolution-based architecture with grid-based cells at its center. This means that despite being anchor- free and more flexible, YOLO11 still divides the image feature maps into a dense grid, where each cell (spatial location in the feature map) is responsible for predicting object presence, bounding box offsets, and class probabilities. For YOLO11, 3 different spatial feature maps are used in the detectors head, which can be seen in the Figure 2.2. YOLO11 uses a configuration similar to CSP-based backbone [43] as its backbone for multi-scale feature extraction. However, it incorporates key modules such as C3k2 to improve feature representation (replacing the older C2f), SPPF (Spatial Pyramid Pooling Fast) to extract global semantic information, and the new C2PSA module, which uses pyramid slice attention to better identify objects in complex backgrounds and locate small objects. Following the backbone, the neck employs a PAN-FPN (Path Aggregation Network - Feature Pyramid Network) structure. This design allows for bidirectional (bottom-up and top-down) information flow, effectively fusing shallow spatial features with deep semantic features to enhance localisation accuracy. Finally, the head uses a decoupled architecture, handling classification and bounding box regression as separate tasks. A critical evolution in this architecture is the shift from static to dynamic label assignment. While the original YOLOv1 relied on a rigid geometric rule—assigning responsibility strictly to the single grid 11 2. Preliminaries cell containing the object’s center YOLO11 employs Task-Aligned Learning (TAL). Instead of a binary assignment based on center location, TAL dynamically selects the top-k grid cells inside an object that maximize a high-order alignment metric: t = sα × uβ (2.1) where s is the predicted classification score and u is the Intersection over Union (IoU) with the ground truth. This ensures that the assigned positive samples maximize both classification confidence and localisation accuracy simultaneously. Unlike the hard labels in YOLOv1, TAL generates "soft" supervision targets, scaling the training signal based on the alignment quality t. The classification branch uses Binary Cross Entropy (BCE) Loss and an efficient depthwise convolution layer , while the regression branch combines Distribution Focal Loss (DFL) and Complete IoU (CIoU) loss to optimize bounding box accuracy and stability [12], [44], [45]. Figure 2.2: Detailed breakdown of the YOLO11 architecture. The input image is processed through a CSP-based backbone (blue) featuring SPPF and C2PSA modules, followed by feature fusion in the neck (grey) via upsampling and concatenation, leading to the final multi-scale detection output. From Fang et al. [44]. 2.4 Transformer-based object detection This section describes the attention mechanism and its application in computer vision. Finally, I introduce RoboFlow-DETR (RF-DETR), which is one of the main architectures examined in this study. 12 2. Preliminaries 2.4.1 The attention mechanism and the Vision Transformer The Transformer architecture marked a paradigm shift by demonstrating that recur- rence was not a prerequisite for state-of-the-art sequence modeling [46]. The central hypothesis, "Attention Is All You Need," proposed dispensing with recurrence entirely and relying solely on attention mechanisms. The core component, self-attention (or intra-attention), allows the model to compute representations for each position in a sequence by attending to all other positions within that same sequence. To compensate for the loss of sequential order information, the model ingests positional encoding along with the input embeddings. This architecture is massively parallelized and has been one of the foundations of the progress we are seeing today in artificial intelligence. In 2021, Dosovitkiy et al. introduced the Vision Transformer (ViT), successfully applying this architecture to computer vision [14]. ViT processes an image by dividing it into fixed-size patches (e.g., 16×16 in pixels), flattening them into linear embeddings, and treating them as a sequence of "words." A learnable "classification token" is prepended to the sequence to aggregate global information for the final prediction. This approach achieved excellent results compared to state-of-the-art CNNs while being highly computationally efficient to train. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ c l ass] embedding 1 2 3 4 5 6 7 8 90Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm +L x + Transformer Encoder Figure 2.3: the image is split into fixed-size patches, each patch is converted into a vector, a positional embedding is added, and the resulting sequence of vectors is passed into a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. The figure is retrieved from the orginal paper [14]. 2.4.2 RoboFlow-DETR (RF-DETR) The RF-DETR [47] represents the current state-of-the-art in object detection, culmi- nating in a lineage that began with DETR [15]. DETR introduced a paradigm shift by utilizing set-based prediction via bipartite matching to eliminate the need for Non- Maximum Suppression (NMS). While subsequent iterations like Deformable DETR 13 2. Preliminaries [48] improved convergence through sparse sampling and RT-DETR [49] optimized encoders for real-time speed, RF-DETR is structurally derived from LW-DETR [16]. It adopts a modular encoder-projector-decoder architecture that is further optimized via Neural Architecture Search (NAS) to identify Pareto-optimal configurations by varying patch-sizes, the number of decoder layers, the number of queries, image resolution, and the number of windows in the power attention block [47]. The architecture (see Figure 2.4 for a full view) begins with the encoder, which diverges from standard ViT backbones by utilizing DINOv2 [18]. This self-supervised Vision Transformer is pre-trained on massive curated datasets to yield robust, "all- purpose" visual features and is processed using efficient windowed self-attention to manage computational costs[50]. Linking the encoder and decoder, and serving as the neck, is the projector. The projector employs C2f blocks adapted from YOLOv8 to fuse multi-scale features effectively. The decoder implements a mixed query selection strategy [51] to improve initialisation. In standard DETR models, object queries are typically static, learnable embeddings that must learn to locate objects from scratch. In contrast, this strategy extracts the top-K features from the projector’s last layer representing the regions with the highest probability of containing objects and uses their positions to dynamically initialize the spatial queries. These positional priors are then combined with learnable content queries. Essentially, this gives the decoder a ’head start’ by explicitly pointing it toward relevant image regions, which significantly accelerates training convergence (slightly functioning as a prior). Decoder Group x N Decoder Group 13 ViT Backbone Block 3 Block 4 Block 2 Block 1 Non-Windowed Encoder Layer Windowed Encoder Layer x 2 Positional Embeddings Image Patches Projector Detection Head Box Head Class Head Decoder Group 1 Layer 6 Layer x N Layer 1 Query Embeddings Query Selection Self Attention Deformable Cross Attend Feed ForwardFeed Forward Bilinear Upsample Cat Segmentation Head Depthwise Conv 1 Depthwise Conv x N Depthwise Conv 6 FFN FFN FFN Figure 2.4: Overview of the RF-DETR architecture utilizing DINOv2 and NAS [47]. To stabilize and accelerate convergence, RF-DETR adopts Group-DETR [52] training 14 2. Preliminaries dynamics. The object queries are instantiated as K parallel groups, where each group undergoes independent one-to-one bipartite matching (the linear sum assignment problem), which is solved using a variation of the Hungarian algorithm. This effectively creates a global one-to-many assignment strategy where a single ground- truth object serves as a positive target for K predictions (one per group), significantly increasing the density of supervision signals per image. This auxiliary grouping is utilized solely for training, during inference, the extra groups are discarded to revert to a single-decoder architecture. Although there are K decoder groups during training, they all share weights, thus minimally increasing memory [52]. The model minimizes a composite loss Ltotal applied to all intermediate decoder layers and encoder proposals. The classification term, Lcls, utilizes an IoU-aware Binary Cross-Entropy loss. Instead of a static binary label, the target for positive samples is softened to a dynamic quality score t = pα · IoU1−α, where p is the predicted probability of that class, IoU is the intersection-over-union with the ground truth, and α is a hyperparameter that acts as a balancing coefficient for localisation and class confidence. This formulation aligns classification confidence with localisation accuracy, explicitly suppressing high- confidence but poorly localized predictions. Simultaneously, the box regression terms combine a standard L1 loss for absolute coordinate accuracy with a Generalized IoU (GIoU) loss [53], which ensures non-vanishing gradients even for non-overlapping boxes by penalizing the smallest enclosing rectangle. 2.5 Self-supervised learning and domain adapta- tion in object detection Modern detectors increasingly rely on self-supervised (label-free) pretraining to improve downstream transfer under distribution shift. Contrastive pretraining (e.g., MoCo [54]) learns instance-discriminative representations without labels, while vision transformers trained with self-distillation (DINOv2) [18] or masked image modeling (MAE) further boost robustness to domain changes when fine-tuned for detection [55]. These paradigms reduce reliance on labeled source domains and provide stronger initialisation for adaptation. Automatic adaptation for certain domains has improved significantly compared to standard source-only baselines. RoyChowdhury et al. [56] demonstrated that automatically obtaining pseudo-labels from the source (or "base") detector and refining them with single-direction tracking allows the model to self-train effectively. This process fills in missing detections (false negatives) in the new domain, leading to a promising performance increase over the original pre-trained model. In the self-driving vehicle domain, a novel semi-supervised training method was integrated into YOLOv5 that improves label generation by utilizing both high- and low-confidence predictions, rather than discarding the latter. The authors introduce a bi-directional object tracking mechanism that leverages temporal data (past and 15 2. Preliminaries future frames) to refine bounding boxes and recover missing labels [57]. However, the core idea of using low-confidence labels is the innovation behind the "Byte-tracker", which was published in 2022, a simple yet robust multi-object tracking method. By associating high-score boxes first and then leveraging similarities between low-score boxes (often occluded objects) and existing tracklets, the method recovers true objects and filters background noise [58]. Closest to surveillance deployments, self-supervised scene adaptation specializes a generic detector for a single fixed-view camera using only its unlabeled stream. Zhang & Hoai [21] propose cross-teaching between two base detectors, bidirectional tracking for pseudo-label densification, location-aware mixup that respects fixed object priors, and explicit background modeling/fusion. They introduce Scenes100, a 100-camera benchmark and evaluation protocol for per-scene adaptation. This line shows sizable accuracy gains without any human annotation in the target scene [21]. Finally, label fusion is managed by a graph-based refinement module that acts as a ’consensus engine’ to eliminate duplicates. By treating every candidate box as a graph node and drawing edges between overlapping predictions, the module constructs connected components representing single objects. From each component, it selects the node with the highest degree—the box that overlaps with the greatest number of other predictions—thereby consolidating the most consistent spatial proposal from the combined detector and tracker streams, functioning as a "consensus engine". Independent of adaptation, environmental/context specialisation can lower intra- scene variability. By routing images to indoor versus outdoor expert models, they observe statistically significant mean average Precision (mAP) gains over a single generalist network—showing that straightforward scene categorisation (e.g., a Places- based router) provides benefits that are complementary to self-supervised adaptation [59]. To address the challenges of detecting small objects in high-resolution imagery—such as drone or satellite surveillance where targets often lack pixel detail, Akyon et al. proposed Slicing Aided Hyper Inference (SAHI) [60]. This method employs a slicing strategy during both fine-tuning (by augmenting data with zoomed-in patches) and inference (by processing the image in overlapping slices). By recovering small details before merging the results, SAHI proves highly effective for scenarios involving small, densely packed targets. 2.6 Evaluation metrics The industry standard evaluation metrics that are most commonly used is mAP this metric incorporates both localisation and classification to determine a true positive. It is calculated using IoU. which quantifies the spatial accuracy of a predicted bounding box relative to the ground truth. It is calculated as the ratio of the overlapping area 16 2. Preliminaries between the predicted box (Bp) and the ground truth box (Bgt) to the total area covered by their union. IoU = Area(Bp ∩ Bgt) Area(Bp ∪ Bgt A prediction is classified as a True Positive (TP) only if its IoU with a ground truth object exceeds a specific threshold (e.g., IoU ≧ 0.5) and the class labels match. Predictions falling below this threshold, or duplicate detections of the same object, are penalized as False Positives (FP). I compute the mean average precision, which aggregates the area under the Precision- Recall (PR) curve across all classes. Precision measures the purity of positive predictions, while Recall measures the proportion of ground truth objects that are successfully detected. Crucially, the PR curve is generated by ranking all detections by their confidence score (from highest to lowest). The final AP is derived using maximum interpolation, where the precision at a given recall level r is taken as the maximum precision for any recall r′ ≥ r. This interpolation ensures that the metric rewards the model for placing correct detections at the top of the ranking and mitigates penalties for low-confidence false positives once all ground truth objects have been recalled mAP50: This is computed as the mean Average Precision at a single IoU threshold of 0.50. mAP50:95: This metric, which is the main standard used in industry, is obtained by averaging the average precision over 10 IoU thresholds, ranging from 0.50 to 0.95 in increments of 0.05. 2.7 Establisehd comparisons and peformances A study by Sapkota et al. [17] directly evaluated RF-DETR against YOLO12 for greenfruit detection in complex orchard environments. The results highlighted RF- DETR’s superior localisation in single-class settings (mAP@50 = 0.9464) and robust performance in multi-class occluded scenarios (mAP@50 = 0.8298). Qualitative analysis further demonstrated that RF-DETR’s global self-attention mechanism allowed it to recover heavily occluded or camouflaged objects more effectively than YOLO12, which tended to over-detect in cluttered regions. Furthermore, RF-DETR exhibited significantly faster convergence, plateauing in fewer than 10–20 epochs, validating the advantage of its pre-trained DINOv2 backbone. Recent benchmarks reveal a critical divergence between parameter efficiency (model size) and latency efficiency (inference speed). While the YOLO11 family retains a distinct advantage in pure storage requirements—YOLO11-N (2.6M params) is 17 2. Preliminaries nearly 12× smaller than RF-DETR-N (30.5M params)—this size advantage does not translate to superior runtime performance. Instead, transformer-based models (LW-DETR and RF-DETR) establish a great performance on Accuracy-Latency curves, although a new method is challenging the field. For applications constrained by inference time rather than memory efficiency, RF- DETR offers significantly higher accuracy per millisecond of compute. Notably, RF-DETR-N matches the latency of YOLO11-N (≃2.3 ms vs 2.2 ms) but delivers a massive +10.9 mAP improvement on COCO (48.0 vs. 37.1). This indicates that while transformers require more memory to store weights, their parallelizable architecture allows them to process information as fast as much smaller, deeper CNNs while extracting far richer features. In the high-accuracy regime (>5 ms), the RF-DETR family remains optimal, with the RF-DETR-2XL achieving the highest accuracy across all tested benchmarks [47]. Table 2.1: Comparison of YOLO11, LW-DETR and RF-DETR variants on COCO and RF100-VL. Family Variant Params (M) Latency (ms) APCOCO AP50 COCO APRF100 AP50 RF100 YOLO11 N 2.6 2.2 37.1 51.6 55.5 81.3 S 9.4 3.2 44.1 59.3 56.4 82.5 M 20.1 5.1 48.3 63.6 57.0 82.5 L* 25.3 6.2 53.4 – – – LW-DETR N 12.1 1.9 42.9 60.7 57.1 84.7 S 14.6 2.6 48.0 66.8 57.4 85.0 M 28.2 4.4 52.6 72.0 59.8 86.8 RF-DETR N 30.5 2.3 48.0 67.0 57.6 84.9 S 32.1 3.5 52.9 71.9 60.7 87.0 M 33.7 4.4 54.7 73.5 61.5 87.7 L† 135.6 – 59.0 77.3 – – 2XL 126.9 17.2 60.1 78.5 63.3 88.9 General: COCO YOLO11-N/S/M and all LW/RF-DETR (N/S/M/2XL) metrics are retrieved from the RF-DETR paper [47]. * Metrics from official Ultralytics documentation; RF100-VL metrics were not reported. † RF-DETR-L (preview) parameters and COCO AP / AP50 are reported by Roboflow (model zoo and GitHub); latency and RF100-VL metrics were not reported. 18 3 Methods This chapter describes the methodology required to address the primary aim of evaluating self-supervised scene adaptation in edge-computing environments. To investigate the trade-offs between YOLO11 (CNN) and RF-DETR (Transformer) models in a fixed camera scene, the dataset used is of utmost importance. Generic object detection benchmarks do not adequately capture the stationary backgrounds, specific camera angles, and environmental noise inherent to surveillance. Therefore, I will first introduce the Scenes100 dataset, chosen specifically to emulate diverse, realistic fixed-camera environments. This data foundation is essential for correctly evaluating how well different model sizes and architectures can adapt to real-world deployment. Secondly, an objective is to determine if on-device adaptation can compete with resource-heavy methods. Consequently, I outline four distinct label generation strategies. These strategies are selected to establish the potential and limitations of the proposed method: a naive baseline, a heavy real-time ensemble, the proposed resource-efficient SAHI+ByteTrack, and a general auto-labeling server-side model using SegmentAnythingModel3 (SAM3). Thirdly, to determine if the specific static characteristic of surveillance video can be leveraged to improve performance, I describe the implementation of two different model adaptation approaches: Standard fine-tuning and a modified background- context fusion method derived from the method by Zhang et al [21]. This section also details the strict freezing of backbones to safeguard for catastrophic forgetting. To ensure that the findings are robust against real-world environmental shifts, I introduce a method for seasonal data creation using generative AI to understand 19 3. Methods how performance may vary as the environment shifts and what strategy should be used in a production setting. Finally, I describe the validation metrics (mAP, IoU) used to quantify performance. In Figure 3.1 a general overview of the experiments is shown. Figure 3.1: Overview of the general approach of the Experiments 3.1 Dataset and models This section describes the dataset and object detection models used to evaluate self- supervised adaptation. A geographically diverse fixed-camera dataset is combined with models of varying capacity to study how architecture and computational budget affect adaptation performance, particularly for edge deployment. 3.1.1 Dataset Scenes100 To evaluate the model’s performance across diverse surveillance environments, I utilize the Scenes100 dataset [21]. This dataset serves as a benchmark for scene-adaptive object detection, consisting of 100 distinct videos captured from fixed-perspective cameras across 16 countries. The videos capture a wide variety of environments, ranging from crowded urban centers to isolated roadways, covering different times of day, weather conditions, and object densities. The dataset targets two primary categories: person and vehicle. The vehicle category includes all vehicles with four or more wheels and thus corresponds to the COCO categories: car, bus, and truck. Crucially, each scene includes manually annotated evaluation frames and a spatial validity mask. This mask is applied to filter out irrelevant regions, such as distant backgrounds. The number of validation frames per scene varies, scenes with a dense number of objects will have fewer validation frames. For my experiments, I adopt the training frame splits provided by the official implementation [21]. However, to reduce computational overhead, I limit the training data to the last 9,000 samples of the provided sequence. These frames are extracted at a stride of 5 (every 5th frame) from the original 30 FPS videos, the 9000 frames represents 30 minutes of video. The resolution varies from 720 ×1280 to 1080×1920. Four examples of the scenes are shown in Figure 3.2. 20 3. Methods (a) Scene 001 (b) Scene 003 (c) Scene 019) (d) Scene 146 Figure 3.2: Examples of diverse surveillance scenes from the Scenes100 dataset, with ground truth labels, red boxes are "person" and blue boxes are "vehicle". The see-through green overlay is the validation mask (no objects in this area will be included). 21 3. Methods 3.1.2 Models To address evaluation of self-supervised adaptation across different architectural paradigms and model capacities, four distinct models were selected. The selection criteria focused on representing the current state-of-the-art for both convolutional Neural Networks (CNNs) and Real-Time Detection Transformers (RF-DETRs). I selected four models representing the current state-of-the-art. The YOLO11 (Nano and Large) serves as the CNN representative, the Nano variant tests adaptation under extreme edge constraints, while the Large variant establishes a performance ceiling. These are contrasted against the RF-DETR (Nano and Medium), representing the Transformer architecture. Notably, RF-DETR-Medium is selected over the Large variant to maintain strict hardware constraints in the edge-environment. The Medium version of the RF-DETR is also closer in both parameters and latency to Yolo11-Large than what the RF-DETR-Large version. Thus, it will allow for a fairer comparison. A full comparison of models can be seen in Table 2.1. 3.2 Pseudo-Label Generation Strategies To investigate whether on-device training is competitive with heavier methods, this section outlines four pseudo-label generation strategies. These strategies are: 1. SSL-B (Self-Supervised Learning Baseline) 2. Ensemble (Real-time ensemble) 3. ST (SAHI + ByteTrack) 4. SAM3 (Server-based SegmentAnythingModel3) All models are pretrained on 640 × 640 resolution, which is used unless otherwise specified. The resulting pseudo-labels act as ground truth for training in the respective strategy. 3.2.1 Baseline 1: Self-Supervised Learning Baseline (SSL-B) This strategy represents the naive baseline where a COCO-pretrained detector generates pseudo-labels without any refinement. A strict confidence threshold of λdet = 0.5 is applied to filter initial predictions. Additionally, for YOLO-base detectors, Non-Maximum Suppression (NMS) is utilized with an IoU threshold of λnms = 0.75 (the Ultralytics default) to eliminate redundant detections. 3.2.2 Heavy Real-Time Ensemble (Ensemble) Following the self-supervised scene adaptation framework proposed by Zhang et al. [21], this strategy generates high-quality pseudo-labels utilizing a computationally 22 3. Methods intensive ensemble strategy. It represents the upper limit for what we can count as real-time edge deployment due to very high hardware needs. This "heavy" approach serves as a robust baseline. The pipeline aggregates predictions from two large-scale detection models, RF-DETR-Large and YOLO11-X (X-Large). These initial detections are subsequently refined via bi-directional tracking using DiMP50 (Discriminative Model Prediction) [61]. DiMP50 is a powerful single-object tracker that learns a discriminative target model to distinguish objects from the background. In the pipeline of Zhang et al. [21], they initialize the tracker using deep features extracted from the detection boxes (utilizing the ResNet-50 backbone) and propagate these candidates in both forward and backward temporal directions. This bi-directional strategy is crucial for recovering false negatives in adjacent frames and refining localisation accuracy via the tracker’s precise IoU estimation component. Final candidates are determined through a graph-based merging step, where boxes with identical class labels and high intersection-over-union (λiou) are combined, retain- ing only the most connected candidate. I adopt the well-performing hyperparameters established by Zhang et al. [21]. 3.2.3 Proposed Method: SAHI + ByteTrack (ST) I propose a resource-efficient pipeline designed to take advantage of the temporal data of videos and the pretrained model. The method leverages Slicing Aided Hyper Inference (SAHI) [60] to improve small-object performance and ByteTrack [58] to recover low-confidence detections temporally. The main idea behind this is that if one can run a model in their edge-environment for inference, the creation of the pseudo-labels should also be possible in the same edge-environment. 1. Global and Local Inference: Standard inference is first run on the full frame to capture global context. In parallel (if applicable), SAHI performs inference on overlapping windows (overlap ratio 0.1). Window sizes are set to 640 × 640. 2. Hierarchical Merging: To prevent object fragmentation (where a single large object is detected as multiple parts across windows), global detections are prioritized, to make tracking and labels more stable. If a SAHI detection is contained within a high-confidence global detection, with more than a certain % of the total area, the global box is retained, and the SAHI-based box is discarded. Then I perform NMS to get remove potential duplicates. 3. Temporal Recovery (ByteTrack): Firstly, I save all high confidence labels at this stage, then I utilize a bi-directional implementation of the ByteTrack algorithm [58] to mitigate trajectory fragmentation caused by occlusion or 23 3. Methods motion blur. Two copies are created, one copy stays in current temporal order while the other copy is reversed, thus we have a labels for both temporal directions. Then I use the ByteTrack algorithm to recover labels. Unlike traditional tracking methods that strictly discard detections below a high confidence threshold (e.g., λhigh > 0.5), ByteTrack employs a hierarchical data association strategy. • First Association: High-confidence detections are initially matched to existing tracklets using Kalman Filter motion predictions and Intersection- over-Union (IoU). If there is a high confidence label in the frame not being matched from a previous frames, we initiate a tracklet. • Second Association: Crucially, any tracklets that remain unmatched are not immediately terminated. Instead, the algorithm searches a secondary pool of low-confidence proposals (threshold 0.01 < λlow < 0.5) to find spatial matches. Thus, based on the active tracklets, I can recover some potential low-confidence labels. However, as I perform bi-directional tracking, some labels can be recovered twice, thus I run NMS across the recovered labels to remove duplicates. The whole flow of creation of pseudo-labels is visualised in Figure 3.3. 3.2.4 Server-based (SAM3) This strategy yields a high-quality labeled dataset and serves as a representative commercial-grade labeling approach. It leverages the SAM 3 Video tracker [62] as a general-purpose, server-side commercial auto-labeling tool. What makes SAM3 a great choice for creating labels for videos is that the model has a detector and tracker implemented, thus, one can create high-end labels end-to-end. In my implementation, SAM3 is prompted with text concepts to map outputs to our specific ontology: person uses the prompt person, while vehicle aggregates prompts for car, truck, bus. To stabilize tracking while avoiding duplicates, frames are processed in overlapping sliding windows containing 35 saved frames plus 2 looking-back to intialise tracking context frames. Finally, per-frame outputs are filtered to retain boxes with scores > λSam3 = 0.5 and aggregated via NMS (λnms = 0.85) to remove duplicates.For this strategy, I use the original resolution of the videos for full performance. 3.3 Training details and adaptation strategies This section describes the implementation of the experiments. Firstly, I will cover the two adaptation strategies for the base-models: Standard fine-tuning and background- context fusion. Finally, I will discuss how the general training is conducted. 24 3. Methods Figure 3.3: Overview of the flow for creation of pseudo-labels using SAHI + ByteTrack (SBT). 25 3. Methods 3.3.1 Standard Fine-Tuning (SF) This serves as the baseline adaptation method. The method is a standard fine-tuning strategy; I only re-initialize the head and adapt the model for the input resolution: • Initialisation: Models are initialized with COCO pre-trained weights. For RF-DETR, positional encodings are bilinearly interpolated to 640 × 640. • Head re-initialisation: The classification and regression heads are re-initialized to facilitate the fewer classes. • Training: The backbone remains frozen. A constant learning rate and fixed batch size are applied across all experiments to ensure comparable convergence. 3.3.2 Background-Context Fusion (BF) To explicitly leverage the static nature of surveillance cameras, I adapt the background- fusion concept from Zhang et al.[21] which is most similar to their "mid-fusion" adaptation. However, with two main differences, the backbone is frozen and I do not add a loss-function to regularize the update of backbone weights. The flow of the modified architecture can be seen in Figure 3.4. • Background generation: A dynamic background reference image B is constructed by temporal aggregation. I used the already created background images from the official repository of Scenes100. [21], thus this implies I borrow the theoretical method of creating them as well, which is as follows. For a video frame I of dimension H × W associated with a set of K pseudo-annotated object bounding boxes {(x1,k, y1,k, x2,k, y2,k) | k = 1, . . . , K}, a background mask M of the same dimension as I can be constructed as follows. For each pixel (x, y) in M , set M [x, y] = 0 if (x, y) is inside of any pseudo-annotated bounding box, and 1 otherwise. Then, for a sequence of frame-mask pairs {(Il, Ml) | l = 1, . . . , L}, the background image is determined as: B = ∑L l=1 Il ⊗ Ml∑L l=1 Ml , (3.1) where ⊗ is the pixel-wise multiplication operator. However, there might be a location (x′, y′) that lies inside an object bounding 26 3. Methods box in every image, for example, a parked car, Ml[x′, y′] = 0 for all l. In this case, the background at this location is never observed, and its pixel value cannot be determined via Equation 3.1. In those cases, an inpainting algorithm was used [21]. • Object mask: An object mask MO is derived via difference after normalisation (MO = (I − B + 1) × 0.5) to serve as the secondary input stream. The mapping of the background to the current frame in training is set to "nearest" in the temporal aspect and validation, thus providing the closest match for the fusion. • Parallel streams: The original image I and the Object Mask MO are processed through two parallel, frozen backbones. • Feature-level fusion: Unlike prior works that use dual-branch losses to update the backbone, I strictly keep the backbone frozen. Fusion occurs at the scalar level, thus, feature maps from the image stream and mask stream are combined via element-wise averaging. • Prediction: The fused multi-scale feature map is passed to the detector neck and head. 3.3.3 Training setup For each generated set of pseudo-labels, the models will be trained using the official repositories, which I have adapted to fit the current settings. All models have been pretrained using the COCO dataset, and been provided by the models open-source repositories [63], [64]. These are the models that I will refer to as the base models. To ensure similar settings for training, I use the AdamW optimiser with a constant learning-rate of 1 · 10−4. A duration of 2 epochs is used to train the models. No augmentations, except for resizing and normalizing, are taking place. Resolution for training and inference is 640 × 640. A batch size of 12 is used, and the final model being used for validation after 2 epochs is the Exponential Moving Average of Weights (EMA) model, using the default settings of their respective public repositories. Once training begins, consistent with modern literature on avoiding catastrophic forgetting in foundation models [21], [65], all backbones are strictly frozen during the training process. All other parameters are trained during the experiments. See Table 3.1 for specifics of parameters and trainable parameters. For RF-DETR, the pretrained models have not been trained using the resolution 640 × 640 as Yolo11 has been; for medium, 576 × 576 was used, and for nano, 384 × 384 was used. Thus, 27 3. Methods Figure 3.4: Overview of the modification in Background-Context Fusion from flow of input until the output of the fused features to further be proccesses the positional encodings will be bi-linearly interpolated to fit 640 × 640 resolution. Furthermore, YOLO has been trained using Letter-boxing [45], retaining the aspect ratio in the image resizing, while RF-DETR does not. I have kept this behavior consistent with the pre-training procedure, since altering it led to a substantial drop in performance. Table 3.1: Model Parameter Breakdown (in Millions) Model Total (M) Trainable (M) Frozen (M) % Trainable YOLO11n 2.590 1.225 1.365 47.3% YOLO11l 25.312 12.478 12.834 49.3% RF-nano 30.467 6.885 23.583 22.6% RF-medium 33.687 9.828 23.859 29.2% 3.4 Seasonal Data creation To understand how a model trained for a specific scene responds to environmental changes when the scene’s conditions shift during real-world deployment. I chose to translate one scene filmed during the summer into a winter session. This is done by using Nano-banana-Pro from Google [66] which is a multi-modal-to-image model. By 28 3. Methods uploading one frame at the time of the validation frames (the summer/original image) and a prompting. The prompt was subjected different minor changes.However, these adjustments were not explored over an extended period of time. It was selected based on my perceived similarity among the generated images during visual inspection. "I am using a validation set in the summer for a static camera surveillance object detector. However, I would like to create a winter version of it, Object location cant be changed no matter what. Do it for this, make sure not to ADD/REMOVE or CHANGE people or cars locations" [66] . The selected scene for the experiment is Scene 001, and a side-by-side comparison example from this scene is shown in Figure 3.5. However, not all objects remained consistent after generating the winter version of the frame, and as a result, it was necessary to manually re-annotate certain frames. The total number of objects decreased for the person class and increased for the vehicle classes across all validation frames when comparing the original summer images to the generated winter versions. The total changes in the ground-truth objects are presented in Table 3.2, representing the total number of objects across all validation frames from the person class and the vehicle class. Table 3.2: Change in Ground Truth Labels by Season Class Summer (original) Winter Person 485 344 Vehicle 221 229 (a) Original image (b) Generated image from Nano-banana-pro Figure 3.5: A side-by-side comparison. On the left (a), we see the original summer image, while the right (b) shows the generated winter image 3.5 Validation This section outlines how the evaluation will be carried out and which metrics will be reported. As introduced in the preliminaries, the primary metric is the industry- standard mean Average Precision (mAP). In line with the COCO evaluation protocol, I restrict evaluation to the top 100 detections per image, sorted by confidence score. This ranking procedure ensures that evaluation focuses on the model’s most confident 29 3. Methods predictions. Because average precision uses maximum interpolation, low-confidence detections that exceed the number of ground-truth instances (i.e., the “tail”) do not reduce the final score, as long as true positives appear at the top of the ranking. This yields a standardized evaluation setting that avoids unfairly penalizing the model for low-confidence background noise in sparse scenes, and follows the standard COCO evaluation protocol. Following the COCO evaluation protocol, I also report AP broken down by object size: APsmall, APmedium, and APlarge. These metrics are computed based on the ground-truth bounding box area (w × h in pixels), thus the original resolution of the videos: • APsmall: Objects with area < 322 pixels (i.e., smaller than approximately 32×32 pixels). • APmedium: Objects with area in the range [322, 962) pixels. • APlarge: Objects with area ≥ 962 pixels (i.e., larger than approximately 96 × 96 pixels). For the base-detectors to align with the surveillance context, the COCO class ontology is remapped to person and vehicle, with classes mentioned in Section 3.1. Additionally, since the number of persons and vehicles varies across scenes in the validation sets, I compute a weighted mAP based on the ratio of the total number of persons to vehicles in each scene. This metric is reported to provide a more fair and representative evaluation of scene-level performance. Each fine-tuned model is evaluated on the specific scene for which it was adapted. This results in a total of 100 × 4 × 2 × 4 = 3200 models to be evaluated, covering all combinations of scenes, model architectures, adaptation methods, and pseudo- labeling strategies. Furthermore, following the practice of Zhang et al. [21], I apply the non-evaluation mask; the bounding boxes that have at least one corner inside the non-evaluation mask will be removed. Thus, distant parts in the frames where objects are deemed too small and blurry will not affect the evaluation results. The average metrics across scenes will be reported. When referring to AP solely in the coming chapters, I will be referring to the AP50:95. 30 4 Results This section goes through the results from the experiments, starting with the perfor- mance across the 100scenes dataset and how the different architectures converged. Further, I will look deeper into how the different pseudo-labeling strategies per- form. Additionally, I will cover results of the different Adaptation strategies, and Background-Context Fusion compared to Standard-Finetuning. Lastly, I will report the results of the seasonality changes impact on the models performances. 4.1 Architecture and model-size This section will address the impact of architecture choice and model size on detection performance and their adaptation capabilities models are compared across nano and medium/large scales, using both base-models (COCO-pretrained) and adapted configurations. Performance is evaluated quantitatively using weighted and raw AP metrics, complemented by qualitative cases and an analysis of convergence behavior. First section of Table 4.1 reports the performance of the base models (COCO- pretrained) across all scenes in the scenes100 dataset. The RF-DETR Medium model achieved the highest overall performance of 0.4535 APweighted, whereas the YOLO11-nano model attained the lowest at 0.2406 APweighted. The second section of Table 4.1 provides an overview of the most promising model con- figurations of both architectures and sizes, chosen based on the highest performance of APweighted. SAM3 is included in all configurations except one; the configuration that does not incorporate SAM3 instead employs the ST strategy. The best-preforming configurations for the RF-DETR models both utilize the background-context fusion strategy, whereas both YOLO-based models rely on standard finetuning. Further, the RF-DETR models outperform the YOLO11 variants across metrics. The best RF- DETR achieving 0.4912 APweighted while the best YOLO11 achieves 0.4721 APweighted. 31 4. Results Compared to the base performance, we can see that RF-DETR Medium still achieves the highest performance and that RF-DETR Nano has surpassed YOLO11-large. Notably, the performance improvement of YOLO11-Nano is substantially greater than that of the other models, going from 0.2406 ⇒ 0.375 APweighted exhibiting an increase of more than 50% relative to its baseline configuration. Based on qualitative analysis, certain scenes exhibit very low performance across all models, scenes which can be characterized by a dense amount of small objects where the camera is fixed far from the actual objects. One example of this is scene 019, which can be seen in Figure 4.1, where the best performing model configuration is RF-DETR medium using BF and SAM3 reached 0.22 APweighted while the worst model configuration was YOLO11-Nano which can not detect any objects. Table 4.1: Performance comparison between base models and best adapted configu- rations (Mean over all scenes). Model Model Adaptation Labeling Strategy AP (Weighted) AP50 (Weighted) AP (Raw) AP50 (Raw) Base models (COCO pretrained, no adaptation) RF (Medium) Base None 0.4535 0.7028 0.4431 0.6736 RF (Nano) Base None 0.4113 0.6797 0.4037 0.6502 YOLO11 (Large) Base None 0.4227 0.6181 0.4059 0.5784 YOLO11 (Nano) Base None 0.2406 0.4107 0.2339 0.3832 Best adapted configurations RF (Medium) Background-Context Fusion ST 0.4912 0.7580 0.4732 0.7260 RF (Nano) Background-Context Fusion SAM3 0.4758 0.7679 0.4518 0.7302 YOLO11 (Large) Standard Fine-tuning SAM3 0.4721 0.7307 0.4393 0.6810 YOLO11 (Nano) Standard Fine-tuning SAM3 0.3750 0.6395 0.3416 0.5748 Figure 4.1: Image from Scene 019. The convergence of the models respective rates of convergence during training, is illustrated in Figure 4.2. For each model–scene combination, the metric was first normalized and subsequently averaged over all scenes, providing a representative 32 4. Results convergence curve for each model configuration. The RF-DETR model exhibits highly consistent convergence dynamics across different model variants, adaptation procedures, and labeling strategies. Its convergence is rapid, with performance beginning to plateau after approximately 200–300 batch steps. In contrast, the YOLO11-based models display greater variability. The YOLO11-Nano variants re- quire more iterations to converge and the curve exhibit an approximately exponential convergence profile. The YOLO-Large model behaves more similarly to RF-DETR in terms of initial convergence speed. However, instead of reaching a clear plateau, it tends to continue to improve over a longer range of training steps, indicating more prolonged learning compared to RF-DETR. 4.2 Pseudo-labeling strategy Table 4.2 presents the relative performance of each method with respect to its corresponding base model on the 100Scenes dataset. Several consistent patterns emerge: First, the Self-Supervised Learning baseline (SSL-B), where the labels were created from the current model itself without any augmentation or improvements, does not improve performance for any model-configuration across all the scenes, while increases performance for some individual scenes (see APPENDIX). Furthermore, it leads to a pronounced degradation of the YOLO11-based models, YOLO11-Large decreasing ∼ 11% in APweighted compared to the RF-DETR-Medium decreasing ∼ 1%. Second, the proposed on-device strategy, the SAHI-ByteTrack (ST), yielded performance gains APweighted across all methods and model families, being part of the best performing configuration shown in Table 4.1. The Smaller models in both architectures increased their performance relatively more from Ensemble and SAM3 than the larger ones. SAM3 also showed the largest and most consistent increase across model architectures and configurations. This is evident by the highlighted bold numbers, which represent the best labeling-strategy in all model adaptations seen in Table 4.2, except for the RF-DETR Medium Background-Context Fusion. Using of Ensemble strategy showed consistent increase for all metrics on YOLO11-Nano, while being more inconsistent for the other models. Looking at the more detailed Table 4.3, a clear dichotomy emerges regarding class- specific improvements, particularly within the RF-DETR architecture. While the APvehicle scores remain relatively static across adaptation methods—hovering near the 0.51 baseline for the Medium model the APperson metric demonstrates significant adaptation to scene adaptation, rising from a base of 0.3095 to over 0.37 in the best-performing configurations. Furthermore, a clear trend across both YOLO and RF-DETR architectures is that the most substantial relative gains are concentrated in the smaller object scales rather than the larger ones. While APLarge shows only marginal improvements, often hitting a saturation point, APSmall and APMedium exhibit dramatic increases; for instance, in the YOLO11-nano model, APSmall nearly 33 4. Results (a) Convergence speed for YOLO11 models under Standard Fine-tuning (SF) and Background-Context Fusion (BF) and by labeling strategy. (b) Convergence speed for RF-DETR models under Standard Fine-tuning (SF) and Background-Context Fusion (BF) and by labeling strategy. Figure 4.2: Convergence dynamics across architectures. Curves show normalized AP50_90 performance over training steps, where 0 denotes run start and 1 denotes peak performance per run. 34 4. Results Table 4.2: Performance delta vs. base (percentage points). Positive values indicate improvement over the corresponding base model. Best values per model block are bolded. Model Model Adaptation Labeling Strategy ∆ AP (Weighted) ∆ AP50 (Weighted) ∆ AP (Raw) ∆ AP50 (Raw) RF (Medium) Standard Fine-tuning SSL-B −1.33% −3.31% −1.84% −3.50% ST 2.44% 3.11% 1.61% 2.45% Ensemble 0.08% −3.79% −0.44% −3.33% SAM3 2.28% 5.43% 0.54% 3.61% Background-Context Fusion SSL-B −0.50% −1.66% −0.89% −1.53% ST 3.77% 5.53% 3.00% 5.24% Ensemble 1.25% −2.17% 0.72% −1.53% SAM3 3.48% 7.90% 2.07% 7.09% RF (Nano) Standard Fine-tuning SSL-B −3.94% −8.84% −4.46% −9.01% ST 2.58% 1.28% 1.44% 0.25% Ensemble 3.40% −2.50% 2.55% −2.07% SAM3 5.42% 6.58% 3.50% 5.12% Background-Context Fusion SSL-B −3.70% −8.23% −4.13% −7.99% ST 3.61% 3.22% 2.40% 2.32% Ensemble 4.33% −1.11% 3.60% −0.30% SAM3 6.46% 8.82% 4.81% 8.00% Yolo11 (Large) Standard Fine-tuning SSL-B −11.69% −22.28% −10.69% −19.32% ST 2.17% 0.64% 1.12% 0.65% Ensemble −0.78% −4.21% −0.51% −2.32% SAM3 4.94% 11.26% 3.34% 10.26% Background-Context Fusion SSL-B −11.70% −22.44% −10.62% −19.21% ST 1.61% −0.60% 0.74% −0.42% Ensemble −1.63% −6.15% −1.37% −4.19% SAM3 4.45% 10.80% 2.88% 9.59% Yolo11 (Nano) Standard Fine-tuning SSL-B −7.43% −16.51% −7.36% −15.04% ST 5.69% 7.20% 4.64% 6.12% Ensemble 10.41% 13.03% 9.13% 12.07% SAM3 13.45% 22.88% 10.77% 19.15% Background-Context Fusion SSL-B −6.78% −15.75% −6.94% −14.63% ST 6.22% 7.54% 5.04% 6.44% Ensemble 10.17% 12.29% 8.74% 11.42% SAM3 13.42% 22.83% 10.71% 19.20% 35 4. Results triples under best configuration, going from ∼ 0.04 ⇒ 0.12. We can also see in general that it is the smaller objects that are harder to detect, as the models consequently has worse score on the APSmall metric. 4.3 Adaptation Strategy Hinted at in previous sections, the Background-Context Fusion approach has a more pronounced positive effect on the RF-DETR models than on the YOLO11 models, being the best performing configuration of the RF-DETR models, while not being the best performing YOLO11 configurations. An important goal is to determine whether background extraction and feature fusion lead to an improvement in performance,thus to further investigate and determine whether a statistically significant difference exists between Standard Finetuning (SF) and Background-context Fusion (BF). I employed a Wilcoxon signed-rank test. This non-parametric procedure was selected because visual inspection of the performance distributions indicated deviations from normality, including skewness, rendering the assumptions of the paired t-test questionable. The Wilcoxon signed-rank test is more robust under these conditions and therefore more appropriate for the analysis. The corresponding results are reported in Table 4.4. For the RF-DETR models, the use of BF is clearly advantageous, regardless of the adaptation strategy, size, or pseudo-labeling strategy. YOLO11 shows a different outcome as it depends on the adaptation strategy, size, or pseudo-labeling strategy. YOLO11-Nano’s significant results indicate that the BF appears to benefit from the less computationally intensive pseudo-labeling strategies (SSL-B, ST), whereas the non-significant outcomes remain either positive or ambivalent for the heavier models. In contrast, the significant findings for YOLO11-Large generally indicate a slight decrease in performance. Nevertheless, specific scenes exhibit a systematic preference for particular methods. I highlight several such cases to further illustrate that the relative performance of the approaches can depend strongly on the nature and contextual characteristics of the scene. In Figure 4.3, scenes 156 and 058 are shown, where BF outperforms SF while also surpassing the base model. Conversely, Figure 4.4 presents scenes in which SF achieves superior performance compared to BF, again while exceeding the performance of the base model. For Scene 058, Yolo11-nano with BF reached 0.5123 APweighted using SAM3 and 0.4676 using ST, while the YOLO11-Large base model reached 0.5202. This demonstrates that a model 10× smaller in parameters was able to achieve comparable performance. 36 4. Results Table 4.3: Detection performance across Scenes100 for each model, including base (non-adapted) performance, Standard Fine-tuning, and Background-Context Fusion. Values are mean across all scenes, AR being Average Recall at 100 Detections max. AP referring to AP50:95 Model Model Adaptation Labeling Strategy APperson APvehicle AR100 APsmall APmedium APlarge RF (Medium) Base None 0.3095 0.5150 0.5352 0.1810 0.5260 0.6651 Standard Fine-tuning SSL-B 0.2945 0.4957 0.5142 0.1603 0.5191 0.6572 ST 0.3383 0.5154 0.5578 0.2075 0.5500 0.6587 Ensemble 0.3105 0.5091 0.5403 0.1887 0.5393 0.6611 SAM3 0.3232 0.5110 0.5392 0.2140 0.5366 0.6192 Background-Context Fusion SSL-B 0.3194 0.4882 0.5252 0.1781 0.5276 0.6596 ST 0.3743 0.5053 0.5789 0.2339 0.5588 0.6602 Ensemble 0.3385 0.5027 0.5591 0.2056 0.5515 0.6638 SAM3 0.3615 0.5015 0.5571 0.2429 0.5484 0.6150 RF (Nano) Base None 0.2687 0.4846 0.5034 0.1549 0.4770 0.6268 Standard Fine-tuning SSL-B 0.2351 0.4331 0.4569 0.1125 0.4426 0.6223 ST 0.3071 0.4685 0.5266 0.1839 0.4994 0.6267 Ensemble 0.3005 0.5017 0.5312 0.1778 0.5280 0.6541 SAM3 0.3136 0.5025 0.5297 0.2037 0.5260 0.6166 Background-Context Fusion SSL-B 0.2490 0.4252 0.4595 0.1238 0.4463 0.6226 ST 0.3365 0.4568 0.5390 0.2024 0.5049 0.6298 Ensemble 0.3273 0.4942 0.5472 0.1935 0.5404 0.6583 SAM3 0.3480 0.4927 0.5462 0.2319 0.5378 0.6054 Yolo11 (Large) Base None 0.2855 0.4647 0.4926 0.1376 0.4939 0.6402 Standard Fine-tuning SSL-B 0.1992 0.3578 0.3199 0.0620 0.3620 0.5724 ST 0.2932 0.4836 0.4667 0.1748 0.5122 0.6154 Ensemble 0.2760 0.4729 0.4431 0.1485 0.4972 0.6399 SAM3 0.3121 0.5043 0.5144 0.2121 0.5372 0.6106 Background-Context Fusion SSL-B 0.2011 0.3578 0.3202 0.0618 0.3662 0.5620 ST 0.2994 0.4710 0.4617 0.1787 0.5104 0.6142 Ensemble 0.2765 0.4569 0.4315 0.1446 0.4922 0.6263 SAM3 0.3183 0.4900 0.5061 0.2181 0.5343 0.5922 Yolo11 (Nano) Base None 0.1537 0.2763 0.3409 0.0431 0.2628 0.4672 Standard Fine-tuning SSL-B 0.1069 0.1969 0.1875 0.0119 0.1703 0.3941 ST 0.1781 0.3438 0.3486 0.0634 0.3475 0.4947 Ensemble 0.1925 0.4141 0.3888 0.0820 0.3998 0.5615 SAM3 0.2066 0.4284 0.4337 0.1211 0.4151 0.5253 Background-Context Fusion SSL-B 0.1111 0.2045 0.1934 0.0118 0.1801 0.4131 ST 0.1870 0.3431 0.3492 0.0728 0.3576 0.4980 Ensemble 0.1977 0.4026 0.3810 0.0852 0.3990 0.5557 SAM3 0.2173 0.4167 0.4331 0.1212 0.4170 0.5391 37 4. Results Table 4.4: Comparing Background-Context Fusion (BF) against Standard Fine- tuning (SF) using Wilcoxon signed-rank test on AP (Weighted), computed per scene. Positive median differences indicate BF outperformed SF. Variant Labeling Strategy N Scenes Median Difference p-value Significance RF-Medium Ensemble 100 +0.0107 4.49 × 10−8 *** RF-Medium SAM3 100 +0.0133 3.13 × 10−6 *** RF-Medium SSL-B 100 +0.0089 9.19 × 10−6 *** RF-Medium ST 100 +0.0133 1.52 × 10−7 *** RF-Nano Ensemble 100 +0.0077 1.15 × 10−5 *** RF-Nano SAM3 100 +0.0130 4.41 × 10−5 *** RF-Nano SSL-B 100 +0.0049 2.85 × 10−2 * RF-Nano ST 100 +0.0106 1.20 × 10−5 *** Yolo11-L Ensemble 100 −0.0043 6.60 × 10−3 ** Yolo11-L SAM3 100 −0.0020 3.98 × 10−2 * Yolo11-L SSL-B 100 +0.0001 8.92 × 10−1 ns Yolo11-L ST 100 −0.0019 4.36 × 10−2 * Yolo11-N Ensemble 100 −0.0000 5.57 × 10−1 ns Yolo11-N SAM3 100 +0.0038 3.32 × 10−1 ns Yolo11-N SSL-B 100 +0.0019 1.43 × 10−3 ** Yolo11-N ST 100 +0.0071 5.29 × 10−3 ** Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05, ns = not significant. Median Difference: Positive values indicate BF outperformed SF. 4.4 Seasonality changes In order to investigate how adaptation results vary under changing environmental and operational conditions, the following section presents the results obtained when models trained on a summer scene are evaluated on the same scene under winter conditions. Looking at Table 4.5, The models has a better score on the Summer scene compared to the Winter scene, even when looking at the base models. Further, I observe that all configurations that improve APweighted on Summer scenes fail to retain the same margin over the Base model when evaluated on Winter scenes. In other words, a substantial fraction of the gains achieved in-domain (Summer) do not fully transfer out-of-domain (Winter). For RF-DETR Medium, only the SF with SAM3 variant remains clearly better than the Base model in Winter, whereas all except one variant lose their advantage. In contrast, RF-DETR Nano stands out: all adaptations except the SSL-B variants show positive ∆ values relative to the Base in both Summer and Winter, indicating robust generalisation across scene conditions. Among the YOLO11 models, only the YOLO-Nano using SAM3 variants consistently outperform their Base in Winter. The improvements observed in this case are modest relative to the gains achieved in the Summer setting. Given that the summer scene is “harder” according to the base model’s performance, and that it also contains more objects 38 4. Results (a) Scene 058 – image (b) Scene 058 – Object mask (c) Scene 156 – image (d) Scene 156 – Object mask Figure 4.3: Qualitative examples of two scenes where Background-Context Fusion outperforms Standard Fine-tuning. Each row shows the original image (left) and the corresponding object mask image used by the fusion pipeline (right). 39 4. Results (a) Scene 090 – image (b) Scene 090 – object mask (c) Scene 125 – image (d) Scene 125 – object mask Figure 4.4: Qualitative examples of two scenes where Standard Fine-tuning performs better than Background-Context Fusion. Each row shows the original image (left) and the corresponding object mask image (right). 40 4. Results from the person classes, in the summer setting, I therefore use a relative performance metric. ∆% = AP − APBase APBase × 100 as a more appropriate basis for comparison. Thus, assuming that we have the ∆% for summer and one for winter, I subtract the winter from the summer and if we have a positive value it indicates that winter increase in percentage performance was larger than in summer case. Across all models and configurations, every model that surpassed the base model in the summer setting exhibits comparatively worse performance in the winter setting (in percentage terms), as reported in the final boldfaced column, they all show negative values. 41 4. Results Table 4.5: Summer → Winter generalisation using APweighted. All models are trained on Summer data only and evaluated on both Summer and Winter scenes. For each architecture, the Base model is the unadapted reference. We report relative change as ∆% = AP −APBase APBase × 100 (computed per season). The final column is the Winter − Summer difference in relative change (percentage points). Model Model Adaptation Labeling Strategy Summer AP %∆ vs base Winter AP %∆ vs base ∆%(W−S) RF (Medium) Base Reference 0.5061 +0.00% 0.6601 +0.00% +0.00 Standard Fine-tuning SSL-B 0.5018 −0.85% 0.6389 −3.21% −2.36 ST 0.5273 +4.19% 0.6599 −0.03% −4.22 Ensemble 0.5100 +0.77% 0.6226 −5.68% −6.45 SAM3 0.5374 +6.18% 0.6726 +1.89% −4.29 Background-Context Fusion SSL-B 0.5150 +1.76% 0.6183 −6.33% −8.09 ST 0.5567 +10.00% 0.6385 −3.27% −13.27 Ensemble 0.5274 +4.21% 0.6053 −8.30% −12.51 SAM3 0.5702 +12.67% 0.6512 −1.35% −14.01 RF (Nano) Base Reference 0.4504 +0.00% 0.5754 +0.00% +0.00 Standard Fine-tuning SSL-B 0.4130 −8.30% 0.5364 −6.78% +1.53 ST 0.5021 +11.48% 0.6238 +8.41% −3.07 Ensemble 0.5010 +11.23% 0.6072 +5.53% −5.71 SAM3 0.5291 +17.47% 0.6564 +14.08% −3.40 Background-Context Fusion SSL-B 0.4186 −7.06% 0.5239 −8.95% −1.89 ST 0.5217 +15.83% 0.5962 +3.61% −12.22 Ensemble 0.5123 +13.74% 0.5885 +2.28% −11.47 SAM3 0.5522 +22.60% 0.6228 +8.24% −14.36 Yolo11 (Large) Base Reference 0.5271 +0.00% 0.6760 +0.00% +0.00 Standard Fine-tuning SSL-B 0.4394 −16.64% 0.5919 −12.44% +4.20 ST 0.5711 +8.35% 0.6490 −3.99% −12.34 Ensemble 0.5192 −1.50% 0.6065 −10.28% −8.78 SAM3 0.5909 +12.10% 0.6689 −1.05% −13.15 Background-Context Fusion SSL-B 0.4270 −18.99% 0.5572 −17.57% +1.42 ST 0.5784 +9.73% 0.6263 −7.35% −17.08 Ensemble 0.5146 −2.37% 0.5836 −13.67% −11.30 SAM3 0.5980 +13.45% 0.6412 −5.15% −18.60 Yolo11 (Nano) Base Reference 0.2993 +0.00% 0.4531 +0.00% +0.00 Standard Fine-tuning SSL-B 0.2314 −22.69% 0.3330 −26.51% −3.82 ST 0.3409 +13.90% 0.4360 −3.77% −17.67 Ensemble 0.3888 +29.90% 0.4343 −4.15% −34.05 SAM3 0.4099 +36.95% 0.4641 +2.43% −34.53 Background-Context Fusion SSL-B 0.2274 −24.02% 0.3134 −30.83% −6.81 ST 0.3448 +15.20% 0.4390 −3.11% −18.31 Ensemble 0.3836 +28.17% 0.4349 −4.02% −32.18 SAM3 0.4116 +37.52% 0.4571 +0.88% −36.64 42 5 Conclusion and Discussion 5.1 Discussion This section presents a discussion specifically aimed at addressing the research questions. Specifically, we examine how YOLO11(CNN) and RF-DETR (Transformer) compare in adaptation capability within static scenes, to what extent self-supervised scene adaptation can enable smaller models compared to larger ones, and whether a potential on-device strategy for self-supervised learning can achieve performance parity with more resource-heavy methods. I also assess whether the integration of background extraction and feature fusion in a fixed-camera environment provides a performance improvement for the chosen models, and quantify how large the accuracy drop is under seasonal and weather shifts for an adapted model. 5.1.1 Adaptation Capabilities In this section, I will cover RQ1: How do YOLO11(CNN) and RF-DETR (Transformer) compare in adaptation capability within static scenes? The RF-DETR model consistently exhibits superior performance across the 100Scenes dataset, additionally exhibiting faster and more stable convergence behavior compared to YOLO11.A comparable convergence behaviour for RF-DETR was reported by Sapkota et al. [17], as discussed in Section 2.6. In their work, RF-DETR exhibited rapid and stable convergence despite employing an unfrozen backbone, in contrast to the frozen-backbone configuration adopted in the present study. This suggests that the lower number of % trainable-parameters for the RF-DETR (22.6% and 29.2%) architecture compared to the YOLO11-models (47.3% and 49.3%) is not the crucial factor for faster convergence. The specific allocation and organization of parameters within the architecture likely play a pivotal role in determining its capacity for generalisation. RF-DETR relies 43 5. Conclusion and Discussion on a "heavy backbone" design, utilizing the massive pre-trained DiNOv2 backbone [18]. In this configuration, the vast majority of the model’s capacity resides in the frozen backbone, which encapsulates robust general-purpose semantic representations learned from vast amounts of data. In contrast, the YOLO11 architectures examined are designed with a heavier neck and head. In this configuration, just under 50% of the parameters are located outside the backbone compared to RF-DETRS (22.6% (nano), 29.2%) less trainable parameters. While this allows the model to adapt more specifically to the dataset, it increases the risk of overfitting. This expectation is reflected in the results. As presented in the Performance Delta vs. Base (Table 4.2), the YOLO11 models are more negatively impacted by inadequate labeling strategies, resulting in a more pronounced degradation in performance. Concurrently, when comparing larger and smaller model variants, the relative performance gains achieved under enhanced training conditions are more pronounced for the YOLO11 architectures than for their counterparts within the RF-DETR models. For example, when high-quality annotations generated by SAM3 were employed, YOLO11-Large exhibited a greater relative performance improvement over its corresponding base model than RF-DETR-Medium did over its own base model. Thus, if trained with high-quality labels and appropriate optimization strategies on a given scene, YOLO11 models may offer greater potential performance gains, albeit with an increased risk of overfitting even with a frozen backbone. This is further highlighted in the seasonality change results for scene 001. The relative change in performance is consistently worse when looking at the YOLO11 model configurations than RF-DETR model configurations. For instance, the RF-Medium using SAM3 labels and standard fine- tuning was −4.29 percentage points worse in the winter scene, while YOLO11-Large with the same configurations showed a greater degradation of performance that was 3 times large, being −13.15 percentage points worse in the winter scene. Thus, it further proves that the robustness of the RF-DETR architecture is better as we can see better results across Scenes100 and the sesonality change results. Convergence smoothness is further governed by the loss landscape defined by label as- signment strategies. While YOLO11’s heuristic assignment induces gradient variance through local’many-to-one” ambiguity, RF-DETR, using a global bipartite matching and the incorporation of Group-DETR described in Section 2.4.2[52], provides dense gradient flow through auxiliary query groups, effectively stabilizing the training dynamics. However, it comes with some minor computational costs. The observed differences in generalization performance can also be attributed to the distinct architectural properties of CNNs and Transformer models, in particular, to the mechanisms by which they extract, represent, and selectively attend to features. YOLO11 is still largely shaped by the CNN inductive bias of locality [35], it builds features by stacking convolutional layers, where each layer aggregates information from a local neighborhood. Global context is therefore reached only indirectly, by gradually expanding the effective receptive field across many layers (Section 2.3.1). Although YOLO11 has evolved (Section 2.3.2) and even includes attention- 44 5. Conclusion and Discussion components, it still heavily relies on locality. The model has three more or less decoupled feature-maps, which it relies on for final predictions. The decoupling from each other, in one sense, could make it more robust, but it may miss fine-overlapping patterns. In contrast, the RF-DETR architecture uses attention across all parts and heavily relies on it as the primary mechanism for information aggregation, effectively focusing on more important "areas". Its design includes efficient attention variants such as Deformable Attention and Mixed-Query Selection (Section 2.4.2) [14], [16], [48]. Instead of depending on fixed, local feature extraction, Transformers prioritize information based on relative importance, they can directly focus on the most relevant regions and suppress less relevant background. Moreover, in the decoder, queries act like detection slots, which are derived from the same feature representation and operate in the same embedding space. This means that they effectively compete for representation capacity, which can encourage a more consistent and globally coordinated interpretation of the scene. In theory, this global coordination and attention focus should make the model less sensitive to local divergence and therefore more robust when conditions shift. However, it is also important to note that RF-DETR entails higher hardware require- ments, both during training and inference, in terms of memory consumption and the need for hardware capable of extensive parallelization to achieve the inference speeds reported in prior work [64]. Thus, one should not choose RF-DETR solely based on performance, as YOLO11 still leads when it comes to parameter-efficient needs, but RF-DETR nano is extremely close to being as efficient as YOLO11-large. RF-DETR’s superior stability and generalization arise from the combined effects of its heavily pretrained backbone, efficient architectural attention mechanisms, and globally consistent bipartite-matching loss. These design elements allow RF-DETR to function as a generalist detector that remains robust under label noise and domain shift but comes with an increase in hardware requirements. By contrast, YOLO11 architectures place a larger share of parameters in the neck and head, making them highly adaptable and capable of strong peak performance on well-labeled, scene- specific datasets in this configuration. However, this same flexibility also makes YOLO11 more sensitive to imperfect supervision and more prone to overfitting. 5.1.2 Smaller models compared to larger This section focuses on RQ2 to further understand how a smaller model can compete with a larger one. To what extent can self-supervised scene adaptation enable smaller models compared to larger ones? 45 5. Conclusion and Discussion The results show that there is a tendency for a pattern between model size and the efficacy of self-supervised adaptation in the YOLO11-models while it being less clear in the RF-DETR models. As evidenced in Table 4.2, using the label-strategy of ST, the YOLO11-Nano increases by as much as 6.22% units from the Base in APweighted, while YOLO11-Large only yields an increase of up to 2.17%units. In the RF-DETR models, there are no such patterns, as the medium size achieves 3.77% and the Nano 3.61%. However, looking at the results when using the heavier methods, Ensemble and SAM3, which deploy larger models. There is a general pattern across the architectures regarding model size. The smaller variants compared to the larger variants exhibit much greater increases in performance. YOLO11-Nano performance using SAM3 increases up to 13.45%, while YOLO11-Large in the same configuration yields an increase of 4.94%. RF-DETR models show the same pattern as the RF- DETR-Nano using SAM3, which increases up to 6.46%, while the RF-DETR-Medium with the same configuration yields a 3.48% increase. Showing that there is a greater performance to achieve. Allowing YOLO11-Nano (Best: 0.3750 AP) to approach the performance tier of the unadapted Base YOLO11-Large (Base: 0.4227 AP). These methods would theoretically not be purely self-supervised but would involve knowledge distillation, as they are being trained by larger models. Additionally, as reported in the scene-specific performance analysis in Section 4.3 on the adaptation strategy, YOLO11-Nano with background-context fusion (BF) achieved a weighted average precision (APweighted) of 0.5123 when combined with SAM3 and 0.4676 when trained using purely self-supervised ST, whereas the YOLO11-Large base model achieved 0.5202. These results demonstrate that a model with approximately 10× fewer parameters can match the performance of a non-specialized, substantially larger model. This result indicates that a compact model, when appropriately configured and trained, can outperform substantially larger models on the same task. It further suggests that smaller, specialized models can be competitive with, or superior to, larger general-purpose models on highly specialized tasks. Moreover, the current methodology of self-supervised learning (referring to labeling strategies SSL-B and ST) employs only a single round of adaptation: the base model first generates pseudo-labels, and the target model is then trained on these labels. Since the adapted model now substantially outperforms the base model used to produce the initial labels (in the ST setting), an additional iteration of adaptation—where the improved model is used to regenerate labels—could plausibly yield further performance gains. 5.1.3 SAHI + ByteTrack Yields performance increase at top level This section covers the RQ3, Can a potential on-device strategy for self- supervised learning achieve performance in parity with more resource- heavy methods? 46 5. Conclusion and Discussion The results unequivocally demonstrate that the proposed SAHI + ByteTrack (ST) strategy is highly effective, often competing with or superseding computationally heavier methods like the Ensemble approach. While the impact is slightly more pronounced in the larger models compared to the heavier methods. This is likely due to their clos