Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveillance: A Comparative Study of YOLO11 and RF-DETR
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis investigates self-supervised fixed-scene adaptation for real-time objectdetectors
in an edge-computing surveillance context. While modern object-detectors
achieve strong results on general-purpose benchmarks, deployment in static camera
scenes introduces distinct challenges: domain shift to a specific viewpoint, limited
availability of scene-specific labels, and stringent compute and memory budgets ondevice.
At the same time, the stationary background of surveillance footage provides
exploitable structures, as do their temporal dependencies of video-frames. This study
conducts a comparative analysis of two state-of-the-art object detection architectures:
the Transformer-dominant RF-DETR and the convolutional neural network (CNN)-
dominant YOLO11. The thesis employs the 100Scenes dataset to represent a broad
range of surveillance environments. Experimental results demonstrate that RF-DETR
consistently achieves higher accuracy, smoother convergence, and greater robustness
than YOLO11, albeit with higher hardware demands. In contrast, YOLO11 variants
(with a frozen backbone) leverage the larger trainable capacity of the neck and head
to enable high scene-specific adaptability. While this yields significant gains under
quality labeling, it tends to increase sensitivity to imperfect pseudo-labels and the
risk of overfitting. Furthermore, by systematically varying model scales, adaptation
strategies and environmental conditions the experimental design yields more than 3400
distinct runs. First the work examines the extent to which smaller, specialized models
can match the approach of substantially larger models. The experimental results show
that a small specialised model can compete with larger general models. Secondly, the
study evaluated a proposed on-device self-supervised labeling strategy that integrates
SAHI with a bidirectional implementation of ByteTrack. The proposed self-supervised
labeling strategy provided reliable performance gains across all architectures and
configurations, by recovering hard negatives, more specifically small, occluded and
low confidence instances. Thirdly, the study investigated background-context fusion
(BF). It proved to be consistently improving the performance in general for RFDETR,
while it proved inconsistent for YOLO11 and failed to increase robustness
against seasonality, suggesting it induced background-dependent overfitting. Finally,
the study shows that all models being trained on a summer scene exhibit a decrease
in relative performance compared with the non-adapted models during a seasonal
domain shift to a winter scene.
Beskrivning
Ämne/nyckelord
Computer Vision, YOLO11, DETR, RF-DETR, Object detection, LWDETR, YOLO, Self-Supervised Learning
