Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveillance: A Comparative Study of YOLO11 and RF-DETR
| dc.contributor.author | Justad, Jacob | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för elektroteknik | sv |
| dc.contributor.examiner | Alvén, Jennifer | |
| dc.contributor.supervisor | Alvén, Jennifer | |
| dc.contributor.supervisor | Ljungqvist, Martin | |
| dc.contributor.supervisor | Moberg, Tiger | |
| dc.date.accessioned | 2026-02-10T11:57:52Z | |
| dc.date.issued | 2026 | |
| dc.date.submitted | ||
| dc.description.abstract | This thesis investigates self-supervised fixed-scene adaptation for real-time objectdetectors in an edge-computing surveillance context. While modern object-detectors achieve strong results on general-purpose benchmarks, deployment in static camera scenes introduces distinct challenges: domain shift to a specific viewpoint, limited availability of scene-specific labels, and stringent compute and memory budgets ondevice. At the same time, the stationary background of surveillance footage provides exploitable structures, as do their temporal dependencies of video-frames. This study conducts a comparative analysis of two state-of-the-art object detection architectures: the Transformer-dominant RF-DETR and the convolutional neural network (CNN)- dominant YOLO11. The thesis employs the 100Scenes dataset to represent a broad range of surveillance environments. Experimental results demonstrate that RF-DETR consistently achieves higher accuracy, smoother convergence, and greater robustness than YOLO11, albeit with higher hardware demands. In contrast, YOLO11 variants (with a frozen backbone) leverage the larger trainable capacity of the neck and head to enable high scene-specific adaptability. While this yields significant gains under quality labeling, it tends to increase sensitivity to imperfect pseudo-labels and the risk of overfitting. Furthermore, by systematically varying model scales, adaptation strategies and environmental conditions the experimental design yields more than 3400 distinct runs. First the work examines the extent to which smaller, specialized models can match the approach of substantially larger models. The experimental results show that a small specialised model can compete with larger general models. Secondly, the study evaluated a proposed on-device self-supervised labeling strategy that integrates SAHI with a bidirectional implementation of ByteTrack. The proposed self-supervised labeling strategy provided reliable performance gains across all architectures and configurations, by recovering hard negatives, more specifically small, occluded and low confidence instances. Thirdly, the study investigated background-context fusion (BF). It proved to be consistently improving the performance in general for RFDETR, while it proved inconsistent for YOLO11 and failed to increase robustness against seasonality, suggesting it induced background-dependent overfitting. Finally, the study shows that all models being trained on a summer scene exhibit a decrease in relative performance compared with the non-adapted models during a seasonal domain shift to a winter scene. | |
| dc.identifier.coursecode | EENX30 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310971 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Computer Vision | |
| dc.subject | YOLO11 | |
| dc.subject | DETR | |
| dc.subject | RF-DETR | |
| dc.subject | Object detection | |
| dc.subject | LWDETR | |
| dc.subject | YOLO | |
| dc.subject | Self-Supervised Learning | |
| dc.title | Self-Supervised Fixed-Scene Adaptation for Object-Detection in Real-Time Surveillance: A Comparative Study of YOLO11 and RF-DETR | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Data science and AI (MPDSC), MSc |
