Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs

Albarham, Mohammad; Berggren, Linus

Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs

Ladda ner

Master_Thesis_chalmers_overleaf_final_version.pdf (20.83 MB)

Publicerad

2026

Författare

Albarham, Mohammad

Berggren, Linus

Typ

Examensarbete för masterexamen
Master's Thesis

Program

Systems, control and mechatronics (MPSYS), MSc
Engineering mathematics and computational science (MPENM), MSc

Sammanfattning

This thesis studies scalable semantic retrieval of autonomous driving multi-view videos recorded with synchronized multi-camera systems. Using a subset of 3,502 multi-view driving videos from NVIDIA’s PhysicalAI Autonomous Vehicles dataset, the work investigates text-to-multi-view video retrieval using natural-language queries and learned cross-modal embeddings. Because the dataset does not contain paired textual descriptions, the proposed pipeline generates pseudo ground-truth captions from sampled video frames using a pretrained vision-language model and extracts frozen text and visual embeddings with jina-clip-v2. These generated captions provide the supervision used for training and evaluation without requiring manual annotation. Lightweight trainable alignment heads are then used to map text and video representations into a shared embedding space, while multi-view representations are constructed through view-level and temporal aggregation. The results quantify the difference between single-view (front camera) and multi-view (front and surrounding cameras) retrieval representations. Extending the representation from a single front-facing camera to six synchronized camera views increases Recall@ 5 from 71% to 85% and Recall@10 from 81% to 93%, indicating improved separation of ground-truth matches in the learned embedding space. In contrast, the LLM-based semantic similarity score changes only marginally, from 77 to 79, suggesting that both retrieval settings often retrieve semantically related driving scenarios. The experiments further show that temporal sampling can be reduced considerably with only minor changes in retrieval performance, indicating substantial redundancy in densely sampled driving video. Since the supervision is derived from automatically generated captions rather than human-annotated descriptions, the retrieval results should be interpreted with that limitation in mind. Overall, the thesis demonstrates that frozen pretrained encoders combined with lightweight fusion and alignment modules provide a computationally scalable approach for semantic retrieval of large-scale autonomous driving multi-view videos.

Ämne/nyckelord

Multi-view video retrieval, Text-to-multi-view video retrieval, Multicamera driving logs, Temporal video sampling

URI

https://hdl.handle.net/20.500.12380/311847

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By