Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs

Albarham, Mohammad; Berggren, Linus

Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs

dc.contributor.author	Albarham, Mohammad
dc.contributor.author	Berggren, Linus
dc.contributor.department	Chalmers tekniska högskola / Institutionen för elektroteknik	sv
dc.contributor.examiner	Hammarstrand, Lars
dc.contributor.supervisor	Maoz, Ori
dc.contributor.supervisor	Altetmek, Altug
dc.date.accessioned	2026-07-03T13:56:57Z
dc.date.issued	2026
dc.date.submitted
dc.description.abstract	This thesis studies scalable semantic retrieval of autonomous driving multi-view videos recorded with synchronized multi-camera systems. Using a subset of 3,502 multi-view driving videos from NVIDIA’s PhysicalAI Autonomous Vehicles dataset, the work investigates text-to-multi-view video retrieval using natural-language queries and learned cross-modal embeddings. Because the dataset does not contain paired textual descriptions, the proposed pipeline generates pseudo ground-truth captions from sampled video frames using a pretrained vision-language model and extracts frozen text and visual embeddings with jina-clip-v2. These generated captions provide the supervision used for training and evaluation without requiring manual annotation. Lightweight trainable alignment heads are then used to map text and video representations into a shared embedding space, while multi-view representations are constructed through view-level and temporal aggregation. The results quantify the difference between single-view (front camera) and multi-view (front and surrounding cameras) retrieval representations. Extending the representation from a single front-facing camera to six synchronized camera views increases Recall@ 5 from 71% to 85% and Recall@10 from 81% to 93%, indicating improved separation of ground-truth matches in the learned embedding space. In contrast, the LLM-based semantic similarity score changes only marginally, from 77 to 79, suggesting that both retrieval settings often retrieve semantically related driving scenarios. The experiments further show that temporal sampling can be reduced considerably with only minor changes in retrieval performance, indicating substantial redundancy in densely sampled driving video. Since the supervision is derived from automatically generated captions rather than human-annotated descriptions, the retrieval results should be interpreted with that limitation in mind. Overall, the thesis demonstrates that frozen pretrained encoders combined with lightweight fusion and alignment modules provide a computationally scalable approach for semantic retrieval of large-scale autonomous driving multi-view videos.
dc.identifier.coursecode	EENX30
dc.identifier.uri	https://hdl.handle.net/20.500.12380/311847
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Multi-view video retrieval
dc.subject	Text-to-multi-view video retrieval
dc.subject	Multicamera driving logs
dc.subject	Temporal video sampling
dc.title	Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Systems, control and mechatronics (MPSYS), MSc
local.programme	Engineering mathematics and computational science (MPENM), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: Master_Thesis_chalmers_overleaf_final_version.pdf
Size:: 20.83 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen