Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs

dc.contributor.authorAlbarham, Mohammad
dc.contributor.authorBerggren, Linus
dc.contributor.departmentChalmers tekniska högskola / Institutionen för elektrotekniksv
dc.contributor.examinerHammarstrand, Lars
dc.contributor.supervisorMaoz, Ori
dc.contributor.supervisorAltetmek, Altug
dc.date.accessioned2026-07-03T13:56:57Z
dc.date.issued2026
dc.date.submitted
dc.description.abstractThis thesis studies scalable semantic retrieval of autonomous driving multi-view videos recorded with synchronized multi-camera systems. Using a subset of 3,502 multi-view driving videos from NVIDIA’s PhysicalAI Autonomous Vehicles dataset, the work investigates text-to-multi-view video retrieval using natural-language queries and learned cross-modal embeddings. Because the dataset does not contain paired textual descriptions, the proposed pipeline generates pseudo ground-truth captions from sampled video frames using a pretrained vision-language model and extracts frozen text and visual embeddings with jina-clip-v2. These generated captions provide the supervision used for training and evaluation without requiring manual annotation. Lightweight trainable alignment heads are then used to map text and video representations into a shared embedding space, while multi-view representations are constructed through view-level and temporal aggregation. The results quantify the difference between single-view (front camera) and multi-view (front and surrounding cameras) retrieval representations. Extending the representation from a single front-facing camera to six synchronized camera views increases Recall@ 5 from 71% to 85% and Recall@10 from 81% to 93%, indicating improved separation of ground-truth matches in the learned embedding space. In contrast, the LLM-based semantic similarity score changes only marginally, from 77 to 79, suggesting that both retrieval settings often retrieve semantically related driving scenarios. The experiments further show that temporal sampling can be reduced considerably with only minor changes in retrieval performance, indicating substantial redundancy in densely sampled driving video. Since the supervision is derived from automatically generated captions rather than human-annotated descriptions, the retrieval results should be interpreted with that limitation in mind. Overall, the thesis demonstrates that frozen pretrained encoders combined with lightweight fusion and alignment modules provide a computationally scalable approach for semantic retrieval of large-scale autonomous driving multi-view videos.
dc.identifier.coursecodeEENX30
dc.identifier.urihttps://hdl.handle.net/20.500.12380/311847
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectMulti-view video retrieval
dc.subjectText-to-multi-view video retrieval
dc.subjectMulticamera driving logs
dc.subjectTemporal video sampling
dc.titleScalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeSystems, control and mechatronics (MPSYS), MSc
local.programmeEngineering mathematics and computational science (MPENM), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Master_Thesis_chalmers_overleaf_final_version.pdf
Size:
20.83 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Size:
2.35 KB
Format:
Item-specific license agreed upon to submission
Description: