Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs
| dc.contributor.author | Albarham, Mohammad | |
| dc.contributor.author | Berggren, Linus | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för elektroteknik | sv |
| dc.contributor.examiner | Hammarstrand, Lars | |
| dc.contributor.supervisor | Maoz, Ori | |
| dc.contributor.supervisor | Altetmek, Altug | |
| dc.date.accessioned | 2026-07-03T13:56:57Z | |
| dc.date.issued | 2026 | |
| dc.date.submitted | ||
| dc.description.abstract | This thesis studies scalable semantic retrieval of autonomous driving multi-view videos recorded with synchronized multi-camera systems. Using a subset of 3,502 multi-view driving videos from NVIDIA’s PhysicalAI Autonomous Vehicles dataset, the work investigates text-to-multi-view video retrieval using natural-language queries and learned cross-modal embeddings. Because the dataset does not contain paired textual descriptions, the proposed pipeline generates pseudo ground-truth captions from sampled video frames using a pretrained vision-language model and extracts frozen text and visual embeddings with jina-clip-v2. These generated captions provide the supervision used for training and evaluation without requiring manual annotation. Lightweight trainable alignment heads are then used to map text and video representations into a shared embedding space, while multi-view representations are constructed through view-level and temporal aggregation. The results quantify the difference between single-view (front camera) and multi-view (front and surrounding cameras) retrieval representations. Extending the representation from a single front-facing camera to six synchronized camera views increases Recall@ 5 from 71% to 85% and Recall@10 from 81% to 93%, indicating improved separation of ground-truth matches in the learned embedding space. In contrast, the LLM-based semantic similarity score changes only marginally, from 77 to 79, suggesting that both retrieval settings often retrieve semantically related driving scenarios. The experiments further show that temporal sampling can be reduced considerably with only minor changes in retrieval performance, indicating substantial redundancy in densely sampled driving video. Since the supervision is derived from automatically generated captions rather than human-annotated descriptions, the retrieval results should be interpreted with that limitation in mind. Overall, the thesis demonstrates that frozen pretrained encoders combined with lightweight fusion and alignment modules provide a computationally scalable approach for semantic retrieval of large-scale autonomous driving multi-view videos. | |
| dc.identifier.coursecode | EENX30 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12380/311847 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Multi-view video retrieval | |
| dc.subject | Text-to-multi-view video retrieval | |
| dc.subject | Multicamera driving logs | |
| dc.subject | Temporal video sampling | |
| dc.title | Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Systems, control and mechatronics (MPSYS), MSc | |
| local.programme | Engineering mathematics and computational science (MPENM), MSc |
