Scalable Vision–Language Machine Learning for Semantic Retrieval of Autonomous Driving Logs
Hämtar...
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis studies scalable semantic retrieval of autonomous driving multi-view
videos recorded with synchronized multi-camera systems. Using a subset of 3,502
multi-view driving videos from NVIDIA’s PhysicalAI Autonomous Vehicles dataset,
the work investigates text-to-multi-view video retrieval using natural-language queries
and learned cross-modal embeddings. Because the dataset does not contain paired
textual descriptions, the proposed pipeline generates pseudo ground-truth captions
from sampled video frames using a pretrained vision-language model and extracts
frozen text and visual embeddings with jina-clip-v2. These generated captions provide
the supervision used for training and evaluation without requiring manual annotation.
Lightweight trainable alignment heads are then used to map text and
video representations into a shared embedding space, while multi-view representations
are constructed through view-level and temporal aggregation. The results
quantify the difference between single-view (front camera) and multi-view (front
and surrounding cameras) retrieval representations. Extending the representation
from a single front-facing camera to six synchronized camera views increases Recall@
5 from 71% to 85% and Recall@10 from 81% to 93%, indicating improved
separation of ground-truth matches in the learned embedding space. In contrast,
the LLM-based semantic similarity score changes only marginally, from 77 to 79,
suggesting that both retrieval settings often retrieve semantically related driving
scenarios. The experiments further show that temporal sampling can be reduced
considerably with only minor changes in retrieval performance, indicating substantial
redundancy in densely sampled driving video. Since the supervision is derived
from automatically generated captions rather than human-annotated descriptions,
the retrieval results should be interpreted with that limitation in mind. Overall,
the thesis demonstrates that frozen pretrained encoders combined with lightweight
fusion and alignment modules provide a computationally scalable approach for semantic
retrieval of large-scale autonomous driving multi-view videos.
Beskrivning
Ämne/nyckelord
Multi-view video retrieval, Text-to-multi-view video retrieval, Multicamera driving logs, Temporal video sampling
