Self-supervised pre-training with Vision Foundation Models on 3D point clouds
dc.contributor.author | Carlén, Anni | |
dc.contributor.author | Nässlander, Pauline | |
dc.contributor.department | Chalmers tekniska högskola / Institutionen för elektroteknik | sv |
dc.contributor.examiner | Hammarstrand, Lars | |
dc.contributor.supervisor | Fatemi, Maryam | |
dc.contributor.supervisor | Verbeke, Willem | |
dc.contributor.supervisor | Rafidashti, Mahan | |
dc.date.accessioned | 2025-06-19T07:25:59Z | |
dc.date.issued | 2025 | |
dc.date.submitted | ||
dc.description.abstract | The ability to accurately identify objects and their spatial properties in 3D is a critical task within the field of autonomous driving. To achieve this goal, deep neural networks identify objects and determine their corresponding 3D position, extent and orientation. For deep models to learn general properties of the environment and be able to identify objects, large amounts of labelled data are required. Since labelled data is expensive and time-consuming to obtain, it is crucial to reduce reliance on it without compromising performance. Therefore, this thesis investigates two pre-training tasks for 3D object detection in LiDAR point clouds and highlights their effectiveness in one-shot and few-shot detection. The pre-training methods in this project utilise two different web-scaled foundation models, namely, SAM 2 and DINOv2. As both models are trained on 2D images, we lift their outputs to 3D to generate pseudo-labels for training a 3D object detection model in a self-supervised fashion. Predicted features from DINOv2 are used in the first pre-training task, while segmentation masks from SAM 2 are used in the second task. As a result, the performance of each method reflects both the choice of pseudo-labelling method and the quality of the underlying foundation models. The results show that the model effectively learns to identify objects from a single annotated sample after pre-training on either SAM 2 or DINOv2. When only a few samples are available, the pre-trained and fine-tuned models significantly outperform the baseline which is trained fully-supervised from scratch. This emphasises how these pre-training methods enabled the model to learn and recognize less common object categories even when they are represented by only a few instances in the dataset. | |
dc.identifier.coursecode | EENX30 | |
dc.identifier.uri | http://hdl.handle.net/20.500.12380/309559 | |
dc.language.iso | eng | |
dc.setspec.uppsok | Technology | |
dc.subject | object detection | |
dc.subject | LiDAR | |
dc.subject | foundation models | |
dc.subject | self-supervised | |
dc.subject | pre-training | |
dc.subject | point cloud | |
dc.subject | thesis | |
dc.title | Self-supervised pre-training with Vision Foundation Models on 3D point clouds | |
dc.type.degree | Examensarbete för masterexamen | sv |
dc.type.degree | Master's Thesis | en |
dc.type.uppsok | H | |
local.programme | Biomedical engineering (MPBME), MSc | |
local.programme | Systems, control and mechatronics (MPSYS), MSc |