Self-supervised pre-training with Vision Foundation Models on 3D point clouds

Carlén, Anni; Nässlander, Pauline

Self-supervised pre-training with Vision Foundation Models on 3D point clouds

dc.contributor.author	Carlén, Anni
dc.contributor.author	Nässlander, Pauline
dc.contributor.department	Chalmers tekniska högskola / Institutionen för elektroteknik	sv
dc.contributor.examiner	Hammarstrand, Lars
dc.contributor.supervisor	Fatemi, Maryam
dc.contributor.supervisor	Verbeke, Willem
dc.contributor.supervisor	Rafidashti, Mahan
dc.date.accessioned	2025-06-19T07:25:59Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	The ability to accurately identify objects and their spatial properties in 3D is a critical task within the field of autonomous driving. To achieve this goal, deep neural networks identify objects and determine their corresponding 3D position, extent and orientation. For deep models to learn general properties of the environment and be able to identify objects, large amounts of labelled data are required. Since labelled data is expensive and time-consuming to obtain, it is crucial to reduce reliance on it without compromising performance. Therefore, this thesis investigates two pre-training tasks for 3D object detection in LiDAR point clouds and highlights their effectiveness in one-shot and few-shot detection. The pre-training methods in this project utilise two different web-scaled foundation models, namely, SAM 2 and DINOv2. As both models are trained on 2D images, we lift their outputs to 3D to generate pseudo-labels for training a 3D object detection model in a self-supervised fashion. Predicted features from DINOv2 are used in the first pre-training task, while segmentation masks from SAM 2 are used in the second task. As a result, the performance of each method reflects both the choice of pseudo-labelling method and the quality of the underlying foundation models. The results show that the model effectively learns to identify objects from a single annotated sample after pre-training on either SAM 2 or DINOv2. When only a few samples are available, the pre-trained and fine-tuned models significantly outperform the baseline which is trained fully-supervised from scratch. This emphasises how these pre-training methods enabled the model to learn and recognize less common object categories even when they are represented by only a few instances in the dataset.
dc.identifier.coursecode	EENX30
dc.identifier.uri	http://hdl.handle.net/20.500.12380/309559
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	object detection
dc.subject	LiDAR
dc.subject	foundation models
dc.subject	self-supervised
dc.subject	pre-training
dc.subject	point cloud
dc.subject	thesis
dc.title	Self-supervised pre-training with Vision Foundation Models on 3D point clouds
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Biomedical engineering (MPBME), MSc
local.programme	Systems, control and mechatronics (MPSYS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: Master_s_Thesis_Report_c.pdf
Storlek:: 56.9 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Storlek:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Beskrivning:

Ladda ner

Samlingar

Examensarbeten för masterexamen