Self-Supervised Vision Transformers for Steel Surface Defect Detection - An Empirical Investigation of Fine-Tuning Strategies and Data Efficiency

dc.contributor.authorHemmingsson, Nora
dc.contributor.authorOlsson, Alexander
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerWedelin, Dag
dc.contributor.supervisorSelpi
dc.date.accessioned2026-06-30T06:48:39Z
dc.date.issued2026
dc.date.submitted
dc.description.abstractIndustrial defect classification is a critical task in quality control, where accurate detection of surface defects is essential for ensuring product reliability. However, ob taining large amounts of labeled data is often costly and time-consuming, motivating the use of self-supervised learning (SSL) to leverage unlabeled data. This thesis in vestigates the effectiveness of SSL for defect classification using Vision Transformer based methods, with a focus on Masked Autoencoders (MAE) and Distillation with No labels (DINO). The study evaluates the performance of these methods under different conditions, including fine-tuning vs linear probing, ImageNet initialization vs training from scratch and varying amounts of labeled data. A comprehensive experimental setup is used to assess both overall performance and label efficiency and results are compared to a supervised You Only Look Once (YOLO) baseline. The results show that both MAE and DINO learn transferable representations that achieve high classification performance after fine-tuning. DINO consistently outper forms MAE, indicating that distillation-based approaches produce more discrimina tive features for this task. Fine-tuning significantly improves performance compared to linear probing, highlighting the importance of adapting the full model to the down stream task. Additionally, ImageNet initialization provides a strong advantage over training from scratch, demonstrating the importance of large-scale pretraining. Un der limited labeled data condition during fine-tuning stage, both methods remain effective, achieving competitive performance even at low label fractions such as 1 % or 5%. However, performance improves steadily as more labeled data becomes available. Analysis of the results reveals that most misclassifications occur classi fying non-defective samples in defect classes. However, the confusion between the defect classes is minimal which indicates that the key challenge is to avoid the false positives, i.e. identifying non-defective samples as defective. Overall, the finding demonstrate that self-supervised learning is a viable and scalable approach for industrial defect classification, particularly in scenarios where labeled data is scarce. While fully supervised methods still achieve the highest performance when sufficient labeled data is available, SSL provides a strong alternative with reduce reliance on annotations.
dc.identifier.coursecodeDATX05
dc.identifier.urihttps://hdl.handle.net/20.500.12380/311639
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectSelf-supervised learning (SSL), industrial defect classification, computer vision, Vision Transformer, MAE, DINO, label efficiency, transfer learning.
dc.titleSelf-Supervised Vision Transformers for Steel Surface Defect Detection - An Empirical Investigation of Fine-Tuning Strategies and Data Efficiency
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComplex adaptive systems (MPCAS), MSc
local.programmeData science and AI (MPDSC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 26-30.pdf
Size:
9.86 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Size:
2.35 KB
Format:
Item-specific license agreed upon to submission
Description: