On the Role of Attention Maps in Visual Transformers - A Clustering Perspective

dc.contributor.authorAnttila Ryderup, Erik
dc.contributor.authorHsu, Yu-Ping
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerMercado Oropeza, Rocío
dc.contributor.supervisorPanahi, Ashkan
dc.date.accessioned2025-09-11T12:18:09Z
dc.date.issued2024
dc.date.submitted
dc.description.abstractThis thesis delves into a novel area of research, exploring whether attention maps from a single-layer Vision Transformer model exhibit a clustering structure. The discover of such a structure would imply that tokens with similar semantic information tend to cluster together. We extract an attention map from a one-layer Vision Transformer model, which uses image patches as input data. Values below a set threshold are pruned from the attention map, and a graph is created from the remaining data. Various community detection algorithms are then applied to this graph and evaluated based on modularity. We visualize the patches belonging to each cluster and compare classification performance when removing salient and non-salient clusters. The method reveals a significant clustering structure, which was discovered by the Louvain algorithm. The tokens cluster to other objects with similar semantic information, effectively separating parts of the image. The classification logit values for specific images are improved when tokens belonging to unimportant clusters are removed while removing tokens from important clusters negatively impacts performance. This work suggests that a Vision Transformer’s attention layer clusters tokens based on their semantic information, but further research is needed to confirm the generality of this result.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310470
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectVision Transformer
dc.subjectattention layer
dc.subjectattention map
dc.subjectclustering
dc.subjectinterpretability
dc.subjectLouvain
dc.subjectvisualization
dc.titleOn the Role of Attention Maps in Visual Transformers - A Clustering Perspective
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeData science and AI (MPDSC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 24-155 EAR YH.pdf
Storlek:
15.54 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: