On the Role of Attention Maps in Visual Transformers - A Clustering Perspective
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis delves into a novel area of research, exploring whether attention maps from a single-layer Vision Transformer model exhibit a clustering structure. The discover of such a structure would imply that tokens with similar semantic information tend to cluster together. We extract an attention map from a one-layer Vision Transformer model, which uses image patches as input data. Values below a set threshold are pruned from the attention map, and a graph is created from the remaining data. Various community detection algorithms are then applied to this graph and evaluated based on modularity. We visualize the patches belonging to each cluster and compare classification performance when removing salient and non-salient clusters. The method reveals a significant clustering structure, which was discovered by the Louvain algorithm. The tokens cluster to other objects with similar semantic information, effectively separating parts of the image. The classification logit values for specific images are improved when tokens belonging to unimportant clusters are removed while removing tokens from important clusters negatively impacts performance. This work suggests that a Vision Transformer’s attention layer clusters tokens based on their semantic information, but further research is needed to confirm the generality of this result.
Beskrivning
Ämne/nyckelord
Vision Transformer, attention layer, attention map, clustering, interpretability, Louvain, visualization
