On the Role of Attention Maps in Visual Transformers - A Clustering Perspective

Loading...
Thumbnail Image

Date

Type

Examensarbete för masterexamen
Master's Thesis

Model builders

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis delves into a novel area of research, exploring whether attention maps from a single-layer Vision Transformer model exhibit a clustering structure. The discover of such a structure would imply that tokens with similar semantic information tend to cluster together. We extract an attention map from a one-layer Vision Transformer model, which uses image patches as input data. Values below a set threshold are pruned from the attention map, and a graph is created from the remaining data. Various community detection algorithms are then applied to this graph and evaluated based on modularity. We visualize the patches belonging to each cluster and compare classification performance when removing salient and non-salient clusters. The method reveals a significant clustering structure, which was discovered by the Louvain algorithm. The tokens cluster to other objects with similar semantic information, effectively separating parts of the image. The classification logit values for specific images are improved when tokens belonging to unimportant clusters are removed while removing tokens from important clusters negatively impacts performance. This work suggests that a Vision Transformer’s attention layer clusters tokens based on their semantic information, but further research is needed to confirm the generality of this result.

Description

Keywords

Vision Transformer, attention layer, attention map, clustering, interpretability, Louvain, visualization

Citation

Architect

Location

Type of building

Build Year

Model type

Scale

Material / technology

Index

Endorsement

Review

Supplemented By

Referenced By