Visualizing and Interpreting Transformer-based Vision Models

Interpretability Visualization Shapley Values Vision Transformer Masked Auto-encoder

Transformer-based vision models are increasingly popular and we need better ways to interpret and visualize their predictions. Previous works have been limited to visualizing attention maps; we apply a Shapley-value based method (FastSHAP) to Vision Transformers and Masked Autoencoders, comparing the results to a classical ResNet. We find that choosing ResNet as the surrogate model for FastSHAP lets us successfully interpret and visualize transformer-based vision models. We observe that the estimated Shapley values of ResNet and ViT trained on CIFAR-10 are qualitatively different, even though the models’ predictions are mostly consistent.

Technologies: Python, PyTorch

Paper