ConViT Improving Vision Transformers with Soft Convolutional Inductive Biases

Properties
authors	Stéphane d'Ascoli, Hugo Touvron, Matthew L. Leavitt, Ari S. Morcos, Giulio Biroli, Levent Sagun

Abstract

TODO:
- [ ] Read paper
- [ ] Add main text summary

From Early Convolutions Help Transformers See Better, where [9] is this paper:

We did not observe evidence that the hard locality constraint in early layers hampers the representational capacity of the network, as might be feared [9].
[...]
This perspective resonates with the findings of [9], who observe that early transformer blocks prefer to learn more local attention patterns than later blocks.

This is contrary to How do vision transformers work?, as they claim that locality constraint is beneficial to ViTs.

Haven't fully read this paper, so the above contradiction might be incorrect.