ConViT Improving Vision Transformers with Soft Convolutional Inductive Biases
Properties | |
---|---|
authors | Stéphane d'Ascoli, Hugo Touvron, Matthew L. Leavitt, Ari S. Morcos, Giulio Biroli, Levent Sagun |
Abstract
TODO:
- [ ] Read paper
- [ ] Add main text summary
From Early Convolutions Help Transformers See Better, where [9] is this paper:
We did not observe evidence that the hard locality constraint in early layers hampers the representational capacity of the network, as might be feared [9].
[...]
This perspective resonates with the findings of [9], who observe that early transformer blocks prefer to learn more local attention patterns than later blocks.
This is contrary to How do vision transformers work?, as they claim that locality constraint is beneficial to ViTs.
Haven't fully read this paper, so the above contradiction might be incorrect.