Early Convolutions Help Transformers See Better
| Properties | |
|---|---|
| authors | Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick |
Hypothesis
ViT's patchify convolution is contrary to standard early layers in CNNs. Maybe that's the cause?
Main idea
Replace patchify convolution with a small number of convolutional layers and drop one transformer block to make comparison fair.

Notes for myself:
- Interesting experimentation regarding #optimizability , maybe take into account into hessian analysis