Learning with Unmasked Tokens Drives Stronger Vision Learners

Properties
authors	Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han
year	2024
url	https://arxiv.org/abs/2310.13593

Abstract

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Code is available at this https URL

Notes¶

Pasted image 20250219142245.png

Some notes regarding MIM as a good objective are on Masked Image Modelling.

However, MIM strategies often encounter challenges, such as local dependency on attention to understand entire context of an image. For example, liu et al. [36] revealed that MAE [22], a state-of-the-art MIM method, exhibits shorter average attention distances. Furthermore, we observe that attention map patterns by MAE substantiate extremely local behavior (See Fig. 1) indeed. In other words, the MAE-trained attention mechanism less integrates information across the entire image pixels and tends to focus on specific input regions. This is presumably attributed to MIM-pretraining, primarily dedicated to predicting low-level pixel details (e.g., color or texture) without a comprehensive understanding of less-regional information (e.g., the input structure or shape).

This maybe should not really be an issue: How do vision transformers work? explicitly constraint ViTs to only use local attention and they improve performance. So maybe this is an advantage? See Are less inductive biases better or worse?.

Pasted image 20240702135103.png