Masked Image Modelling

It seems like MIM objectives are becoming a strong learning objective for vision foundation models. Right now it seems to be the closest answer to: Do Vision Foundation models exist?

However, intuitively it seems a bit like a weak signal, as it focuses on individual patches/pixels, without much consideration to semantic information. This is echoed on Learning with Unmasked Tokens Drives Stronger Vision Learners:

However, MIM strategies often encounter challenges, such as local dependency on attention to understand entire context of an image. For example, liu et al. [36] revealed that MAE [22], a state-of-the-art MIM method, exhibits shorter average attention distances. Furthermore, we observe that attention map patterns by MAE substantiate extremely local behavior (See Fig. 1) indeed. In other words, the MAE-trained attention mechanism less integrates information across the entire image pixels and tends to focus on specific input regions. This is presumably attributed to MIM-pretraining, primarily dedicated to predicting low-level pixel details (e.g., color or texture) without a comprehensive understanding of less-regional information (e.g., the input structure or shape).

Register tokens?¶

Vision Transformers Need Registers observe that there are no high-norm artifacts that would justify adding registers and claims this is because the model only ises local information.