Intuitive physics understanding emerges from self supervised pretraining on natural videos
Properties | |
---|---|
authors | Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun |
year | 2025 |
url | http://arxiv.org/abs/2502.11831 |
Abstract
We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.
Notes¶
The mechanism consisting of predicting in representation space is congruentwith the predictive coding hypothesis of cognitive neuroscience
violation-of-expectation framework to probe for intuitive physics understanding without requiring any task-specific training or adaptation
directly from the input withouthard-wiring a cascade of intermediate representations like object contours or pose estimation
prediction error — the distance between thepredicted video representations and the actual encoded video representations — at each time-step, we obtaina temporally aligned quantitative measure of the model’s surprise throughout the video
dev set of IntPhys
object permanence
continuity
shape
color constancy
gravity
support
solidity
inertia
collision
VideoMAEv2 can be evaluated in the sameway as V-JEPA, by predicting the future and measuring surprise via prediction error
I wonder if both models are trained with the same masking strategy
we give the model a pair of videos, asking which one of the two isimpossible
The V-JEPA score uses the maximum surprise from each video, which generalizesbetter for single-video classification
V-JEPA excels at properties related to the scene’s content (e.g., object permanence), but struggles with categories that require knowledge of a contextualizing event (gravity and solidity in InfLevel-lab) or themodeling of precise object interactions such as collisions
model’s framerate constraints.
2 shows that although the accuracies are high for physicalviolations that imply properties intrinsic to objects (except for the color property), violations implicatinginteractions between objects, like solidity or collision, are close to chance
While at training time the corruption used is the removal of spatio-temporal blocks, we can see that if weinstead use the first C frames to predict the rest of the video, this objective turns into a measure of error forthe prediction of the future.