How Does SimSiam Avoid Collapse Without Negative Samples A Unified Understanding with Self supervised Contrastive Learning

Properties
authors	Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, In So Kweon
year	2022
url	http://arxiv.org/abs/2203.16262

Abstract

To avoid collapse in self-supervised learning (SSL), a contrastive loss is widely used but often requires a large number of negative samples. Without negative samples yet achieving competitive performance, a recent work (Chen & He, 2021) has attracted signiﬁcant attention for providing a minimalist simple Siamese (SimSiam) method to avoid collapse. However, the reason for how it avoids collapse without negative samples remains not fully clear and our investigation starts by revisiting the explanatory claims in the original SimSiam. After refuting their claims, we introduce vector decomposition for analyzing the collapse based on the gradient analysis of the l2-normalized representation vector. This yields a uniﬁed perspective on how negative samples and SimSiam alleviate collapse. Such a uniﬁed perspective comes timely for understanding the recent progress in SSL.

Notes¶

Zotero Link

Cool blog: https://www.nowozin.net/sebastian/blog/thoughts-on-trace-estimation-in-deep-learning.html

Now I undestand better

Here’s a concise recap of the unified theoretical framework for covariance-regularized SSL methods:

Core Objective
Every method optimizes

L = Lalign(z1,z2)⏟enforce view‐invariance + λ Ω ⁣(Σ)⏟prevent collapse by “expanding” features,\mathcal L \;=\;\underbrace{\mathcal L_{\rm align}(z_1,z_2)}{\text{enforce view‐invariance}} \;+\;\lambda\,\underbrace{\Omega!\bigl(\Sigma\bigr)}{\text{prevent collapse by “expanding” features}},

where Σ\Sigma is the batch covariance of the dd-dimensional embeddings.
Alignment Term
- Matches two augmented views of the same image (e.g. MSE or cosine loss).
- Drives invariance but by itself admits the trivial solution z≡z\equiv constant.
Regularizer Ω(Σ)\Omega(\Sigma)
Encodes two essential second-moment effects (per Zhang et al., 2022):
- De-centering: enforces non-zero variance in every dimension (so features can’t collapse to a constant).
- De-correlation: penalizes linear correlations between dimensions (so features can’t collapse onto a lower‐dimensional subspace).
VICReg vs. Coding-Rate (SimDINO)
- Coding-rate uses the exact log-det term
  Rε(Σ)=12log⁡det⁡(I+αΣ)\displaystyle R_\varepsilon(\Sigma)=\tfrac12\log\det\bigl(I+\alpha\Sigma\bigr),
  which simultaneously maximizes variance (all eigenvalues > 0) and enforces isotropy (decorrelation).
- VICReg approximates −Rε-R_\varepsilon via its second-order Taylor expansion around Σ=0\Sigma=0:
  
  −log⁡det⁡(I+αΣ) ≈ −α tr Σ + α22 tr Σ2, -\log\det(I+\alpha\Sigma) \;\approx\; -\alpha\,\mathrm{tr}\,\Sigma \;+\;\tfrac{\alpha^{2}{2}\,\mathrm{tr}\,\Sigma}2,
  
  yielding two simple surrogates:
  - A variance hinge ∑jmax⁡(0,1−σj)2\sum_j\max(0,1-\sigma_j)^2 enforcing σj>1\sigma_j>1,
  - A covariance penalty ∑i≠jΣij2\sum_{i\neq j}\Sigma_{ij}^2 pushing off-diagonals to zero.
Information-Theoretic Interpretation
- log⁡det⁡Σ\log\det\Sigma (up to constants) is the differential entropy of a Gaussian with covariance Σ\Sigma.
- Maximizing RεR_\varepsilon under the alignment constraint is an instance of InfoMax (maximize representation entropy while matching positives).
- Contrasting contrastive losses (InfoNCE) and these non-contrastive criteria reveals they are dual formulations of the same underlying entropy-alignment trade-off.
Practical Consequences
- Stability & Robustness: explicit Ω\Omega avoids the need for large negatives, momentum encoders, or architectural tricks—works with moderate batch sizes.
- Interpretability: VICReg’s variance/covariance terms correspond directly to whitening; coding-rate gives a principled entropy measure.
- Hyperparameters: surrogates (VICReg) need two weights (λvar,λcov\lambda_{\rm var},\lambda_{\rm cov}); coding-rate needs just one strength parameter γ\gamma (plus ε\varepsilon).

Take-home: All successful negative-free SSL methods boil down to “align your positives” and “keep your covariance full-rank and (ideally) isotropic.” VICReg-style losses do this by simple variance + covariance penalties; coding-rate methods do it by directly maximizing a log-det entropy objective.

Here’s a concise explanation:

Decorrelation in SSL means removing linear correlations between different feature‐dimensions so that each coordinate carries unique information. Concretely, if Z∈RB×dZ\in\mathbb R^{B\times d} are batch embeddings, their (empirical) covariance

Σ=1B (Z−Zˉ)⊤(Z−Zˉ)\Sigma = \frac1B\,(Z-\bar Z)^\top(Z-\bar Z)

has off-diagonal entries Σij\Sigma_{ij} measuring linear correlation between dimensions ii and jj. Decorrelation drives Σij→0\Sigma_{ij}\to0 for i≠ji\neq j, ensuring the learned features span the full dd-dimensional subspace rather than collapse onto a lower-dimensional manifold (Wikipedia).

In VICReg, this is done explicitly by adding a term ∑i≠jΣij2\sum_{i\neq j}\Sigma_{ij}^2 to the loss, which directly penalizes off-diagonal covariance (arXiv).
More generally, feature decorrelation has been shown to prevent dimensional collapse—where all information piles into a few axes—by standardizing the covariance matrix toward an identity form (arXiv).

Why does the coding rate regularizer also decorrelate?¶

The coding-rate regularizer used in SimDINO (and related Maximal Coding Rate Reduction (MCR²) methods) is

Rε(Σ)=12log⁡det⁡ ⁣(I+dε2 Σ),R_\varepsilon(\Sigma) =\tfrac12\log\det!\bigl(I + \tfrac{d}{\varepsilon^2}\,\Sigma\bigr),

and the SSL loss includes −γ Rε(Σ)-\gamma\,R_\varepsilon(\Sigma) to maximize this quantity (arXiv).

Determinant ∝ volume: det⁡(Σ)\det(\Sigma) equals the squared volume of the ellipsoid defined by Σ\Sigma; the term log⁡det⁡(I+αΣ)\log\det(I+\alpha\Sigma) thus measures the log-volume of the representation cloud (Wikipedia).
Maximizing volume enforces spread: To maximize det⁡\det, the model must push all eigenvalues λi(Σ)\lambda_i(\Sigma) away from zero—i.e. preserve variance in every direction.
Isotropy → decorrelation: For a fixed total variance ∑iλi\sum_i \lambda_i, the product ∏iλi\prod_i\lambda_i (hence det⁡\det) is maximized precisely when all λi\lambda_i are equal; that is, Σ\Sigma becomes proportional to the identity, which implies zero off-diagonal entries (perfect decorrelation) (Mathematics Stack Exchange).
Global surrogate: Unlike VICReg’s separate variance and off-diagonal penalties, log⁡det⁡\log\det is a single global measure of entropy (volume) that inherently couples variance preservation and decorrelation—a principled InfoMax criterion under Gaussian assumptions (arXiv).

Put simply, by maximizing the coding rate (log-det of the covariance), the model is driven to occupy an isotropic, full-rank region in feature space, which automatically decorrelates the dimensions without needing an explicit off-diagonal penalty (Cross Validated, ICML).

References

Decorrelation definition: Wikipedia “Decorrelation” (Wikipedia)
Dimensional collapse & need for decorrelation: Hua et al. (2021) (arXiv)
VICReg’s explicit covariance penalty: Bardes et al. (2021) (arXiv)
Determinant as volume: Wikipedia “Determinant” (Wikipedia)
Geometric intuition of det⁡(Σ)\det(\Sigma): Math.SE (Mathematics Stack Exchange)
Log-det in InfoMax SSL: Statistics.SE on log-det ﹙Jacobian, log-likelihood﹚ (Cross Validated)
SimDINO coding rate reg.: Wu et al. (2025) “Simplifying DINO…” (arXiv)
MCR² principle: Yu et al. (2020) (arXiv)
CorInfoMax’s second-order MI: Ozsoy et al. (2022) (NeurIPS Proceedings)
Matrix Info Theory unifying view: Zhang et al. (ICML 2024) (ICML)