We've explored several self-supervised and generative learning approaches, each offering a different way to model or understand data geometry. Here's a structured summary of what we've done.
Goal: Learn an embedding space where similar points (e.g., perturbed versions of the same sample) are close, and dissimilar points are far apart.
- Sampled pairs:
(x, x')wherex'is a perturbed version ofx. - Trained a 2-layer MLP encoder.
- Applied contrastive loss, such as cosine similarity:
- Pull
xandx'together. - Push
xand unrelatedx''apart.
- Pull
- Encourages local structure awareness.
- Effective at separating clusters.
- Visualization: energy landscape = -cosine similarity.
- Learned well-separated embeddings for different clusters.
- Strong energy wells around data points.
- Contrastive loss required careful batch construction.
Goal: Estimate the gradient of the log density (a.k.a. the score function): [ \nabla_x \log p(x) ]
- Added Gaussian noise: (\tilde{x} = x + \epsilon).
- Trained a network to predict: (-\epsilon / \sigma^2).
- Used MSE between predicted vector and true noise direction.
- Trains without contrastive or positive/negative pairs.
- Approximates score function of the data distribution.
- Enables sampling via Langevin dynamics.
- Model learned to point toward data clusters.
- Energy can be high outside, low near data or vice versa.
- Captures local structure and partially long-range geometry.
Goal: Predict embeddings of perturbed samples without using contrastive loss.
- Input:
(x, x'), wherex' = x + noise. - Learn embeddings
f(x)andf(x'). - Minimize: (| f(x) - f(x') |^2)
- Compare every
x_ito allx'_jin batch: [ \mathcal{L} = \sum_{i,j} | f(x_i) - f(x'_j) |^2 ] - Learns a non-local similarity structure.
- Randomly mask one coordinate (
x[0]orx[1]). - Predict embedding of full point from masked input.
- Alternated mask per epoch.
- Paper
- Online encoder:
f(x)(trainable). - Target encoder:
f'(x')(updated via exponential moving average). - Loss: match embeddings.
- Paper
- No contrastive terms needed.
- EMA improves stability and avoids collapse.
- Masked JEPA encourages conditional modeling.
- Pairwise JEPA promotes global geometry learning.
- Learned smooth, local or global energy structures.
- Less collapse, especially with EMA.
- Masked variant captured conditional dependencies.
| Method | Uses Contrastive? | Learns Score? | Sampling? | Learns Global Structure? | Masked Prediction? |
|---|---|---|---|---|---|
| Contrastive JEA | ✅ Yes | ❌ No | ❌ | ❌ | |
| DSM | ❌ No | ✅ Yes | ✅ | ❌ Local only | ❌ |
| Global Pairwise JEPA | ❌ No | ❌ No | ❌ | ✅ Yes | ❌ |
| Masked JEPA | ❌ No | ❌ No | ❌ | ✅ | |
| JEPA + EMA | ❌ No | ❌ No | ❌ | ✅ Stable + non-local | Optional |
- Goal: Apply various transformations (affine, erosion, dilation, inversion, and noise) to perturb MNIST images.
- Method: Uses random transformations like rotations, scaling, shearing, and random noise to alter the images.
- Goal: Train a model to generate robust latent representations (embeddings) using perturbation-based self-supervised learning.
- Method: The model learns to predict representations of perturbed images using cosine similarity loss.
- Evaluation: Embeddings visualized using PCA and evaluated with k-NN accuracy.
- Accuracy: 0.9901
- Precision: 0.9901
- Recall: 0.9901
- F1 Score: 0.9901
- Goal: Fine-tune the model for image reconstruction using a decoder.
- Method: The model, after training embeddings, reconstructs perturbed images and optimizes using MSE loss.
- Evaluation: Visual comparison of original, perturbed, and reconstructed images.








