This repository presents a series of experiments conducted on a baseline Vision Transformer (ViT) to analyze the effects of different attention mechanisms, positional encoding strategies, and information-theoretic enhancements.
We have completed the image classification task on ImageNet-100 using these variants. Next, we will extend this work to object detection and semantic segmentation, which are currently in progress.
We evaluate multiple architectural and attention-level modifications on a fixed ViT backbone. These include:
1.Shannon entropy–based enhancements
2.Positional encoding modifications
3.Spatial suppression inspired by neuroscience
4.RoPE extensions
5.Hybrid RoPE + Absolute Positional Encoding
All experiments are trained under identical settings to allow fair comparison.
| No. | Experiment | acc@1 | acc@5 |
|---|---|---|---|
| 1 | Positional Encoding + Dropouts | 44.970 | 75.240 |
| 2 | Attention Based on Difference-Similarity | 46.360 | 75.540 |
| 3 | Baseline ViT | 48.109 | 75.617 |
| 4 | Spatial Suppression Attention | 49.383 | 77.673 |
| 5 | RoPE Extension | 53.331 | 80.407 |
| 6 | Differential Attention + Shannon Entropy | 53.803 | 80.530 |
| 7 | RoPE Mixed + Absolute Positional Encoding (Best) | 56.805 | 83.326 |
The following baseline is kept constant for all experiments (unless explicitly noted):
1.Dataset: ImageNet-100
2.Epochs: 90
3.Batch Size: 256
4.Transformer Layers: 12
5.Attention Heads: 3
6.Hidden Dimension: 192
7.MLP Dimension: 768
8.Patch Size: 16×16
9.Baseline Positional Encoding: Learnable
A standard ViT with learnable positional embeddings and vanilla multi-head attention. Serves as the primary reference for all comparisons.
Applying dropout on positional embeddings to test robustness against spatial perturbations.
Uses similarity of feature differences (ΔQ, ΔK) rather than raw features.
Key Idea
ΔQi=Qi−Qi−1 , ΔKj=Kj−Kj−1
Attention scores:
Enhances local relational modeling but loses some global context.
Replaces the learnable scalar λ in differential attention with patch-dependent Shannon entropy: H(x)=−∑p(x)logp(x)
High-entropy patches receive stronger modulation → better handling of texture-rich or complex regions.
Significant gain over the baseline.
Inspired by surround suppression in biological vision.
A depthwise convolution learns a suppression kernel per head and subtracts it from raw attention scores: suppressed = scores - self.suppression_conv(scores) Removes noisy neighborhood interactions → improves attention quality.
Unlike standard RoPE that rotates Q and K independently:
This version rotates Q–K pairs relative to each other
The best-performing configuration.
Combines:
-Absolute Positional Encoding (APE) for global positioning
-Modified RoPE for fine-grained relative geometric cues
Balances global + local spatial understanding → highest overall accuracy.
Rotation angle depends on query–key positional relationship
Enables richer relative geometry modeling
Shows strong improvement over baseline.
Classification module on ImageNet-100
Multiple positional and attention mechanism experiments
Object Detection (ViT backbone)
Semantic Segmentation (ViT encoder with segmentation head)
Results and implementations will be added soon.