Toy-Vision-Transformer A simple ViT implementation from scratch + MNIST dataset classification Hugely inspired by timm's vision models and Hugging Face implementations.