Hi, I really like the model . It has been trained good and is generating good results compared to the size and the slight uniqueness in architecture. I know the dataset used here is coming from the paper TinyStories, but is there also a literature backing for the model architecture or are you planing to publish a paper?