Skip to content

Conversation

@JorgeVanco
Copy link
Contributor

I reduced the number of layers to 12 and doubled the batch size to 256. Had to modify the number of steps to 18200 steps.

@JorgeVanco
Copy link
Contributor Author

​I added learned value embeddings just like in the modded nano-gpt repo. I have also further reduced the number of layers to 10.

​I have also added de muon momentum warmup. It barely has any improvement, but as it is already very hard to improve, I might as well add it as it does not affect the speed.

@whiteOsky
Copy link

good!

@JorgeVanco
Copy link
Contributor Author

Edit 10/29/25

  • Slightly increased the overall learning rate, as well as the learning rate for the embeddings.
  • Implemented YaRN to progressively increase the sequence length from 256 to 1792.
  • Added the NorMuon update.
  • Increased weight decay to $0.01$.

@vskogstad
Copy link
Contributor

Thats amazing improvement! 👍

When evaluating on the validation set, I assume you do so with the new maximum context length(1792)?
I noticed for my model that validation loss decreases by just doubling the context length and decreasing batch size during validation. Even with no training at extended context. This is/was also done in the NanoGPT-speedrun at some point. It seems like we get some gains during validation just from decreasing the occurences when the model has very little available context and has to make a guess.
(Just to be clear: I think you validating your model with large context length is correct, as you've actually trained the model on that context length.)

@JorgeVanco
Copy link
Contributor Author

Thank you!
Yes exactly, I am running validation with the maximum context length.

@marcelroed
Copy link
Member

Ah, I should have made this clear in this repo earlier (will add it to the readme) but verification will happen at context length 512. I think this might have been something we communicated over our class Slack which should have been added to the writeup.

Could you eval using these settings? Sorry for the confusion!

@marcelroed
Copy link
Member

Great work!

@JorgeVanco
Copy link
Contributor Author

Sure! I'll update it in the next couple of days. Thanks for the clarification! It was to good to be true haha

@vskogstad
Copy link
Contributor

Ah, I should have made this clear in this repo earlier (will add it to the readme) but verification will happen at context length 512. I think this might have been something we communicated over our class Slack which should have been added to the writeup.

Could you eval using these settings? Sorry for the confusion!

Sorry about highjacking the thread Jorge, but could you clarify a bit about what is and isn't allowed in leaderboard submissions?
For assignment 1 we are supposed to build everything from scratch, and as such there are limitations on using torch.nn. Is this requirement waivered for the leaderboard? (Basically can we use torch.nn.functional.scaled_dot_product_attention, flex_attention etc. to get system speedups)
In the readme it says:
"The code must clearly be your own work, and you can't use external implementations for systems-critical aspects of your model."
If we cant use torch.nn, can we write our own kernels? What level of abstraction still qualifies as our own work? (Cuda only or can we use higher level abstractions like Triton/Thunderkittens)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants