Improved leaderboard submission Vegard #95

vskogstad · 2025-09-21T06:36:02Z

Made some changes. Barely within 90 mins, so could be over on a different H100. 5 out of 90 minutes are spent validating against entire validation set.

Improvements
-Changed the data-loader from random, to randomized strided sampling without replacement. More unique training samples gave increased training loss(no repetitions) but brought the validation loss down to 3.126.
-Implemented gated attention (Sort of like in qwen next, but I do full gates instead of per head and use SILU instead of sigmoid). Changed to just 6 attention heads, which is a bit worse but gives higher MFU. In sum this brought validation loss down to 3.1035. https://arxiv.org/pdf/2505.06708
-QK-norm -> 3.094
-Adjusting AdamW params from [0.90, 0.95] to [0.9, 0.999] gave a surprisingly large boost -> 3.0839
-U-net architecture with learnable params. -> 3.0762

Other attempts
-Lowering the LR for the LM-head layer as recommended in Dion. NanoGPT seems to be doing it the other way around? Gave very minimal improvements. https://arxiv.org/pdf/2504.05295
-Document masking. Lower MFU and very slight performance decrease. I really expected this to work a lot better.
-Sliding window attention. Hybrid with 3 layers sliding window 1 full layer is only slightly worse, but no speedup
-Grouped query attention. Worse, as expected.
-Scaling output of each block like in Ernie. Worse. Think this might be interfering with my layernorm scaling.
-Mixing in extra embedding values in later value matrices like in NanoGPT speedrun. (Looks good 2/3 in but ends up worse in the end). General idea: https://arxiv.org/pdf/2410.17897
-Decreasing/increasing warmup steps or increasing learning rate after adding QK-norm gave no benefit.
-QK-clip. I was not able to get this to work. In theory it should help a bit with the MFU compared to QK-norm.

-Scaled down d_model from to 1024 -> 3.059 -Changed from Muon to NorMuon -> 3.0396. -Finally got extra value embeddings mixed with the V-matrix to work. Got identical results using learnable scalar mixing of embeddings in only the final two layers as when mixing in all layers, so ended up going with that implementation. -> 3.032 -Increased the amount of training steps to the edge of the time-limit. -> 3.0305. I changed to measuring validation loss against a small subset of the validation set during training. I calculate validation loss on entire validation set after training instead, which is not included in the runtime. The run shown in the graph actually reached 3.0285 in validation loss, but to have some margin for worse GPU performance I have decreased my number of steps by 200 for my submission. This gives a validation loss to 3.0305.

vskogstad · 2025-10-13T09:43:17Z

Made some more adjustments:
-Scaled down d_model from 1536 to 1024 and increased steps -> 3.059
-Changed from Muon to NorMuon -> 3.0396.
-Finally got extra value embeddings mixed with the V-matrix to work. Got identical results using learnable scalar mixing of embeddings in only the final two layers as when mixing in all layers, so ended up going with that implementation. -> 3.032
-Increased the amount of training steps to the edge of the time-limit. -> 3.0305.
I now measure validation loss against a small subset of the validation set during training. I calculate validation loss on entire validation set after training, which is not included in the runtime. The run shown in the graph actually reached 3.0285 in validation loss, but to have some margin for worse GPU performance I have decreased my number of steps by 200 for my submission. That gives validation loss of 3.0305.

vskogstad added 2 commits September 21, 2025 08:27

Improved submission Vegard

ad2aa6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved leaderboard submission Vegard #95

Improved leaderboard submission Vegard #95

Uh oh!

vskogstad commented Sep 21, 2025

Uh oh!

vskogstad commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improved leaderboard submission Vegard #95

Are you sure you want to change the base?

Improved leaderboard submission Vegard #95

Uh oh!

Conversation

vskogstad commented Sep 21, 2025

Uh oh!

vskogstad commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant