Improved leaderboard submission Vegard #95
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Made some changes. Barely within 90 mins, so could be over on a different H100. 5 out of 90 minutes are spent validating against entire validation set.
Improvements
-Changed the data-loader from random, to randomized strided sampling without replacement. More unique training samples gave increased training loss(no repetitions) but brought the validation loss down to 3.126.
-Implemented gated attention (Sort of like in qwen next, but I do full gates instead of per head and use SILU instead of sigmoid). Changed to just 6 attention heads, which is a bit worse but gives higher MFU. In sum this brought validation loss down to 3.1035. https://arxiv.org/pdf/2505.06708
-QK-norm -> 3.094
-Adjusting AdamW params from [0.90, 0.95] to [0.9, 0.999] gave a surprisingly large boost -> 3.0839
-U-net architecture with learnable params. -> 3.0762
Other attempts
-Lowering the LR for the LM-head layer as recommended in Dion. NanoGPT seems to be doing it the other way around? Gave very minimal improvements. https://arxiv.org/pdf/2504.05295
-Document masking. Lower MFU and very slight performance decrease. I really expected this to work a lot better.
-Sliding window attention. Hybrid with 3 layers sliding window 1 full layer is only slightly worse, but no speedup
-Grouped query attention. Worse, as expected.
-Scaling output of each block like in Ernie. Worse. Think this might be interfering with my layernorm scaling.
-Mixing in extra embedding values in later value matrices like in NanoGPT speedrun. (Looks good 2/3 in but ends up worse in the end). General idea: https://arxiv.org/pdf/2410.17897
-Decreasing/increasing warmup steps or increasing learning rate after adding QK-norm gave no benefit.
-QK-clip. I was not able to get this to work. In theory it should help a bit with the MFU compared to QK-norm.