Skip to content

Conversation

@vskogstad
Copy link
Contributor

Made some changes. Barely within 90 mins, so could be over on a different H100. 5 out of 90 minutes are spent validating against entire validation set.

Improvements
-Changed the data-loader from random, to randomized strided sampling without replacement. More unique training samples gave increased training loss(no repetitions) but brought the validation loss down to 3.126.
-Implemented gated attention (Sort of like in qwen next, but I do full gates instead of per head and use SILU instead of sigmoid). Changed to just 6 attention heads, which is a bit worse but gives higher MFU. In sum this brought validation loss down to 3.1035. https://arxiv.org/pdf/2505.06708
-QK-norm -> 3.094
-Adjusting AdamW params from [0.90, 0.95] to [0.9, 0.999] gave a surprisingly large boost -> 3.0839
-U-net architecture with learnable params. -> 3.0762

Other attempts
-Lowering the LR for the LM-head layer as recommended in Dion. NanoGPT seems to be doing it the other way around? Gave very minimal improvements. https://arxiv.org/pdf/2504.05295
-Document masking. Lower MFU and very slight performance decrease. I really expected this to work a lot better.
-Sliding window attention. Hybrid with 3 layers sliding window 1 full layer is only slightly worse, but no speedup
-Grouped query attention. Worse, as expected.
-Scaling output of each block like in Ernie. Worse. Think this might be interfering with my layernorm scaling.
-Mixing in extra embedding values in later value matrices like in NanoGPT speedrun. (Looks good 2/3 in but ends up worse in the end). General idea: https://arxiv.org/pdf/2410.17897
-Decreasing/increasing warmup steps or increasing learning rate after adding QK-norm gave no benefit.
-QK-clip. I was not able to get this to work. In theory it should help a bit with the MFU compared to QK-norm.

-Scaled down d_model from to 1024 -> 3.059
-Changed from Muon to NorMuon -> 3.0396.
-Finally got extra value embeddings mixed with the V-matrix to work. Got identical results using learnable scalar mixing of embeddings in only the final two layers as when mixing in all layers, so ended up going with that implementation. -> 3.032
-Increased the amount of training steps to the edge of the time-limit. -> 3.0305. 
I changed to measuring validation loss against a small subset of the validation set during training. I calculate validation loss on entire validation set after training instead, which is not included in the runtime. The run shown in the graph actually reached 3.0285 in validation loss, but to have some margin for worse GPU performance I have decreased my number of steps by 200 for my submission. This gives a validation loss to 3.0305.
@vskogstad
Copy link
Contributor Author

Made some more adjustments:
-Scaled down d_model from 1536 to 1024 and increased steps -> 3.059
-Changed from Muon to NorMuon -> 3.0396.
-Finally got extra value embeddings mixed with the V-matrix to work. Got identical results using learnable scalar mixing of embeddings in only the final two layers as when mixing in all layers, so ended up going with that implementation. -> 3.032
-Increased the amount of training steps to the edge of the time-limit. -> 3.0305.
I now measure validation loss against a small subset of the validation set during training. I calculate validation loss on entire validation set after training, which is not included in the runtime. The run shown in the graph actually reached 3.0285 in validation loss, but to have some margin for worse GPU performance I have decreased my number of steps by 200 for my submission. That gives validation loss of 3.0305.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant