-
Notifications
You must be signed in to change notification settings - Fork 28
[Prototype] Option to configure layers independently #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @jlamypoirier. This introduces a lot of complexity, and based on your own comments, it is not yet user-friendly or fully supported. Given that this feature is not urgent, I'd prefer we leave it unmerged until the conversion and usability concerns are properly addressed. The immediate priority is LoRA. Thanks. |
Agreed this is not entirely ready, but the feature is relatively small and locked behind an experimental flag, so there wouldn't be any harm in merging so we can play with it until we have something better (we already have need for it). |
✨ Description
Fixes: #154, #155.
This PR proposes a simple way to obtain layer-dependent configuration by leveraging Fast-LLM's existing config update mechanism. It works by providing a "default" layer configuration (same as before), and optional overrides for specified layer ranges.
See
tests/test_transformer.py
for examples.The thing works, but is admittedly far from perfect and I do have some concern on user-friendliness:
"normalization/epsilon": 1
overrides only normalization epsilon, while"normalization" : {"epsilon": 1}
overrides the entire dict, i.e., everything other thanepsilon
reverts to its default value. This could be confusing and needs to be well documented.transformer
tolayers/default
, which adds a small amount of complexity when not using the feature. (We could probably revert that change though.)num_layers
,hidden_size
,full_precision_residual
) overriding doesn't really make sense to override. I left them as-is and added assertions, but we may want to think about moving them away from the layer config.TensorSpace
wasn't designed for that kind of thing. I made a quick fix using a hierarchy of tensor spaces, but not sure about long-term viability.This feature removes the need for
max_window_layers
, but I kept it for now because of the likely conversion issues. @bigximik I also added back support for backup windowed attention and fixed the layer range by shifting the layer index, see comments in #157)🔍 Type of change
Select all that apply: