-
Notifications
You must be signed in to change notification settings - Fork 28
Make the specified config parameters update the pretrained config #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
else: | ||
expected_config["base_model"] = base_model_update | ||
|
||
check_equal_nested(serialized_config, expected_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I want to make sure I understand what this test validates. This is difficult because I don't understand what the different load-config values mean, and I suspect I'm not alone there.
As far as I can tell, we want to make sure that when we initialize a model from a saved config and some ad hoc updates, and also somehow specify how much of the saved config to load, then the final config reflects that choice correctly and precisely.
So, now drilling into the different load_config
cases:
architecture
: is this supposed to ignore everything else, like training parameters?model
: I guess this isarchitecture
++, but may not include everythingfast_llm
: pulls in everything?none
: ignore everything
Can you please explain all this? Otherwise I can't understand what the expected behaviour here is and if that all makes sense. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making a FastLLMModelConfig
from a loaded one. Remember its structure
class FastLLMModelConfig(Config):
base_model: BaseModelConfig
multi_stage: MultiStageConfig
distributed: DistributedConfig
Where BaseModelConfig subclasses BaseModelArchitectureConfig
which defines the core model parameters, i.e. those for which it doesn't make sense to override on an existing model.
So:
fast_llm
: Load the whole thing and use it as default.none
: load nothing and use fast-llm's default.model
: load the base model config, but use Fast-LLM default formulti_stage
anddistributed
.architecture
: load the architecture part of the base model config, but use fast-llm default for the rest.
Note that training parameters outside the model config are irrelevant, they aren't even in the checkpoint. Also for Hugging Face there is no multi_stage
or distributed
either, so the Fast-LLM defaults are used either way (same for all non-architecture parameters at the moment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jlamypoirier!
This framing confirms the problem: we are encoding internal class structure into the external loading API. I don't think that's a good idea. I don't think anybody cares to know what BaseModelArchitectureConfig
is. We should just be able to say: "load the full thing, and let me override some fields."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. I'm a bit at a loss here. Do we actually care what part of the config to load from disk? And can't we just say that overrides should always take precedence? I hope they do, because that is what I would expect.
You're saying there's no breaking change here, but there is, even if it's just a subtle one. Actually I don't mind breaking changes, and I think what's done here now is not going far enough.
So can we be more bold and just kill load_config
? Let's always load full configs, and then apply an override if it exists. Very breaking change, but also much clearer behaviour.
Maybe I'm way off here, but instead of treating config loading almost like an interpreter pass, can we treat it like a pure transformation pipeline:
Each step produces a complete dict, no state or mutation. Also, we can easily track here where each field value came from (default, import, override) |
We cared a lot more before this PR. The architecture/base_model split was there in large part to make it possible to override some base model parameters when loading a pretrained model, but that's no longer relevant. So there isn't much reason for using
This is approximately what I'm trying to do here, but remembering the actual transformation pipeline would be a bad idea:
Since we only care about reproducing the config or any of its sub-configs and not about where parameters came from, the natural alternative is to compress the transformation pipeline into a smaller piece of information that is sufficient to do so. We were already somewhat doing that by keeping only non-default values, but here (in. #205) I'm proposing to keep track of explicitly set parameters instead which I think is a big improvements:
|
thanks for the great response, @jlamypoirier!
If we agree that |
It is bad, and I'm not arguing for that. I'm actually arguing for the opposite. All dependencies on config should be downstream from config resolution. Reproducing a subconfig should never require knowledge of global context. And that is ensured by treating config resolution as a pure, layered transformation (i.e. deterministic: same input leads to same output, side-effect free: no in-place mutation during validation or derivation, and context-independent: doesn't depend on being invoked from some special place.) That's the core idea I'm arguing for: make config resolution a clean, deterministic pipeline. After that, the resolved config is immutable and complete. No special knowledge or context needed. You can pass any sub-tree around with confidence. Right now, the system interleaves validation, mutation, and derivation. It's hard to tell what happened when or why. Again, we should treat config resolution explicitly as:
Each layer is self-contained. No mutation. No inference of user intent. No implicit context dependencies. Once resolved, the config is just immutable data. You mentioned tracking explicitly set fields. Sure, that's useful for cleaning up serialized configs. (And maybe this feature is overrated, because people only look at configs in Wandb and there the config is complete and things are manageable and easily discoverable due to search.) But cleaning up serialized configs doesn't address the real issue: users and developers need to understand how a config was built, not just what fields were manually set. Debugging is about causality, not just surface state.
That only holds if you trust the system to always get it right, with no surprises. Right now though we have confusion around defaults, derivations, and implicit mutations. A clean layering model eliminates that ambiguity and makes things easier to test, reason about, and serialize correctly. I'm not saying we need provenance tracking or transformation logs, just clear separation between inputs, transformation, and outputs. Derived fields shouldn't be serialized. They should be recomputed from inputs every time, just like computed properties or Clarity and compactness are not at odds. But compactness without clarity leads to brittle complexity. I think we are at that point now. Let's treat it as an opportunity to simplify. |
To answer your comments and those in #205, I think we want the same thing, we just have a different view on how to get there.
The first steps make sense from the Core config systemThe "compute derived fields" and "final resolution" parts are a lot more complicated than they look and hide quite a few things which any good config system needs to address (here roughly in the order it's done in Fast-LLM):
So Fast-LLM's config system is really standard and follows the same basic principles as dataclasses, pydantic and omegaconf, with one major exception: cross-config dependencies are resolved in instantiated configs, before their final resolutions. There are good reasons for doing so:
The main drawbacks is that we're calling the method
I'm not sure about this. I tried to keep things simple and follow the same idea as dataclasses, etc. which cram everything in Maybe one initial step would be to convert derived fields to Main program, aka runnableNow we can come back to making a runnable program from the core config system, I.e turning user inputs into an actual config dict ready to be fed to the config system. I kept is separate in Fast-LLM, and I don't think it's controversial since it's the exact same thing as the omegaconf/hydra split. (Dataclasses and pydantic have no such equivalent, so we're really just comparing to hydra here.) What we have in Fast-LLM is a minimalistic config merging system roughly similar to hydra, with resolution order:
That's it (and we usually skip the cli part). The defaults are implicitly first in the resolution order, but are handled in the config instantiation so don't show up here. The yaml part is basically identical to hydra (with a lot fewer features), but the cli part differs a bit:
This runnable resolution system seems to be working fine, but I'm open to revisiting if there is a better and/or simpler way. Pretrained configThe core config and runnable described above are enough to make a generic and robust config system, but doesn't really have a natural place for dealing with the pretrained model config. We could do it in the runnable part as you're suggesting, but there are good reasons not to. We need to be able to deal with pretrained models and their configs as self-contained objects. So we want pretrained configs to behave just like any normal configs, i.e. like huggingface model configs, so we can play around freely with pretrained configs and models, ex. for testing, debugging, interactive inference, making models on-the-fly, deal with multiple pretrained models at once (ex. reference model for distillation), etc. The need for a self-contained object complicates things quite a bit. What we've been doing so far (and not planning on changing) is to define a composite config made of a model config and a checkpoint loading config, and construct the combined model config (provided config + loaded config) during the cross-config dependencies step. This is enough to obtain a predictable config resolution order, though it needs to be documented somewhere. The problem so this PR attempts is that the resulting resolution order, though predictable, was a bit too complicated and unintuitive because of limitations of the config system:
But now that we removed the limitations, we can do the following which makes much more sense:
All this PR is about is the order reversing part. I agree we need to deal with the arbitrary subset part, but we'll need to make sure it doesn't break existing work, so let's keep it for the next step. |
✨ Description
Fixes #170
The specified model config now overrides the parameters in the pretrained config, instead of the
load_config
field determining whether they are used or discarded. This allows loading the full config from the pretrained model while also overriding some of the fields.This is a big step towards #166, with some caveats:
load_config
remainsarchitecture
by default. It needs to be set explicitly tomodel
orfast_llm
for the full config to be loaded. Changing the default would be difficult because of backward compatibility issues.This is a technically breaking change in the sense that the fields that are no longer ignored may lead to a change in behaviour, but defining them was a bad idea in the first place and I don't expect any problem in practice.
🔍 Type of change
Select all that apply: