Refactor tiny-model generation scripts#5637
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 730c87629a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| inputs = processor.apply_chat_template( | ||
| conversation=messages, | ||
| add_generation_prompt=True, |
There was a problem hiding this comment.
Use non-chat fallback for VLM smoke inputs
smoke_test unconditionally calls processor.apply_chat_template(...) for every processor with an image_processor, but some VLM checkpoints in this refactor are not chat models (for example google/paligemma-3b-pt-224, whose script calls smoke_test(model, processor) in paligemma_for_conditional_generation.py). In that case apply_chat_template can fail due to missing/unsupported chat templates, so the script aborts before push even though the model itself is valid with regular processor(text=..., images=...) inputs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think we could address this in the future, google/paligemma-3b-pt-224 is this only vlm that doesn't have a chat template
| return_dict=True, | ||
| return_tensors="pt", | ||
| padding=True, | ||
| ).to(device) |
There was a problem hiding this comment.
smoke_test crashes for PaliGemma lacking chat template
Medium Severity
The smoke_test VLM branch unconditionally calls processor.apply_chat_template(...) for any ProcessorMixin. The PaliGemma script passes its processor to smoke_test, but google/paligemma-3b-pt-224 has no chat template defined, so apply_chat_template will raise at runtime, making the PaliGemma generation script unusable. The PR discussion confirms PaliGemma is the only VLM without a chat template.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5cc7fc8. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4730fec. Configure here.
|
|
||
| config = AutoConfig.from_pretrained(MODEL_ID, text_config=text_config, vision_config=vision_config) | ||
| model = PaliGemmaForConditionalGeneration(config).to(dtype=torch.float32) | ||
| smoke_test(model, processor) |
There was a problem hiding this comment.
PaliGemma smoke_test crashes due to missing chat template
Medium Severity
smoke_test(model, processor) passes a ProcessorMixin to smoke_test, which takes the VLM branch and calls processor.apply_chat_template(...). PaliGemma (google/paligemma-3b-pt-224) has no chat template, so this call will crash at runtime. The PR discussion confirms this: "google/paligemma-3b-pt-224 is the only VLM that doesn't have a chat template." The smoke_test function needs a fallback path for VLM processors that lack a chat template.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4730fec. Configure here.
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks.
You replaced 450 code lines with 2,765. Are you sure this is the right direction?
On the other hand, what if we want to regenerate all models? Or some specific family of models? After this PR there is no simple way other than running each model individually.
The pin of transformers for the lower supported version is repeated in every script. Why not defaulting to the lower supported version if no explicit version is set for a specific model?
Also, I think create_pr should be the default.


Split tiny-model generation into per-model scripts
Replace the monolithic
scripts/generate_tiny_models.py(one 437-line file, a big tuple-driven loop plus a growingif issubclass(...)ladder for VLMs) with a per-model layout:One script per tiny model. Each script is fully self-contained. Shared logic (push, smoke test, diff, weight init) stays in
_common.py.Why
out_hidden_size, Qwen3-VLlayer_typesdeletion, Qwen3.5 linear-attn fp32 cast, Gemma4 in-place mutation, llava-v1.6 dtype hotfix, …).With one file per model, each script reads top-to-bottom in 20–50 lines. Model-specific quirks stay scoped to the model that needs them.
New features added to every script
print_config_diff(MODEL_ID, model)prints every flat-key difference between the reference Hub config and the tiny model's config before push. Makes it obvious when a shrink kwarg was silently ignored or when an unexpected field drifted.check_dtype_pattern(MODEL_ID, model)reads the reference safetensors header via the Hub API (no weight download) and flags any tensor whose dtype diverges from the reference — catches cases like models with mixed-precision weights (e.g. fp32 norms inside a bf16 checkpoint).TRANSFORMERS_VERSION = "X.Y.Z"and callscheck_transformers_version(...)which raises unless the installed version matches exactly. The pinned value ismax(introduction_version, trl_floor=4.56.2). Rationale: transformers is backward-compatible (a checkpoint saved by X loads on any ≥ X) but not forward-compatible; TRL CI runs against the floor, so tiny models must be saved with the oldest version that supports them to avoid config-field drift. Exact match prevents accidental regenerations with a newer transformers from silently breaking min-version CI.--create-prflag. When a tiny model already exists on the Hub, the default is to skip the push. Passing--create-propens a single PR against the existing repo instead (all artifacts bundled into one commit viaHfApi.create_commit), so updates can be reviewed before landing onmain.How to run
See
scripts/generate_tiny_models/README.mdfor full documentation.Scope
This PR is refactor-only — the Python logic for each tiny model is preserved exactly. No tiny model on the Hub is regenerated by this PR; the existing Hub repos remain the source of truth for CI.
Follow-up PRs will use the new scripts to regenerate and push individual tiny models where the existing Hub checkpoint drifts from the reference (e.g. non-size config fields defaulting to wrong values, missing upstream-added fields, quantization parity, etc.). Each regeneration is one PR per tiny, with a
refs/pr/Noverride intests/conftest.pyuntil merged on the Hub.Note
Low Risk
Primarily a tooling refactor under
scripts/that doesn’t affect library runtime, but it changes the Hub upload path and adds new validation steps that could alter how future tiny models are regenerated.Overview
Replaces the monolithic
scripts/generate_tiny_models.pywith ascripts/generate_tiny_models/package containing one script per tiny model grouped by task (for_causal_lm,for_sequence_classification,for_conditional_generation).Adds shared utilities in
_common.pyto standardize generation: exacttransformersversion pin checks, a minimalsmoke_testforward pass,check_dtype_patternagainst the reference safetensors metadata, andprint_config_diffagainst the reference config.Updates Hub publishing to support
--create-prand to upload all artifacts (model, tokenizer/processor, generation config, model card) in a single commit viaHfApi.create_commit, with new documentation inscripts/generate_tiny_models/README.md.Reviewed by Cursor Bugbot for commit 4730fec. Bugbot is set up for automated code reviews on this repo. Configure here.