Skip to content

feat(model): add bailing v2.6 model#46713

Open
kemuxiaozi000 wants to merge 5 commits into
huggingface:mainfrom
kemuxiaozi000:add-bailing-hybrid
Open

feat(model): add bailing v2.6 model#46713
kemuxiaozi000 wants to merge 5 commits into
huggingface:mainfrom
kemuxiaozi000:add-bailing-hybrid

Conversation

@kemuxiaozi000

@kemuxiaozi000 kemuxiaozi000 commented Jun 17, 2026

Copy link
Copy Markdown

What does this PR do?

This PR adds BailingHybridbailing v2.6 by InclusionAI), a hybrid linear-attention Mixture-of-Experts model. The
architecture combines:

  • Hybrid attention — a 1:7 layer ratio of full Multi-head Latent Attention (MLA) to Lightning Linear Attention, giving near-linear complexity. The pattern is driven by
    layer_group_size (every layer_group_size-th layer is full attention).
  • Multi-head Latent Attention (MLA) — DeepSeek-V3-style, with compressed KV cache via LoRA projections.
  • Lightning Linear Attention — based on SimpleGLA from flash-linear-attention, with a pure-PyTorch fallback when fla is not installed.
  • Mixture of Experts — 256 routed experts (8 active/token) with shared experts and group-limited greedy (noaux_tc) routing.

The model is implemented with the modular mechanism (modular_bailing_hybrid.py), inheriting from deepseek_v3, bamba, llama, and mixtral, so
modeling_bailing_hybrid.py is generated and stays in sync with those parents.

It exposes BailingHybridModel, BailingHybridForCausalLM, BailingHybridForSequenceClassification, and BailingHybridForTokenClassification, registered in the auto classes,
plus a checkpoint conversion mapping in conversion_mapping.py for loading the original BailingMoeV2_5 checkpoints.

Fixes # (issue)

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline and the
    Pull Request checks?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes according to the guidelines?
  • Did you write any new necessary tests?

Tests run

# Model tests
pytest tests/models/bailing_hybrid/test_modeling_bailing_hybrid.py

# Repo consistency
python utils/check_modular_conversion.py --files src/transformers/models/bailing_hybrid/modular_bailing_hybrid.py
python utils/check_auto.py
python utils/check_config_attributes.py
make style

▎ Note: AI assistance was used to help port and regenerate the modular/auto code; all changed lines were reviewed by me, and I can defend the change end-to-end.

Who can review?

text models: @ArthurZucker @Cyrilvallez

A few things to flag before you submit, given the strict policy in `CONTRIBUTING.md`:

1. **`Fixes # (issue)`** — you need to fill in the real issue number, or remove the line. The policy requires the model addition to be coordinated via an issue first. If there's
no issue/approval yet, open one before the PR.
2. **The "discussed/approved" checkbox** — left unchecked since I don't have a coordination link; add it once you have one.
3. **"Not a pure code agent PR"** — I checked it on the assumption that *you* reviewed every line and will run the tests yourself. If you haven't yet, run the model tests in a
torch-equipped environment first (the local `.venv` has no PyTorch, so I couldn't execute them — only static checks passed).
4. The doc example in `bailing_hybrid.md` shows a sample generation output that should be verified against the real checkpoint before claiming it.

Want me to write this into a file (e.g. `PR_DESCRIPTION.md`), or adjust the title/reviewers?

kemuxiaozi000 and others added 2 commits June 17, 2026 16:46
- _init_weights: replace in-place module.slope.copy_(...) with the
  init.copy_ primitive to satisfy the modeling-structure linter (TRF012),
  matching the pattern used by Bamba for buffer re-init. Edited the modular
  source and regenerated the modeling file.
- Add the contribution date stamp to the bailing_hybrid model card so the
  repository-consistency add_dates check passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46713&sha=6ffaff

Update the auto_docstring checkpoint, from_pretrained examples, and model
card to the 2.6-generation checkpoint inclusionAI/Ling-2.6-flash-base, and
drop the inaccurate trillion-parameter description (the referenced model is
the flash variant).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kemuxiaozi000 kemuxiaozi000 changed the title feat(model): add bailing v3 model feat(model): add bailing v2.6 model Jun 22, 2026
… HF checkpoint

Align the in-library naming with InclusionAI's published modeling file
(modeling_bailing_moe_v2_5.py): rename the module/dir to bailing2_5_moe,
all classes from BailingHybrid* to BailingMoeV2_5* (e.g.
BailingMoeV2_5ForCausalLM, BailingMoeV2_5Model, BailingMoeV2_5Config), and
set config.model_type = "bailing2_5_moe", following the qwen3_moe layout.

Updates the auto mappings, conversion mapping, toctree, model card, tests,
and check_config_attributes accordingly; the modeling file is regenerated
from the modular source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Rocketknight1

Copy link
Copy Markdown
Member

cc @vasqu I think but let me know if you want me to assign someone else!

@vasqu

vasqu commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Nope will take it; first review probably tomorrow or the day after 🤗

The dir rename makes docs/source/en/model_doc/bailing2_5_moe.md a new path on
main, so the add_dates consistency check computes today's date; update the
stamp to match and pass CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, bailing2_5_moe

@vasqu vasqu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some initial comments but there are many moving parts re linear attention refactors so asking to be a bit patient 🙏

@@ -0,0 +1,72 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
<!--Copyright 2026 The HuggingFace Team. All rights reserved.

for others as well please

model = AutoModelForCausalLM.from_pretrained(
"inclusionAI/Ling-2.6-flash-base",
device_map="auto",
dtype=torch.bfloat16,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtype=torch.bfloat16,

shouldnt be needed we use auto as default

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure: The tokenizers backend is used for this model so we don't need an entry to tokenization auto?


@auto_docstring(checkpoint="inclusionAI/Ling-2.6-flash-base")
@strict
class BailingMoeV2_5Config(PreTrainedConfig):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo we can move this to modular and inherit from somethin like deepseek v2/3?

@strict
class BailingMoeV2_5Config(PreTrainedConfig):
r"""
layer_group_size (`int`, *optional*, defaults to 8):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be set via layer types instead, we could handle this in the post init

Comment on lines +534 to +535
self.rotary_emb = BailingMoeV2_5RotaryEmbedding(config=config)
self.rotary_emb_linear = BailingMoeV2_5LinearRotaryEmbedding(config=config)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's why we should do one class instead tbh per layer type

past_key_values=past_key_values,
)

def _update_linear_attn_mask(self, attention_mask, past_key_values):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

masking will also change, sorry many moving things #46738

if layer_group_size > 0:
full_attn_layers = [i for i in range(num_hidden_layers) if (i + 1) % layer_group_size == 0]
self_attn_renames = [
WeightRenaming(rf"layers\.{i}\.attention\.", f"layers.{i}.self_attn.") for i in full_attn_layers

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, imo I also don't mind to have the same naming internally. this is very awkward so would like to avoid this

)


class BailingMoeV2_5ModelTester:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the causal lm tester, qwen next should be a good pointer

"BambaConfig": ["attn_layer_indices"],
# layer_group_size builds `layer_types` in __post_init__ (and drives weight conversion); scoring_func/topk_method
# describe the router behavior the model hardcodes (sigmoid + noaux_tc), kept for checkpoint config compatibility.
"BailingMoeV2_5Config": ["layer_group_size", "scoring_func", "topk_method"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo we can ignore these in transformers if not used at all - we still save those via kwargs but for a sole transformers model we likely dont need it then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants