Releases · modelscope/ms-swift

Megatron-SWIFT
a. mcore-bridge 从 ms-swift 拆分成独立 repo，为最先进模型提供 megatron-core 模型定义：https://github.com/modelscope/mcore-bridge
b. 支持 GRPO Router Replay，使用--router_replay_mode 参数。感谢招商技术团队 @XianlongLi 的贡献。
c. Qwen3.5 解除 TP 数受 num_query_groups 限制的约束，支持 CP 和序列 packing，并支持多模态 MTP。参考 Qwen3.5 最佳实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. 新模型支持：GLM-5、Deepseek-v3.2 和 MiniMax2.5。
e. 支持 muon、dist_muon 优化器，训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. 支持 --tuner_type lora_llm，对 LLM 部分使用 LoRA 训练，对 ViT/Aligner 使用全参数训练。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full
RL
a. OPSD 算法支持，支持设置教师模型为训练模型并支持设置 teacher_prompt，参考https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. REAL 算法支持，使用 --loss_type real 参数。感谢招商技术团队 @li2zhi 的贡献。
c. 支持 QLoRA GRPO，参考 https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. GRPO K3-KL 计算增加 clamp 操作稳定训练。
e. top-k 默认值从 50 修改为 -1，top-p 默认值从 0.95 修改为 1。
训练
a. 优化 yaml 启动方式的支持，参考：https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. 新增架构文档：https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. 新增 Metax 支持最佳实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. 新增通过 uv 安装 ms-swift 的支持。

新模型

纯文本模型
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B系列（感谢 @ciaoyizhen 的贡献）
多模态模型
a. google/gemma-4-E2B-it系列，脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh

English Version

New Features

Megatron-SWIFT
a. mcore-bridge has been split from ms-swift into an independent repository, providing megatron-core model definitions for state-of-the-art models: https://github.com/modelscope/mcore-bridge
b. Support for GRPO Router Replay via the --router_replay_mode parameter. Thanks to @XianlongLi from the CMB Tech team for the contribution.
c. Qwen3.5 removes the TP size restriction imposed by num_query_groups, with added support for CP, sequence packing, and multimodal MTP. Refer to the Qwen3.5 best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. New model support: GLM-5, DeepSeek-V3.2, and MiniMax2.5.
e. Support for muon and dist_muon optimizers. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. Support for --tuner_type lora_llm, enabling LoRA training on the LLM component and full-parameter training on ViT/Aligner. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full
RL
a. Support for the OPSD algorithm, with the ability to set the teacher model as the training model and configure teacher_prompt. Refer to: https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. Support for the REAL algorithm via the --loss_type real parameter. Thanks to @li2zhi from the CMB Tech team for the contribution.
c. Support for QLoRA GRPO. Refer to: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. Added clamp operation to GRPO K3-KL computation for training stability.
e. Changed the default value of top-k from 50 to -1, and top-p from 0.95 to 1.
Training
a. Improved support for YAML-based launch configurations. Refer to: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. Added architecture documentation: https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. Added Metax support best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. Added support for installing ms-swift via uv.

New Models

Text-Only Models
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B series (Thanks to @ciaoyizhen for the contribution)
Multimodal Models
a. google/gemma-4-E2B-it series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh

What's Changed

[docs] update arch docs by @Jintao-Huang in #8185
[docs] update qwen3.5 best practice by @Jintao-Huang in #8189
[docs] fix docs by @Jintao-Huang in #8191
feat(megatron): add on_save callback to MegatronCallback by @inzamam-iqbal in #8187
[model] support qwen3.5 mtp by @Jintao-Huang in #8194
[bugfix] fix minimax 2.1 enable_tp by @Jintao-Huang in #8199
[megatron] comcat mcore 016 by @Jintao-Huang in #8204
feat: Add YuFeng XGuard template support for training by @ciaoyizhen in #8179
[bugfix] fix num_query_groups by @Jintao-Huang in #8206
[bugfix] fix max_shard_size transformers 5.x by @Jintao-Huang in #8209
Fix for load_dataset function to restore ability to use custom loader by @gusario in #8184
[megatron] support GLM-5 megatron by @Jintao-Huang in #8085
[bugfix] fix kimi k2 by @Jintao-Huang in #8229
[megatron] support deepseek-v3.2 by @Jintao-Huang in #8226
[model] support minimax 2.5 by @Jintao-Huang in #8235
[docs] support uv by @Jintao-Huang in #8190
[bugfix] fix megatron kimi_k2 by @Jintao-Huang in #8238
[docs] add swift 4.0 image by @Jintao-Huang in #8242
[docs] compat npu megatron by @Jintao-Huang in #8244
[bugfix] fix eval-generation-config json parse by @hjh0119 in #8246
[bugfix] fix megatron grpo log completion-length by @hjh0119 in #8247
[bugfix] fix callbacks by @Jintao-Huang in #8250
[compat] compat transformers 5.3.0 by @Jintao-Huang in #8249
[fix] update ascend communication and fix megatron issue by @jiaqiw09 in #8243
[megatron] Qwen3.5 supports larger num_query_groups (mcore 0.16) by @Jintao-Huang in #8253
[docs] update docs modelscope.ai by @Jintao-Huang in #8258
[doc] update qwen3.5 best practice doc by @hjh0119 in #8255
[bugfix] fix accelerator by @Jintao-Huang in #8261
add metax best practices by @qq1243196045 in #8251
[docs] fix metax docs index by @Jintao-Huang in #8264
[bugfix] fix gkd load teacher by @hjh0119 in #8265
[bugfix] Fix qwen3 omni image_patch_size by @Jintao-Huang in #8236
[docs] fix metax docs by @Jintao-Huang in #8270
Perf: avoid intermediate tensor allocs via in-place div & optimized top-k flow by @hjh0119 in #8268
[megatonr] update padding_free check by @Jintao-Huang in #8274
[bugfix] fix weight sync with vllm_enable_lora and resume_from_checkpoint by @hjh0119 in #8275
[bugfix] update sync method for different backend by @jiaqiw09 in #8273
[megatron] _get_param_groups compat mcore016 by @Jintao-Huang in #8278
[bugfix] fix trl import vllm_ascend by @hjh0119 in #8280
[bugfix] ignore max_length error by @Jintao-Huang in #8279
fix npu hccl timeout by @addsubmuldiv in #8281
[bugfix] fix tuner_type by @Jintao-Huang in #8283
[megatron] qwen3.5 use megatron-core GDN by @Jintao-Huang in #8282
[docs] update docs by @Jintao-Huang in #8292
[doc] qwen3.5 moe grpo examples by @hjh0119 in #8302
[bugfix] fix tie_word_embeddings seq_cls by @Jintao-Huang in #8297
[bugfix] fix megatron mcore 015 qwen3_5 by @Jintao-Huang in #8311
update npu fsdp example by @addsubmuldiv in #8308
[bugfix] fix mtp rope by @Jintao-Huang in #8316
[megatron] support qwen3_5 packing by @Jintao-Huang in #8313
[bugfix] fix megatron grpo ris by @hjh0119 in #8321
[bugfix] fix megatron gkd tp top-k by @hjh0119 in https://...

Contributors

inzamam-iqbal, addsubmuldiv, and 11 other contributors

Assets 2

03 Apr 22:36

Jintao-Huang

v4.0.4

ba4631c

Patch release v4.0.4

Full Changelog: v4.0.3...v4.0.4

Assets 2

29 Mar 04:21

Jintao-Huang

v4.0.3

45a1481

Patch release v4.0.3

Full Changelog: v4.0.2...v4.0.3

Assets 2

14 Mar 14:20

Jintao-Huang

v4.0.2

da96d9b

Patch release v4.0.2

Full Changelog: v4.0.1...v4.0.2

Assets 2

08 Mar 04:33

Jintao-Huang

v4.0.1

15b2442

Patch release v4.0.1

Full Changelog: v4.0.0...v4.0.1

Assets 2

03 Mar 08:25

Jintao-Huang

v4.0.0

5e85341

v4.0.0

中文版

新特性

架构优化
a. 目录结构重构与依赖关系优化，使用模块化设计，提升架构的可扩展性和可定制性。
b. model_type与template解耦，简化同一 model_type 含多个 template 的模型支持流程。
c. Megatron-SWIFT 训练循环重写，使用 megatron-core 替代 megatron-lm 依赖。（兼容Ascend NPU）
Megatron-SWIFT
a. 新模型支持：Qwen3.5系列、GLM4.7-Flash、MiniMax-M2.1、OLMoE。
b. Embedding 任务支持，训练示例：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker 任务支持，训练示例：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. 新增save_total_limit参数，自动清理过期 checkpoint，并保留指标最优和最新的权重。
e. Qwen3-Next/Qwen3.5 新增apply_wd_to_qk_layernorm参数，支持对 qk layernorm 应用权重衰减。
f. 多模态MoE模型lora支持 --target_modules all-router 配置。
RL
a. 支持GDPO算法计算优势，使用参数--scale_rewards gdpo。（感谢 @Auraithm 的贡献）
b. GKD 支持使用 top-k logits 计算KL以节约显存，使用参数 --gkd_topk_logits。
c. GKD 支持使用 teacher server，避免显式加载教师模型。
训练
a. 新增 muon clip 优化器支持，训练示例：https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh （感谢 @vx120 的贡献）
b. 依赖更新：兼容最新依赖 python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0等。
c. generative reranker lm_head 部分计算优化，降低显存占用。
d. fsdp2支持激活 cpu offload；deepspeed elastic支持。（感谢招商 @meichangsu1 的贡献）

新模型

纯文本模型
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct系列（感谢 @qianhao0713 的贡献）
多模态模型
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B 系列。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it 系列

English Version

New Features

Architecture Optimization
a. Directory structure refactoring and dependency optimization with modular design to enhance architecture scalability and customizability.
b. Decoupling of model_type and template to simplify support for models with multiple templates under the same model_type.
c. Rewritten Megatron-SWIFT training loop using megatron-core instead of megatron-lm dependency. (Compatible with Ascend NPU)
Megatron-SWIFT
a. New model support: Qwen3.5 series, GLM4.7-Flash, MiniMax-M2.1, OLMoE.
b. Embedding task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. Added save_total_limit parameter to automatically clean up expired checkpoints while retaining the best-performing and latest weights.
e. Added apply_wd_to_qk_layernorm parameter for Qwen3-Next/Qwen3.5 to support weight decay on qk layernorm.
f. Multi-modal MoE model LoRA supports --target_modules all-router configuration.
RL
a. Support for GDPO algorithm to compute advantages using parameter --scale_rewards gdpo. (Thanks to @Auraithm)
b. GKD supports using top-k logits to compute KL for memory savings with parameter --gkd_topk_logits.
c. GKD supports using teacher server to avoid explicitly loading the teacher model.
Training
a. Added Muon-CLIP optimizer support. Training example: https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (Thanks to @vx120)
b. Dependency updates: Compatible with latest dependencies including python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0, etc.
c. Optimized generative reranker lm_head computation to reduce memory usage.
d. FSDP2 supports CPU offload activation; DeepSpeed elastic support. (Thanks to @meichangsu1)

New Models

Text-only Models
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct series (Thanks to @qianhao0713)
Multi-modal Models
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it series

What's Changed

[misc] update swift patch_conv3d by @Jintao-Huang in #7320
add npu megatron multi-node example by @addsubmuldiv in #7321
[bugfix] fix megatron convert by @Jintao-Huang in #7323
[model] Support Qwen3-VL-Embedding/Qwen3-VL-Reranker by @Jintao-Huang in #7329
[reranker] refactor reranker by @Jintao-Huang in #7334
[bugfix] fix video base64 torchcodec by @Jintao-Huang in #7338
[bugfix] fix modelopt by @Jintao-Huang in #7339
[docs] Update swift image 3.12 by @Jintao-Huang in #7332
[bugfix] fix get_chunked_inputs slice by @hjh0119 in #7346
fix find node ip by @tastelikefeet in #7350
Fix multi-modal reranker doc by @tastelikefeet in #7354
[bugfix] fix app_args by @Jintao-Huang in #7367
[bugfix] fix qwen2_vl video by @Jintao-Huang in #7376
[bugfix] fix vllm moe model load_weights by @hjh0119 in #7362
[v4] refactor ms-swift v4 by @Jintao-Huang in #7238
feat: support scale rewards "gdpo" by @Auraithm in #7348
[infer] infer backend pt -> transformers by @Jintao-Huang in #7379
[docs] update docs & update Copyright by @Jintao-Huang in #7384
Fix device mismatch in _forward_qwen3_vl_or_qwen3_omni when computing visual_pos_masks by @yaqiangsun in #7372
add npu qwen3-next example and warning of ep size by @addsubmuldiv in #7390
[bugfix] fix deepseek_v3_1 thinking template by @Jintao-Huang in #7388
[docs] update docs & update dataset 'loss' by @Jintao-Huang in #7402
[bugfix] Fix ref adapters trainable params 0 by @Jintao-Huang in #7403
[readme] update error timeline of news by @shizhengLi in #7404
[bugfix] fix sp reranker by @Jintao-Huang in #7405
[v4] fix ci by @Jintao-Huang in #7559
[refactor] reorganize reward and rollout modules into dedicated direct… by @hjh0119 in #7397
[grpo] speedup grpo train stage encode with concurrent by @Cccei000 in #7391
Update the NPU-supported features table by @addsubmuldiv in #7562
[bugfix] fix attn_impl by @Jintao-Huang in #7564
[v4] refactor ms-swift v4 (pipelines/arguments/swiftmixin/callback/tuner_plugin) by @Jintao-Huang in #7385
[bugfix] fix minimax tp by @Jintao-Huang in #7788
fix inputs_embeds for hunyuanOCR by @slin000111 in #7803
[bugfix] fix deepspeed distributed weight offload code by @Silas-11 in #7802
[generative_reranker] generative reranker logits memory optimization by @Jintao-Huang in #7816
update requirements by @Jintao-Huang in #7819
[misc] update issue template by @Jintao-Huang in #7818
[bugfix] fix dpo by @Jintao-Huang in #7824
update wechat by @tastelikefeet in #7827
[bugfix] fix deepspeed optimizer offload code by @Silas-11 in #7821
[model] support glm4_moe_lite by @Jintao-Huang in #7829
[bugfix] fix hunyuan ocr by @Jintao-Huang in #7831
[megatron] support glm_moe_lite by @Jintao-Huang in #7833
chore: epochs -> epoch by @zzc0430 in #7825
[optimizer] Set loss mask to compute the loss for multi-turn reasoning by @Simon-ss7 in #7838
[bugfix] fix recompute_granularity none by @Jintao-Huang in https://gi...