Releases: modelscope/ms-swift
Releases · modelscope/ms-swift
Patch release v4.1.2
Full Changelog: v4.1.1...v4.1.2
Patch release v4.1.1
Full Changelog: v4.1.0...v4.1.1
v4.1.0
中文版
新特性
- Megatron-SWIFT
a. mcore-bridge 从 ms-swift 拆分成独立 repo,为最先进模型提供 megatron-core 模型定义:https://github.com/modelscope/mcore-bridge
b. 支持 GRPO Router Replay,使用--router_replay_mode参数。 感谢招商技术团队 @XianlongLi 的贡献。
c. Qwen3.5 解除 TP 数受num_query_groups限制的约束,支持 CP 和序列 packing,并支持多模态 MTP。参考 Qwen3.5 最佳实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. 新模型支持:GLM-5、Deepseek-v3.2 和 MiniMax2.5。
e. 支持 muon、dist_muon 优化器,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. 支持--tuner_type lora_llm,对 LLM 部分使用 LoRA 训练,对 ViT/Aligner 使用全参数训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full - RL
a. OPSD 算法支持,支持设置教师模型为训练模型并支持设置 teacher_prompt,参考https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. REAL 算法支持,使用--loss_type real参数。感谢招商技术团队 @li2zhi 的贡献。
c. 支持 QLoRA GRPO,参考 https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. GRPO K3-KL 计算增加 clamp 操作稳定训练。
e. top-k 默认值从 50 修改为 -1,top-p 默认值从 0.95 修改为 1。 - 训练
a. 优化 yaml 启动方式的支持,参考:https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. 新增架构文档:https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. 新增 Metax 支持最佳实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. 新增通过uv安装 ms-swift 的支持。
新模型
- 纯文本模型
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B系列 (感谢 @ciaoyizhen 的贡献) - 多模态模型
a. google/gemma-4-E2B-it系列,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh
English Version
New Features
- Megatron-SWIFT
a.mcore-bridgehas been split from ms-swift into an independent repository, providing megatron-core model definitions for state-of-the-art models: https://github.com/modelscope/mcore-bridge
b. Support for GRPO Router Replay via the--router_replay_modeparameter. Thanks to @XianlongLi from the CMB Tech team for the contribution.
c. Qwen3.5 removes the TP size restriction imposed bynum_query_groups, with added support for CP, sequence packing, and multimodal MTP. Refer to the Qwen3.5 best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. New model support: GLM-5, DeepSeek-V3.2, and MiniMax2.5.
e. Support formuonanddist_muonoptimizers. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. Support for--tuner_type lora_llm, enabling LoRA training on the LLM component and full-parameter training on ViT/Aligner. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full - RL
a. Support for the OPSD algorithm, with the ability to set the teacher model as the training model and configureteacher_prompt. Refer to: https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. Support for the REAL algorithm via the--loss_type realparameter. Thanks to @li2zhi from the CMB Tech team for the contribution.
c. Support for QLoRA GRPO. Refer to: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. Added clamp operation to GRPO K3-KL computation for training stability.
e. Changed the default value oftop-kfrom 50 to -1, andtop-pfrom 0.95 to 1. - Training
a. Improved support for YAML-based launch configurations. Refer to: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. Added architecture documentation: https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. Added Metax support best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. Added support for installing ms-swift viauv.
New Models
- Text-Only Models
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B series (Thanks to @ciaoyizhen for the contribution) - Multimodal Models
a. google/gemma-4-E2B-it series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh
What's Changed
- [docs] update arch docs by @Jintao-Huang in #8185
- [docs] update qwen3.5 best practice by @Jintao-Huang in #8189
- [docs] fix docs by @Jintao-Huang in #8191
- feat(megatron): add on_save callback to MegatronCallback by @inzamam-iqbal in #8187
- [model] support qwen3.5 mtp by @Jintao-Huang in #8194
- [bugfix] fix minimax 2.1 enable_tp by @Jintao-Huang in #8199
- [megatron] comcat mcore 016 by @Jintao-Huang in #8204
- feat: Add YuFeng XGuard template support for training by @ciaoyizhen in #8179
- [bugfix] fix num_query_groups by @Jintao-Huang in #8206
- [bugfix] fix max_shard_size transformers 5.x by @Jintao-Huang in #8209
- Fix for load_dataset function to restore ability to use custom loader by @gusario in #8184
- [megatron] support GLM-5 megatron by @Jintao-Huang in #8085
- [bugfix] fix kimi k2 by @Jintao-Huang in #8229
- [megatron] support deepseek-v3.2 by @Jintao-Huang in #8226
- [model] support minimax 2.5 by @Jintao-Huang in #8235
- [docs] support uv by @Jintao-Huang in #8190
- [bugfix] fix megatron kimi_k2 by @Jintao-Huang in #8238
- [docs] add swift 4.0 image by @Jintao-Huang in #8242
- [docs] compat npu megatron by @Jintao-Huang in #8244
- [bugfix] fix eval-generation-config json parse by @hjh0119 in #8246
- [bugfix] fix megatron grpo log completion-length by @hjh0119 in #8247
- [bugfix] fix callbacks by @Jintao-Huang in #8250
- [compat] compat transformers 5.3.0 by @Jintao-Huang in #8249
- [fix] update ascend communication and fix megatron issue by @jiaqiw09 in #8243
- [megatron] Qwen3.5 supports larger num_query_groups (mcore 0.16) by @Jintao-Huang in #8253
- [docs] update docs modelscope.ai by @Jintao-Huang in #8258
- [doc] update qwen3.5 best practice doc by @hjh0119 in #8255
- [bugfix] fix accelerator by @Jintao-Huang in #8261
- add metax best practices by @qq1243196045 in #8251
- [docs] fix metax docs index by @Jintao-Huang in #8264
- [bugfix] fix gkd load teacher by @hjh0119 in #8265
- [bugfix] Fix qwen3 omni image_patch_size by @Jintao-Huang in #8236
- [docs] fix metax docs by @Jintao-Huang in #8270
- Perf: avoid intermediate tensor allocs via in-place div & optimized top-k flow by @hjh0119 in #8268
- [megatonr] update padding_free check by @Jintao-Huang in #8274
- [bugfix] fix weight sync with vllm_enable_lora and resume_from_checkpoint by @hjh0119 in #8275
- [bugfix] update sync method for different backend by @jiaqiw09 in #8273
- [megatron] _get_param_groups compat mcore016 by @Jintao-Huang in #8278
- [bugfix] fix trl import vllm_ascend by @hjh0119 in #8280
- [bugfix] ignore max_length error by @Jintao-Huang in #8279
- fix npu hccl timeout by @addsubmuldiv in #8281
- [bugfix] fix tuner_type by @Jintao-Huang in #8283
- [megatron] qwen3.5 use megatron-core GDN by @Jintao-Huang in #8282
- [docs] update docs by @Jintao-Huang in #8292
- [doc] qwen3.5 moe grpo examples by @hjh0119 in #8302
- [bugfix] fix tie_word_embeddings seq_cls by @Jintao-Huang in #8297
- [bugfix] fix megatron mcore 015 qwen3_5 by @Jintao-Huang in #8311
- update npu fsdp example by @addsubmuldiv in #8308
- [bugfix] fix mtp rope by @Jintao-Huang in #8316
- [megatron] support qwen3_5 packing by @Jintao-Huang in #8313
- [bugfix] fix megatron grpo ris by @hjh0119 in #8321
- [bugfix] fix megatron gkd tp top-k by @hjh0119 in https://...
Patch release v4.0.4
Full Changelog: v4.0.3...v4.0.4
Patch release v4.0.3
Full Changelog: v4.0.2...v4.0.3
Patch release v4.0.2
Full Changelog: v4.0.1...v4.0.2
Patch release v4.0.1
Full Changelog: v4.0.0...v4.0.1
v4.0.0
中文版
新特性
- 架构优化
a. 目录结构重构与依赖关系优化,使用模块化设计,提升架构的可扩展性和可定制性。
b.model_type与template解耦,简化同一 model_type 含多个 template 的模型支持流程。
c. Megatron-SWIFT 训练循环重写,使用 megatron-core 替代 megatron-lm 依赖。(兼容Ascend NPU) - Megatron-SWIFT
a. 新模型支持:Qwen3.5系列、GLM4.7-Flash、MiniMax-M2.1、OLMoE。
b. Embedding 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. 新增save_total_limit参数,自动清理过期 checkpoint,并保留指标最优和最新的权重。
e. Qwen3-Next/Qwen3.5 新增apply_wd_to_qk_layernorm参数,支持对 qk layernorm 应用权重衰减。
f. 多模态MoE模型lora支持--target_modules all-router配置。 - RL
a. 支持GDPO算法计算优势,使用参数--scale_rewards gdpo。(感谢 @Auraithm 的贡献)
b. GKD 支持使用 top-k logits 计算KL以节约显存,使用参数--gkd_topk_logits。
c. GKD 支持使用 teacher server,避免显式加载教师模型。 - 训练
a. 新增 muon clip 优化器支持,训练示例:https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (感谢 @vx120 的贡献)
b. 依赖更新:兼容最新依赖 python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0等。
c. generative reranker lm_head 部分计算优化,降低显存占用。
d. fsdp2支持激活 cpu offload;deepspeed elastic支持。(感谢招商 @meichangsu1 的贡献)
新模型
- 纯文本模型
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct系列(感谢 @qianhao0713 的贡献) - 多模态模型
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B 系列。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it 系列
English Version
New Features
- Architecture Optimization
a. Directory structure refactoring and dependency optimization with modular design to enhance architecture scalability and customizability.
b. Decoupling ofmodel_typeandtemplateto simplify support for models with multiple templates under the same model_type.
c. Rewritten Megatron-SWIFT training loop using megatron-core instead of megatron-lm dependency. (Compatible with Ascend NPU) - Megatron-SWIFT
a. New model support: Qwen3.5 series, GLM4.7-Flash, MiniMax-M2.1, OLMoE.
b. Embedding task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. Addedsave_total_limitparameter to automatically clean up expired checkpoints while retaining the best-performing and latest weights.
e. Addedapply_wd_to_qk_layernormparameter for Qwen3-Next/Qwen3.5 to support weight decay on qk layernorm.
f. Multi-modal MoE model LoRA supports--target_modules all-routerconfiguration. - RL
a. Support for GDPO algorithm to compute advantages using parameter--scale_rewards gdpo. (Thanks to @Auraithm)
b. GKD supports using top-k logits to compute KL for memory savings with parameter--gkd_topk_logits.
c. GKD supports using teacher server to avoid explicitly loading the teacher model. - Training
a. Added Muon-CLIP optimizer support. Training example: https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (Thanks to @vx120)
b. Dependency updates: Compatible with latest dependencies including python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0, etc.
c. Optimized generative reranker lm_head computation to reduce memory usage.
d. FSDP2 supports CPU offload activation; DeepSpeed elastic support. (Thanks to @meichangsu1)
New Models
- Text-only Models
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct series (Thanks to @qianhao0713) - Multi-modal Models
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it series
What's Changed
- [misc] update swift patch_conv3d by @Jintao-Huang in #7320
- add npu megatron multi-node example by @addsubmuldiv in #7321
- [bugfix] fix megatron convert by @Jintao-Huang in #7323
- [model] Support Qwen3-VL-Embedding/Qwen3-VL-Reranker by @Jintao-Huang in #7329
- [reranker] refactor reranker by @Jintao-Huang in #7334
- [bugfix] fix video base64 torchcodec by @Jintao-Huang in #7338
- [bugfix] fix modelopt by @Jintao-Huang in #7339
- [docs] Update swift image 3.12 by @Jintao-Huang in #7332
- [bugfix] fix get_chunked_inputs slice by @hjh0119 in #7346
- fix find node ip by @tastelikefeet in #7350
- Fix multi-modal reranker doc by @tastelikefeet in #7354
- [bugfix] fix app_args by @Jintao-Huang in #7367
- [bugfix] fix qwen2_vl video by @Jintao-Huang in #7376
- [bugfix] fix vllm moe model load_weights by @hjh0119 in #7362
- [v4] refactor ms-swift v4 by @Jintao-Huang in #7238
- feat: support scale rewards "gdpo" by @Auraithm in #7348
- [infer] infer backend pt -> transformers by @Jintao-Huang in #7379
- [docs] update docs & update Copyright by @Jintao-Huang in #7384
- Fix device mismatch in _forward_qwen3_vl_or_qwen3_omni when computing visual_pos_masks by @yaqiangsun in #7372
- add npu qwen3-next example and warning of ep size by @addsubmuldiv in #7390
- [bugfix] fix deepseek_v3_1 thinking template by @Jintao-Huang in #7388
- [docs] update docs & update dataset 'loss' by @Jintao-Huang in #7402
- [bugfix] Fix ref adapters trainable params 0 by @Jintao-Huang in #7403
- [readme] update error timeline of news by @shizhengLi in #7404
- [bugfix] fix sp reranker by @Jintao-Huang in #7405
- [v4] fix ci by @Jintao-Huang in #7559
- [refactor] reorganize reward and rollout modules into dedicated direct… by @hjh0119 in #7397
- [grpo] speedup grpo train stage encode with concurrent by @Cccei000 in #7391
- Update the NPU-supported features table by @addsubmuldiv in #7562
- [bugfix] fix attn_impl by @Jintao-Huang in #7564
- [v4] refactor ms-swift v4 (pipelines/arguments/swiftmixin/callback/tuner_plugin) by @Jintao-Huang in #7385
- [bugfix] fix minimax tp by @Jintao-Huang in #7788
- fix inputs_embeds for hunyuanOCR by @slin000111 in #7803
- [bugfix] fix deepspeed distributed weight offload code by @Silas-11 in #7802
- [generative_reranker] generative reranker logits memory optimization by @Jintao-Huang in #7816
- update requirements by @Jintao-Huang in #7819
- [misc] update issue template by @Jintao-Huang in #7818
- [bugfix] fix dpo by @Jintao-Huang in #7824
- update wechat by @tastelikefeet in #7827
- [bugfix] fix deepspeed optimizer offload code by @Silas-11 in #7821
- [model] support glm4_moe_lite by @Jintao-Huang in #7829
- [bugfix] fix hunyuan ocr by @Jintao-Huang in #7831
- [megatron] support glm_moe_lite by @Jintao-Huang in #7833
- chore: epochs -> epoch by @zzc0430 in #7825
- [optimizer] Set loss mask to compute the loss for multi-turn reasoning by @Simon-ss7 in #7838
- [bugfix] fix recompute_granularity none by @Jintao-Huang in https://gi...
Patch release v3.12.6
What's Changed
Full Changelog: v3.12.5...v3.12.6
Patch release v3.12.5
Full Changelog: v3.12.4...v3.12.5