Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and enhance model infer logic etc. #889

suncade · 2025-12-10T12:59:48Z

📌 PR 内容 / PR Description

支持微调任务暂停/恢复：新增 /v1/finetuneTasks/{job_id}:pause 和 /v1/finetuneTasks/{job_id}:resume API 接口
增加断点续训功能：新增断点续训逻辑，支持从指定 checkpoint 恢复训练
训练进度优化：增加和改进微调进度跟踪和状态管理机制
支持微调阶段: sft, pt, dpo; support 支持微调类型: lora, qlora, full

✅ 变更类型 / Type of Change

修复 Bug / Bug fix (non-breaking change that fixes an issue)
[ ✅ ] 新功能 / New feature (non-breaking change that adds functionality)
重构 / Refactor (no functionality change, code structure optimized)
重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
文档更新 / Documentation update (changes to docs only)
性能优化 / Performance optimization

… from breakpoints, etc.)

suncade · 2025-12-10T13:02:41Z

support stages: sft, pt, dpo
support finetuning_type: lora, qlora, full

…estart

JingofXin · 2025-12-18T02:11:29Z

PR描述需要补充完整。

JingofXin · 2025-12-18T03:11:00Z

本地验证了两个llamafactory的测试，纯语言模型+多模态模型，训练和推理都正常。
多模态正常：

纯语言模型正常：

lazyllm/components/finetune/llamafactory.py

wzh1994 · 2025-12-18T09:33:21Z

lazyllm/module/llms/trainablemodule.py


-lazyllm.config.add('trainable_module_config_map_path', str, '', 'TRAINABLE_MODULE_CONFIG_MAP_PATH',
-                   description='The default path for trainable module config map.')
+lazyllm.config.add(


不要做无关的lint

恢复原来状态和格式

代码合并时自动格式换行，现恢复原状

wzh1994 · 2025-12-18T09:35:26Z

lazyllm/module/llms/trainablemodule.py


-        kw = kw or self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
+        # Get default args and merge with user-provided kw, with kw taking precedence
+        default_args = self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])


这里会改变原来的逻辑，正常情况下是用户传kw，就完全使用用户给的；否则使用默认的；现在行为变成了合并参数，即用户传入的覆盖默认的。
这个kw主要用于_async_finetune，其余场景都没用上，需要确认一下_async_finetune场景这样做是否符合预期

自测时发现不同微调场景用户可能只会传部分参数，但还有些参数是必须的。所以改成了当前逻辑，即优先使用用户传的参数+其它用户没传又需要的参数（使用默认值）。

wzh1994 · 2025-12-18T09:35:59Z

lazyllm/module/llms/trainablemodule.py

-        kw = kw or self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
+        # Get default args and merge with user-provided kw, with kw taking precedence
+        default_args = self._get_train_or_deploy_args(mode, disable=['base_model', 'target_path'])
+        if kw:


kw = {**default_args, **kw} if kw else default_args

接受更新为此行更简洁的写法

JingofXin · 2025-12-18T12:40:22Z

PR描述中，没涉及的内容要清除掉。

lazyllm/tools/train_service/serve.py

sundebiao added 4 commits December 1, 2025 11:19

Model fine-tuning and optimization (fine-tuning stop, resume training…

11eed7c

… from breakpoints, etc.)

优化微调进度及full模式的断点续训逻辑

6686f8a

根据逻辑和系统最新版本调整优化

81f1313

格式化代码和清除debug日志

0ae6057

Merge remote-tracking branch 'upstream/main' into add_finetune_stop_r…

299d204

…estart

suncade had a problem deploying to protected December 11, 2025 03:29 — with GitHub Actions Error

Optimize code

2c326a5

suncade had a problem deploying to protected December 11, 2025 04:00 — with GitHub Actions Error

optimize code

644d052

suncade had a problem deploying to protected December 11, 2025 04:27 — with GitHub Actions Error

JingofXin reviewed Dec 18, 2025

View reviewed changes

优化代码细节

bb85a53

suncade had a problem deploying to protected December 18, 2025 07:48 — with GitHub Actions Error

wzh1994 reviewed Dec 18, 2025

View reviewed changes

微调格式代码

2d9de42

suncade had a problem deploying to protected December 19, 2025 03:19 — with GitHub Actions Error

移除多余的路径分隔符转换

203c256

suncade had a problem deploying to protected December 19, 2025 03:33 — with GitHub Actions Error

Merge branch 'main' into add_finetune_stop_restart

b948f12

suncade had a problem deploying to protected December 23, 2025 03:07 — with GitHub Actions Error

JingofXin reviewed Dec 23, 2025

View reviewed changes

优化部分逻辑以提升些性能

6ecbc30

suncade had a problem deploying to protected December 26, 2025 03:36 — with GitHub Actions Error

优化部分方法的复杂度

af8220a

suncade requested a deployment to protected December 26, 2025 13:21 — with GitHub Actions Waiting

JingofXin approved these changes Dec 29, 2025

View reviewed changes

suncade changed the title ~~Add finetune 1stop 2pause 3resume 4finetune_cost 5finetune_progress and 6multi-gpu training or fine-tuning logic~~ Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and fine-tuning logic etc. Dec 30, 2025

suncade changed the title ~~Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and fine-tuning logic etc.~~ Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and enhance model infer logic etc. Dec 30, 2025

修复向量模型&NPU（昇腾）部署模块参数不支持问题

666c359

suncade requested a deployment to protected December 30, 2025 03:18 — with GitHub Actions Waiting

Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and enhance model infer logic etc. #889

Are you sure you want to change the base?

Add finetune stop pause resume finetune_cost finetune_progress multi-gpu training and enhance model infer logic etc. #889

Uh oh!

Conversation

suncade commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 PR 内容 / PR Description

✅ 变更类型 / Type of Change

Uh oh!

suncade commented Dec 10, 2025

Uh oh!

JingofXin commented Dec 18, 2025

Uh oh!

JingofXin commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wzh1994 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

suncade Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

suncade Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

wzh1994 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

suncade Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

wzh1994 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

suncade Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

JingofXin commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suncade commented Dec 10, 2025 •

edited

Loading

JingofXin commented Dec 18, 2025 •

edited

Loading