provide the ability to unload the LM prior to the DiT execution phase#1090
provide the ability to unload the LM prior to the DiT execution phase#10905kft wants to merge 5 commits intoace-step:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds guarded LM-unload helper and conditional unload-before-DiT logic in inference; extends LLMHandler to persist the last successful init config, add reload_last_configuration(), and harden unload/accelerator-cache cleanup with safer exception handling and extra CUDA calls. Changes
Sequence DiagramsequenceDiagram
participant Inference as Inference Engine
participant LLM as LLMHandler
participant CUDA as CUDA Runtime
participant DiT as DiT Generator
Inference->>LLM: check request_needs_lm
alt LM required but not initialized
Inference->>LLM: reload_last_configuration()
LLM-->>Inference: (status, success)
end
alt ACESTEP_UNLOAD_LM_BEFORE_DIT enabled and backend in {pt,vllm}
Inference->>LLM: _unload_lm_before_dit()
LLM->>LLM: log pre-unload memory
LLM->>LLM: unload() (guarded)
LLM->>CUDA: torch.cuda.synchronize() (guarded)
LLM->>CUDA: torch.cuda.empty_cache() (guarded)
LLM->>CUDA: torch.cuda.ipc_collect() (guarded)
LLM->>LLM: gc.collect()
LLM->>LLM: log post-unload memory
else
Inference->>LLM: skip unload (unsupported backend or disabled)
end
Inference->>DiT: run DiT generation (phase 2)
DiT-->>Inference: generated output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/inference.py`:
- Around line 355-368: The LM cleanup block currently swallows all exceptions;
instead catch and log failures for each CUDA call so cleanup errors are visible:
wrap torch.cuda.synchronize(), torch.cuda.empty_cache(), and
torch.cuda.ipc_collect() so any Exception is caught and reported with
loguru.logger.error/exception including which step failed (e.g., "synchronize
failed", "empty_cache failed", "ipc_collect failed") and the exception details;
also replace the outer bare except: pass with a logged error using
loguru.logger.exception and do not silently ignore the failure. Use the existing
symbols torch.cuda.synchronize, torch.cuda.empty_cache, torch.cuda.ipc_collect
and loguru.logger (or the module's logger alias) when adding these logs.
- Around line 663-665: The unconditional unload-before-DiT can trigger MLX
reload failures; restrict the call to only supported backends. Modify the block
that calls _unload_lm_before_dit(llm_handler) so it first checks the LM backend
(e.g., llm_handler.backend or llm_handler.backend_name) and only calls
_unload_lm_before_dit when the backend is one of {"vllm", "pt"} and
ACESTEP_UNLOAD_LM_BEFORE_DIT is truthy; reference LLMHandler.initialize behavior
in comments if needed to justify the guard.
In `@acestep/llm_inference.py`:
- Around line 149-160: The current try/except blocks around
torch.cuda.synchronize(), torch.cuda.empty_cache(), and torch.cuda.ipc_collect()
swallow errors; replace them by calling the existing _clear_accelerator_cache()
helper (which centralizes cleanup and logging) or at minimum catch exceptions
and log them via loguru.logger.error including the exception info and context
(e.g., which cuda call failed) so failures during unload are visible; ensure you
reference and use _clear_accelerator_cache(), torch.cuda.synchronize(),
torch.cuda.empty_cache(), torch.cuda.ipc_collect(), and loguru.logger in the
change.
- Around line 740-747: The current code sets self._last_init_config before the
backend is actually initialized, causing MLX early returns and vLLM→PyTorch
fallbacks to record incorrect reload state; remove or defer setting
self._last_init_config here and instead call the helper
_remember_successful_init() immediately before each successful return path
(specifically before the MLX early return points and before the vLLM and PyTorch
success returns) so the persisted config reflects the backend that actually
initialized; ensure reload_last_configuration() will then replay the correct
backend/state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6598c303-46cc-4aef-bb2f-808d0aced553
📒 Files selected for processing (2)
acestep/inference.pyacestep/llm_inference.py
|
It looks like this issue is related: #1085 |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
acestep/llm_inference.py (1)
164-186:⚠️ Potential issue | 🟠 MajorUse the device-aware cache helper here, not a CUDA-first cleanup block.
This path now bypasses
_clear_accelerator_cache(). On mixed CUDA+XPU hosts,torch.cuda.is_available()wins even when the LM was on XPU, so the XPU cache never gets cleared. The nestedexcept Exception: passalso hides the failures you need when unload does not actually free memory.🛠️ One way to tighten this
gc.collect() - - if torch.cuda.is_available(): - try: - torch.cuda.synchronize() - except Exception: - pass - try: - torch.cuda.empty_cache() - except Exception: - pass - try: - torch.cuda.ipc_collect() - except Exception: - pass - elif hasattr(torch, "mps") and torch.backends.mps.is_available(): - if hasattr(torch.mps, "synchronize"): - torch.mps.synchronize() - if hasattr(torch.mps, "empty_cache"): - torch.mps.empty_cache() - elif hasattr(torch, "xpu") and torch.xpu.is_available(): - torch.xpu.empty_cache() - torch.xpu.synchronize() + self._clear_accelerator_cache() + if ( + str(getattr(self, "device", "")).split(":")[0] == "cuda" + and torch.cuda.is_available() + and hasattr(torch.cuda, "ipc_collect") + ): + try: + torch.cuda.ipc_collect() + except RuntimeError as exc: + logger.warning("[LLM] torch.cuda.ipc_collect() failed during unload: {}", exc)As per coding guidelines, do not alter non-target hardware/runtime paths unless explicitly required, and handle errors explicitly with
loguru.logger.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/llm_inference.py` around lines 164 - 186, Current cleanup bypasses _clear_accelerator_cache() and prefers CUDA even when the model ran on XPU; replace the explicit CUDA-first cleanup block in llm_inference.py with a single call to _clear_accelerator_cache() (or delegate to that helper) so device-aware cache clearing is used, and remove the nested bare excepts — if you must catch exceptions around device-specific calls (inside _clear_accelerator_cache() or here), log them using loguru.logger.error with context instead of using except Exception: pass so failures surface during unload; reference the existing symbols _clear_accelerator_cache(), torch.cuda.is_available(), torch.mps, torch.xpu, and loguru.logger when making the change.acestep/inference.py (1)
355-368:⚠️ Potential issue | 🟡 MinorDon't swallow post-unload CUDA cleanup failures.
If
synchronize(),empty_cache(), oripc_collect()fails here, the helper still looks like a successful VRAM release and later DiT OOMs become much harder to diagnose.As per coding guidelines, handle errors explicitly and log errors with
loguru.logger.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/inference.py` around lines 355 - 368, The CUDA cleanup block currently swallows all exceptions, masking failures in torch.cuda.synchronize(), torch.cuda.empty_cache(), torch.cuda.ipc_collect(), and subsequent memory queries; replace the broad "except Exception: pass" behavior with explicit error handling that logs any thrown exception using loguru.logger (e.g., logger.exception or logger.error with exception info) so failures are visible, and only suppress non-critical errors after logging; keep the existing memory reporting (alloc/reserved) but ensure any exceptions from memory_allocated()/memory_reserved() are also caught and logged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/inference.py`:
- Around line 663-672: The guard currently only checks use_lm before unloading,
but use_lm reflects whether this request ran the LM phase and can miss a
resident model from a prior request; change the condition to check the actual
resident state on llm_handler instead of use_lm (e.g., replace "if use_lm and
unload_enabled:" with a check like "if unload_enabled and llm_handler is not
None and getattr(llm_handler, 'is_resident', getattr(llm_handler,
'model_loaded', False)):"), then keep the existing safe_backend check and call
to _unload_lm_before_dit(llm_handler); also consider adding a note or code path
to lazily load models when needed to minimize VRAM.
---
Duplicate comments:
In `@acestep/inference.py`:
- Around line 355-368: The CUDA cleanup block currently swallows all exceptions,
masking failures in torch.cuda.synchronize(), torch.cuda.empty_cache(),
torch.cuda.ipc_collect(), and subsequent memory queries; replace the broad
"except Exception: pass" behavior with explicit error handling that logs any
thrown exception using loguru.logger (e.g., logger.exception or logger.error
with exception info) so failures are visible, and only suppress non-critical
errors after logging; keep the existing memory reporting (alloc/reserved) but
ensure any exceptions from memory_allocated()/memory_reserved() are also caught
and logged.
In `@acestep/llm_inference.py`:
- Around line 164-186: Current cleanup bypasses _clear_accelerator_cache() and
prefers CUDA even when the model ran on XPU; replace the explicit CUDA-first
cleanup block in llm_inference.py with a single call to
_clear_accelerator_cache() (or delegate to that helper) so device-aware cache
clearing is used, and remove the nested bare excepts — if you must catch
exceptions around device-specific calls (inside _clear_accelerator_cache() or
here), log them using loguru.logger.error with context instead of using except
Exception: pass so failures surface during unload; reference the existing
symbols _clear_accelerator_cache(), torch.cuda.is_available(), torch.mps,
torch.xpu, and loguru.logger when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e680f41-cfb4-4558-90f2-b9426ed296e4
📒 Files selected for processing (2)
acestep/inference.pyacestep/llm_inference.py
…check unload based upon resident LM state
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/inference.py`:
- Line 335: Add a short descriptive docstring to the newly introduced helper
function _unload_lm_before_dit(llm_handler) describing its purpose, the
parameter (llm_handler) and any important behavior or side effects (e.g., that
it unloads or cleans up the language model before DIT operations). Keep the
style consistent with other functions in the module (brief one- or two-line
summary plus a param description if used) to satisfy the project docstring
guideline.
- Around line 468-475: In generate_music, when request_needs_lm is true and
llm_handler.reload_last_configuration() returns reload_ok == False, fail fast
instead of silently proceeding without LM: replace the current log-only behavior
by aborting the request (e.g., raise a RuntimeError or return an error response)
that includes reload_status so callers know LM reload failed; make this change
in the block that checks llm_handler.llm_initialized and calls
llm_handler.reload_last_configuration() (reference symbols: generate_music,
request_needs_lm, llm_handler, reload_last_configuration, reload_status).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7f400642-cf60-44fc-b7c8-91c5151c96c0
📒 Files selected for processing (2)
acestep/inference.pyacestep/llm_inference.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/inference.py`:
- Around line 469-482: The current logic attempts a reload whenever
request_needs_lm and llm_handler exists but llm_initialized is false; change
this to only try reload (and fail-fast) when the unload-before-DIT feature is
explicitly enabled or when the handler can confirm a saved config exists: add a
guard that checks the ACESTEP_UNLOAD_LM_BEFORE_DIT flag (or a new
llm_handler.has_saved_configuration() method) before calling
llm_handler.reload_last_configuration(), and only treat reload failure as a
request-level abort in that guarded path; leave the existing default path (no
reload attempt) when the flag is false and no saved config is reported.
- Around line 675-679: The unload path in generate_music leaves the shared
llm_handler unloaded while understand_music, create_sample, and format_sample
still short-circuit on not llm_handler.llm_initialized; change those functions
to reuse the same "ensure LM ready" logic (or extract a shared helper, e.g.,
ensure_llm_ready()) that checks llm_handler.llm_initialized and calls
reload_last_configuration() when needed instead of returning early, and update
generate_music to call that same helper after _unload_lm_before_dit so all
LM-only APIs lazily reload the model when required (reference: generate_music,
understand_music, create_sample, format_sample, _unload_lm_before_dit,
reload_last_configuration, llm_handler.llm_initialized).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b6def9cf-aed8-4232-b9e0-2c6fa5f937f0
📒 Files selected for processing (1)
acestep/inference.py
| if unload_enabled and llm_handler is not None and llm_handler.llm_initialized: | ||
| if current_backend in safe_unload_backends: | ||
| _unload_lm_before_dit(llm_handler) | ||
| else: | ||
| logger.info("[generate_music] Skipping LM unload before DiT for unsupported backend={}", current_backend) |
There was a problem hiding this comment.
This unload path strands the other LM inference APIs.
After Line 677 unloads the shared handler, understand_music(), create_sample(), and format_sample() in this same module still short-circuit on not llm_handler.llm_initialized and never call reload_last_configuration(). With the flag enabled, a generate_music() call can therefore make the next LM-only request fail until something else reinitializes the model. Please reuse the same “ensure LM ready” path there, or centralize it in a shared helper.
Based on learnings, acestep/inference.py should lazy load models when needed to minimize memory allocations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/inference.py` around lines 675 - 679, The unload path in
generate_music leaves the shared llm_handler unloaded while understand_music,
create_sample, and format_sample still short-circuit on not
llm_handler.llm_initialized; change those functions to reuse the same "ensure LM
ready" logic (or extract a shared helper, e.g., ensure_llm_ready()) that checks
llm_handler.llm_initialized and calls reload_last_configuration() when needed
instead of returning early, and update generate_music to call that same helper
after _unload_lm_before_dit so all LM-only APIs lazily reload the model when
required (reference: generate_music, understand_music, create_sample,
format_sample, _unload_lm_before_dit, reload_last_configuration,
llm_handler.llm_initialized).
There was a problem hiding this comment.
Technically, yes, this is a lifecycle inconsistency. However it is not clear that these other APIs would even be hit from the normal UI flow here. I don't know enough at this point to determine if this is a real problem or not...
There was a problem hiding this comment.
🧩 Analysis chain
Let me check how those functions are wired into the Gradio UI to give you a concrete answer.
🏁 Script executed:
#!/bin/bash
echo "=== understand_music usages ==="
rg -n "understand_music" --type py -C2
echo ""
echo "=== create_sample usages ==="
rg -n "create_sample" --type py -C2
echo ""
echo "=== format_sample usages ==="
rg -n "format_sample" --type py -C2
Length of output: 66675
@5kft, the search results give a concrete answer: yes, this is a real (though narrow) issue in the normal Gradio UI flow.
Here's what was found:
| Function | Called from UI | How |
|---|---|---|
understand_music |
✅ Yes | llm_analysis_actions.py ("Analyze Audio" button), metadata_loading.py (load metadata event) |
create_sample |
✅ Yes | llm_sample_actions.py → wired to "Create Sample" button click in generation_mode_wiring.py |
format_sample |
✅ Yes | llm_format_actions.py → "Format Caption/Lyrics" buttons |
All three are separate clickable buttons in the same Gradio session sharing the same llm_handler. So the failure scenario is:
- User clicks Generate (with
ACESTEP_UNLOAD_LM_BEFORE_DIT=1) → LM unloaded after DiT handoff ✅ - User then clicks Create Sample, Format, or Analyze in the same session →
llm_handler.llm_initializedisFalse→ silent failure or error ❌
Notably, metadata_loading.py line 257 just silently skips with if llm_handler.llm_initialized: rather than surfacing any error.
Severity is bounded because:
- The flag
ACESTEP_UNLOAD_LM_BEFORE_DIT=1must be explicitly opted into - The
generate_musicreload path already works correctly - These auxiliary actions are optional helpers
The fix would be adding the same reload guard (check _last_init_config, call reload_last_configuration()) in the three functions before the not llm_handler.llm_initialized early-return, or extracting a shared helper. Would you like me to plan those changes?
🧠 Learnings used
Learnt from: CR
Repo: ace-step/ACE-Step-1.5 PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-20T01:31:33.304Z
Learning: Applies to acestep/handler.py,acestep/inference.py : Use batch operations supported up to 8 songs for audio generation
Learnt from: CR
Repo: ace-step/ACE-Step-1.5 PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-24T04:42:41.285Z
Learning: Applies to acestep/{handler,inference,llm_inference}.py : Lazy load models when needed to minimize memory allocations
Learnt from: chigkim
Repo: ace-step/ACE-Step-1.5 PR: 362
File: cli.py:982-982
Timestamp: 2026-02-08T22:50:13.134Z
Learning: In cli.py, the --backend argument defaults to "vllm" on non-MPS systems. However, the actual fallback from "vllm" to "pt" on non-CUDA devices is handled in LLMHandler.initialize() (acestep/llm_inference.py lines 524-528), not in the CLI. This separation allows the CLI to express user preference while LLMHandler performs capability detection.
Learnt from: CR
Repo: ace-step/ACE-Step-1.5 PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-20T01:31:25.142Z
Learning: Applies to acestep/{handler,inference}.py : Batch operations are supported up to 8 songs
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 811
File: docs/en/GRADIO_GUIDE.md:0-0
Timestamp: 2026-03-11T12:03:00.308Z
Learning: In ace-step/ACE-Step-1.5 GRADIO_GUIDE.md, the "How It Integrates With 5Hz LM" section explicitly notes that when external mode is active, some LM-assisted workflows (notably Create Sample) still check `llm_handler.llm_initialized` in both the Gradio handler path and `create_sample()`, and therefore may still require local 5Hz LM initialization until the remaining routing is completed. The doc describes the external-mode paths as "intended behavior" rather than currently fully functional. Do not flag this remaining initialization requirement as a new bug in documentation reviews.
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 774
File: acestep/ui/gradio/interfaces/training_dataset_tab_label_preview.py:70-95
Timestamp: 2026-03-05T23:12:28.095Z
Learning: In ace-step/ACE-Step-1.5, the hardcoded English placeholder strings in `acestep/ui/gradio/interfaces/training_dataset_tab_label_preview.py` (e.g., `"Music description..."` for `edit_caption`, `"pop, electronic, dance..."` for `edit_genre`, `Dropdown` choices `["Use Global Ratio", "Caption", "Genre"]` and its default value, and the lyrics placeholder `"[Verse 1]\nLyrics here...\n\n[Chorus]\n..."`) are pre-existing localization gaps carried forward from before the decomposition refactor. They are deferred to a follow-up PR and should not be flagged as new issues introduced by decomposition PRs.
Learnt from: tonyjohnvan
Repo: ace-step/ACE-Step-1.5 PR: 858
File: acestep/customized_vllm/__init__.py:125-128
Timestamp: 2026-03-17T21:18:29.789Z
Learning: In `acestep/customized_vllm/__init__.py` and `acestep/llm_inference.py`, the ACE-Step vLLM path always runs with `batch_size=1` (one prompt per `LLM.generate()` call). Because of this, `[sampling_params] * len(prompts)` only ever produces a single-element list, and sharing one `MetadataConstrainedLogitsProcessor` / `logits_processor_update_state` instance across all slots is safe by design. Do not flag the shared logits-processor reference as a mutable-state / batch-corruption bug.
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 769
File: acestep/api/train_api_dataset_auto_label_async_route.py:158-169
Timestamp: 2026-03-04T22:40:44.930Z
Learning: In ace-step/ACE-Step-1.5, the `builder.label_all_samples(...)` call in both `train_api_dataset_auto_label_async_route.py` and `train_api_dataset_auto_label_sync_route.py` passes `chunk_size`, `batch_size`, and `sample_labeled_callback` kwargs that are not declared in the current `label_all_samples` signature in `acestep/training/dataset_builder_modules/label_all.py`. This is a pre-existing issue carried forward unchanged from the pre-decomposition `train_api_dataset_service.py` and is deferred to a follow-up PR.
Learnt from: tonyjohnvan
Repo: ace-step/ACE-Step-1.5 PR: 858
File: acestep/customized_vllm/__init__.py:1-406
Timestamp: 2026-03-17T21:14:13.573Z
Learning: In `acestep/customized_vllm/__init__.py`, the author (tonyjohnvan) intentionally keeps all inference engine internals (ForwardState, thread-local context, GenerationSlot, CachePool, _EngineConfig, LLM, generation loop) in a single ~400 LOC file with clear section comments. They consider splitting into sub-modules to add import complexity for no functional benefit. Do not raise module-size refactor suggestions for this file.
Learnt from: tonyjohnvan
Repo: ace-step/ACE-Step-1.5 PR: 858
File: acestep/customized_vllm/pipeline.py:1-423
Timestamp: 2026-03-17T21:17:04.611Z
Learning: In `acestep/customized_vllm/pipeline.py`, the author (tonyjohnvan) intentionally keeps all inference pipeline internals (sampling helpers _filter_by_top_k/_filter_by_nucleus/sample_tokens, model init, KV cache provisioning _provision_kv_storage, transfer buffers _init_transfer_buffers, prefill/decode execution _execute_prefill/_execute_autoregressive/_forward_pass, and CUDA graph capture _compile_execution_graphs) in a single ~420 LOC file with clear section comments. They consider splitting into sub-modules to add import complexity for no functional benefit. Do not raise module-size refactor suggestions for this file.
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 837
File: acestep/core/generation/handler/init_service_loader_components.py:44-47
Timestamp: 2026-03-14T10:35:13.950Z
Learning: In `acestep/core/generation/handler/init_service_loader.py` (origin/main, pre-extraction), `_load_text_encoder_and_tokenizer()` already used `self.text_encoder.to("cpu").to(self.dtype)` in the CPU-offload branch — using `self.dtype` (which can be bfloat16) instead of `torch.float32`. This is a pre-existing behavior, not introduced by PR `#837`. PR `#837` moves it unchanged into `init_service_loader_components.py` and intends to fix it there. Do not flag the original `init_service_loader.py` pattern as a new regression in future reviews.
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 808
File: acestep/inference.py:25-26
Timestamp: 2026-03-11T10:56:30.827Z
Learning: In ace-step/ACE-Step-1.5, acestep/inference.py’s lyric-density guard now uses a Unicode-aware tokenizer (_extract_lyric_word_tokens with str.isalnum()) and the counting helpers (_count_lyric_words/_count_lyric_syllables) ignore tag lines. A non‑Latin regression test exists, so do not re‑flag ASCII-only tokenization for this path in future reviews.
Learnt from: CR
Repo: ace-step/ACE-Step-1.5 PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-28T14:14:55.765Z
Learning: AI-Agent Workflow: Add/update focused tests. Self-review only changed hunks for regressions and scope creep. Summarize risk, validation, and non-target impact in PR notes.
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 0
File: :0-0
Timestamp: 2026-03-20T01:04:19.217Z
Learning: In `acestep/text_tasks/external_lm_captioning.py`, `apply_user_metadata_overrides()` intentionally writes both the canonical field name and its runtime alias in a single pass (e.g., `plan.keyscale` + `plan.key_scale`, `plan.timesignature` + `plan.time_signature`, `plan.language` + `plan.vocal_language`). Invalid `bpm`/`duration` conversion errors are swallowed and logged via `loguru.logger.debug()`. Do not flag either of these patterns as bugs in future reviews.
Learnt from: CR
Repo: ace-step/ACE-Step-1.5 PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-24T14:09:24.530Z
Learning: Applies to acestep/{handler,inference,audio_utils}.py : Target 4GB VRAM as the memory constraint and minimize memory allocations accordingly
Learnt from: 1larity
Repo: ace-step/ACE-Step-1.5 PR: 508
File: acestep/core/generation/handler/memory_utils.py:1-18
Timestamp: 2026-02-13T11:03:36.432Z
Learning: Enforce a module size guideline for Python files: only raise module-size concerns when a file exceeds 200 lines of code (LOC). Do not raise for files in the 150–200 LOC range, and accept files below 200 LOC without justification to split. Apply this across all Python files (pattern **/*.py) to maintain consistency in reviews; when a file exceeds 200 LOC, suggest splitting into smaller modules to improve readability and maintainability.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
acestep/inference.py (1)
356-360: Redundant None check.The guard at line 356 is unnecessary since the function already returns early at lines 337-338 when
llm_handler is None.♻️ Suggested simplification
gc.collect() - if llm_handler is not None: - try: - llm_handler._clear_accelerator_cache("[LM handoff unload]") - except Exception as exc: - logger.warning("[LM handoff unload] accelerator cleanup failed: {}", exc) + try: + llm_handler._clear_accelerator_cache("[LM handoff unload]") + except Exception as exc: + logger.warning("[LM handoff unload] accelerator cleanup failed: {}", exc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/inference.py` around lines 356 - 360, The if llm_handler is not None guard is redundant because earlier code already returns when llm_handler is None; remove that outer check and simply call llm_handler._clear_accelerator_cache("[LM handoff unload]") inside the try/except block, catching Exception as exc and logging with logger.warning — ensure you reference llm_handler and the method _clear_accelerator_cache and keep the existing except handler using logger.warning (fix the log call to format the exception correctly, e.g., include exc in the message rather than passing it as a separate format arg).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@acestep/inference.py`:
- Around line 356-360: The if llm_handler is not None guard is redundant because
earlier code already returns when llm_handler is None; remove that outer check
and simply call llm_handler._clear_accelerator_cache("[LM handoff unload]")
inside the try/except block, catching Exception as exc and logging with
logger.warning — ensure you reference llm_handler and the method
_clear_accelerator_cache and keep the existing except handler using
logger.warning (fix the log call to format the exception correctly, e.g.,
include exc in the message rather than passing it as a separate format arg).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3b98ae65-1a80-4bcc-a163-2230da8686cf
📒 Files selected for processing (1)
acestep/inference.py
At present in the Gradio UI implementation, the 5Hz LM remains resident in VRAM after it has been used. For the typical use of ACE Step, this behavior is correct because it allows for rapid iteration on music generation. However, there are situations where this behavior limits the ability to use larger ACE models together - e.g., the 4B LM model with the XL SFT DiT model on a 24GB VRAM device when utilizing the "vllm" backend.
This PR adds the ability to optionally enable the complete unloading of the 5Hz LM for both the "vllm" and "pt" backends, after the LM phase is complete and before the DiT phase begins. When the LM is unloaded, the next time it is required, it will be loaded again automatically, and LM processing will proceed - this way the music generation workflow of the Gradio UI works completely as normal.
I made these changes as I need to be able to use the 4B LM model with the XL SFT model on my 24GB 4090. When using the "vllm" backend with the 4B LM, nanovllm holds on to too much VRAM, and it isn't possible to load the XL SFT model and run the DiT phase. A workaround for this in ACE today is to switch to using the "pt" backend, but when doing this the LM speed hit is huge, making song generation take multiple times longer.
With these changes, and enabling the unload of vllm and the LM in them, I can run generations perfectly with both the 4B LM + XL SFT models and the "vllm" backend enabled, and the only speed hit is just ~20 seconds to reload the 4B LM after each batch of generations. This result is much faster than using the "pt" backend for 4B LM inference.
As provided, the default operation of ACE Step is unchanged with these changes. To enable the unload of the LM prior to the DiT phase, one can simply set
ACESTEP_UNLOAD_LM_BEFORE_DIT=1before launching the Gradio UI - e.g.,I hope this is helpful to others as it has been to me!
Summary by CodeRabbit
New Features
Bug Fixes