provide the ability to unload the LM prior to the DiT execution phase by 5kft · Pull Request #1090 · ace-step/ACE-Step-1.5

5kft · 2026-04-11T15:46:58Z

At present in the Gradio UI implementation, the 5Hz LM remains resident in VRAM after it has been used. For the typical use of ACE Step, this behavior is correct because it allows for rapid iteration on music generation. However, there are situations where this behavior limits the ability to use larger ACE models together - e.g., the 4B LM model with the XL SFT DiT model on a 24GB VRAM device when utilizing the "vllm" backend.

This PR adds the ability to optionally enable the complete unloading of the 5Hz LM for both the "vllm" and "pt" backends, after the LM phase is complete and before the DiT phase begins. When the LM is unloaded, the next time it is required, it will be loaded again automatically, and LM processing will proceed - this way the music generation workflow of the Gradio UI works completely as normal.

I made these changes as I need to be able to use the 4B LM model with the XL SFT model on my 24GB 4090. When using the "vllm" backend with the 4B LM, nanovllm holds on to too much VRAM, and it isn't possible to load the XL SFT model and run the DiT phase. A workaround for this in ACE today is to switch to using the "pt" backend, but when doing this the LM speed hit is huge, making song generation take multiple times longer.

With these changes, and enabling the unload of vllm and the LM in them, I can run generations perfectly with both the 4B LM + XL SFT models and the "vllm" backend enabled, and the only speed hit is just ~20 seconds to reload the 4B LM after each batch of generations. This result is much faster than using the "pt" backend for 4B LM inference.

As provided, the default operation of ACE Step is unchanged with these changes. To enable the unload of the LM prior to the DiT phase, one can simply set ACESTEP_UNLOAD_LM_BEFORE_DIT=1 before launching the Gradio UI - e.g.,

cd ~/ACE-Step-1.5
export ACESTEP_UNLOAD_LM_BEFORE_DIT=1
./start_gradio_ui.sh

I hope this is helpful to others as it has been to me!

Summary by CodeRabbit

New Features
- Language models can be reloaded from the last successful configuration; reload failures return clear status.
- Option to automatically unload the language model before specific generation phases to reduce GPU memory use (skips unsupported backends).
Bug Fixes
- Improved error reporting and recovery for model initialization and reload attempts.
- More robust GPU/XPU/MPS cleanup and teardown with extra synchronization, memory collection, and guarded failure handling.

coderabbitai · 2026-04-11T15:47:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds guarded LM-unload helper and conditional unload-before-DiT logic in inference; extends LLMHandler to persist the last successful init config, add reload_last_configuration(), and harden unload/accelerator-cache cleanup with safer exception handling and extra CUDA calls.

Changes

Cohort / File(s)	Summary
Inference / DiT control `acestep/inference.py`	Added `_unload_lm_before_dit(llm_handler)` that logs GPU memory, calls `llm_handler.unload()` with exception handling, runs `gc.collect()`, attempts accelerator cache clearing, and logs post-unload memory. Updated `generate_music` to reload LM if needed via `reload_last_configuration()` and to conditionally unload LM before DiT based on `ACESTEP_UNLOAD_LM_BEFORE_DIT` and backend whitelist (`{"pt","vllm"}`).
LLMHandler persistence & teardown `acestep/llm_inference.py`	Added `self._last_init_config` and `_save_last_init_config(...)`; new `reload_last_configuration()` to re-invoke `initialize()` with saved params. Hardened `_clear_accelerator_cache()` (optional `context`, added `torch.cuda.ipc_collect()`, per-call try/except with warnings). Improved `unload()` teardown for vLLM (`llm.exit()` preferred, fallback `llm.reset()`), guarded distributed/teardown calls, and set `_hf_model_for_scoring = None` on reset. Persist init config across successful initialize/fallback paths.

Sequence Diagram

sequenceDiagram
    participant Inference as Inference Engine
    participant LLM as LLMHandler
    participant CUDA as CUDA Runtime
    participant DiT as DiT Generator

    Inference->>LLM: check request_needs_lm
    alt LM required but not initialized
        Inference->>LLM: reload_last_configuration()
        LLM-->>Inference: (status, success)
    end

    alt ACESTEP_UNLOAD_LM_BEFORE_DIT enabled and backend in {pt,vllm}
        Inference->>LLM: _unload_lm_before_dit()
        LLM->>LLM: log pre-unload memory
        LLM->>LLM: unload() (guarded)
        LLM->>CUDA: torch.cuda.synchronize() (guarded)
        LLM->>CUDA: torch.cuda.empty_cache() (guarded)
        LLM->>CUDA: torch.cuda.ipc_collect() (guarded)
        LLM->>LLM: gc.collect()
        LLM->>LLM: log post-unload memory
    else
        Inference->>LLM: skip unload (unsupported backend or disabled)
    end

    Inference->>DiT: run DiT generation (phase 2)
    DiT-->>Inference: generated output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nudged the weights, then let them go,

Cleared CUDA crumbs where tiny bytes flow,
I saved the steps to boot again fast,
So DiT can dance once the LM’s uncast,
— a rabbit hop, tidy and light at last 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding the ability to unload the language model before the DiT execution phase, which is the primary feature across both modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/inference.py`:
- Around line 355-368: The LM cleanup block currently swallows all exceptions;
instead catch and log failures for each CUDA call so cleanup errors are visible:
wrap torch.cuda.synchronize(), torch.cuda.empty_cache(), and
torch.cuda.ipc_collect() so any Exception is caught and reported with
loguru.logger.error/exception including which step failed (e.g., "synchronize
failed", "empty_cache failed", "ipc_collect failed") and the exception details;
also replace the outer bare except: pass with a logged error using
loguru.logger.exception and do not silently ignore the failure. Use the existing
symbols torch.cuda.synchronize, torch.cuda.empty_cache, torch.cuda.ipc_collect
and loguru.logger (or the module's logger alias) when adding these logs.
- Around line 663-665: The unconditional unload-before-DiT can trigger MLX
reload failures; restrict the call to only supported backends. Modify the block
that calls _unload_lm_before_dit(llm_handler) so it first checks the LM backend
(e.g., llm_handler.backend or llm_handler.backend_name) and only calls
_unload_lm_before_dit when the backend is one of {"vllm", "pt"} and
ACESTEP_UNLOAD_LM_BEFORE_DIT is truthy; reference LLMHandler.initialize behavior
in comments if needed to justify the guard.

In `@acestep/llm_inference.py`:
- Around line 149-160: The current try/except blocks around
torch.cuda.synchronize(), torch.cuda.empty_cache(), and torch.cuda.ipc_collect()
swallow errors; replace them by calling the existing _clear_accelerator_cache()
helper (which centralizes cleanup and logging) or at minimum catch exceptions
and log them via loguru.logger.error including the exception info and context
(e.g., which cuda call failed) so failures during unload are visible; ensure you
reference and use _clear_accelerator_cache(), torch.cuda.synchronize(),
torch.cuda.empty_cache(), torch.cuda.ipc_collect(), and loguru.logger in the
change.
- Around line 740-747: The current code sets self._last_init_config before the
backend is actually initialized, causing MLX early returns and vLLM→PyTorch
fallbacks to record incorrect reload state; remove or defer setting
self._last_init_config here and instead call the helper
_remember_successful_init() immediately before each successful return path
(specifically before the MLX early return points and before the vLLM and PyTorch
success returns) so the persisted config reflects the backend that actually
initialized; ensure reload_last_configuration() will then replay the correct
backend/state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6598c303-46cc-4aef-bb2f-808d0aced553

📥 Commits

Reviewing files that changed from the base of the PR and between 82252c2 and fc44709.

📒 Files selected for processing (2)

acestep/inference.py
acestep/llm_inference.py

5kft · 2026-04-11T15:55:03Z

It looks like this issue is related: #1085

…alization

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

acestep/llm_inference.py (1)

164-186: ⚠️ Potential issue | 🟠 Major

Use the device-aware cache helper here, not a CUDA-first cleanup block.

This path now bypasses _clear_accelerator_cache(). On mixed CUDA+XPU hosts, torch.cuda.is_available() wins even when the LM was on XPU, so the XPU cache never gets cleared. The nested except Exception: pass also hides the failures you need when unload does not actually free memory.

🛠️ One way to tighten this

             gc.collect()
-
-            if torch.cuda.is_available():
-                try:
-                    torch.cuda.synchronize()
-                except Exception:
-                    pass
-                try:
-                    torch.cuda.empty_cache()
-                except Exception:
-                    pass
-                try:
-                    torch.cuda.ipc_collect()
-                except Exception:
-                    pass
-            elif hasattr(torch, "mps") and torch.backends.mps.is_available():
-                if hasattr(torch.mps, "synchronize"):
-                    torch.mps.synchronize()
-                if hasattr(torch.mps, "empty_cache"):
-                    torch.mps.empty_cache()
-            elif hasattr(torch, "xpu") and torch.xpu.is_available():
-                torch.xpu.empty_cache()
-                torch.xpu.synchronize()
+            self._clear_accelerator_cache()
+            if (
+                str(getattr(self, "device", "")).split(":")[0] == "cuda"
+                and torch.cuda.is_available()
+                and hasattr(torch.cuda, "ipc_collect")
+            ):
+                try:
+                    torch.cuda.ipc_collect()
+                except RuntimeError as exc:
+                    logger.warning("[LLM] torch.cuda.ipc_collect() failed during unload: {}", exc)

As per coding guidelines, do not alter non-target hardware/runtime paths unless explicitly required, and handle errors explicitly with loguru.logger.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@acestep/llm_inference.py` around lines 164 - 186, Current cleanup bypasses
_clear_accelerator_cache() and prefers CUDA even when the model ran on XPU;
replace the explicit CUDA-first cleanup block in llm_inference.py with a single
call to _clear_accelerator_cache() (or delegate to that helper) so device-aware
cache clearing is used, and remove the nested bare excepts — if you must catch
exceptions around device-specific calls (inside _clear_accelerator_cache() or
here), log them using loguru.logger.error with context instead of using except
Exception: pass so failures surface during unload; reference the existing
symbols _clear_accelerator_cache(), torch.cuda.is_available(), torch.mps,
torch.xpu, and loguru.logger when making the change.

acestep/inference.py (1)

355-368: ⚠️ Potential issue | 🟡 Minor

Don't swallow post-unload CUDA cleanup failures.

If synchronize(), empty_cache(), or ipc_collect() fails here, the helper still looks like a successful VRAM release and later DiT OOMs become much harder to diagnose.

As per coding guidelines, handle errors explicitly and log errors with loguru.logger.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/inference.py` around lines 355 - 368, The CUDA cleanup block
currently swallows all exceptions, masking failures in torch.cuda.synchronize(),
torch.cuda.empty_cache(), torch.cuda.ipc_collect(), and subsequent memory
queries; replace the broad "except Exception: pass" behavior with explicit error
handling that logs any thrown exception using loguru.logger (e.g.,
logger.exception or logger.error with exception info) so failures are visible,
and only suppress non-critical errors after logging; keep the existing memory
reporting (alloc/reserved) but ensure any exceptions from
memory_allocated()/memory_reserved() are also caught and logged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/inference.py`:
- Around line 663-672: The guard currently only checks use_lm before unloading,
but use_lm reflects whether this request ran the LM phase and can miss a
resident model from a prior request; change the condition to check the actual
resident state on llm_handler instead of use_lm (e.g., replace "if use_lm and
unload_enabled:" with a check like "if unload_enabled and llm_handler is not
None and getattr(llm_handler, 'is_resident', getattr(llm_handler,
'model_loaded', False)):"), then keep the existing safe_backend check and call
to _unload_lm_before_dit(llm_handler); also consider adding a note or code path
to lazily load models when needed to minimize VRAM.

---

Duplicate comments:
In `@acestep/inference.py`:
- Around line 355-368: The CUDA cleanup block currently swallows all exceptions,
masking failures in torch.cuda.synchronize(), torch.cuda.empty_cache(),
torch.cuda.ipc_collect(), and subsequent memory queries; replace the broad
"except Exception: pass" behavior with explicit error handling that logs any
thrown exception using loguru.logger (e.g., logger.exception or logger.error
with exception info) so failures are visible, and only suppress non-critical
errors after logging; keep the existing memory reporting (alloc/reserved) but
ensure any exceptions from memory_allocated()/memory_reserved() are also caught
and logged.

In `@acestep/llm_inference.py`:
- Around line 164-186: Current cleanup bypasses _clear_accelerator_cache() and
prefers CUDA even when the model ran on XPU; replace the explicit CUDA-first
cleanup block in llm_inference.py with a single call to
_clear_accelerator_cache() (or delegate to that helper) so device-aware cache
clearing is used, and remove the nested bare excepts — if you must catch
exceptions around device-specific calls (inside _clear_accelerator_cache() or
here), log them using loguru.logger.error with context instead of using except
Exception: pass so failures surface during unload; reference the existing
symbols _clear_accelerator_cache(), torch.cuda.is_available(), torch.mps,
torch.xpu, and loguru.logger when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e680f41-cfb4-4558-90f2-b9426ed296e4

📥 Commits

Reviewing files that changed from the base of the PR and between fc44709 and fd286f5.

📒 Files selected for processing (2)

acestep/inference.py
acestep/llm_inference.py

…check unload based upon resident LM state

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/inference.py`:
- Line 335: Add a short descriptive docstring to the newly introduced helper
function _unload_lm_before_dit(llm_handler) describing its purpose, the
parameter (llm_handler) and any important behavior or side effects (e.g., that
it unloads or cleans up the language model before DIT operations). Keep the
style consistent with other functions in the module (brief one- or two-line
summary plus a param description if used) to satisfy the project docstring
guideline.
- Around line 468-475: In generate_music, when request_needs_lm is true and
llm_handler.reload_last_configuration() returns reload_ok == False, fail fast
instead of silently proceeding without LM: replace the current log-only behavior
by aborting the request (e.g., raise a RuntimeError or return an error response)
that includes reload_status so callers know LM reload failed; make this change
in the block that checks llm_handler.llm_initialized and calls
llm_handler.reload_last_configuration() (reference symbols: generate_music,
request_needs_lm, llm_handler, reload_last_configuration, reload_status).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7f400642-cf60-44fc-b7c8-91c5151c96c0

📥 Commits

Reviewing files that changed from the base of the PR and between fd286f5 and c96cdf3.

📒 Files selected for processing (2)

acestep/inference.py
acestep/llm_inference.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/inference.py`:
- Around line 469-482: The current logic attempts a reload whenever
request_needs_lm and llm_handler exists but llm_initialized is false; change
this to only try reload (and fail-fast) when the unload-before-DIT feature is
explicitly enabled or when the handler can confirm a saved config exists: add a
guard that checks the ACESTEP_UNLOAD_LM_BEFORE_DIT flag (or a new
llm_handler.has_saved_configuration() method) before calling
llm_handler.reload_last_configuration(), and only treat reload failure as a
request-level abort in that guarded path; leave the existing default path (no
reload attempt) when the flag is false and no saved config is reported.
- Around line 675-679: The unload path in generate_music leaves the shared
llm_handler unloaded while understand_music, create_sample, and format_sample
still short-circuit on not llm_handler.llm_initialized; change those functions
to reuse the same "ensure LM ready" logic (or extract a shared helper, e.g.,
ensure_llm_ready()) that checks llm_handler.llm_initialized and calls
reload_last_configuration() when needed instead of returning early, and update
generate_music to call that same helper after _unload_lm_before_dit so all
LM-only APIs lazily reload the model when required (reference: generate_music,
understand_music, create_sample, format_sample, _unload_lm_before_dit,
reload_last_configuration, llm_handler.llm_initialized).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6def9cf-aed8-4232-b9e0-2c6fa5f937f0

📥 Commits

Reviewing files that changed from the base of the PR and between c96cdf3 and 2058f73.

📒 Files selected for processing (1)

acestep/inference.py

coderabbitai · 2026-04-11T17:52:37Z

+        if unload_enabled and llm_handler is not None and llm_handler.llm_initialized:
+            if current_backend in safe_unload_backends:
+                _unload_lm_before_dit(llm_handler)
+            else:
+                logger.info("[generate_music] Skipping LM unload before DiT for unsupported backend={}", current_backend)


⚠️ Potential issue | 🟠 Major

This unload path strands the other LM inference APIs.

After Line 677 unloads the shared handler, understand_music(), create_sample(), and format_sample() in this same module still short-circuit on not llm_handler.llm_initialized and never call reload_last_configuration(). With the flag enabled, a generate_music() call can therefore make the next LM-only request fail until something else reinitializes the model. Please reuse the same “ensure LM ready” path there, or centralize it in a shared helper.

Based on learnings, acestep/inference.py should lazy load models when needed to minimize memory allocations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@acestep/inference.py` around lines 675 - 679, The unload path in generate_music leaves the shared llm_handler unloaded while understand_music, create_sample, and format_sample still short-circuit on not llm_handler.llm_initialized; change those functions to reuse the same "ensure LM ready" logic (or extract a shared helper, e.g., ensure_llm_ready()) that checks llm_handler.llm_initialized and calls reload_last_configuration() when needed instead of returning early, and update generate_music to call that same helper after _unload_lm_before_dit so all LM-only APIs lazily reload the model when required (reference: generate_music, understand_music, create_sample, format_sample, _unload_lm_before_dit, reload_last_configuration, llm_handler.llm_initialized).

Technically, yes, this is a lifecycle inconsistency. However it is not clear that these other APIs would even be hit from the normal UI flow here. I don't know enough at this point to determine if this is a real problem or not...

🧩 Analysis chain

Let me check how those functions are wired into the Gradio UI to give you a concrete answer.

🏁 Script executed:

#!/bin/bash echo "=== understand_music usages ===" rg -n "understand_music" --type py -C2 echo "" echo "=== create_sample usages ===" rg -n "create_sample" --type py -C2 echo "" echo "=== format_sample usages ===" rg -n "format_sample" --type py -C2

Length of output: 66675

@5kft, the search results give a concrete answer: yes, this is a real (though narrow) issue in the normal Gradio UI flow.

Here's what was found:

Function Called from UI How

understand_music ✅ Yes llm_analysis_actions.py ("Analyze Audio" button), metadata_loading.py (load metadata event)

create_sample ✅ Yes llm_sample_actions.py → wired to "Create Sample" button click in generation_mode_wiring.py

format_sample ✅ Yes llm_format_actions.py → "Format Caption/Lyrics" buttons

All three are separate clickable buttons in the same Gradio session sharing the same llm_handler. So the failure scenario is:

User clicks Generate (with ACESTEP_UNLOAD_LM_BEFORE_DIT=1) → LM unloaded after DiT handoff ✅

User then clicks Create Sample, Format, or Analyze in the same session → llm_handler.llm_initialized is False → silent failure or error ❌

Notably, metadata_loading.py line 257 just silently skips with if llm_handler.llm_initialized: rather than surfacing any error.

Severity is bounded because:

The flag ACESTEP_UNLOAD_LM_BEFORE_DIT=1 must be explicitly opted into

The generate_music reload path already works correctly

These auxiliary actions are optional helpers

The fix would be adding the same reload guard (check _last_init_config, call reload_last_configuration()) in the three functions before the not llm_handler.llm_initialized early-return, or extracting a shared helper. Would you like me to plan those changes?

🧠 Learnings used

Learnt from: CR Repo: ace-step/ACE-Step-1.5 PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2026-03-20T01:31:33.304Z Learning: Applies to acestep/handler.py,acestep/inference.py : Use batch operations supported up to 8 songs for audio generation

Learnt from: CR Repo: ace-step/ACE-Step-1.5 PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2026-03-24T04:42:41.285Z Learning: Applies to acestep/{handler,inference,llm_inference}.py : Lazy load models when needed to minimize memory allocations

Learnt from: chigkim Repo: ace-step/ACE-Step-1.5 PR: 362 File: cli.py:982-982 Timestamp: 2026-02-08T22:50:13.134Z Learning: In cli.py, the --backend argument defaults to "vllm" on non-MPS systems. However, the actual fallback from "vllm" to "pt" on non-CUDA devices is handled in LLMHandler.initialize() (acestep/llm_inference.py lines 524-528), not in the CLI. This separation allows the CLI to express user preference while LLMHandler performs capability detection.

Learnt from: CR Repo: ace-step/ACE-Step-1.5 PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2026-03-20T01:31:25.142Z Learning: Applies to acestep/{handler,inference}.py : Batch operations are supported up to 8 songs

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 811 File: docs/en/GRADIO_GUIDE.md:0-0 Timestamp: 2026-03-11T12:03:00.308Z Learning: In ace-step/ACE-Step-1.5 GRADIO_GUIDE.md, the "How It Integrates With 5Hz LM" section explicitly notes that when external mode is active, some LM-assisted workflows (notably Create Sample) still check `llm_handler.llm_initialized` in both the Gradio handler path and `create_sample()`, and therefore may still require local 5Hz LM initialization until the remaining routing is completed. The doc describes the external-mode paths as "intended behavior" rather than currently fully functional. Do not flag this remaining initialization requirement as a new bug in documentation reviews.

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 774 File: acestep/ui/gradio/interfaces/training_dataset_tab_label_preview.py:70-95 Timestamp: 2026-03-05T23:12:28.095Z Learning: In ace-step/ACE-Step-1.5, the hardcoded English placeholder strings in `acestep/ui/gradio/interfaces/training_dataset_tab_label_preview.py` (e.g., `"Music description..."` for `edit_caption`, `"pop, electronic, dance..."` for `edit_genre`, `Dropdown` choices `["Use Global Ratio", "Caption", "Genre"]` and its default value, and the lyrics placeholder `"[Verse 1]\nLyrics here...\n\n[Chorus]\n..."`) are pre-existing localization gaps carried forward from before the decomposition refactor. They are deferred to a follow-up PR and should not be flagged as new issues introduced by decomposition PRs.

Learnt from: tonyjohnvan Repo: ace-step/ACE-Step-1.5 PR: 858 File: acestep/customized_vllm/__init__.py:125-128 Timestamp: 2026-03-17T21:18:29.789Z Learning: In `acestep/customized_vllm/__init__.py` and `acestep/llm_inference.py`, the ACE-Step vLLM path always runs with `batch_size=1` (one prompt per `LLM.generate()` call). Because of this, `[sampling_params] * len(prompts)` only ever produces a single-element list, and sharing one `MetadataConstrainedLogitsProcessor` / `logits_processor_update_state` instance across all slots is safe by design. Do not flag the shared logits-processor reference as a mutable-state / batch-corruption bug.

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 769 File: acestep/api/train_api_dataset_auto_label_async_route.py:158-169 Timestamp: 2026-03-04T22:40:44.930Z Learning: In ace-step/ACE-Step-1.5, the `builder.label_all_samples(...)` call in both `train_api_dataset_auto_label_async_route.py` and `train_api_dataset_auto_label_sync_route.py` passes `chunk_size`, `batch_size`, and `sample_labeled_callback` kwargs that are not declared in the current `label_all_samples` signature in `acestep/training/dataset_builder_modules/label_all.py`. This is a pre-existing issue carried forward unchanged from the pre-decomposition `train_api_dataset_service.py` and is deferred to a follow-up PR.

Learnt from: tonyjohnvan Repo: ace-step/ACE-Step-1.5 PR: 858 File: acestep/customized_vllm/__init__.py:1-406 Timestamp: 2026-03-17T21:14:13.573Z Learning: In `acestep/customized_vllm/__init__.py`, the author (tonyjohnvan) intentionally keeps all inference engine internals (ForwardState, thread-local context, GenerationSlot, CachePool, _EngineConfig, LLM, generation loop) in a single ~400 LOC file with clear section comments. They consider splitting into sub-modules to add import complexity for no functional benefit. Do not raise module-size refactor suggestions for this file.

Learnt from: tonyjohnvan Repo: ace-step/ACE-Step-1.5 PR: 858 File: acestep/customized_vllm/pipeline.py:1-423 Timestamp: 2026-03-17T21:17:04.611Z Learning: In `acestep/customized_vllm/pipeline.py`, the author (tonyjohnvan) intentionally keeps all inference pipeline internals (sampling helpers _filter_by_top_k/_filter_by_nucleus/sample_tokens, model init, KV cache provisioning _provision_kv_storage, transfer buffers _init_transfer_buffers, prefill/decode execution _execute_prefill/_execute_autoregressive/_forward_pass, and CUDA graph capture _compile_execution_graphs) in a single ~420 LOC file with clear section comments. They consider splitting into sub-modules to add import complexity for no functional benefit. Do not raise module-size refactor suggestions for this file.

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 837 File: acestep/core/generation/handler/init_service_loader_components.py:44-47 Timestamp: 2026-03-14T10:35:13.950Z Learning: In `acestep/core/generation/handler/init_service_loader.py` (origin/main, pre-extraction), `_load_text_encoder_and_tokenizer()` already used `self.text_encoder.to("cpu").to(self.dtype)` in the CPU-offload branch — using `self.dtype` (which can be bfloat16) instead of `torch.float32`. This is a pre-existing behavior, not introduced by PR `#837`. PR `#837` moves it unchanged into `init_service_loader_components.py` and intends to fix it there. Do not flag the original `init_service_loader.py` pattern as a new regression in future reviews.

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 808 File: acestep/inference.py:25-26 Timestamp: 2026-03-11T10:56:30.827Z Learning: In ace-step/ACE-Step-1.5, acestep/inference.py’s lyric-density guard now uses a Unicode-aware tokenizer (_extract_lyric_word_tokens with str.isalnum()) and the counting helpers (_count_lyric_words/_count_lyric_syllables) ignore tag lines. A non‑Latin regression test exists, so do not re‑flag ASCII-only tokenization for this path in future reviews.

Learnt from: CR Repo: ace-step/ACE-Step-1.5 PR: 0 File: AGENTS.md:0-0 Timestamp: 2026-02-28T14:14:55.765Z Learning: AI-Agent Workflow: Add/update focused tests. Self-review only changed hunks for regressions and scope creep. Summarize risk, validation, and non-target impact in PR notes.

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 0 File: :0-0 Timestamp: 2026-03-20T01:04:19.217Z Learning: In `acestep/text_tasks/external_lm_captioning.py`, `apply_user_metadata_overrides()` intentionally writes both the canonical field name and its runtime alias in a single pass (e.g., `plan.keyscale` + `plan.key_scale`, `plan.timesignature` + `plan.time_signature`, `plan.language` + `plan.vocal_language`). Invalid `bpm`/`duration` conversion errors are swallowed and logged via `loguru.logger.debug()`. Do not flag either of these patterns as bugs in future reviews.

Learnt from: CR Repo: ace-step/ACE-Step-1.5 PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2026-03-24T14:09:24.530Z Learning: Applies to acestep/{handler,inference,audio_utils}.py : Target 4GB VRAM as the memory constraint and minimize memory allocations accordingly

Learnt from: 1larity Repo: ace-step/ACE-Step-1.5 PR: 508 File: acestep/core/generation/handler/memory_utils.py:1-18 Timestamp: 2026-02-13T11:03:36.432Z Learning: Enforce a module size guideline for Python files: only raise module-size concerns when a file exceeds 200 lines of code (LOC). Do not raise for files in the 150–200 LOC range, and accept files below 200 LOC without justification to split. Apply this across all Python files (pattern **/*.py) to maintain consistency in reviews; when a file exceeds 200 LOC, suggest splitting into smaller modules to improve readability and maintainability.

coderabbitai

🧹 Nitpick comments (1)

acestep/inference.py (1)

356-360: Redundant None check.

The guard at line 356 is unnecessary since the function already returns early at lines 337-338 when llm_handler is None.

♻️ Suggested simplification

     gc.collect()

-    if llm_handler is not None:
-        try:
-            llm_handler._clear_accelerator_cache("[LM handoff unload]")
-        except Exception as exc:
-            logger.warning("[LM handoff unload] accelerator cleanup failed: {}", exc)
+    try:
+        llm_handler._clear_accelerator_cache("[LM handoff unload]")
+    except Exception as exc:
+        logger.warning("[LM handoff unload] accelerator cleanup failed: {}", exc)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@acestep/inference.py` around lines 356 - 360, The if llm_handler is not None
guard is redundant because earlier code already returns when llm_handler is
None; remove that outer check and simply call
llm_handler._clear_accelerator_cache("[LM handoff unload]") inside the
try/except block, catching Exception as exc and logging with logger.warning —
ensure you reference llm_handler and the method _clear_accelerator_cache and
keep the existing except handler using logger.warning (fix the log call to
format the exception correctly, e.g., include exc in the message rather than
passing it as a separate format arg).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@acestep/inference.py`:
- Around line 356-360: The if llm_handler is not None guard is redundant because
earlier code already returns when llm_handler is None; remove that outer check
and simply call llm_handler._clear_accelerator_cache("[LM handoff unload]")
inside the try/except block, catching Exception as exc and logging with
logger.warning — ensure you reference llm_handler and the method
_clear_accelerator_cache and keep the existing except handler using
logger.warning (fix the log call to format the exception correctly, e.g.,
include exc in the message rather than passing it as a separate format arg).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3b98ae65-1a80-4bcc-a163-2230da8686cf

📥 Commits

Reviewing files that changed from the base of the PR and between 2058f73 and aa1c948.

📒 Files selected for processing (1)

acestep/inference.py

provide the ability to unload the LM prior to the DiT execution phase

fc44709

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread acestep/inference.py Outdated

Comment thread acestep/inference.py Outdated

Comment thread acestep/llm_inference.py Outdated

Comment thread acestep/llm_inference.py Outdated

gate unload to vllm and pt; properly persist reload state after initi…

fd286f5

…alization

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread acestep/inference.py

consolidated accelerator cleanup logic and added logging; changed to …

c96cdf3

…check unload based upon resident LM state

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread acestep/inference.py

Comment thread acestep/inference.py Outdated

fail fast when the LM reload fails; added docstring

2058f73

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

gate reload/fail-fast to actual handoff unloads

aa1c948

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Function	Called from UI	How
`understand_music`	✅ Yes	`llm_analysis_actions.py` ("Analyze Audio" button), `metadata_loading.py` (load metadata event)
`create_sample`	✅ Yes	`llm_sample_actions.py` → wired to "Create Sample" button click in `generation_mode_wiring.py`
`format_sample`	✅ Yes	`llm_format_actions.py` → "Format Caption/Lyrics" buttons

Conversation

5kft commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

5kft commented Apr 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

5kft Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

5kft commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

coderabbitai Bot Apr 11, 2026 •

edited

Loading