Skip to content

fix(scoring): stop Autoscore unified-memory leak on MLX (issue #1081)#1097

Open
ChuxiJ wants to merge 5 commits intomainfrom
claude/analyze-issue-1081-jJiqU
Open

fix(scoring): stop Autoscore unified-memory leak on MLX (issue #1081)#1097
ChuxiJ wants to merge 5 commits intomainfrom
claude/analyze-issue-1081-jJiqU

Conversation

@ChuxiJ
Copy link
Copy Markdown
Contributor

@ChuxiJ ChuxiJ commented Apr 13, 2026

Fixes the catastrophic unified-memory growth described in #1081, where enabling Autoscore on macOS with the MLX backend (acestep-v15-xl-turbo + acestep-5Hz-lm-4B loaded, 32 GB Mac) would drain RAM and crash the system after only a few generations.

Root cause

On the MLX backend, Autoscore triggers three compounding memory paths:

  1. A second full copy of the LM is loaded as a HuggingFace PyTorch model. The MLX weights cannot be used for PyTorch teacher-forcing scoring, so get_hf_model_for_scoring() (acestep/llm_inference.py:4163-4194) calls AutoModelForCausalLM.from_pretrained(...) and caches the result on self._hf_model_for_scoring. For the 5Hz-lm-4B model that is roughly ~8 GB of CPU-resident memory that never gets freed.

  2. That cached model was migrated CPU ↔ MPS on every forward pass. _load_scoring_model_context in acestep/core/scoring/lm_score.py moved the model to the accelerator and back inside every call. One Autoscore pass performs dozens of forward passes per sample (per-metadata recall + caption PMI + lyrics PMI, each with a conditional and unconditional prompt), so the MPS allocator accumulated fragmented multi-GB allocations every time.

  3. PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 removes the last guardrail. That env var is required to even start the app with both models loaded on a 32 GB Mac, but it disables the MPS allocator's high-watermark check, letting the fragmentation above push the process past unified-memory limits.

Disabling Autoscore makes the leak disappear entirely, because generation_progress.py:280 gates the whole scoring block behind if auto_score: — that matches the reporter's observation.

Fix

Two localised edits in acestep/core/scoring/lm_score.py:

1. _load_scoring_model_context is now reentrant

Nested entries are tracked via a depth counter (llm_handler._scoring_ctx_depth) and become no-ops. Only the outermost entry performs the CPU → accelerator migration; only the outermost exit performs the offload. This mirrors the device-based reentrancy guard that already exists for the pt backend in llm_inference.py:4075-4082.

2. calculate_pmi_score_per_condition batches all forward passes under one outer context

The whole scoring body is now wrapped in a single with _load_scoring_model_context(llm_handler):. All the inner _get_logits_and_target_for_scoring calls still enter the context, but the reentrancy guard turns them into no-ops. Result: exactly one CPU ↔ MPS migration per Autoscore pass, instead of one per forward pass.

3. MLX-specific: drop the cached HF copy on outermost exit

When the backend is MLX and offload_to_cpu is enabled, the outermost exit additionally:

  • sets llm_handler._hf_model_for_scoring = None
  • runs gc.collect()
  • empties the MPS cache

This returns the ~8 GB duplicate PyTorch copy to the OS between scoring passes. The next scoring call re-materialises the model via AutoModelForCausalLM.from_pretrained; HF hub weight caching makes this a fast rematerialisation from disk, not a redownload. A 32 GB Mac pays a one-time few-seconds reload cost on each Autoscore run in exchange for not crashing.

Other backends are unaffected:

  • pt: still delegates to llm_handler._load_model_context() which already has its own reentrancy guard.
  • vllm: same hoisted outer context, but the HF scoring model is kept cached (the CUDA allocator handles this cleanly and VRAM is not unified with system RAM).

Files changed

  • acestep/core/scoring/lm_score.py — +110 / -40

Test plan

  • macOS unit memory test: on a 32 GB Apple Silicon Mac with MLX backend, load acestep-v15-xl-turbo + acestep-5Hz-lm-4B, enable Autoscore, run ≥5 consecutive generations. Expected: peak RSS plateaus between generations instead of growing monotonically; no crash.
  • Check Terminal logs show [scoring] Released cached HF scoring model on MLX backend after each Autoscore pass.
  • Verify PMI scores are numerically identical before/after (the reentrant guard should be behaviour-preserving).
  • Smoke-test PyTorch backend: run Autoscore on CUDA and confirm no behavioural change (delegates to _load_model_context() as before).
  • Smoke-test vllm backend (if available): confirm Autoscore still works with the hoisted outer context.

Follow-ups (not in this PR)

  • Per-sample DiT migration in get_lyric_score (acestep/core/generation/handler/lyric_score.py:93) is still once-per-sample. A future PR could hoist a single _load_model_context("model") around the whole Autoscore loop in generation_progress.py:280.
  • UI-side warning when MLX + ≤32 GB unified memory + both models loaded + Autoscore enabled (extension of the acestep/gpu_config.py:261 heuristic).

Refs: #1081

https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U

Summary by CodeRabbit

  • Performance
    • Reduced redundant model transfers during scoring: model migration to/from accelerator now happens once per autoscore pass, lowering runtime and accelerator memory usage and avoiding races under concurrency; cached accelerator memory is explicitly cleared when offloaded.
  • Tests
    • Added regression tests validating scoring context lifecycle, nesting, cross-thread behavior, and correct load/offload semantics.

On the MLX backend, Autoscore was the cause of a catastrophic unified-memory
leak on Apple Silicon (issue #1081):

1. `get_hf_model_for_scoring()` loads a *separate* HuggingFace PyTorch copy
   of the LM (the MLX weights cannot be used for torch teacher-forcing
   scoring). For the 5Hz-lm-4B that is roughly ~8 GB of CPU-resident memory
   cached on the handler and never freed.

2. `_load_scoring_model_context` moved that cached model CPU <-> MPS on
   *every* forward pass. A single Autoscore pass performs dozens of forward
   passes per sample (metadata recall, caption PMI, lyrics PMI, each with a
   conditional and unconditional prompt), so the MPS allocator accumulated
   fragmented multi-GB allocations.

3. With `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` (required to run at all on
   32 GB Macs with both XL-Turbo and 5Hz-lm-4B loaded), the MPS allocator
   has no high-watermark guardrail, and the fragmentation above pushed the
   system past its physical memory limit after only a few generations.

This change fixes the accumulating leak with two localised edits in
`acestep/core/scoring/lm_score.py`:

- `_load_scoring_model_context` is now **reentrant**. Nested entries are
  tracked via a depth counter on the handler and become no-ops. Only the
  outermost entry performs the CPU->accelerator migration and only the
  outermost exit performs the offload.

- `calculate_pmi_score_per_condition` wraps its entire scoring body in a
  single outer `_load_scoring_model_context`. All the inner
  `_get_logits_and_target_for_scoring` calls now see an active outer
  context and skip their own migrations. Result: exactly one CPU<->MPS
  migration per Autoscore pass instead of one per forward pass.

- On MLX with `offload_to_cpu` enabled, the outermost exit additionally
  drops the cached `_hf_model_for_scoring`, runs `gc.collect()` and empties
  the MPS cache so the ~8 GB duplicate PyTorch copy is returned to the OS
  between scoring passes. The next scoring call re-materialises it via
  `AutoModelForCausalLM.from_pretrained`; HF hub weight caching makes this
  a fast rematerialisation rather than a redownload.

Refs: #1081

https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a thread-local, reentrant scoring-model context with per-handler outermost serialization, explicit accelerator-cache eviction, MLX offload cleanup (model deletion + GC), and wraps the full PMI scoring pass in one outer context to avoid repeated CPU↔accelerator migrations.

Changes

Cohort / File(s) Summary
Scoring model context & memory helpers
acestep/core/scoring/lm_score.py
Added _empty_accelerator_cache(backend); introduced gc and threading; implemented thread-local reentrancy depth per-handler, lazily-created per-handler threading.RLock, and outermost-only migration/offload semantics.
Scoring flow update
acestep/core/scoring/lm_score.py
Wrapped entire calculate_pmi_score_per_condition(...) pass in a single _load_scoring_model_context to ensure one model migration/offload per Autoscore pass.
Tests & housekeeping
acestep/core/scoring/lm_score_context_test.py, acestep/core/scoring/scoring_test.py
Added lm_score_context_test.py with unit tests for lifecycle, reentrancy, per-thread outermost detection, backend-specific offload/cleanup and error handling; scoring_test.py updated with comment noting moved regression tests (no runtime changes).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Thread as Thread
    participant Handler as LLM Handler
    participant Lock as RLock (per-handler)
    participant HFModel as HF Scoring Model
    participant Accelerator as Accelerator (CUDA/MLX)
    participant GC as GC

    Thread->>Handler: enter _load_scoring_model_context()
    Handler->>Lock: acquire (outermost serialized)
    alt cached model not on accelerator
        Handler->>HFModel: get_hf_model_for_scoring()
        HFModel->>Accelerator: .to("cuda") / move to accel
    end
    Note right of Thread: scoring work (PMI passes) executes
    Thread->>HFModel: perform conditional/unconditional forwards
    Thread->>Handler: exit _load_scoring_model_context()
    Lock->>Handler: release (outermost exit)
    alt backend == "mlx" and offload_to_cpu == true
        Handler->>HFModel: .to("cpu") / offload
        Handler->>Handler: clear cached _hf_model_for_scoring
        Handler->>GC: gc.collect()
        Handler->>Accelerator: empty accelerator cache
    else
        Handler->>Accelerator: maybe empty accelerator cache
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped with locks and depth to keep,
One outer move, no needless leap.
Offload, delete, and sweep the cache clean,
Nested hops stay quiet, calm, and lean.
Scoring snug — a happy rabbit scene! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a unified-memory leak on MLX in Autoscore, with reference to the issue number for context.
Docstring Coverage ✅ Passed Docstring coverage is 86.36% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/analyze-issue-1081-jJiqU

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
acestep/core/scoring/lm_score.py (1)

129-155: ⚠️ Potential issue | 🟠 Major

Make the reentrancy state request-scoped, not handler-scoped.

_scoring_ctx_depth lives on llm_handler, so two concurrent Autoscore calls on the same handler are treated as “nested”. One outer exit can then reset the depth to 0 and clear _hf_model_for_scoring while the other call is still inside model(...), which makes this a race rather than true reentrancy. Please move the nesting state to thread/task-local storage and protect the shared load/offload transition with a lock or refcount.

Also applies to: 168-176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/lm_score.py` around lines 129 - 155, The current
reentrancy counter _scoring_ctx_depth on llm_handler causes races for concurrent
Autoscore calls; change it to be request-scoped by moving the nesting state into
task-local/thread-local storage (e.g., use contextvars or threading.local)
instead of storing _scoring_ctx_depth on the handler, and protect the shared
model load/offload/change with a global lock or a handler-level refcount
(increment/decrement in the task-local context) so calls can nest per-request
without one call resetting the handler state for others; update the context
manager logic around get_hf_model_for_scoring, offload_to_cpu, device and
model.to to consult the task-local depth and use the lock/refcount when
performing model.to(...) and when clearing _hf_model_for_scoring.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/core/scoring/lm_score.py`:
- Around line 478-519: Add a focused regression test that constructs a fake
llm_handler and exercises the Autoscore scoring path that wraps the block using
_load_scoring_model_context, asserting three invariants: (1) the model
load/offload cycle is invoked exactly once for the outer scoring pass (mock the
load/offload side-effects on the fake handler), (2) nested re-entries into
_load_scoring_model_context (triggered by calls like
_get_logits_and_target_for_scoring or _calculate_log_prob) are no-ops (verify
they do not increment load/unload counters), and (3) after the outer context
exits the module-level _hf_model_for_scoring is cleared/None. Use the same call
sites as in the diff (invoke the scoring code path via llm_handler and calls to
_calculate_log_prob/_calculate_metadata_recall) so the test exercises the exact
lifecycle contract.
- Around line 71-83: The helper _empty_accelerator_cache currently empties
CUDA/XPU/MPS caches and is invoked on every scoring offload, which affects
non-target runtimes; change _empty_accelerator_cache to only perform
MPS-specific cleanup (check torch.backends.mps.is_available() and
torch.mps.empty_cache()) or add a backend parameter (e.g., backend="mps") and
only empty the cache when backend == "mps"; leave CUDA/XPU paths untouched and
ensure the offload call site that triggers the MLX/MPS cleanup calls this
MPS-scoped helper instead of a blanket cache eviction.

---

Outside diff comments:
In `@acestep/core/scoring/lm_score.py`:
- Around line 129-155: The current reentrancy counter _scoring_ctx_depth on
llm_handler causes races for concurrent Autoscore calls; change it to be
request-scoped by moving the nesting state into task-local/thread-local storage
(e.g., use contextvars or threading.local) instead of storing _scoring_ctx_depth
on the handler, and protect the shared model load/offload/change with a global
lock or a handler-level refcount (increment/decrement in the task-local context)
so calls can nest per-request without one call resetting the handler state for
others; update the context manager logic around get_hf_model_for_scoring,
offload_to_cpu, device and model.to to consult the task-local depth and use the
lock/refcount when performing model.to(...) and when clearing
_hf_model_for_scoring.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3b905821-7395-4783-9889-e4973ea4db2c

📥 Commits

Reviewing files that changed from the base of the PR and between 82252c2 and 764f587.

📒 Files selected for processing (1)
  • acestep/core/scoring/lm_score.py

Comment thread acestep/core/scoring/lm_score.py Outdated
Comment on lines +478 to 519
# Batch *all* scoring forward passes under a single load/offload
# cycle. The inner ``_get_logits_and_target_for_scoring`` calls
# still enter ``_load_scoring_model_context``, but the reentrancy
# guard turns those nested entries into no-ops -- so the cached HF
# scoring model is moved CPU↔accelerator exactly once per Autoscore
# pass instead of once per condition (issue #1081).
with _load_scoring_model_context(llm_handler):
# 1. Calculate Recall for Metadata Fields
if metadata and isinstance(metadata, dict):
scores = {}
# Define which fields use which metric
metadata_recall_keys = ['bpm', 'duration', 'genres', 'keyscale', 'language', 'timesignature']
metadata_pmi_keys = ['caption']
for key in metadata_recall_keys:
if key in metadata and metadata[key] is not None:
recall_metadata = {key: metadata[key]}
field_scores = _calculate_metadata_recall(llm_handler, formatted_prompt, recall_metadata, topk=topk)
scores.update(field_scores)

# 2. Calculate PMI for Caption
for key in metadata_pmi_keys:
if key in metadata and metadata[key] is not None:
cot_yaml = yaml.dump({key: metadata[key]}, allow_unicode=True, sort_keys=True).strip()
target_text = f"<think>\n{cot_yaml}\n</think>\n"

log_prob_cond = _calculate_log_prob(llm_handler, formatted_prompt, target_text)
log_prob_uncond = _calculate_log_prob(llm_handler, prompt_uncond, target_text)

pmi_normalized = pmi_to_normalized_score(log_prob_cond - log_prob_uncond, scale=score_scale)
scores[key] = pmi_normalized

# 3. Calculate PMI for Lyrics
if lyrics:
target_text = f"<think>\n</think>\n# Lyric\n{lyrics}\n"

log_prob_cond = _calculate_log_prob(llm_handler, formatted_prompt, target_text)

prompt_uncond = llm_handler.build_formatted_prompt_for_understanding(audio_codes="NO USER INPUT", is_negative_prompt=False)
log_prob_uncond = _calculate_log_prob(llm_handler, prompt_uncond, target_text)

scores['lyrics'] = pmi_to_normalized_score(log_prob_cond - log_prob_uncond, scale=score_scale)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a focused regression test for the new lifecycle contract.

This now depends on three invariants that are easy to regress silently: one outer load/offload per scoring pass, nested _load_scoring_model_context() entries staying no-op, and MLX outermost exit clearing _hf_model_for_scoring. I’d add a fake-handler test around this block before merging.

Based on learnings: "AI-Agent Workflow: Add/update focused tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/lm_score.py` around lines 478 - 519, Add a focused
regression test that constructs a fake llm_handler and exercises the Autoscore
scoring path that wraps the block using _load_scoring_model_context, asserting
three invariants: (1) the model load/offload cycle is invoked exactly once for
the outer scoring pass (mock the load/offload side-effects on the fake handler),
(2) nested re-entries into _load_scoring_model_context (triggered by calls like
_get_logits_and_target_for_scoring or _calculate_log_prob) are no-ops (verify
they do not increment load/unload counters), and (3) after the outer context
exits the module-level _hf_model_for_scoring is cleared/None. Use the same call
sites as in the diff (invoke the scoring code path via llm_handler and calls to
_calculate_log_prob/_calculate_metadata_recall) so the test exercises the exact
lifecycle contract.

Addresses CodeRabbit review feedback on #1097:

1. Thread safety (was major concurrency race).
   ``_scoring_ctx_depth`` lived on ``llm_handler``, so two concurrent
   Autoscore calls on the same handler saw each other's depth counter
   and one outer exit could clear ``_hf_model_for_scoring`` while the
   other call was still inside a forward pass.  Nesting state is now
   kept in ``threading.local`` keyed by ``id(llm_handler)``, and the
   outermost entry on each thread acquires a per-handler
   ``threading.RLock`` so the shared load / offload / drop transition
   is atomic across threads.  A module-level creation lock guards lazy
   RLock installation so racing first callers can't install two
   different locks.

2. Backend-scoped accelerator cache eviction.
   ``_empty_accelerator_cache`` previously called
   ``torch.cuda.empty_cache()`` on the MLX cleanup path, adding
   allocator churn to a runtime that is not even involved in the fix.
   The helper now takes a ``backend`` argument: ``mlx`` -> MPS only,
   ``vllm`` -> CUDA / XPU only.  The ``pt`` backend never reaches this
   helper (it delegates to the handler's own ``_load_model_context``).

3. Regression tests for the lifecycle contract.
   New ``LoadScoringModelContextTests`` in
   ``acestep/core/scoring/scoring_test.py`` asserts the three
   invariants the fix depends on, using a fake handler + fake model
   that record ``.to()`` calls:

   - exactly one load + one offload per outermost entry
   - nested re-entries are no-ops (no extra ``.to()``, no extra
     ``get_hf_model_for_scoring`` calls)
   - MLX + offload_to_cpu drops ``_hf_model_for_scoring`` on outer exit
   - vllm keeps the cached HF scoring model (CUDA is fine)
   - MLX without offload keeps the cached model and performs no
     migrations at all

Refs: #1081

https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
acestep/core/scoring/scoring_test.py (1)

142-249: Split this test file or add a follow-up split plan in PR notes.

With these additions, acestep/core/scoring/scoring_test.py is now 252 LOC, which is above the 200 LOC hard cap. Consider moving LoadScoringModelContextTests into a dedicated test module.

Based on learnings, "only raise module-size concerns when a file exceeds 200 lines of code (LOC)," and this file now exceeds that threshold.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/scoring_test.py` around lines 142 - 249, The test file
exceeds the 200 LOC limit—split LoadScoringModelContextTests into its own test
module: create a new test file containing the LoadScoringModelContextTests class
(including its inner _FakeModel and _FakeHandler helpers) and any references to
_load_scoring_model_context, import that helper from the original module or move
it alongside the class, then update imports in the original scoring_test.py or
test suite so tests discover correctly; alternatively, if you prefer a follow-up
split, add a short PR note describing this planned extraction and which symbols
(_load_scoring_model_context, LoadScoringModelContextTests, _FakeModel,
_FakeHandler) will be moved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/core/scoring/scoring_test.py`:
- Around line 158-177: Add concise docstrings to the newly introduced helper
classes and their methods: add class-level docstrings to _FakeModel and
_FakeHandler describing their test-purpose (a recorder fake model and a handler
stub for scoring tests), and add method docstrings for _FakeModel.__init__,
_FakeModel.to, _FakeHandler.__init__, and _FakeHandler.get_hf_model_for_scoring
that briefly state parameters, side-effects (e.g., records device in
device_calls, toggles llm_initialized/offload flags) and return values (e.g.,
returns self or the fake model). Ensure the docstrings follow project style
(one-line summary plus short param/return details where useful) and are placed
directly under each class/method definition.

---

Nitpick comments:
In `@acestep/core/scoring/scoring_test.py`:
- Around line 142-249: The test file exceeds the 200 LOC limit—split
LoadScoringModelContextTests into its own test module: create a new test file
containing the LoadScoringModelContextTests class (including its inner
_FakeModel and _FakeHandler helpers) and any references to
_load_scoring_model_context, import that helper from the original module or move
it alongside the class, then update imports in the original scoring_test.py or
test suite so tests discover correctly; alternatively, if you prefer a follow-up
split, add a short PR note describing this planned extraction and which symbols
(_load_scoring_model_context, LoadScoringModelContextTests, _FakeModel,
_FakeHandler) will be moved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a11c3ac3-aae2-4ef9-87d1-815c25705c08

📥 Commits

Reviewing files that changed from the base of the PR and between 764f587 and 07b033f.

📒 Files selected for processing (2)
  • acestep/core/scoring/lm_score.py
  • acestep/core/scoring/scoring_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • acestep/core/scoring/lm_score.py

Comment thread acestep/core/scoring/scoring_test.py Outdated
Comment on lines +158 to +177
class _FakeModel:
def __init__(self):
self.device_calls = []

def to(self, device):
self.device_calls.append(str(device))
return self

class _FakeHandler:
def __init__(self, backend, offload=True):
self.llm_backend = backend
self.llm_initialized = True
self.offload_to_cpu = offload
self.device = "cuda" # synthetic; _FakeModel.to is a recorder
self._hf_model_for_scoring = LoadScoringModelContextTests._FakeModel()
self.get_calls = 0

def get_hf_model_for_scoring(self):
self.get_calls += 1
return self._hf_model_for_scoring
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add docstrings to new helper classes and methods.

Line 158 and Line 166 add new classes, and Line 162/Line 175 add new methods, but they currently have no docstrings.

✍️ Suggested patch
     class _FakeModel:
+        """Minimal model test double that records device transfer calls."""
+
         def __init__(self):
+            """Initialize transfer-call recorder."""
             self.device_calls = []
 
         def to(self, device):
+            """Record requested device transfer and return self."""
             self.device_calls.append(str(device))
             return self
 
     class _FakeHandler:
+        """Minimal handler test double used by scoring-model context tests."""
+
         def __init__(self, backend, offload=True):
+            """Set backend flags and initialize cached fake scoring model."""
             self.llm_backend = backend
             self.llm_initialized = True
             self.offload_to_cpu = offload
             self.device = "cuda"  # synthetic; _FakeModel.to is a recorder
             self._hf_model_for_scoring = LoadScoringModelContextTests._FakeModel()
             self.get_calls = 0
 
         def get_hf_model_for_scoring(self):
+            """Return cached scoring model and count retrievals."""
             self.get_calls += 1
             return self._hf_model_for_scoring

As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/scoring_test.py` around lines 158 - 177, Add concise
docstrings to the newly introduced helper classes and their methods: add
class-level docstrings to _FakeModel and _FakeHandler describing their
test-purpose (a recorder fake model and a handler stub for scoring tests), and
add method docstrings for _FakeModel.__init__, _FakeModel.to,
_FakeHandler.__init__, and _FakeHandler.get_hf_model_for_scoring that briefly
state parameters, side-effects (e.g., records device in device_calls, toggles
llm_initialized/offload flags) and return values (e.g., returns self or the fake
model). Ensure the docstrings follow project style (one-line summary plus short
param/return details where useful) and are placed directly under each
class/method definition.

Addresses CodeRabbit review feedback on #1097:

1. 200 LOC cap (nitpick).
   Previous commit pushed ``scoring_test.py`` to 252 LOC, above the
   project's 200 LOC hard cap.  Move ``LoadScoringModelContextTests``
   (plus its ``_FakeModel`` and ``_FakeHandler`` helpers) into a
   dedicated module ``lm_score_context_test.py``.  Post-split:
   - ``scoring_test.py``:              147 LOC
   - ``lm_score_context_test.py``:     150 LOC

2. Missing docstrings on helper classes/methods.
   ``_FakeModel``, ``_FakeHandler`` and their methods were
   originally inner classes of the test case and lacked docstrings.
   Pull them out to module level in the new file and document each
   class and method per project style (purpose, params, side-effects,
   return value).

Both test modules run cleanly together (23 tests total, all green).

Refs: #1081

https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
acestep/core/scoring/lm_score_context_test.py (2)

108-117: Make the cache-release assertion resilient to delattr cleanup.

assertIsNone(handler._hf_model_for_scoring) is brittle if implementation clears by deleting the attribute instead of setting it to None. Prefer getattr(..., None) so both valid cleanup styles pass.

Suggested test hardening
-        self.assertIsNone(handler._hf_model_for_scoring)
+        self.assertIsNone(getattr(handler, "_hf_model_for_scoring", None))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/lm_score_context_test.py` around lines 108 - 117, The
test test_mlx_outermost_exit_drops_cached_model is brittle because it accesses
handler._hf_model_for_scoring directly; update the assertions to use
getattr(handler, "_hf_model_for_scoring", None) so deletion vs. setting to None
both pass — e.g., replace self.assertIsNotNone(handler._hf_model_for_scoring)
and self.assertIsNone(handler._hf_model_for_scoring) with
self.assertIsNotNone(getattr(handler, "_hf_model_for_scoring", None)) and
self.assertIsNone(getattr(handler, "_hf_model_for_scoring", None)), locating the
checks in test_mlx_outermost_exit_drops_cached_model for _FakeHandler and
_load_scoring_model_context usage.

138-147: Add one thread-isolation test for the reentrant guard.

Current coverage is single-threaded. Since the behavior depends on thread-local depth tracking, add a two-thread regression test to ensure each thread performs its own outermost entry accounting.

Suggested additional test
+import threading
 import unittest
@@
 class LoadScoringModelContextTests(unittest.TestCase):
@@
     def test_get_hf_called_only_by_outermost_entry(self):
         """Nested entries must not re-query ``get_hf_model_for_scoring``."""
         handler = _FakeHandler("mlx")
         with _load_scoring_model_context(handler):
             with _load_scoring_model_context(handler):
                 with _load_scoring_model_context(handler):
                     pass
         # Outer entry queries once; nested entries are pure no-ops.
         self.assertEqual(handler.get_calls, 1)
+
+    def test_outermost_detection_is_thread_local(self):
+        """Each thread should perform its own outermost model fetch."""
+        handler = _FakeHandler("mlx")
+        barrier = threading.Barrier(2)
+
+        def worker():
+            barrier.wait()
+            with _load_scoring_model_context(handler):
+                with _load_scoring_model_context(handler):
+                    pass
+
+        t1 = threading.Thread(target=worker)
+        t2 = threading.Thread(target=worker)
+        t1.start()
+        t2.start()
+        t1.join()
+        t2.join()
+
+        self.assertEqual(handler.get_calls, 2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/lm_score_context_test.py` around lines 138 - 147, Add a
multithread regression test alongside test_get_hf_called_only_by_outermost_entry
that verifies the reentrant guard is thread-local: create a shared _FakeHandler
instance, start two threads where each thread enters multiple nested
_load_scoring_model_context(handler) calls (e.g., three nested withs) and then
exits; join both threads and assert handler.get_calls == 2 to ensure each thread
performed its own outermost entry accounting; use threading.Thread and any
needed synchronization (join) so the test is deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@acestep/core/scoring/lm_score_context_test.py`:
- Around line 108-117: The test test_mlx_outermost_exit_drops_cached_model is
brittle because it accesses handler._hf_model_for_scoring directly; update the
assertions to use getattr(handler, "_hf_model_for_scoring", None) so deletion
vs. setting to None both pass — e.g., replace
self.assertIsNotNone(handler._hf_model_for_scoring) and
self.assertIsNone(handler._hf_model_for_scoring) with
self.assertIsNotNone(getattr(handler, "_hf_model_for_scoring", None)) and
self.assertIsNone(getattr(handler, "_hf_model_for_scoring", None)), locating the
checks in test_mlx_outermost_exit_drops_cached_model for _FakeHandler and
_load_scoring_model_context usage.
- Around line 138-147: Add a multithread regression test alongside
test_get_hf_called_only_by_outermost_entry that verifies the reentrant guard is
thread-local: create a shared _FakeHandler instance, start two threads where
each thread enters multiple nested _load_scoring_model_context(handler) calls
(e.g., three nested withs) and then exits; join both threads and assert
handler.get_calls == 2 to ensure each thread performed its own outermost entry
accounting; use threading.Thread and any needed synchronization (join) so the
test is deterministic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5baffc84-ba39-4847-9cb4-38919f009b8a

📥 Commits

Reviewing files that changed from the base of the PR and between 07b033f and e4e440d.

📒 Files selected for processing (2)
  • acestep/core/scoring/lm_score_context_test.py
  • acestep/core/scoring/scoring_test.py
✅ Files skipped from review due to trivial changes (1)
  • acestep/core/scoring/scoring_test.py

claude and others added 2 commits April 13, 2026 03:22
Addresses CodeRabbit review feedback on #1097:

1. Resilient cached-model drop assertions.
   ``test_mlx_outermost_exit_drops_cached_model`` now uses
   ``getattr(handler, "_hf_model_for_scoring", None)`` so the test
   still passes if the implementation ever switches from
   ``setattr(handler, "_hf_model_for_scoring", None)`` to deleting
   the attribute.  Both cleanup styles are valid.

2. Thread-isolation regression for the reentrancy depth.
   New ``test_outermost_detection_is_thread_local`` starts two worker
   threads racing into the same handler via a ``threading.Barrier``
   and asserts that ``get_hf_model_for_scoring`` fires twice (once
   per thread's outermost entry).  If the reentrancy depth ever
   regressed from ``threading.local`` to a process-global counter,
   one thread would see ``depth > 0`` at outer entry, mis-identify
   itself as "nested", and skip the model fetch -- making the assert
   fail fast.  Uses ``offload=False`` so the MLX release path does
   not drop the cached model between threads and confuse the count.

Total scoring tests: 24 (23 previously + 1 new).  All green.
``lm_score_context_test.py`` at 190 LOC (under the 200 cap).

Refs: #1081

https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
Wrap the CPU offload call in a nested try/finally so that the MLX-specific
release path (_hf_model_for_scoring = None, gc.collect, empty MPS cache) runs
regardless of whether model.to("cpu") succeeds.  Without this, a failing
offload skips the cleanup and leaves the ~8 GB cached HF scoring model
resident — moving the unified-memory leak from the happy path to the
exception path.

Adds a regression test that simulates an offload failure and asserts the
cached model is still released.

Finding: empirically reproduced on macOS (model.to("cpu") raises →
_hf_model_for_scoring remains non-None without the nested try/finally).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@ChuxiJ
Copy link
Copy Markdown
Contributor Author

ChuxiJ commented Apr 16, 2026

Review: verified on macOS, one issue found and fixed

I read the full diff against main, traced the call graph, and ran empirical tests on a Mac. Here's what I found:

✅ Cross-thread stale-model race — NOT a real bug

I wrote a test with two threads racing into _load_scoring_model_context on a shared handler. Event-log output confirmed:

A: entering → A: inside_ctx → B: entering (BLOCKED) → A: exiting → B: inside_ctx

Thread B is blocked at the RLock acquisition until Thread A fully exits. The with lock: at L202 wraps the entire try: yield block, so the lock is held across the yield. This makes the race condition described in a prior code-review pass unreachable — the RLock serializes the shared _hf_model_for_scoring load/offload transition correctly.

The existing test_outermost_detection_is_thread_local test also passed on macOS with both threads correctly identified as outermost (get_calls == 2).

❌ Exception-path cleanup — was a real bug, now fixed

I injected a RuntimeError into model.to("cpu") and observed that _hf_model_for_scoring was not cleared afterwards — the MLX release path at L234+ was skipped because the exception propagated out of the finally block before reaching it.

Reproduction:

class FailingModel:
    def to(self, device):
        if device == "cpu":
            raise RuntimeError("simulated offload failure")
        return self

handler = Handler("mlx", offload=True)
handler._hf_model_for_scoring = FailingModel()
with _load_scoring_model_context(handler):
    pass
# → _hf_model_for_scoring is still cached (should be None)

Fix pushed in 1f03d25: wrapped the CPU offload in a nested try/finally so the MLX-specific release (_hf_model_for_scoring = None, gc.collect(), torch.mps.empty_cache()) runs regardless of whether the offload succeeds. Added test_mlx_cleanup_runs_when_offload_raises as a regression test — all 8 tests pass on macOS.

Test results

8 passed in 1.11s

All original 7 tests + the new exception-path test pass on macOS (Python 3.11.4, Apple Silicon).

Other verification notes

  • _empty_accelerator_cache(backend) correctly scopes cache eviction to the active backend (mlx → MPS, vllm → CUDA/XPU). No cross-backend interference.
  • The only external entry point is scoring.py:97calculate_pmi_score_per_condition, which always wraps all inner calls in one outer _load_scoring_model_context. No call site bypasses the outer wrapper.
  • RLock serializing vllm too is intentional and acceptable per the PR description.
  • Thread-local depth tracking via _scoring_tls keyed by id(llm_handler) is correct for this use case (synchronous torch.no_grad forward pass, no async suspension points).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
acestep/core/scoring/lm_score.py (2)

1-7: Optional: Consider splitting this module in a follow-up.

At ~613 LOC, this module exceeds the 200 LOC guideline. The scoring logic could potentially be split into:

  • lm_score_context.py: Context manager and thread-safety primitives (~150 LOC)
  • lm_score.py: Scoring calculations and public API

This is not urgent for this PR since the focus is the memory leak fix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/lm_score.py` around lines 1 - 7, The lm_score.py module
is too large (~613 LOC) and should be split to improve maintainability: extract
context/thread-safety logic (e.g., the context manager class and any locking
primitives and functions referenced by names like LMScoreContext, _thread_local,
or similar) into a new lm_score_context.py, and keep scoring calculations and
public API functions (e.g., compute_perplexity, compute_pmi, top_k_recall,
metadata_recall, and get_reward_score) in lm_score.py; update imports so
lm_score.py imports the context/primitive symbols from lm_score_context.py and
run tests to ensure no behavioral changes.

587-587: Minor: prompt_uncond is redundantly reassigned.

prompt_uncond is already assigned at line 548 with the same value. This reassignment inside the lyrics PMI block is unnecessary.

♻️ Suggested fix
             if lyrics:
                 target_text = f"<think>\n</think>\n# Lyric\n{lyrics}\n"

                 log_prob_cond = _calculate_log_prob(llm_handler, formatted_prompt, target_text)

-                prompt_uncond = llm_handler.build_formatted_prompt_for_understanding(audio_codes="NO USER INPUT", is_negative_prompt=False)
                 log_prob_uncond = _calculate_log_prob(llm_handler, prompt_uncond, target_text)

                 scores['lyrics'] = pmi_to_normalized_score(log_prob_cond - log_prob_uncond, scale=score_scale)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/core/scoring/lm_score.py` at line 587, The variable prompt_uncond is
redundantly reassigned inside the lyrics PMI block using
llm_handler.build_formatted_prompt_for_understanding("NO USER INPUT",
is_negative_prompt=False) even though it was already set earlier in the
function; remove this duplicate assignment from the lyrics PMI block and rely on
the original prompt_uncond value (ensure no other code in the block depends on a
different value or side-effects from rebuild).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@acestep/core/scoring/lm_score.py`:
- Around line 1-7: The lm_score.py module is too large (~613 LOC) and should be
split to improve maintainability: extract context/thread-safety logic (e.g., the
context manager class and any locking primitives and functions referenced by
names like LMScoreContext, _thread_local, or similar) into a new
lm_score_context.py, and keep scoring calculations and public API functions
(e.g., compute_perplexity, compute_pmi, top_k_recall, metadata_recall, and
get_reward_score) in lm_score.py; update imports so lm_score.py imports the
context/primitive symbols from lm_score_context.py and run tests to ensure no
behavioral changes.
- Line 587: The variable prompt_uncond is redundantly reassigned inside the
lyrics PMI block using llm_handler.build_formatted_prompt_for_understanding("NO
USER INPUT", is_negative_prompt=False) even though it was already set earlier in
the function; remove this duplicate assignment from the lyrics PMI block and
rely on the original prompt_uncond value (ensure no other code in the block
depends on a different value or side-effects from rebuild).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4ed13971-072a-4e10-a960-c5b0cce4bee8

📥 Commits

Reviewing files that changed from the base of the PR and between d2426f4 and 1f03d25.

📒 Files selected for processing (2)
  • acestep/core/scoring/lm_score.py
  • acestep/core/scoring/lm_score_context_test.py

@ChuxiJ
Copy link
Copy Markdown
Contributor Author

ChuxiJ commented Apr 30, 2026

PR #1170 now includes a disk-backed Gradio feature cache for the Auto Score / Auto LRC tensor-retention path: per-sample alignment tensors are persisted under the generated source session (or feature_sessions) and batch history keeps cache paths instead of large tensors. I referenced this PR from #1170 rather than using a closing keyword because #1097 also contains a separate MLX scoring-model lifecycle optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants