fix(scoring): stop Autoscore unified-memory leak on MLX (issue #1081)#1097
fix(scoring): stop Autoscore unified-memory leak on MLX (issue #1081)#1097
Conversation
On the MLX backend, Autoscore was the cause of a catastrophic unified-memory leak on Apple Silicon (issue #1081): 1. `get_hf_model_for_scoring()` loads a *separate* HuggingFace PyTorch copy of the LM (the MLX weights cannot be used for torch teacher-forcing scoring). For the 5Hz-lm-4B that is roughly ~8 GB of CPU-resident memory cached on the handler and never freed. 2. `_load_scoring_model_context` moved that cached model CPU <-> MPS on *every* forward pass. A single Autoscore pass performs dozens of forward passes per sample (metadata recall, caption PMI, lyrics PMI, each with a conditional and unconditional prompt), so the MPS allocator accumulated fragmented multi-GB allocations. 3. With `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` (required to run at all on 32 GB Macs with both XL-Turbo and 5Hz-lm-4B loaded), the MPS allocator has no high-watermark guardrail, and the fragmentation above pushed the system past its physical memory limit after only a few generations. This change fixes the accumulating leak with two localised edits in `acestep/core/scoring/lm_score.py`: - `_load_scoring_model_context` is now **reentrant**. Nested entries are tracked via a depth counter on the handler and become no-ops. Only the outermost entry performs the CPU->accelerator migration and only the outermost exit performs the offload. - `calculate_pmi_score_per_condition` wraps its entire scoring body in a single outer `_load_scoring_model_context`. All the inner `_get_logits_and_target_for_scoring` calls now see an active outer context and skip their own migrations. Result: exactly one CPU<->MPS migration per Autoscore pass instead of one per forward pass. - On MLX with `offload_to_cpu` enabled, the outermost exit additionally drops the cached `_hf_model_for_scoring`, runs `gc.collect()` and empties the MPS cache so the ~8 GB duplicate PyTorch copy is returned to the OS between scoring passes. The next scoring call re-materialises it via `AutoModelForCausalLM.from_pretrained`; HF hub weight caching makes this a fast rematerialisation rather than a redownload. Refs: #1081 https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a thread-local, reentrant scoring-model context with per-handler outermost serialization, explicit accelerator-cache eviction, MLX offload cleanup (model deletion + GC), and wraps the full PMI scoring pass in one outer context to avoid repeated CPU↔accelerator migrations. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Thread as Thread
participant Handler as LLM Handler
participant Lock as RLock (per-handler)
participant HFModel as HF Scoring Model
participant Accelerator as Accelerator (CUDA/MLX)
participant GC as GC
Thread->>Handler: enter _load_scoring_model_context()
Handler->>Lock: acquire (outermost serialized)
alt cached model not on accelerator
Handler->>HFModel: get_hf_model_for_scoring()
HFModel->>Accelerator: .to("cuda") / move to accel
end
Note right of Thread: scoring work (PMI passes) executes
Thread->>HFModel: perform conditional/unconditional forwards
Thread->>Handler: exit _load_scoring_model_context()
Lock->>Handler: release (outermost exit)
alt backend == "mlx" and offload_to_cpu == true
Handler->>HFModel: .to("cpu") / offload
Handler->>Handler: clear cached _hf_model_for_scoring
Handler->>GC: gc.collect()
Handler->>Accelerator: empty accelerator cache
else
Handler->>Accelerator: maybe empty accelerator cache
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
acestep/core/scoring/lm_score.py (1)
129-155:⚠️ Potential issue | 🟠 MajorMake the reentrancy state request-scoped, not handler-scoped.
_scoring_ctx_depthlives onllm_handler, so two concurrent Autoscore calls on the same handler are treated as “nested”. One outer exit can then reset the depth to0and clear_hf_model_for_scoringwhile the other call is still insidemodel(...), which makes this a race rather than true reentrancy. Please move the nesting state to thread/task-local storage and protect the shared load/offload transition with a lock or refcount.Also applies to: 168-176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/scoring/lm_score.py` around lines 129 - 155, The current reentrancy counter _scoring_ctx_depth on llm_handler causes races for concurrent Autoscore calls; change it to be request-scoped by moving the nesting state into task-local/thread-local storage (e.g., use contextvars or threading.local) instead of storing _scoring_ctx_depth on the handler, and protect the shared model load/offload/change with a global lock or a handler-level refcount (increment/decrement in the task-local context) so calls can nest per-request without one call resetting the handler state for others; update the context manager logic around get_hf_model_for_scoring, offload_to_cpu, device and model.to to consult the task-local depth and use the lock/refcount when performing model.to(...) and when clearing _hf_model_for_scoring.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/core/scoring/lm_score.py`:
- Around line 478-519: Add a focused regression test that constructs a fake
llm_handler and exercises the Autoscore scoring path that wraps the block using
_load_scoring_model_context, asserting three invariants: (1) the model
load/offload cycle is invoked exactly once for the outer scoring pass (mock the
load/offload side-effects on the fake handler), (2) nested re-entries into
_load_scoring_model_context (triggered by calls like
_get_logits_and_target_for_scoring or _calculate_log_prob) are no-ops (verify
they do not increment load/unload counters), and (3) after the outer context
exits the module-level _hf_model_for_scoring is cleared/None. Use the same call
sites as in the diff (invoke the scoring code path via llm_handler and calls to
_calculate_log_prob/_calculate_metadata_recall) so the test exercises the exact
lifecycle contract.
- Around line 71-83: The helper _empty_accelerator_cache currently empties
CUDA/XPU/MPS caches and is invoked on every scoring offload, which affects
non-target runtimes; change _empty_accelerator_cache to only perform
MPS-specific cleanup (check torch.backends.mps.is_available() and
torch.mps.empty_cache()) or add a backend parameter (e.g., backend="mps") and
only empty the cache when backend == "mps"; leave CUDA/XPU paths untouched and
ensure the offload call site that triggers the MLX/MPS cleanup calls this
MPS-scoped helper instead of a blanket cache eviction.
---
Outside diff comments:
In `@acestep/core/scoring/lm_score.py`:
- Around line 129-155: The current reentrancy counter _scoring_ctx_depth on
llm_handler causes races for concurrent Autoscore calls; change it to be
request-scoped by moving the nesting state into task-local/thread-local storage
(e.g., use contextvars or threading.local) instead of storing _scoring_ctx_depth
on the handler, and protect the shared model load/offload/change with a global
lock or a handler-level refcount (increment/decrement in the task-local context)
so calls can nest per-request without one call resetting the handler state for
others; update the context manager logic around get_hf_model_for_scoring,
offload_to_cpu, device and model.to to consult the task-local depth and use the
lock/refcount when performing model.to(...) and when clearing
_hf_model_for_scoring.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3b905821-7395-4783-9889-e4973ea4db2c
📒 Files selected for processing (1)
acestep/core/scoring/lm_score.py
| # Batch *all* scoring forward passes under a single load/offload | ||
| # cycle. The inner ``_get_logits_and_target_for_scoring`` calls | ||
| # still enter ``_load_scoring_model_context``, but the reentrancy | ||
| # guard turns those nested entries into no-ops -- so the cached HF | ||
| # scoring model is moved CPU↔accelerator exactly once per Autoscore | ||
| # pass instead of once per condition (issue #1081). | ||
| with _load_scoring_model_context(llm_handler): | ||
| # 1. Calculate Recall for Metadata Fields | ||
| if metadata and isinstance(metadata, dict): | ||
| scores = {} | ||
| # Define which fields use which metric | ||
| metadata_recall_keys = ['bpm', 'duration', 'genres', 'keyscale', 'language', 'timesignature'] | ||
| metadata_pmi_keys = ['caption'] | ||
| for key in metadata_recall_keys: | ||
| if key in metadata and metadata[key] is not None: | ||
| recall_metadata = {key: metadata[key]} | ||
| field_scores = _calculate_metadata_recall(llm_handler, formatted_prompt, recall_metadata, topk=topk) | ||
| scores.update(field_scores) | ||
|
|
||
| # 2. Calculate PMI for Caption | ||
| for key in metadata_pmi_keys: | ||
| if key in metadata and metadata[key] is not None: | ||
| cot_yaml = yaml.dump({key: metadata[key]}, allow_unicode=True, sort_keys=True).strip() | ||
| target_text = f"<think>\n{cot_yaml}\n</think>\n" | ||
|
|
||
| log_prob_cond = _calculate_log_prob(llm_handler, formatted_prompt, target_text) | ||
| log_prob_uncond = _calculate_log_prob(llm_handler, prompt_uncond, target_text) | ||
|
|
||
| pmi_normalized = pmi_to_normalized_score(log_prob_cond - log_prob_uncond, scale=score_scale) | ||
| scores[key] = pmi_normalized | ||
|
|
||
| # 3. Calculate PMI for Lyrics | ||
| if lyrics: | ||
| target_text = f"<think>\n</think>\n# Lyric\n{lyrics}\n" | ||
|
|
||
| log_prob_cond = _calculate_log_prob(llm_handler, formatted_prompt, target_text) | ||
|
|
||
| prompt_uncond = llm_handler.build_formatted_prompt_for_understanding(audio_codes="NO USER INPUT", is_negative_prompt=False) | ||
| log_prob_uncond = _calculate_log_prob(llm_handler, prompt_uncond, target_text) | ||
|
|
||
| scores['lyrics'] = pmi_to_normalized_score(log_prob_cond - log_prob_uncond, scale=score_scale) | ||
|
|
There was a problem hiding this comment.
Add a focused regression test for the new lifecycle contract.
This now depends on three invariants that are easy to regress silently: one outer load/offload per scoring pass, nested _load_scoring_model_context() entries staying no-op, and MLX outermost exit clearing _hf_model_for_scoring. I’d add a fake-handler test around this block before merging.
Based on learnings: "AI-Agent Workflow: Add/update focused tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/core/scoring/lm_score.py` around lines 478 - 519, Add a focused
regression test that constructs a fake llm_handler and exercises the Autoscore
scoring path that wraps the block using _load_scoring_model_context, asserting
three invariants: (1) the model load/offload cycle is invoked exactly once for
the outer scoring pass (mock the load/offload side-effects on the fake handler),
(2) nested re-entries into _load_scoring_model_context (triggered by calls like
_get_logits_and_target_for_scoring or _calculate_log_prob) are no-ops (verify
they do not increment load/unload counters), and (3) after the outer context
exits the module-level _hf_model_for_scoring is cleared/None. Use the same call
sites as in the diff (invoke the scoring code path via llm_handler and calls to
_calculate_log_prob/_calculate_metadata_recall) so the test exercises the exact
lifecycle contract.
Addresses CodeRabbit review feedback on #1097: 1. Thread safety (was major concurrency race). ``_scoring_ctx_depth`` lived on ``llm_handler``, so two concurrent Autoscore calls on the same handler saw each other's depth counter and one outer exit could clear ``_hf_model_for_scoring`` while the other call was still inside a forward pass. Nesting state is now kept in ``threading.local`` keyed by ``id(llm_handler)``, and the outermost entry on each thread acquires a per-handler ``threading.RLock`` so the shared load / offload / drop transition is atomic across threads. A module-level creation lock guards lazy RLock installation so racing first callers can't install two different locks. 2. Backend-scoped accelerator cache eviction. ``_empty_accelerator_cache`` previously called ``torch.cuda.empty_cache()`` on the MLX cleanup path, adding allocator churn to a runtime that is not even involved in the fix. The helper now takes a ``backend`` argument: ``mlx`` -> MPS only, ``vllm`` -> CUDA / XPU only. The ``pt`` backend never reaches this helper (it delegates to the handler's own ``_load_model_context``). 3. Regression tests for the lifecycle contract. New ``LoadScoringModelContextTests`` in ``acestep/core/scoring/scoring_test.py`` asserts the three invariants the fix depends on, using a fake handler + fake model that record ``.to()`` calls: - exactly one load + one offload per outermost entry - nested re-entries are no-ops (no extra ``.to()``, no extra ``get_hf_model_for_scoring`` calls) - MLX + offload_to_cpu drops ``_hf_model_for_scoring`` on outer exit - vllm keeps the cached HF scoring model (CUDA is fine) - MLX without offload keeps the cached model and performs no migrations at all Refs: #1081 https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
acestep/core/scoring/scoring_test.py (1)
142-249: Split this test file or add a follow-up split plan in PR notes.With these additions,
acestep/core/scoring/scoring_test.pyis now 252 LOC, which is above the 200 LOC hard cap. Consider movingLoadScoringModelContextTestsinto a dedicated test module.Based on learnings, "only raise module-size concerns when a file exceeds 200 lines of code (LOC)," and this file now exceeds that threshold.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/scoring/scoring_test.py` around lines 142 - 249, The test file exceeds the 200 LOC limit—split LoadScoringModelContextTests into its own test module: create a new test file containing the LoadScoringModelContextTests class (including its inner _FakeModel and _FakeHandler helpers) and any references to _load_scoring_model_context, import that helper from the original module or move it alongside the class, then update imports in the original scoring_test.py or test suite so tests discover correctly; alternatively, if you prefer a follow-up split, add a short PR note describing this planned extraction and which symbols (_load_scoring_model_context, LoadScoringModelContextTests, _FakeModel, _FakeHandler) will be moved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/core/scoring/scoring_test.py`:
- Around line 158-177: Add concise docstrings to the newly introduced helper
classes and their methods: add class-level docstrings to _FakeModel and
_FakeHandler describing their test-purpose (a recorder fake model and a handler
stub for scoring tests), and add method docstrings for _FakeModel.__init__,
_FakeModel.to, _FakeHandler.__init__, and _FakeHandler.get_hf_model_for_scoring
that briefly state parameters, side-effects (e.g., records device in
device_calls, toggles llm_initialized/offload flags) and return values (e.g.,
returns self or the fake model). Ensure the docstrings follow project style
(one-line summary plus short param/return details where useful) and are placed
directly under each class/method definition.
---
Nitpick comments:
In `@acestep/core/scoring/scoring_test.py`:
- Around line 142-249: The test file exceeds the 200 LOC limit—split
LoadScoringModelContextTests into its own test module: create a new test file
containing the LoadScoringModelContextTests class (including its inner
_FakeModel and _FakeHandler helpers) and any references to
_load_scoring_model_context, import that helper from the original module or move
it alongside the class, then update imports in the original scoring_test.py or
test suite so tests discover correctly; alternatively, if you prefer a follow-up
split, add a short PR note describing this planned extraction and which symbols
(_load_scoring_model_context, LoadScoringModelContextTests, _FakeModel,
_FakeHandler) will be moved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a11c3ac3-aae2-4ef9-87d1-815c25705c08
📒 Files selected for processing (2)
acestep/core/scoring/lm_score.pyacestep/core/scoring/scoring_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
- acestep/core/scoring/lm_score.py
| class _FakeModel: | ||
| def __init__(self): | ||
| self.device_calls = [] | ||
|
|
||
| def to(self, device): | ||
| self.device_calls.append(str(device)) | ||
| return self | ||
|
|
||
| class _FakeHandler: | ||
| def __init__(self, backend, offload=True): | ||
| self.llm_backend = backend | ||
| self.llm_initialized = True | ||
| self.offload_to_cpu = offload | ||
| self.device = "cuda" # synthetic; _FakeModel.to is a recorder | ||
| self._hf_model_for_scoring = LoadScoringModelContextTests._FakeModel() | ||
| self.get_calls = 0 | ||
|
|
||
| def get_hf_model_for_scoring(self): | ||
| self.get_calls += 1 | ||
| return self._hf_model_for_scoring |
There was a problem hiding this comment.
Add docstrings to new helper classes and methods.
Line 158 and Line 166 add new classes, and Line 162/Line 175 add new methods, but they currently have no docstrings.
✍️ Suggested patch
class _FakeModel:
+ """Minimal model test double that records device transfer calls."""
+
def __init__(self):
+ """Initialize transfer-call recorder."""
self.device_calls = []
def to(self, device):
+ """Record requested device transfer and return self."""
self.device_calls.append(str(device))
return self
class _FakeHandler:
+ """Minimal handler test double used by scoring-model context tests."""
+
def __init__(self, backend, offload=True):
+ """Set backend flags and initialize cached fake scoring model."""
self.llm_backend = backend
self.llm_initialized = True
self.offload_to_cpu = offload
self.device = "cuda" # synthetic; _FakeModel.to is a recorder
self._hf_model_for_scoring = LoadScoringModelContextTests._FakeModel()
self.get_calls = 0
def get_hf_model_for_scoring(self):
+ """Return cached scoring model and count retrievals."""
self.get_calls += 1
return self._hf_model_for_scoringAs per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/core/scoring/scoring_test.py` around lines 158 - 177, Add concise
docstrings to the newly introduced helper classes and their methods: add
class-level docstrings to _FakeModel and _FakeHandler describing their
test-purpose (a recorder fake model and a handler stub for scoring tests), and
add method docstrings for _FakeModel.__init__, _FakeModel.to,
_FakeHandler.__init__, and _FakeHandler.get_hf_model_for_scoring that briefly
state parameters, side-effects (e.g., records device in device_calls, toggles
llm_initialized/offload flags) and return values (e.g., returns self or the fake
model). Ensure the docstrings follow project style (one-line summary plus short
param/return details where useful) and are placed directly under each
class/method definition.
Addresses CodeRabbit review feedback on #1097: 1. 200 LOC cap (nitpick). Previous commit pushed ``scoring_test.py`` to 252 LOC, above the project's 200 LOC hard cap. Move ``LoadScoringModelContextTests`` (plus its ``_FakeModel`` and ``_FakeHandler`` helpers) into a dedicated module ``lm_score_context_test.py``. Post-split: - ``scoring_test.py``: 147 LOC - ``lm_score_context_test.py``: 150 LOC 2. Missing docstrings on helper classes/methods. ``_FakeModel``, ``_FakeHandler`` and their methods were originally inner classes of the test case and lacked docstrings. Pull them out to module level in the new file and document each class and method per project style (purpose, params, side-effects, return value). Both test modules run cleanly together (23 tests total, all green). Refs: #1081 https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
There was a problem hiding this comment.
🧹 Nitpick comments (2)
acestep/core/scoring/lm_score_context_test.py (2)
108-117: Make the cache-release assertion resilient todelattrcleanup.
assertIsNone(handler._hf_model_for_scoring)is brittle if implementation clears by deleting the attribute instead of setting it toNone. Prefergetattr(..., None)so both valid cleanup styles pass.Suggested test hardening
- self.assertIsNone(handler._hf_model_for_scoring) + self.assertIsNone(getattr(handler, "_hf_model_for_scoring", None))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/scoring/lm_score_context_test.py` around lines 108 - 117, The test test_mlx_outermost_exit_drops_cached_model is brittle because it accesses handler._hf_model_for_scoring directly; update the assertions to use getattr(handler, "_hf_model_for_scoring", None) so deletion vs. setting to None both pass — e.g., replace self.assertIsNotNone(handler._hf_model_for_scoring) and self.assertIsNone(handler._hf_model_for_scoring) with self.assertIsNotNone(getattr(handler, "_hf_model_for_scoring", None)) and self.assertIsNone(getattr(handler, "_hf_model_for_scoring", None)), locating the checks in test_mlx_outermost_exit_drops_cached_model for _FakeHandler and _load_scoring_model_context usage.
138-147: Add one thread-isolation test for the reentrant guard.Current coverage is single-threaded. Since the behavior depends on thread-local depth tracking, add a two-thread regression test to ensure each thread performs its own outermost entry accounting.
Suggested additional test
+import threading import unittest @@ class LoadScoringModelContextTests(unittest.TestCase): @@ def test_get_hf_called_only_by_outermost_entry(self): """Nested entries must not re-query ``get_hf_model_for_scoring``.""" handler = _FakeHandler("mlx") with _load_scoring_model_context(handler): with _load_scoring_model_context(handler): with _load_scoring_model_context(handler): pass # Outer entry queries once; nested entries are pure no-ops. self.assertEqual(handler.get_calls, 1) + + def test_outermost_detection_is_thread_local(self): + """Each thread should perform its own outermost model fetch.""" + handler = _FakeHandler("mlx") + barrier = threading.Barrier(2) + + def worker(): + barrier.wait() + with _load_scoring_model_context(handler): + with _load_scoring_model_context(handler): + pass + + t1 = threading.Thread(target=worker) + t2 = threading.Thread(target=worker) + t1.start() + t2.start() + t1.join() + t2.join() + + self.assertEqual(handler.get_calls, 2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/scoring/lm_score_context_test.py` around lines 138 - 147, Add a multithread regression test alongside test_get_hf_called_only_by_outermost_entry that verifies the reentrant guard is thread-local: create a shared _FakeHandler instance, start two threads where each thread enters multiple nested _load_scoring_model_context(handler) calls (e.g., three nested withs) and then exits; join both threads and assert handler.get_calls == 2 to ensure each thread performed its own outermost entry accounting; use threading.Thread and any needed synchronization (join) so the test is deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@acestep/core/scoring/lm_score_context_test.py`:
- Around line 108-117: The test test_mlx_outermost_exit_drops_cached_model is
brittle because it accesses handler._hf_model_for_scoring directly; update the
assertions to use getattr(handler, "_hf_model_for_scoring", None) so deletion
vs. setting to None both pass — e.g., replace
self.assertIsNotNone(handler._hf_model_for_scoring) and
self.assertIsNone(handler._hf_model_for_scoring) with
self.assertIsNotNone(getattr(handler, "_hf_model_for_scoring", None)) and
self.assertIsNone(getattr(handler, "_hf_model_for_scoring", None)), locating the
checks in test_mlx_outermost_exit_drops_cached_model for _FakeHandler and
_load_scoring_model_context usage.
- Around line 138-147: Add a multithread regression test alongside
test_get_hf_called_only_by_outermost_entry that verifies the reentrant guard is
thread-local: create a shared _FakeHandler instance, start two threads where
each thread enters multiple nested _load_scoring_model_context(handler) calls
(e.g., three nested withs) and then exits; join both threads and assert
handler.get_calls == 2 to ensure each thread performed its own outermost entry
accounting; use threading.Thread and any needed synchronization (join) so the
test is deterministic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5baffc84-ba39-4847-9cb4-38919f009b8a
📒 Files selected for processing (2)
acestep/core/scoring/lm_score_context_test.pyacestep/core/scoring/scoring_test.py
✅ Files skipped from review due to trivial changes (1)
- acestep/core/scoring/scoring_test.py
Addresses CodeRabbit review feedback on #1097: 1. Resilient cached-model drop assertions. ``test_mlx_outermost_exit_drops_cached_model`` now uses ``getattr(handler, "_hf_model_for_scoring", None)`` so the test still passes if the implementation ever switches from ``setattr(handler, "_hf_model_for_scoring", None)`` to deleting the attribute. Both cleanup styles are valid. 2. Thread-isolation regression for the reentrancy depth. New ``test_outermost_detection_is_thread_local`` starts two worker threads racing into the same handler via a ``threading.Barrier`` and asserts that ``get_hf_model_for_scoring`` fires twice (once per thread's outermost entry). If the reentrancy depth ever regressed from ``threading.local`` to a process-global counter, one thread would see ``depth > 0`` at outer entry, mis-identify itself as "nested", and skip the model fetch -- making the assert fail fast. Uses ``offload=False`` so the MLX release path does not drop the cached model between threads and confuse the count. Total scoring tests: 24 (23 previously + 1 new). All green. ``lm_score_context_test.py`` at 190 LOC (under the 200 cap). Refs: #1081 https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
Wrap the CPU offload call in a nested try/finally so that the MLX-specific
release path (_hf_model_for_scoring = None, gc.collect, empty MPS cache) runs
regardless of whether model.to("cpu") succeeds. Without this, a failing
offload skips the cleanup and leaves the ~8 GB cached HF scoring model
resident — moving the unified-memory leak from the happy path to the
exception path.
Adds a regression test that simulates an offload failure and asserts the
cached model is still released.
Finding: empirically reproduced on macOS (model.to("cpu") raises →
_hf_model_for_scoring remains non-None without the nested try/finally).
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Review: verified on macOS, one issue found and fixedI read the full diff against ✅ Cross-thread stale-model race — NOT a real bugI wrote a test with two threads racing into Thread B is blocked at the The existing ❌ Exception-path cleanup — was a real bug, now fixedI injected a Reproduction: class FailingModel:
def to(self, device):
if device == "cpu":
raise RuntimeError("simulated offload failure")
return self
handler = Handler("mlx", offload=True)
handler._hf_model_for_scoring = FailingModel()
with _load_scoring_model_context(handler):
pass
# → _hf_model_for_scoring is still cached (should be None)Fix pushed in Test resultsAll original 7 tests + the new exception-path test pass on macOS (Python 3.11.4, Apple Silicon). Other verification notes
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
acestep/core/scoring/lm_score.py (2)
1-7: Optional: Consider splitting this module in a follow-up.At ~613 LOC, this module exceeds the 200 LOC guideline. The scoring logic could potentially be split into:
lm_score_context.py: Context manager and thread-safety primitives (~150 LOC)lm_score.py: Scoring calculations and public APIThis is not urgent for this PR since the focus is the memory leak fix.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/scoring/lm_score.py` around lines 1 - 7, The lm_score.py module is too large (~613 LOC) and should be split to improve maintainability: extract context/thread-safety logic (e.g., the context manager class and any locking primitives and functions referenced by names like LMScoreContext, _thread_local, or similar) into a new lm_score_context.py, and keep scoring calculations and public API functions (e.g., compute_perplexity, compute_pmi, top_k_recall, metadata_recall, and get_reward_score) in lm_score.py; update imports so lm_score.py imports the context/primitive symbols from lm_score_context.py and run tests to ensure no behavioral changes.
587-587: Minor:prompt_uncondis redundantly reassigned.
prompt_uncondis already assigned at line 548 with the same value. This reassignment inside the lyrics PMI block is unnecessary.♻️ Suggested fix
if lyrics: target_text = f"<think>\n</think>\n# Lyric\n{lyrics}\n" log_prob_cond = _calculate_log_prob(llm_handler, formatted_prompt, target_text) - prompt_uncond = llm_handler.build_formatted_prompt_for_understanding(audio_codes="NO USER INPUT", is_negative_prompt=False) log_prob_uncond = _calculate_log_prob(llm_handler, prompt_uncond, target_text) scores['lyrics'] = pmi_to_normalized_score(log_prob_cond - log_prob_uncond, scale=score_scale)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/scoring/lm_score.py` at line 587, The variable prompt_uncond is redundantly reassigned inside the lyrics PMI block using llm_handler.build_formatted_prompt_for_understanding("NO USER INPUT", is_negative_prompt=False) even though it was already set earlier in the function; remove this duplicate assignment from the lyrics PMI block and rely on the original prompt_uncond value (ensure no other code in the block depends on a different value or side-effects from rebuild).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@acestep/core/scoring/lm_score.py`:
- Around line 1-7: The lm_score.py module is too large (~613 LOC) and should be
split to improve maintainability: extract context/thread-safety logic (e.g., the
context manager class and any locking primitives and functions referenced by
names like LMScoreContext, _thread_local, or similar) into a new
lm_score_context.py, and keep scoring calculations and public API functions
(e.g., compute_perplexity, compute_pmi, top_k_recall, metadata_recall, and
get_reward_score) in lm_score.py; update imports so lm_score.py imports the
context/primitive symbols from lm_score_context.py and run tests to ensure no
behavioral changes.
- Line 587: The variable prompt_uncond is redundantly reassigned inside the
lyrics PMI block using llm_handler.build_formatted_prompt_for_understanding("NO
USER INPUT", is_negative_prompt=False) even though it was already set earlier in
the function; remove this duplicate assignment from the lyrics PMI block and
rely on the original prompt_uncond value (ensure no other code in the block
depends on a different value or side-effects from rebuild).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4ed13971-072a-4e10-a960-c5b0cce4bee8
📒 Files selected for processing (2)
acestep/core/scoring/lm_score.pyacestep/core/scoring/lm_score_context_test.py
|
PR #1170 now includes a disk-backed Gradio feature cache for the Auto Score / Auto LRC tensor-retention path: per-sample alignment tensors are persisted under the generated source session (or feature_sessions) and batch history keeps cache paths instead of large tensors. I referenced this PR from #1170 rather than using a closing keyword because #1097 also contains a separate MLX scoring-model lifecycle optimization. |
Fixes the catastrophic unified-memory growth described in #1081, where enabling Autoscore on macOS with the MLX backend (
acestep-v15-xl-turbo+acestep-5Hz-lm-4Bloaded, 32 GB Mac) would drain RAM and crash the system after only a few generations.Root cause
On the MLX backend, Autoscore triggers three compounding memory paths:
A second full copy of the LM is loaded as a HuggingFace PyTorch model. The MLX weights cannot be used for PyTorch teacher-forcing scoring, so
get_hf_model_for_scoring()(acestep/llm_inference.py:4163-4194) callsAutoModelForCausalLM.from_pretrained(...)and caches the result onself._hf_model_for_scoring. For the 5Hz-lm-4B model that is roughly ~8 GB of CPU-resident memory that never gets freed.That cached model was migrated CPU ↔ MPS on every forward pass.
_load_scoring_model_contextinacestep/core/scoring/lm_score.pymoved the model to the accelerator and back inside every call. One Autoscore pass performs dozens of forward passes per sample (per-metadata recall + caption PMI + lyrics PMI, each with a conditional and unconditional prompt), so the MPS allocator accumulated fragmented multi-GB allocations every time.PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0removes the last guardrail. That env var is required to even start the app with both models loaded on a 32 GB Mac, but it disables the MPS allocator's high-watermark check, letting the fragmentation above push the process past unified-memory limits.Disabling Autoscore makes the leak disappear entirely, because
generation_progress.py:280gates the whole scoring block behindif auto_score:— that matches the reporter's observation.Fix
Two localised edits in
acestep/core/scoring/lm_score.py:1.
_load_scoring_model_contextis now reentrantNested entries are tracked via a depth counter (
llm_handler._scoring_ctx_depth) and become no-ops. Only the outermost entry performs the CPU → accelerator migration; only the outermost exit performs the offload. This mirrors the device-based reentrancy guard that already exists for theptbackend inllm_inference.py:4075-4082.2.
calculate_pmi_score_per_conditionbatches all forward passes under one outer contextThe whole scoring body is now wrapped in a single
with _load_scoring_model_context(llm_handler):. All the inner_get_logits_and_target_for_scoringcalls still enter the context, but the reentrancy guard turns them into no-ops. Result: exactly one CPU ↔ MPS migration per Autoscore pass, instead of one per forward pass.3. MLX-specific: drop the cached HF copy on outermost exit
When the backend is MLX and
offload_to_cpuis enabled, the outermost exit additionally:llm_handler._hf_model_for_scoring = Nonegc.collect()This returns the ~8 GB duplicate PyTorch copy to the OS between scoring passes. The next scoring call re-materialises the model via
AutoModelForCausalLM.from_pretrained; HF hub weight caching makes this a fast rematerialisation from disk, not a redownload. A 32 GB Mac pays a one-time few-seconds reload cost on each Autoscore run in exchange for not crashing.Other backends are unaffected:
pt: still delegates tollm_handler._load_model_context()which already has its own reentrancy guard.vllm: same hoisted outer context, but the HF scoring model is kept cached (the CUDA allocator handles this cleanly and VRAM is not unified with system RAM).Files changed
acestep/core/scoring/lm_score.py— +110 / -40Test plan
acestep-v15-xl-turbo+acestep-5Hz-lm-4B, enable Autoscore, run ≥5 consecutive generations. Expected: peak RSS plateaus between generations instead of growing monotonically; no crash.[scoring] Released cached HF scoring model on MLX backendafter each Autoscore pass._load_model_context()as before).Follow-ups (not in this PR)
get_lyric_score(acestep/core/generation/handler/lyric_score.py:93) is still once-per-sample. A future PR could hoist a single_load_model_context("model")around the whole Autoscore loop ingeneration_progress.py:280.acestep/gpu_config.py:261heuristic).Refs: #1081
https://claude.ai/code/session_01Wr7kjcUxgdkf5oUATzHY9U
Summary by CodeRabbit