fix: prevent infinite retry loop in offline mode (#434) #443

coderabbitai · 2026-04-17T04:47:21Z

⚠️ Potential issue | 🔴 Critical

Fix is incomplete — _offline_retry_done is declared but never read or written.

The PR description states that this dict should be consulted to skip retries after one attempt per model, but no code in this file (or, based on the provided context, elsewhere) reads from or writes to _offline_retry_done. As it stands, this change is purely dead state and does not address the infinite retry loop described in issue #434. The except Exception block at lines 46–59 still just re-raises on offline errors — there's no gating on this dict, no per-model bookkeeping, and no short-circuit path.

Additionally, the comment describes the variable as a boolean ("if True, we've already tried…"), but it is declared as a dict. Please reconcile the intent (presumably Dict[str, bool] keyed by model_label) and wire it into the retry/exception flow.

A sketch of the likely intended wiring:

-_offline_retry_done = {} +# Maps model_label -> True once an online retry has been attempted, so +# subsequent per-file failures for the same model don't each trigger their +# own retry cycle. +_offline_retry_done: dict[str, bool] = {}

And inside force_offline_if_cached:

try: yield except Exception as exc: if "offline" in str(exc).lower(): + if _offline_retry_done.get(model_label): + logger.warning( + "[offline-guard] Offline load failed for %s; retry already " + "attempted previously, propagating.", model_label or "model", + ) + raise + _offline_retry_done[model_label] = True logger.warning(...) ...

Note also that force_offline_if_cached is invoked from _load_model_sync via asyncio.to_thread in both pytorch_backend.py and mlx_backend.py, so concurrent access to this module-level dict is possible. Consider guarding mutations with a threading.Lock, or document that only one load runs at a time.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/utils/hf_offline_patch.py` around lines 15 - 18, Declare _offline_retry_done as a Dict[str, bool] and use it in force_offline_if_cached to record and check per-model attempts: accept a model_label parameter in force_offline_if_cached (or derive it from existing args), check _offline_retry_done.get(model_label) under a module-level threading.Lock before attempting network retries, and if True short-circuit to force offline behavior; when a network attempt runs and fails, set _offline_retry_done[model_label] = True (also under the same Lock) so subsequent calls from _load_model_sync (via asyncio.to_thread) skip the retry loop. Also update the top-of-file comment to match the dict semantics (per-model boolean) and ensure concurrent access is protected with threading.Lock as suggested.

-Original file line number
+Diff line change
@@ Expand Up / @@ -12,6 +12,11 @@ @@
     logger = logging.getLogger(__name__)
+    # Module-level flag: if True, we've already tried with network and failed for
+    # this model.  Prevents each individual config file from triggering its own
+    # 5-retry cycle.
+    _offline_retry_done = {}
     @contextmanager
     def force_offline_if_cached(is_cached: bool, model_label: str = ""):
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent infinite retry loop in offline mode (#434) #443

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Apr 17, 2026

Uh oh!

fix: prevent infinite retry loop in offline mode (#434) #443

Are you sure you want to change the base?

Uh oh!

fix: prevent infinite retry loop in offline mode (#434) #443

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!