Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions backend/utils/hf_offline_patch.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@

logger = logging.getLogger(__name__)

# Module-level flag: if True, we've already tried with network and failed for
# this model. Prevents each individual config file from triggering its own
# 5-retry cycle.
_offline_retry_done = {}
Comment on lines +15 to +18
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix is incomplete — _offline_retry_done is declared but never read or written.

The PR description states that this dict should be consulted to skip retries after one attempt per model, but no code in this file (or, based on the provided context, elsewhere) reads from or writes to _offline_retry_done. As it stands, this change is purely dead state and does not address the infinite retry loop described in issue #434. The except Exception block at lines 46–59 still just re-raises on offline errors — there's no gating on this dict, no per-model bookkeeping, and no short-circuit path.

Additionally, the comment describes the variable as a boolean ("if True, we've already tried…"), but it is declared as a dict. Please reconcile the intent (presumably Dict[str, bool] keyed by model_label) and wire it into the retry/exception flow.

A sketch of the likely intended wiring:

-_offline_retry_done = {}
+# Maps model_label -> True once an online retry has been attempted, so
+# subsequent per-file failures for the same model don't each trigger their
+# own retry cycle.
+_offline_retry_done: dict[str, bool] = {}

And inside force_offline_if_cached:

     try:
         yield
     except Exception as exc:
         if "offline" in str(exc).lower():
+            if _offline_retry_done.get(model_label):
+                logger.warning(
+                    "[offline-guard] Offline load failed for %s; retry already "
+                    "attempted previously, propagating.", model_label or "model",
+                )
+                raise
+            _offline_retry_done[model_label] = True
             logger.warning(...)
             ...

Note also that force_offline_if_cached is invoked from _load_model_sync via asyncio.to_thread in both pytorch_backend.py and mlx_backend.py, so concurrent access to this module-level dict is possible. Consider guarding mutations with a threading.Lock, or document that only one load runs at a time.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/utils/hf_offline_patch.py` around lines 15 - 18, Declare
_offline_retry_done as a Dict[str, bool] and use it in force_offline_if_cached
to record and check per-model attempts: accept a model_label parameter in
force_offline_if_cached (or derive it from existing args), check
_offline_retry_done.get(model_label) under a module-level threading.Lock before
attempting network retries, and if True short-circuit to force offline behavior;
when a network attempt runs and fails, set _offline_retry_done[model_label] =
True (also under the same Lock) so subsequent calls from _load_model_sync (via
asyncio.to_thread) skip the retry loop. Also update the top-of-file comment to
match the dict semantics (per-model boolean) and ensure concurrent access is
protected with threading.Lock as suggested.



@contextmanager
def force_offline_if_cached(is_cached: bool, model_label: str = ""):
Expand Down