Skip to content

feat: make async engine the default execution path#592

Merged
andreatgretel merged 15 commits intomainfrom
andreatgretel/chore/async-default-prereqs
May 4, 2026
Merged

feat: make async engine the default execution path#592
andreatgretel merged 15 commits intomainfrom
andreatgretel/chore/async-default-prereqs

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel commented Apr 30, 2026

Summary

Makes the async execution engine the default path and addresses the prerequisites that surface alongside the flip: Python 3.10 compatibility for the async stack, two hardcoded 300s timeouts that broke slow self-hosted endpoints, and root-cause surfacing for runs that produce 0 records. Closes #551.

Changes

Added

  • New Async Engine section in architecture-and-performance.md covering per-model timeouts, run outcomes (partial completion + DataDesignerEarlyShutdownError), and the transitional opt-out.
  • Thread-safety callout and MagicMock(spec=ModelFacade) mocking note in custom_columns.md.
  • ModelFacade.request_timeout property exposing the effective per-request HTTP timeout.
  • AsyncTaskScheduler.first_non_retryable_error — captures the first task error so the interface can surface the original cause when a run produces 0 records.
  • DataDesigner._resolve_client_concurrency_mode — picks the model-client mode that matches the engine the run will actually use, so an allow_resize sync-fallback doesn't end up with async-only clients.
  • Short Update section on the existing async-all-the-way-down dev note pointing at the new arch-and-perf section.

Changed

  • DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites. Set to "0" to fall back for one transitional release. allow_resize=True still triggers an automatic sync-engine fallback with a DeprecationWarning.
  • ThrottleManager.acquire_sync/async default to timeout=None (no queue-wait deadline). Per-request HTTP timeout is the only deadline that bounds actual work; queue waits scale with provider speed and AIMD. Existing tests that pass explicit timeout= floats (0.0, 0.5) are unaffected.
  • _AsyncBridgedModelFacade derives the sync→async bridge timeout via the new _compute_bridge_timeout helper as (1 + max_conversation_restarts) × (1 + max_correction_steps) × request_timeout × 1.5, floored at 60s. Honors per-call timeout= overrides. One knob — the per-model inference_parameters.timeout — drives both the HTTP deadline and the bridge deadline. No new user-facing config surface.
  • _run_from_scratch for non-FromScratchColumnGenerator generators dispatched as seeds (no upstream columns) now passes an rg_size-row buffer snapshot instead of an empty DataFrame. Fixes a routing footgun for plugin authors writing ColumnGeneratorFullColumn with no required_columns.
  • Interface threads first_non_retryable_error into DataDesignerGenerationError when 0 records are produced without early-shutdown, so deterministic seed/generator failures surface the original message instead of a wrapped parquet FileNotFoundError.
  • Interface client-mode resolution moved to DataDesigner._resolve_client_concurrency_mode, threaded into create_resource_provider via a new explicit client_concurrency_mode kwarg. Direct callers of the factory still get the env-var default.
  • Three stale async engine is landing soon callouts in arch-and-perf updated to reflect the flip.
  • README: replaced the stale 2026-03-24 LiteLLM supply-chain notice (litellm has been a non-dependency since v0.5.4) with a short async-default heads-up linking into the docs.

Fixed

  • Async stack now runs on Python 3.10. The single asyncio.TaskGroup use in async_concurrency.py:_run_all is replaced with asyncio.gather plus an explicit cancel-on-BaseException loop. _run_task already swallows its own exceptions and uses _shutdown_event for sibling cancellation, so the rewrite is behavior-equivalent. Removed the sys.version_info < (3, 11) runtime guard and the matching pytestmark skipif.

Attention Areas

Reviewers: Please pay special attention to the following:

  • async_concurrency.py — the gather-with-cancel rewrite preserves TaskGroup semantics for our use only because _run_task doesn't raise into the parent. The explicit cancel loop in the except BaseException: branch is defensive — kicks in only if a future change causes children to raise into gather, mirroring what TaskGroup would have done.
  • throttle_manager.pytimeout=None semantics: when no deadline is passed, the loop waits indefinitely for permits. The HTTP timeout is the meaningful deadline.
  • custom.py — bridge timeout derivation honors kwargs.get("timeout") over facade.request_timeout, and max_conversation_restarts is in the formula. Sync customs that call model.generate(timeout=600) get the override.
  • data_designer.py — root-cause surfacing fires only when actual_num_records == 0 AND not early_shutdown. Partial-salvage runs that fail to load for unrelated reasons still surface the original load error. _resolve_client_concurrency_mode is the single point where the engine-vs-client mismatch is prevented.

Verification

  • make test passes on Python 3.10 and 3.12 (3117/3117 across all three packages).
  • make test-e2e passes locally (6 passed, 2 skipped).
  • Live smoke tests against integrate.api.nvidia.com (Build) and inference-api.nvidia.com (Inference) with no env var set: 13/13 across single-model, multi-model, mixed-provider, multi-row-group, preview, embeddings, native async customs, sync→async bridge, allow_resize fallback, and non-retryable error path.

Follow-ups (out of scope)

  • architecture-and-performance.md still describes the sync engine's column-at-a-time execution model in its main Execution Model section. Worth a separate refresh PR alongside the new Async Engine section added here.
  • Python 3.10 EOL is October 2026; consider filing a separate issue to drop 3.10 support after that, which would let us go back to asyncio.TaskGroup.
  • LoggingConfig.default() still swallows scheduler WARNs (called out as a follow-up in fix(async): pack of fixes for async engine under degraded providers #585). Orthogonal to the flip; should ship soon to make the new degraded-provider WARN visible by default.

Description updated with AI

The async engine has been hardening as opt-in for several releases. Make it
the default and address the prerequisites flagged for the flip.

Default flip
- DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites
- Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out
- allow_resize=True still falls back to sync with a DeprecationWarning

Python 3.10 support
- Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with
  gather-with-explicit-cancel; semantics preserved because _run_task already
  swallows its own exceptions and uses _shutdown_event for sibling cancellation
- Remove the sys.version_info < (3, 11) runtime guard
- Remove the matching pytest skipif so the executor tests run on 3.10 too

Derived timeouts (replaces two hardcoded 300s constants)
- ThrottleManager.acquire_sync/async default to timeout=None (no deadline)
  instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds
  actual work, queue waits scale with provider speed and AIMD
- _AsyncBridgedModelFacade derives the sync->async bridge timeout from the
  model's inference_parameters.timeout and the call's max_correction_steps;
  one knob (per-model timeout) drives both deadlines, no new config surface
- Add ModelFacade.request_timeout property so the bridge can read the
  effective timeout the client is configured with

Root-cause surfacing
- AsyncTaskScheduler captures the first non-retryable error and exposes it
  via first_non_retryable_error
- Interface threads it through DataDesignerGenerationError when 0 records
  are produced without early-shutdown, so deterministic failures (e.g. bad
  seed sources) surface their original message instead of a wrapped
  FileNotFoundError on the parquet path

Tests
- New: throttle no-deadline default behavior (sync+async), parametrized
  derived bridge timeout, restored async_concurrency tests on 3.10
- Updated: test_dataset_builder.py uses an autouse fixture to pin its
  Mock-based tests to the sync engine they cover; existing bridge tests
  set facade.request_timeout for the new derivation

Docs
- Replace the stale LiteLLM security notice in README with a short
  async-default heads-up and link to the migration guide
- Add docs/migration-async-default.md covering per-model timeouts,
  custom-column thread safety, mocking model calls, run outcomes, and
  the opt-out
- Append a short Update section to the async-all-the-way-down dev note
@andreatgretel andreatgretel requested a review from a team as a code owner April 30, 2026 21:27
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Docs preview: https://2a16db98.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #592 — feat: make async engine the default execution path

Summary

Flips DATA_DESIGNER_ASYNC_ENGINE default from "0""1" at both consumption
sites, and resolves the three prerequisites the description calls out for the
flip:

  1. Python 3.10 compatibility — replaces the single asyncio.TaskGroup
    (3.11+) in _run_all with asyncio.gather + an explicit cancel-on-raise
    block, and drops the sys.version_info < (3, 11) runtime guard and matching
    pytestmark.skipif.
  2. Timeout tuning for slow endpoints — removes the two hardcoded 300s
    deadlines (ThrottleManager.acquire_sync/async, _AsyncBridgedModelFacade)
    that bit slow self-hosted endpoints. Throttle waits now default to
    timeout=None (HTTP layer is the only deadline that bounds real work); the
    sync→async bridge timeout is now derived per-call as
    max(60, (1+max_correction_steps) × request_timeout × 1.5).
  3. Root-cause surfacing for 0-record runs — async scheduler captures the
    first non-retryable task error; the interface threads it into
    DataDesignerGenerationError when actual_num_records == 0 and no early
    shutdown fired, replacing the previous wrapped FileNotFoundError.

Also adds docs/migration-async-default.md, a README heads-up replacing the
stale LiteLLM supply-chain notice, and a devnote update pointing at the
migration guide.

Findings

Correctness

  • async_concurrency.py gather rewrite (lines 183-196) is sound. _run_task
    already catches Exception into error_trap (see
    async_concurrency.py:217-219), so children don't raise into gather under
    normal operation. The except BaseException block is only exercised on
    cancellation propagation, and it mirrors TaskGroup's "cancel siblings, await
    cleanup, re-raise" discipline. Comment on line 184 calls this out clearly.
    The second await asyncio.gather(*tasks, return_exceptions=True) correctly
    awaits cleanup after cancellation.

  • Bridge timeout derivation (custom.py:73-79) looks correct:
    int(kwargs.get("max_correction_steps", 0) or 0) defends against a caller
    passing max_correction_steps=None. Worth confirming no call site passes a
    negative value — a negative would make (1 + correction_steps) * ... fall
    below the floor, which the max(...) then catches; safe by construction.

  • first_non_retryable_error plumbing: captured in
    async_scheduler.py:_execute_task_inner_impl only on the first non-retryable
    branch (if self._first_non_retryable_error is None), reset in
    _reset_run_state, and propagated in both the preview and full-run code
    paths in dataset_builder.py. Complete and symmetric.

  • Throttle timeout=None semantics: correctly guarded in both acquire_sync
    and acquire_async. The deadline = ... if timeout is not None else None
    pattern plus the in-loop if deadline is not None: check is consistent
    between the two methods. The unit tests cover both the new default-None
    waits-for-release branch and preserve the explicit-float deadline branch.

Code quality / consistency

  • Duplication in interface/data_designer.py: the root-cause surfacing
    block

    root_cause = builder.first_non_retryable_error
    if root_cause is not None and builder.actual_num_records == 0:
        raise DataDesignerGenerationError(f"🛑 {root_cause}") from root_cause

    is repeated three times (lines 255-257, 276-278, 350-352). A tiny helper
    (_raise_root_cause_if_any(builder)) would remove the repetition. Minor —
    three inline copies are readable, but the next addition of a similar guard
    will push this toward a helper anyway.

  • Root-cause message is bare str(exc): f"🛑 {root_cause}" stringifies
    the exception. For some exception types (e.g. bare RuntimeError("") or
    library exceptions that embed context in attrs rather than __str__), the
    user sees just the emoji. Consider
    f"🛑 {type(root_cause).__name__}: {root_cause}" to always surface the
    exception class — keeps the root cause self-describing.

  • Magic number 60.0 in facade.request_timeout (facade.py:161): "Mirrors
    the fallback in clients/factory.py" — factory uses float(raw if raw is not None else 60). The numbers match today, but the duplication is load-bearing
    (drift would mean the bridge computes a timeout that doesn't match the HTTP
    deadline). A shared constant (e.g.
    DEFAULT_INFERENCE_TIMEOUT_S = 60.0 in one of the models modules) would be
    cleaner and self-enforcing.

  • Unbounded throttle waits trade: timeout=None by default removes a
    safety timer. If a permit accountant bug ever leaks permits, the former
    behavior raised TimeoutError after 300s; the new behavior hangs forever.
    The PR's rationale (per-request HTTP timeout plus AIMD already bound real
    work) is sound for the intended failure mode (slow-but-healthy endpoints),
    but this is worth noting — if a future leak regression surfaces, it will
    manifest as a hang rather than a loud error. Not blocking; the trade is
    the right one for the reported bug: throttle acquire timeout is hardcoded at 300s, causes early shutdown on slow inference endpoints #551 symptoms.

  • test_dataset_builder.py autouse fixture: pinning every test in the
    file to DATA_DESIGNER_ASYNC_ENGINE=False is pragmatic (the Mock-based
    resource providers don't satisfy the async scheduler's contracts), and the
    docstring explains both the why and where coverage lives for the other
    path. Good.

Tests

  • Parametrized test_async_bridge_timeout_derives_from_request_timeout
    covers all three branches of the max/formula (above-floor default, above-
    floor with corrections, below-floor clamped). Uses
    patch.object(concurrent.futures.Future, "result", ...) to capture the
    timeout — intrusive but scoped to the with block.
  • Both sync and async counterparts of the new "no-deadline waits for release"
    test, with tight 0.04 < elapsed < 2.0 bound on the sync version.
  • Existing _AsyncBridgedModelFacade tests updated: facade.request_timeout
    now set explicitly (e.g. test_custom.py:543, :589), and the floor-patch
    test migrates from SYNC_BRIDGE_TIMEOUT to _BRIDGE_TIMEOUT_FLOOR_S.
  • StubScheduler in the builder test file correctly extends with
    first_non_retryable_error = None.
  • One gap worth noting: the PR description says "E2E plugin tests not run
    locally (Mock-only flag)." That's a consistent caveat — the live-smoke
    matrix in the description compensates.

Docs

  • migration-async-default.md is tight and practical (timeouts, thread-safety,
    mocking, run outcomes, opt-out). The mkdocs.yml nav entry places it under
    Getting Started — reasonable for its transitional prominence. Worth revisiting
    once the opt-out is removed.
  • README swap (LiteLLM notice → async heads-up) is accurate — the LiteLLM
    notice had been stale for multiple releases. The async heads-up correctly
    points at inference_parameters.timeout as the knob that matters and keeps
    the opt-out escape hatch visible.
  • Follow-ups list is honest: architecture-and-performance.md refresh and
    LoggingConfig default WARN visibility (fix(async): pack of fixes for async engine under degraded providers #585 thread) are both orthogonal and
    appropriate to defer.

Security / Risk

  • No secrets, no new network endpoints, no new dependencies. Default flip is
    the behavior change with blast radius — but the one-release opt-out is in
    place and documented, and the Python 3.10 fix means the async path is
    actually available to everyone (the previous guard would have raised on 3.10
    even with the opt-in set).
  • SYNC_BRIDGE_TIMEOUT = 300 still lives in column_generators/generators/base.py
    used by _run_coroutine_sync. That path is separate from the async bridge
    touched here — it's fine, just a note that the name is now overloaded
    (the old constant still exists for the unrelated sync-coroutine helper).

Verdict

Looks good to merge. The scope is cohesive (flip + the three pre-flip
prerequisites), the risky change is well-reasoned and well-commented, and
test coverage for new behavior is solid. Suggested nits:

  1. Extract the repeated root-cause surfacing block into a helper
    (interface/data_designer.py, 3 sites).
  2. Include the exception type name in the root-cause error message so
    typeless exceptions still surface something useful.
  3. Hoist the 60.0 / 60 fallback into a named constant shared between
    facade.request_timeout and clients/factory.py so the two can't drift.

None are blocking.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR promotes the async engine to default (DATA_DESIGNER_ASYNC_ENGINE defaults to "1"), and clears the four prerequisites that blocked the flip: Python 3.10 compatibility via asyncio.gather + cancel loop, per-model bridge timeouts derived from inference_parameters.timeout, indefinite throttle-queue waits (no 300 s deadline), and root-cause surfacing when 0 records are produced. The opt-out (DATA_DESIGNER_ASYNC_ENGINE=0) emits a DeprecationWarning; allow_resize=True columns auto-fall-back to sync clients via the new _resolve_client_concurrency_mode.

Confidence Score: 5/5

Safe to merge — no logic errors or correctness bugs found across all changed files.

All changes are well-reasoned and internally consistent. The TaskGroup→gather rewrite is semantically correct because _run_task catches Exception (not BaseException), preserving CancelledError propagation. The timeout=None change is intentional and the HTTP deadline still bounds actual work. The bridge timeout derivation, root-cause surfacing, and allow_resize client-mode alignment all have targeted test coverage. The full test suite (3117/3117) and e2e smoke tests passed.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/async_concurrency.py Replaces asyncio.TaskGroup with gather+cancel for Python 3.10 compat; semantics preserved because _run_task catches Exception (not BaseException), so CancelledError still propagates correctly through the except-BaseException cancel path
packages/data-designer-engine/src/data_designer/engine/models/clients/throttle_manager.py Removes 300 s hard deadline on acquire_sync/acquire_async; timeout=None now waits indefinitely. Intentional design: per-request HTTP timeout is the only live deadline; test coverage added for both sync and async no-deadline paths
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py Bridge timeout now derived via _compute_bridge_timeout from facade.request_timeout * attempts * 1.5, floored at 60 s; honors per-call timeout= override; removes hardcoded 300 s SYNC_BRIDGE_TIMEOUT
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Adds first_non_retryable_error capture (first non-retryable exception only); fixes non-FromScratch seed dispatch to pass rg_size-row buffer snapshot instead of empty DataFrame
packages/data-designer/src/data_designer/interface/data_designer.py Adds _resolve_client_concurrency_mode to align client mode with actual engine path (allow_resize forces SYNC); threads first_non_retryable_error into DataDesignerGenerationError when 0 records produced without early_shutdown
packages/data-designer-engine/src/data_designer/engine/models/facade.py New request_timeout property mirrors 60 s factory.py fallback; used by bridge timeout derivation
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Defaults DATA_DESIGNER_ASYNC_ENGINE to "1"; removes Python 3.11 version guard; threads first_non_retryable_error from scheduler into builder; resets state on each run
packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py Adds client_concurrency_mode kwarg; defaults env-var check from "0" to "1" for direct callers; interface passes explicit value computed by _resolve_client_concurrency_mode
tests_e2e/src/data_designer_e2e_tests/plugins/regex_filter/impl.py Moves regex filter from process_before_batch to process_after_generation to comply with async engine's row-count invariance requirement on pre/post-batch stages

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DataDesigner.create / preview] --> B[_resolve_client_concurrency_mode]
    B --> C{DATA_DESIGNER_ASYNC_ENGINE?}
    C -- "=0 env var" --> D[ClientConcurrencyMode.SYNC\n+ DeprecationWarning]
    C -- "=1 default" --> E{any allow_resize columns?}
    E -- yes --> F[ClientConcurrencyMode.SYNC\n builder-layer warning]
    E -- no --> G[ClientConcurrencyMode.ASYNC]
    D & F & G --> H[create_resource_provider\nexplicit client_concurrency_mode]
    H --> I[DatasetBuilder.build]
    I --> J{DATA_DESIGNER_ASYNC_ENGINE?}
    J -- sync --> K[Sync engine\ncolumn-at-a-time]
    J -- async default --> L[AsyncTaskScheduler]
    L --> M[asyncio.gather + cancel loop\nPython 3.10 compat]
    L --> Q[_handle_task_failure]
    Q --> R{retryable?}
    R -- yes --> S[defer/retry]
    R -- no --> T[capture first_non_retryable_error\ndrop row]
    I --> U{0 records produced?}
    U -- early_shutdown=True --> V[DataDesignerEarlyShutdownError]
    U -- root_cause captured --> W[DataDesignerGenerationError\nwith original cause]
    U -- no cause --> X[generic DataDesignerGenerationError]
Loading

Reviews (12): Last reviewed commit: "docs: breadcrumb explaining why SYNC_BRI..." | Re-trigger Greptile

The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result``
to capture the timeout the bridge passed in. That reaches into stdlib internals
(DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument
on the parametrize was missing.

Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test
now calls the helper directly with no mocking, and the parametrize gets readable ids.
Behavior is unchanged.
The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two
of them were written against sync-engine semantics that the async engine
restricts:

- DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no
  required_columns. The async engine routes ``no-upstream`` columns
  through the from-scratch path, which passes an empty DataFrame to
  generators that aren't FromScratchColumnGenerator subclasses. The
  generator then produces 0 rows and the scheduler raises
  ``update_batch received 0 values``. Switching the plugin to
  FromScratchColumnGenerator with generate_from_scratch(num_records)
  matches what the plugin actually does (produces a constant column
  without input) and works on both engines.

- RegexFilterProcessor implemented process_before_batch with row-count
  changes. The async engine enforces row-count invariance in pre- and
  post-batch processor stages by design. Moving the filter to
  process_after_generation preserves the plugin's purpose (regex-based
  row filtering) at a stage that supports row-count changes on both
  engines. Test assertions check the final dataset, so the stage shift
  is transparent.

Both changes are demo-plugin updates only; no production code change.
Three bugs and two test-quality concerns surfaced by an independent review of
the prior commits. Each was real and worth fixing in the flip PR.

Bug fixes
- Sync-fallback path was creating async-only model clients. The default flip
  meant ``client_concurrency_mode = ASYNC`` for every default run, but the
  ``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()``
  calls then hit ``SyncClientUnavailableError``. The resolution decision now
  lives at the DataDesigner interface level via
  ``_resolve_client_concurrency_mode``: it considers both the env var and the
  config (allow_resize forces sync clients) and is passed explicitly to
  ``create_resource_provider``. Direct callers of the factory still get the
  env-var default.

- Sync→async bridge timeout ignored the per-call ``timeout=`` override. A
  custom column calling ``model.generate(timeout=600)`` against a slow endpoint
  was being cancelled at the model-config default, not 600s. The bridge now
  prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``.

- Bridge timeout formula missed ``max_conversation_restarts``. One logical
  generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)``
  HTTP requests; the formula now multiplies both, matching the worst-case
  attempt budget.

Engine routing fix (also surfaced by failing e2e plugin tests)
- ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch
  generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn``
  with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now
  passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the
  sync engine's FULL_COLUMN contract.
- The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``)
  is reverted; the engine fix subsumes it. The processor-plugin fix
  (``process_after_generation`` for the regex filter) stays — pre-batch
  row-count change is intentionally rejected by the async engine.

Test improvements
- Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)``
  and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct
  from any finite default. Sync and async counterparts mirror.
- New regression tests for ``first_non_retryable_error`` surfacing covering
  both load-raises and load-returns-empty paths, asserting the original
  exception is chained via ``__cause__`` and that the typed
  ``DataDesignerEarlyShutdownError`` doesn't fire in this branch.
- New parametrized regression test for ``_resolve_client_concurrency_mode``
  covering all four (env × allow_resize) combinations.
- New parametrized test for the per-call ``timeout=`` override flowing into
  the bridge timeout calculation.
- Bridge formula tests extended with ``max_conversation_restarts`` cases.
…sync-default-prereqs

# Conflicts:
#	packages/data-designer/src/data_designer/interface/data_designer.py
Three parametrize cases were duplicating coverage already provided by
existing standalone tests:

- ``test_acquire_*_timeout_branches`` parametrized over ``(0.0, raises)``
  and ``(None, waits)``. The ``raises`` half duplicates
  ``test_acquire_*_raises_timeout_when_at_capacity``. Replaced with two
  focused ``..._default_no_deadline_waits_for_release`` tests covering
  only the no-deadline branch.

- ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four
  cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the
  env var alone forces it; the allow_resize input is moot. Dropped.

- ``test_async_bridge_honors_per_call_timeout`` had three cases. The
  "override below floor" case cross-products the per-call override flow
  with the floor-clamping behavior already covered by
  ``test_compute_bridge_timeout``. Dropped.

Net: -25 lines of test code with no loss of essential coverage.
The standalone ``Migrating to the async default`` page didn't fit the
existing docs style — present tense, behavior over comparisons, content
in the natural concept home. Folding it in:

- ``architecture-and-performance.md`` gets a new ``Async Engine`` section
  covering per-model timeouts, run outcomes (partial completion +
  ``DataDesignerEarlyShutdownError``), and the transitional opt-out.
  Three stale ``async engine is landing soon`` callouts updated to
  reflect the flip.
- ``custom_columns.md`` gets two short notes: a thread-safety callout
  near Generation Strategies, and a mocking-with-spec note in
  Development Testing.
- ``async-all-the-way-down.md`` Update section now points at the new
  arch-and-perf section.
- README heads-up links to the same anchor.
- ``migration-async-default.md`` removed; mkdocs.yml entry dropped.
Small targeted edits to make the user-facing concept docs consistent
with the post-flip state. No restructuring.

- ``architecture-and-performance.md``: the ``Execution Model`` callout
  now opens with two engines, links to the new ``Async Engine`` section,
  and frames the existing column-at-a-time description as sync-engine
  semantics. The ``Step 2: Process columns sequentially`` paragraph notes
  the async engine relaxes this. The ``Key Concepts`` table differentiates
  per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells``
  is the same on both.
- ``processors.md``: added a warning callout about the async engine's
  row-count invariance in pre- and post-batch stages, with the guidance
  to use ``process_after_generation()`` for row-filtering or expansion.
nabinchha
nabinchha previously approved these changes May 1, 2026
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @andreatgretel — this is a thoughtful, well-scoped flip with the prerequisites carefully addressed and the test coverage to back it up.

Summary

Makes the async engine the default execution path and lands the Python 3.10 compat, derived-timeout, and root-cause-surfacing prerequisites alongside it. Implementation matches the stated intent across all four "Attention Areas" — the bridge timeout derivation, the throttle no-deadline default, the _resolve_client_concurrency_mode alignment, and the first_non_retryable_error chaining all behave as documented.

Findings

Warnings — Worth addressing

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py:19SYNC_BRIDGE_TIMEOUT = 300 still hardcoded in _run_coroutine_sync

  • What: The PR replaces the SYNC_BRIDGE_TIMEOUT import in custom.py with the derived _compute_bridge_timeout, but the original constant is still defined in base.py and still used by the default ColumnGenerator.generate() fallback (_run_coroutine_sync). Any subclass that only implements agenerate() and is then called from sync context (e.g. user code outside the engine) hits the same 300s ceiling the PR explicitly set out to remove.
  • Why: The PR description calls out "two hardcoded 300s timeouts that broke slow self-hosted endpoints." The throttle one and the custom-column bridge one are addressed; this one is structurally identical and would bite the same slow-endpoint scenarios for users who call generator.generate(...) directly. Since async-first generators are about to become the norm, this code path will likely see more, not less, traffic.
  • Suggestion: Either route this bridge through the same derivation (if a request_timeout is reachable from the generator's resource provider) or, at minimum, hoist the floor up to match _BRIDGE_TIMEOUT_FLOOR_S and accept the 60s minimum here too. If neither fits in this PR, an explicit follow-up issue would be enough — happy to leave it for the next pass, but the PR description should probably acknowledge the third 300s hidden in base.py.

packages/data-designer-engine/tests/engine/models/clients/test_throttle_manager.py:524asyncio.create_task reference dropped

  • What: asyncio.create_task(release_after(0.05)) doesn't keep a reference to the resulting task. Python's asyncio docs note that the loop only holds weak references to tasks, so a GC cycle between scheduling and the inner await can collect the task before it runs.
  • Why: In practice the test almost certainly passes because the asyncio.sleep(0.05) keeps the task alive long enough for acquire_async to acquire the lock, but it's a known flake source on slower CI runners or under heavy load. The new sync counterpart correctly retains its Thread via the daemon thread reference.
  • Suggestion:
    release_task = asyncio.create_task(release_after(0.05))
    try:
        await asyncio.wait_for(
            tm.acquire_async(provider_name=PROVIDER, model_id=MODEL, domain=DOMAIN),
            timeout=2.0,
        )
    finally:
        await release_task

Suggestions — Take it or leave it

packages/data-designer/src/data_designer/interface/data_designer.py:267 — Error formatting may obscure exception type

  • What: raise DataDesignerGenerationError(f"🛑 {root_cause}") from root_cause calls __str__ on the exception, which drops the type name. A ValueError("invalid seed source: no rows after hydration") becomes "🛑 invalid seed source: no rows after hydration" — the chained __cause__ still has full context, but the surface message is less informative than it could be for users reading the top frame.
  • Why: This is the user-facing message we just went out of our way to surface. Including the exception type makes it clearer that the underlying failure came from outside the framework.
  • Suggestion: Consider f"🛑 {type(root_cause).__name__}: {root_cause}" for a small readability win. Take it or leave it — chaining via __cause__ is the contract anyway.

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py:47-48 — Magic numbers in _compute_bridge_timeout

  • What: The 1.5x buffer factor is a literal in the function body. The formula is well-explained in the docstring, but the constant itself doesn't have a name.
  • Why: A reader scanning the function for the buffer multiplier has to read the docstring; named constants would make the formula self-documenting and easier to tune.
  • Suggestion: Optional —
    _BRIDGE_TIMEOUT_OVERHEAD_FACTOR: float = 1.5  # HTTP connect/teardown + serialization buffer
    
    def _compute_bridge_timeout(...):
        attempts = (1 + max_conversation_restarts) * (1 + max_correction_steps)
        return max(_BRIDGE_TIMEOUT_FLOOR_S, attempts * request_timeout * _BRIDGE_TIMEOUT_OVERHEAD_FACTOR)

packages/data-designer/src/data_designer/interface/data_designer.py:580 — Defensive getattr likely unnecessary

  • What: any(getattr(c, "allow_resize", False) for c in config_builder.get_column_configs()) uses getattr with a fallback, but every member of ColumnConfigT (which is what the builder accepts) inherits allow_resize: bool = False from SingleColumnConfig. The builder rejects anything that isn't in ColumnConfigT.__args__ at add_column time.
  • Why: Mirroring the engine's check style would make it obvious that this catches every user-reachable column config; the current form reads as if some configs might not have the field.
  • Suggestion: Optional — any(c.allow_resize for c in config_builder.get_column_configs()). If you want to keep the defensive form for symmetry with _resolve_async_compatibility, that's fine too.

packages/data-designer/src/data_designer/interface/data_designer.py:265-267 — Sentinel-driven branch is implicit

  • What: if root_cause is not None and builder.actual_num_records == 0: works correctly because the sync engine never sets _first_non_retryable_error and leaves actual_num_records = -1. The interface code reads as if it might also fire on the sync path, but it doesn't.
  • Why: A future contributor adding sync-engine error capture (e.g. for symmetry) might wire actual_num_records = 0 for the sync path and accidentally double-trigger this branch.
  • Suggestion: A one-line comment near each call site would make the async-only assumption explicit:
    # actual_num_records is only set on the async path; sync runs leave it at -1
    # so this branch never fires for the sync engine.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:886-887 — First-error capture loses subsequent context

  • What: Only the first non-retryable error is captured; later ones are dropped (the warning log line is the only trace). For a run where the first failure is misleading (e.g. a single corrupt seed row that fails with KeyError) and the systemic failure (e.g. wrong provider auth) shows up later, the surfaced root cause may not be the most actionable one.
  • Why: "First wins" is a reasonable heuristic, especially for the deterministic-failure case the PR targets, but worth being explicit about the trade-off.
  • Suggestion: Take it or leave it — for a follow-up, you might track per-column counts of distinct error types and surface the most frequent. For now the docstring on the property already communicates "first" clearly enough.

What Looks Good

  • Test coverage matches the surface area. The parametrized _compute_bridge_timeout cases (with the corrections × restarts compound and small-clamped-to-floor ids) make the formula concrete; the regression tests for first_non_retryable_error cover both the load-raises and load-returns-empty paths and explicitly assert the typed early-shutdown error doesn't fire spuriously. The test_resolve_client_concurrency_mode_matches_engine_choice matrix is exactly the right shape.
  • The gather-with-cancel rewrite in async_concurrency.py is conservative and well-commented. The defensive cancel block calls out that _run_task swallows its own exceptions today, so the explicit cancel only matters under future change. That's the right framing.
  • Smart capture of root cause through the scheduler. The mechanism — capture the first non-retryable error in the scheduler, expose it via property, thread it through the interface only when the run produced 0 records — is a clean way to recover the context the sync engine got "for free" via raises. The triple guard on the interface (builder.first_non_retryable_error is not None and builder.actual_num_records == 0 plus the early-shutdown check first) correctly prevents partial-salvage runs from masking unrelated load failures.
  • Documentation lands in the right places. Folding the migration guidance into architecture-and-performance.md and processors.md (rather than a standalone migration page) matches the existing docs style. The thread-safety callout in custom_columns.md and the MagicMock(spec=ModelFacade) note are the kind of "you'll wish you knew this before debugging" footnotes that pay back fast.
  • Backward compatibility preserved at the factory boundary. create_resource_provider accepting an explicit client_concurrency_mode while keeping the env-var default for direct callers is the right shape — the interface owns the policy, the factory stays mechanical.

Verdict

Ship it (with nits) — the only Warning-level item (the lingering SYNC_BRIDGE_TIMEOUT in base.py) is arguably scope-adjacent and fine as a follow-up issue if you'd rather keep this PR focused. Everything else is suggestions.


This review was generated by an AI assistant.

Four targeted fixes from the review.

Worth-addressing (warning):
- ``test_acquire_async_default_no_deadline_waits_for_release`` was
  spawning the release task without holding a strong reference. The
  loop's weak-ref bookkeeping could GC it before the inner ``await``
  observes the release, producing a CI flake. Hold the task and
  ``await`` it in ``finally``.

Take-it-or-leave-it (applied):
- Root-cause error surfacing now includes the exception type name:
  ``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see
  ``ValueError: ...`` instead of just the message string. The
  ``__cause__`` chain is preserved either way.
- Drop the defensive ``getattr(c, "allow_resize", False)`` in
  ``_resolve_client_concurrency_mode`` — every member of
  ``ColumnConfigT`` inherits ``allow_resize: bool = False`` from
  ``SingleColumnConfig``.
- One-line comment near the root-cause surfacing branch noting that
  ``actual_num_records == 0`` is async-only (sync runs leave it at
  ``-1``), so the branch is async-only by construction.

Not addressed in this PR (filing as follow-ups):
- ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in
  ``column_generators/generators/base.py:_run_coroutine_sync``. That
  bridge has no model-facade context to derive a timeout from, so the
  fix is a structural refactor outside this PR's scope.
- First-error capture loses subsequent-error context. The "first wins"
  heuristic is documented; richer aggregation is a follow-up.
This was the third hardcoded 300s timeout (Nabin flagged it on PR #592).
The path is the generic sync→async bridge in ``ColumnGenerator.generate()``:
when a subclass overrides only ``agenerate()``, the sync entry point runs
the coroutine in a background thread.

Same philosophy we applied to the throttle queue wait elsewhere in the
PR: a defensive deadline on top of work that's already bounded by the
HTTP timeout doesn't add safety, it just produces spurious failures on
slow self-hosted endpoints. Drop the constant, the timeout exception
handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)``
becomes the simple cleanup.

Tests in ``test_async_generators.py`` exercise the happy path only and
don't depend on the timeout firing.
@andreatgretel
Copy link
Copy Markdown
Contributor Author

Thanks for the careful review. Pushed bf36d256 addressing the nits.

Applied:

  • Test flake fix - held the release_task ref and await in finally, exactly as you suggested. Good catch on the weak-ref behavior, that's the kind of thing that only bites on slow CI.
  • Exception type now in the surfaced error message: f"🛑 {type(root_cause).__name__}: {root_cause}". Both create and preview paths.
  • Dropped the defensive getattr in _resolve_client_concurrency_mode. You're right that every ColumnConfigT member inherits allow_resize.
  • One-line sentinel comment near the surfacing branch noting it's async-only by construction.

Deferred to a follow-up PR:

  • SYNC_BRIDGE_TIMEOUT = 300 in base.py:_run_coroutine_sync. I tried just dropping it (consistent with how we handled the throttle queue wait), but on reflection it's a different risk profile - that bridge has no model-facade context, so removing the deadline would leave Jupyter/asyncio-internal hangs with no escape hatch. The clean fix is to give _run_coroutine_sync an optional timeout param and let calling generators pass one when they have model context. Worth its own focused PR rather than bolting it onto this one.
  • First-error capture losing later context. Agreed it's a "first wins" heuristic; happy to track in a follow-up if it shows up in practice.

Skipped (pure style):

  • _BRIDGE_TIMEOUT_OVERHEAD_FACTOR = 1.5 named constant. The docstring already explains the formula, leaving as-is for now, wdyt?


### Opting out

The legacy sync engine is available as a transitional opt-out:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add "deprecation warning" language here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did both. The doc now uses a !!! warning "Deprecated" admonition naming the env var as a deprecated escape hatch (43f71130), and the runtime emits a DeprecationWarning plus logger.warning when DATA_DESIGNER_ASYNC_ENGINE=0 is set, mirroring the existing allow_resize shape (4dfa200e). Both signals now match the precedent in _resolve_async_compatibility.

Nabin asked whether the docs should adopt explicit "deprecation" language
on the opt-out path. Doing both:

- Doc: ``architecture-and-performance.md``'s ``Opting out`` section now
  uses an ``!!! warning "Deprecated"`` admonition that names the env var
  as a deprecated escape hatch and notes the run-time warning.
- Code: ``DataDesigner._resolve_client_concurrency_mode`` emits a
  ``DeprecationWarning`` when ``DATA_DESIGNER_ASYNC_ENGINE=0`` is detected.
  Same precedent as the existing ``allow_resize=True`` warning. Auto-fallback
  via ``allow_resize`` does not double-warn here; the builder layer emits
  its own warning later.
- Test: parametrized regression now asserts ``pytest.warns(DeprecationWarning)``
  on the opt-out branch and treats any warning on the async-on branches as
  a failure (``simplefilter("error")`` inside the ``catch_warnings`` block).
nabinchha
nabinchha previously approved these changes May 4, 2026
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @andreatgretel — I re-walked the PR with the latest commits in hand, and the prior round of feedback has been thoroughly addressed (with the trade-offs you couldn't apply spelled out cleanly in bf36d256's body). At this point the change reads as a careful, well-scoped flip: async by default, prerequisite gaps filled in, and follow-ups explicitly punted with rationale.

Summary

Flips DATA_DESIGNER_ASYNC_ENGINE to "1" at both consumption sites and lands the prerequisites that surface alongside the flip — Python 3.10 compatibility for the async stack (gather-with-cancel in place of TaskGroup), removal of two hardcoded 300 s timeouts that broke slow self-hosted endpoints (sync bridge and throttle queue-wait), and root-cause surfacing for runs that produce 0 records. Implementation matches the stated intent.

Findings

Suggestions — Take it or leave it

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:72 — Silent opt-out leaves DATA_DESIGNER_ASYNC_ENGINE=0 users in the dark

  • What: The README and architecture-and-performance.md both call the env var a "transitional opt-out" that "will be removed in a future release", but at runtime there's no DeprecationWarning emitted when the user sets DATA_DESIGNER_ASYNC_ENGINE=0. The only deprecation signal in the engine flip is the allow_resize=True auto-fallback path in _resolve_async_compatibility (dataset_builder.py:330). Users who explicitly opt out via the env var get the legacy behavior silently and won't have a programmatic signal to migrate before the switch is removed.
  • Why: This was the spirit of the inline comment I left on architecture-and-performance.md:304 last round. The docs now lean harder on "transitional"/"will be removed" wording, which makes the runtime silence more noticeable.
  • Suggestion: Emit a one-time DeprecationWarning (and a logger.warning, mirroring the allow_resize shape) at the env-var read site when DATA_DESIGNER_ASYNC_ENGINE is explicitly "0". Something like:
    if os.environ.get("DATA_DESIGNER_ASYNC_ENGINE") == "0":
        msg = (
            "DATA_DESIGNER_ASYNC_ENGINE=0 forces the legacy sync engine. "
            "This switch is transitional and will be removed in a future release; "
            "please run on the async engine and report any regressions."
        )
        logger.warning(f"⚠️ {msg}")
        warnings.warn(msg, DeprecationWarning, stacklevel=2)
    DATA_DESIGNER_ASYNC_ENGINE = os.environ.get("DATA_DESIGNER_ASYNC_ENGINE", "1") == "1"
    The same logic could live once in a small helper consumed by both dataset_builder.py and resource_provider.py. Happy to defer to a follow-up issue if you'd rather keep this PR tight.

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py:19SYNC_BRIDGE_TIMEOUT = 300 survives without a follow-up breadcrumb

  • What: The constant and the timeout-handling block in _run_coroutine_sync are unchanged on this branch (7a0b77d4 dropped them, cf0826d1 reverted with no body). The rationale for keeping it lives only in bf36d256's commit message (no facade context to derive a timeout). That's a fine call for this PR — the path is the generic ColumnGenerator.generate() fallback, and there's no model-facade reachable from there to derive a bounded deadline.

  • Why: A future maintainer reading base.py will see the same 300 s pattern that the rest of the PR removed and not know it was deliberately preserved. The revert commit body is also empty, so git blame won't lead them to the rationale either.

  • Suggestion: Two small breadcrumbs would close the loop:

    • A short comment above SYNC_BRIDGE_TIMEOUT = 300 (or above _run_coroutine_sync) noting "Preserved deliberately: this is the generic sync→async bridge for ColumnGenerator.generate() overrides; it has no ModelFacade context to derive a per-call deadline. Tracked for structural follow-up in #."
    • File the follow-up issue (or link an existing one) so the breadcrumb has a destination.

    Same nit applies to the empty revert message in cf0826d1 — even just "Reverting; preserved as-is for follow-up — see bf36d25 for context" in the body would help future archaeologists.

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py:30,47 — Bridge-timeout magic numbers could be explained inline

  • What: _BRIDGE_TIMEOUT_FLOOR_S = 60.0 has a docstring; the 1.5 multiplier in attempts * request_timeout * 1.5 does not. The PR description explains it ("absorbs HTTP connect/teardown and serialization overhead") and the function docstring partially mirrors it.
  • Why: Future tuners (or anyone debugging a flaky bridge timeout) will reach for the multiplier first. A named constant or an inline comment makes the intent self-documenting.
  • Suggestion: Extract the buffer to a named constant alongside the floor:
    _BRIDGE_TIMEOUT_FLOOR_S: float = 60.0
    _BRIDGE_TIMEOUT_OVERHEAD_FACTOR: float = 1.5  # HTTP connect/teardown + serialization buffer
    Then attempts * request_timeout * _BRIDGE_TIMEOUT_OVERHEAD_FACTOR. Tiny, but makes intent obvious and keeps test fixtures explicit if the buffer ever needs adjusting.

What Looks Good

  • bf36d256 is exemplary "address review feedback" hygiene. Each item is labelled by severity, applied changes are described, and the items not applied are filed as follow-ups with concrete rationale ("no facade context to derive a timeout from", "richer aggregation is a follow-up"). That's exactly the level of context that makes a re-review fast.
  • Root-cause surfacing is well-thought-through. The async-only-by-construction comment at data_designer.py:265-267 makes the load/empty branches readable, and f"🛑 {type(root_cause).__name__}: {root_cause}" plus from root_cause gives users both a useful one-liner and the full chain for debugging. The triple-site pattern (create() load failure, create() empty dataset, preview() empty dataset) is consistent.
  • Test fix at test_throttle_manager.py:524-534 reads cleanly. Strong reference + await … finally defends against the asyncio loop's weak-ref bookkeeping, and the inline comment explains why anyone reading it later — the same defense maintainers should reach for elsewhere.
  • Clean Python 3.10 fallback. The gather-with-cancel pattern in async_concurrency.py:_run_all is genuinely behavior-equivalent to TaskGroup for this use (since _run_task swallows its own exceptions), and the comment block explicitly says so. Removing the runtime guard and the matching pytestmark.skipif keeps this minimal.
  • Engine/client-mode alignment via _resolve_client_concurrency_mode is the right shape. Single point of truth, doesn't leak the env var into the interface, and correctly accounts for allow_resize forcing sync. Dropping the defensive getattr was the right call once we confirmed every ColumnConfigT member has the field.

Verdict

Ship it (with nits) — The Suggestions above are all optional polish; nothing is blocking. The deprecation-warning gap is the most user-facing of them and would close a small loop between docs and runtime, but it's reasonable to file as a follow-up if you'd rather not expand the scope of this PR. Linter and formatter both pass on the changed files (ruff check clean, ruff format --check clean).


This review was generated by an AI assistant.

Parity fix from Nabin's re-review of PR #592. The ``allow_resize=True``
auto-fallback path in ``_resolve_async_compatibility`` emits both a
``logger.warning("⚠️ ...")`` and a ``DeprecationWarning``. The new
``DATA_DESIGNER_ASYNC_ENGINE=0`` opt-out path was only emitting the
``DeprecationWarning``, leaving users who run with default warning
filters silenced and inconsistent with the established precedent.

Match the pattern: same message body, both signals, same stacklevel.
Nabin's re-review pointed out that ``base.py`` is the lone place where
the 300s pattern survives, while ``custom.py`` and ``throttle_manager.py``
both retired theirs. Without a comment, a future reader (or a lint sweep)
could mistake this for an oversight and "consistency-fix" it the wrong way.

Add a short note at the constant naming the two retired siblings, the
reason this one stayed (no ``ModelFacade`` context to derive from), and
the fact that it's tracked for a structural follow-up.
@andreatgretel
Copy link
Copy Markdown
Contributor Author

Thanks for the re-review, Nabin. Pushed two more small commits addressing the first two suggestions:

  • 4dfa200elogger.warning next to the DeprecationWarning in _resolve_client_concurrency_mode. Same shape as the allow_resize precedent in _resolve_async_compatibility. You're right that it's the more user-facing of the three; it was free.
  • b57d35dd — breadcrumb comment above SYNC_BRIDGE_TIMEOUT = 300 in base.py naming the two siblings that were retired, why this one stayed (no ModelFacade context to derive from), and that it's tracked for a structural follow-up.

Skipping the _BRIDGE_TIMEOUT_OVERHEAD_FACTOR = 1.5 extraction for now — pure style and the docstring already covers it. Happy to take it in a follow-up cleanup if you'd rather see it land sooner.

Re-requesting your review since the push will dismiss the prior approval. Sorry for the dance.

@andreatgretel andreatgretel requested a review from nabinchha May 4, 2026 19:08
@andreatgretel andreatgretel merged commit 61cdeef into main May 4, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: throttle acquire timeout is hardcoded at 300s, causes early shutdown on slow inference endpoints

2 participants