Skip to content

feat: integrate Bet-Optimal Drafting (BOD) for dynamic block-size optimization#27

Open
0xClandestine wants to merge 1 commit into
bstnxbt:mainfrom
0xClandestine:feat/bet-optimal-drafting
Open

feat: integrate Bet-Optimal Drafting (BOD) for dynamic block-size optimization#27
0xClandestine wants to merge 1 commit into
bstnxbt:mainfrom
0xClandestine:feat/bet-optimal-drafting

Conversation

@0xClandestine
Copy link
Copy Markdown
Contributor

Summary

Integrates Bet-Optimal Drafting (BOD) — a unified bet-size optimizer for chain (vanilla DFlash) and tree (DDTree) speculative decoding — throughout the codebase. The draft model is the gambler; the target model is the house. BOD finds the optimal bet size (γ for chain / B for tree) to maximize throughput using a unified mathematical framework.

Core Algorithm

Both modes reduce to the same optimization problem:

T(x) = (E[tokens | x] + 1) / (c_fixed + c_scale · x)

Where x is the bet size, E[tokens] is a concave increasing function, and the denominator is linear. The optimal x maximizes this ratio.

Chain mode (γ optimization) has three tiers:

  1. Verify-dominated — max γ immediately (zero math)
  2. ρ = 0 — closed-form Lambert W (one log + one Lambert W call)
  3. ρ > 0 — fused Metal kernel sweep (GPU dispatch)

Tree mode (B optimization) has three tiers:

  1. Draft-dominated — max B immediately
  2. Enough observations — Lambert W on log-acceptance model
  3. Cold start — Metal kernel sweep

Integration Points

File Change
dflash_mlx/bet_optimal_drafting.py New file — BODConfig, BODController, BODObservation, bod_optimal_bet() convenience API, Metal kernels, analytical solvers
runtime_profiles.py 6 new bod_* fields on RuntimeProfile / EffectiveRuntimeConfig
runtime_context.py runtime_config_from_profile() and build_offline_runtime_context() thread BOD params
runtime.py stream_dflash_generate() auto-creates BODController when enabled
engine/spec_epoch.py Accepts bod_controller; queries per-cycle for block size; records observations after each cycle
server/config.py 6 CLI flags: --bod-enabled, --bod-mode, --bod-min-bet, --bod-max-bet, --bod-default-scale-cost, --bod-default-fixed-cost
generate.py Same 6 CLI flags + kwargs on run_generate()
__init__.py Exports BODConfig, BODController, BODObservation
doctor.py Registers BOD fields in config/CLI registries

Usage

# Serve with BOD enabled (chain mode — default)
dflash --model Qwen3.5-27B --bod-enabled

# Tree mode with custom bet range
dflash --model Qwen3.5-27B --bod-enabled --bod-mode tree --bod-min-bet 16 --bod-max-bet 256

# Generate with custom cost estimates
dflash generate --model Qwen3.5-27B --prompt "Hello" --bod-enabled --bod-default-scale-cost 5.0

# Default (BOD disabled) — zero behavioral change
dflash --model Qwen3.5-27B

Testing

  • All 343 existing tests pass (3 pre-existing skipped)
  • Added BOD default kwargs to test_generate_cli.py expected dict
  • Verified: import chain, data model completeness, CLI parse/normalize, BOD controller lifecycle, dynamic bet adaptation through 20 simulation cycles

Add BOD as an opt-in dynamic block-size optimizer for speculative
decoding. The draft model is the gambler; the target model is the house.
BOD finds the optimal bet size (γ for chain / B for tree) to maximize
throughput using a unified mathematical framework.

Integration points:
- EffectiveRuntimeConfig / RuntimeProfile: 6 new bod_* fields (disabled
  by default, zero behavioral change).
- CLI (serve + generate): --bod-enabled, --bod-mode, --bod-min-bet,
  --bod-max-bet, --bod-default-scale-cost, --bod-default-fixed-cost.
- spec_epoch.py: accepts optional bod_controller; queries it per-cycle
  for dynamic block sizing and records observations (bet, accepted,
  cycle_time, draft_time, verify_time).
- runtime.py: auto-creates BODController when bod_enabled=True.
- __init__.py: exports BODConfig, BODController, BODObservation.
- doctor.py: registers BOD fields in config/CLI registries.

343 tests pass, 3 skipped (pre-existing).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant