Skip to content

Dev/tw3/box3d 5#1

Open
nvtw wants to merge 661 commits into
dev/tw3/box3d_6from
dev/tw3/box3d_5
Open

Dev/tw3/box3d 5#1
nvtw wants to merge 661 commits into
dev/tw3/box3d_6from
dev/tw3/box3d_5

Conversation

@nvtw
Copy link
Copy Markdown
Owner

@nvtw nvtw commented Apr 27, 2026

Description

Checklist

  • New or existing tests cover these changes
  • The documentation is up to date with these changes
  • CHANGELOG.md has been updated (if user-facing change)

Test plan

Bug fix

Steps to reproduce:

Minimal reproduction:

import newton

# Code that demonstrates the bug

New feature / API change

import newton

# Code that demonstrates the new capability

StafaH and others added 30 commits May 4, 2026 22:57
Co-authored-by: Philipp Reist <preist@nvidia.com>
Signed-off-by: JC <jumyungc@nvidia.com>
…#2694)

Co-authored-by: Philipp Reist <66367163+preist-nvidia@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: JC <jumyungc@nvidia.com>
…-physics#2733)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
newton-physics#2732)

Co-authored-by: Philipp Reist <66367163+preist-nvidia@users.noreply.github.com>
…ysics#2735)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes in `SolverPhoenX.update_contacts()` that together let
`SensorContact` work with the phoenx solver:

1. Negate the assembled wrench. Phoenx's normal points shape0->shape1
   so `lam_n*n + lam_t*t` is the impulse on shape1, but Newton's
   `Contacts.force` stores the force on shape0 (matching MuJoCo's
   `_convert_mjw_contacts_to_newton_kernel`). Without the negate the
   sensor reported -m*g on a settled body.

2. Honor the body-pair-grouping sort permutation. With compound bodies
   the ContactContainer is keyed by `sorted_k` while
   `Contacts.rigid_contact_shape0/1/normal` stay in newton order;
   without the perm reroute, `force[k]` is misaligned with the topology
   arrays and the sensor attributes forces to the wrong contact slots.

Add CUDA + graph-capture-only tests in
`phoenx/tests/test_sensor_contact.py`:

- `TestContactImpulseToForceKernel` directly drives the readback
  kernel with a synthetic ContactContainer and a known non-identity
  sort_perm; the four cases pin sort_perm wiring, the has_perm=0
  identity path, t2 reconstruction with non-unit idt, and tail-slot
  count gating. This is what catches the sort_perm regression --
  integration tests can't, since real-world same-body-pair contacts
  give a sort_perm that's order-undefined within the group and the
  swapped values are physically symmetric.

- `TestSensorContactPhoenX` runs the full
  collide -> step -> update_contacts -> sensor.update pipeline inside
  a CUDA graph and asserts m*g, stack-weight propagation,
  per-counterpart force-matrix split with sum reconciliation, and
  normal/friction decomposition consistency, all at <=2% rel err.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eps)

Three new integration tests in `TestSensorContactPhoenX`:

- `test_tilted_normal_corner_force_balance`: sphere wedged in a
  corner (ground + vertical wall) with tilted gravity. Pins the
  total-force balance at -gravity, the per-counterpart split into
  ground and wall contributions, and -- critically --
  `force_matrix_friction` against an analytical
  `f - (f.n) n` reference. A vertical-only scene can't distinguish
  cc.normal from rigid_contact_normal because both are +Z; this one
  fails immediately if those two normals disagree, since the
  friction projection picks up an O(weight) phantom tangential
  component.

- `test_articulated_robot_per_link_weight`: 2-link revolute robot
  flat on the ground. Each link must report exactly its own weight
  in vertical contact force. Catches any phoenx contact + ADBS
  joint-constraint coupling regression that would leak vertical
  impulse into the contact rows.

- `test_substep_steady_state_invariant`: same scene at substeps=1
  vs substeps=8, settled. The readback divides by the substep dt;
  in steady state every substep carries the same impulse so the
  per-frame readout must equal m*g regardless of substep count. A
  regression in the dt scaling would surface as a per-substep
  multiplier on Fz.

Test count goes 9 -> 12; runtime stays at ~2.1 s on a warm cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: JC <jumyungc@nvidia.com>
adenzler-nvidia and others added 30 commits May 18, 2026 07:43
Signed-off-by: Alain Denzler <adenzler@nvidia.com>
Signed-off-by: adenzler-nvidia <adenzler@nvidia.com>
Co-authored-by: JC-nvidia <116605903+jumyungc@users.noreply.github.com>
Wires the existing UnionFindIslandBuilder into SolverPhoenX behind a
new `sleeping_velocity_threshold` ctor arg. Zero (default) skips every
allocation and per-step kernel; > 0 enables the full pipeline.

Per step (when enabled):
- Union each attached shape's world-frame AABB into a per-body AABB via
  2D atomic min/max; finalize the diagonal length on the body.
- Build islands over the active interaction set (rigid bodies as nodes).
- Per-body score `|v| + 0.5 * diag * |omega|` atomic-maxed into the
  body's island slot, gated by an `if score > island_max[island]` early
  exit so the float atomic only fires on improving threads.
- Mark islands sleeping when their max score is below threshold, then
  propagate to a per-body `is_sleeping` flag on BodyContainer.
- Apply-forces-and-gravity kernel early-returns on sleeping bodies and
  marks them STATIC for the constraint solve.
- Element view rewritten: any slot pointing to a sleeping body becomes
  -1, so the partitioner adjacency count adds nothing and the
  constraint drops out of every colour bucket.
- Broad-phase filter (shared with the cloth share-vertex filter) gains
  a sleeping-aware drop for rigid-rigid pairs where both bodies sleep;
  sleeping-vs-awake pairs pass through so contacts can wake the island
  the next step.

All scratch is pre-allocated; per-step kernels read device-side bounds
(`num_active_constraints[0]`, `num_sets[0]`), no `.numpy()` reads.
Graph capture works end-to-end. Sentinel arrays keep the Warp ABI bound
when sleeping is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six disjoint 64-body clusters, each emitting ~470 mixed-arity
interactions (pair / triplet / quad / 6-body) plus 2 high-degree
hubs at 24 partners. Total ~2640 elements; in-test assertion guards
the 2k minimum if the band sizes are ever tuned down.

Builds the workload six times and byte-compares set_nr, set_sizes,
set_sizes_compact, and num_islands against snapshot 0 -- catches any
regression in the post-sort min-index path that would otherwise leak
through smaller fixtures. Also asserts island ids stay strictly
ordered by their min body id, the invariant the deterministic
post-sort exists to enforce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bodies now require their island's max-velocity score to stay below
sleeping_velocity_threshold for sleeping_frames_required consecutive
frames (default 30, ~0.5 s at 60 Hz) before is_sleeping flips. Wake
stays single-frame and island-wide: the moment any body in the
island lifts the max above threshold, every body in the island
resets its counter to 0 in the same kernel pass.

Trade-off vs. the prior single-frame logic: barely-settled stacks
no longer thrash sleep/wake every frame, which kept the partitioner
rebuilding the colour layout. Cost is one int32 array (counter) and
one extra saturating-increment branch in the propagate kernel.

sleeping_frames_required=0 recovers the previous single-frame
behavior; the saturating cap collapses to zero and the >= check
fires on the first below-threshold frame.

Tests cover all four code paths through the new kernel:
- counter ticks up + saturates at the required count;
- counter blocks premature sleep while still climbing;
- counter resets to 0 in the same step where the island wakes
  (via a high state.body_qd injection);
- sleeping_frames_required=0 still sleeps on the first below-
  threshold frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several issues conspired to make settled stacks explode the moment a
body transitioned to ``is_sleeping``, and to ignore user-applied
forces on sleeping islands.

* The island builder unioned dynamic bodies through the world anchor
  via shared ground contacts -- one big island covered every stack
  resting on the same plane. Filter out non-dynamic endpoints in
  ``_phoenx_copy_elements_to_int2d_kernel`` so the anchor cannot
  bridge separate dynamic islands.

* The broad-phase sleeping filter only dropped pairs where both
  shapes' bodies had ``is_sleeping``. Pairs involving the world
  anchor (or any static / kinematic body) and a sleeping body
  passed through. Extend ``phoenx_cloth_share_vertex_filter`` to
  treat any non-dynamic body as frozen, and add ``body_motion_type``
  to the filter data.

* ``contact_iterate`` / ``contact_iterate_multi`` ran the
  positional-bias solve even when ``access_mode == STATIC`` -- the
  ``inverse_mass`` of a sleeping body is non-zero, so the bias
  impulse compounded over 8 substeps x 6 iterations into ~18 m/s
  in one frame. Add a backup early-out for frozen-vs-frozen pairs
  to cover the sleep-transition frame, where the broad-phase
  contact had already been ingested before ``is_sleeping`` flipped.

* ``_integrate_velocities_kernel`` advanced position for every
  dynamic body regardless of ``is_sleeping``, sliding the sleeping
  island at the residual sub-threshold velocity. Skip sleeping
  bodies in the position integration.

* External forces (e.g. picking) hit the early-return inside
  ``_phoenx_apply_forces_and_gravity_kernel`` and were silently
  discarded for sleeping bodies. Fold ``F * inv_mass * step_dt``
  (and the torque analogue) into the per-island sleep score so any
  meaningful user-applied wrench wakes the whole island the next
  step. Gravity is applied separately and is *not* counted, so
  awake stacks stay awake while settled stacks under gravity alone
  still sleep.

Adds ``test_external_force_wakes_sleeping_island``; the existing
nine sleeping tests continue to pass.
Adds a 40-layer Kapla-style square tower demo (single tower or
grid_side x grid_side tiles in one shared world) that opts into
PhoenX island sleeping. Two per-shape colour buffers (active +
30%-dim) plus a per-frame GPU kernel pick which colour to write
into ``model.shape_color`` based on each body's ``is_sleeping``
flag; the existing viewer pipeline repacks and uploads on its
own, so no shader changes are needed and zero Python work runs
in the hot loop.

The example wires the PhoenX sleeping broad-phase filter into
the bare ``newton.CollisionPipeline`` it builds; without that
the ground-vs-sleeping pairs would survive the broad phase, and
the substep solve would re-inject energy into the freshly
sleeping island.
The sleeping-aware broad-phase filter runs inside ``model.collide()``,
which the host calls *before* ``solver.step()``. By the time the
per-step sleeping pass inside ``world.step()`` lifts ``is_sleeping``
on a body that picking just pushed, the filter has already dropped
every plank-vs-plank and plank-vs-ground pair on the wake frame --
the picked plank free-accelerates against an empty stack and the
column folds. ``set_nr`` from previous frames is no help either: as
soon as the stack falls asleep the broad-phase filter collapses
every interaction, the island builder sees no edges and ``set_nr``
decays to per-body singletons, so a naive ``has external force?
wake my island'' pass only wakes the picked body.

* Snapshot ``set_nr`` per body at the end of every
  ``_run_sleeping_pass`` for bodies that are still awake. Bodies
  that just transitioned to sleeping skip the write, so their entry
  retains the previous frame's awake-state id -- the one built
  while the stack still had its full contact graph.

* Add ``PhoenXWorld.wake_on_external_input``: fan-in per body that
  carries non-zero ``bodies.force`` or ``bodies.torque`` into a
  per-island flag (indexed by the last-awake snapshot), then clear
  ``is_sleeping`` for every body in any flagged island. Cheap, fully
  graph-capture safe, no-op when sleeping is disabled.

* Add ``SolverPhoenX.wake_on_external_input(state)``: imports
  ``state.body_f`` into PhoenX's force accumulators first, then
  drives the world pass. Host loop becomes::

      state.body_f.assign(...)              # picking / wrenches
      solver.wake_on_external_input(state)  # propagate wake
      model.collide(state, contacts)        # filter keeps pairs
      solver.step(state, state_out, ...)

* Wire the kapla example through ``world.wake_on_external_input()``
  between ``picking.apply_force()`` and ``model.collide()``.

With the pre-collide pass installed, picking the bottom plank of a
40-layer settled tower preserves every contact on the wake frame
(1920 vs 0 before), |v|max stays at ~0.02 m/s, z_max doesn't drift
through 30 frames of dragging. Adds
``test_external_force_wakes_full_stack_via_pre_collide_pass``; all
eleven sleeping tests pass.
Replace bodies.is_sleeping + last_awake_set_nr with one int per body:
bodies.island_root. -1 = awake; otherwise the persistent root body id
(lowest body id in the island when it slept). One field, half the
memory, and the wake mechanism keys off a stable body id instead of
the volatile per-frame compact set_nr.

Per step, the sleeping pass now:

  1. self-wake fan + apply for sleeping bodies whose velocity / force
     got externally injected since the last step (host-side state
     mutation that bypasses wake_on_external_input);
  2. detect_active_islands -- a sleeping island is "active" iff one
     of its members shares a constraint with an awake dynamic body;
  3. inject_chain_edges -- for every body in an active island, emit
     an artificial (body, island_root) edge into a second union-find
     interaction array. Inactive sleeping islands stay filtered out
     for free;
  4. union-find runs over (real elements + chain edges) in one pass;
  5. propagate: a sleeping body in an awake compact island clears
     island_root (the chain edges + awake bridge merged them into
     one live island, so propagate wakes the whole group atomically
     via the natural island_is_sleeping = 0 path -- no separate
     wake-on-contact fanin needed);
  6. awake bodies whose island has been below threshold for
     sleeping_frames_required frames stamp island_root with the
     compact island's lowest awake body id.

Fixes the flood-fill where waking one body of a sleeping group
propagated body-by-body through fresh contacts over multiple frames:
the chain edges pull the entire sleeping island into the live compact
island the moment any external body touches it, so all members wake
on the same step.

All sleep-aware kernels live in a new newton/_src/solvers/phoenx/
sleeping_kernels.py module, kept separate from the generic
UnionFindIslandBuilder (which gains a single optional extra_edges
parameter for the second unite pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Graph-coloring warm-start kept the same constraint locked in the
same colour every frame. PGS with finite iterations converges to a
biased fixed point under any fixed coloring; over hundreds of frames
the bias accumulates and on tall stacks like the Kapla tower causes
catastrophic drift (max brick displacement ~3.3 m after 1000 frames).

Cold-start coloring (warm-start disabled) varies frame to frame -
contact counts feed the high bits of the MIS priority, and they
shift slightly as bodies settle, so the coloring drifts with them
and the bias averages out (max drift 0.15 m). The fix preserves the
warm-start fast path while periodically breaking the lock-in:

* set_warm_start_invalidate_period(N): zero the cache every Nth
  build_csr so the next seed_warm_start_kernel finds an empty cache
  and greedy MIS re-derives the coloring from scratch.
* set_warm_start_rotate_skip(True): each step round-robin one
  cached colour and skip seeding its entries, forcing MIS to
  re-pick those constraints. Over num_colors steps every colour
  cycles through a re-MIS pass at ~1/num_colors of the cold-start
  cost.

Combined as period=4 + rotate_skip on the Kapla tower probe: max
drift 0.19 m at 1000 frames vs 3.33 m baseline, FPS unchanged at
110 (pure cold-start drops to 99). Both knobs are exposed via
PhoenXWorld kwargs (warm_start_invalidate_period,
warm_start_rotate_skip_color) and default to off so existing scenes
see no behavioural change.

Also adds a sweep_direction device flag and a symmetric_color_sweep
opt-in that alternates forward/reverse colour-iteration order each
sweep (the user's original hypothesis). It only swaps edge colours -
middle colours stay near the middle of the sweep - so it's not a
sufficient drift fix on its own, but it's kept as a building block
for future PGS scheduling experiments. The plumbing (sweep_direction
array, colour-range helpers returning the mapped colour index, the
extra kernel parameter on the single-world head + fused-tail
kernels) is a no-op when the flag is off and graph-capture safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``capture_while_greedy_coloring`` to ``PhoenXWorld``. When set,
the greedy MIS build replaces its fixed ``MAX_GREEDY_OUTER_ITERS``
host-side loop with ``wp.capture_while(num_remaining, body)``. The
captured graph exits as soon as ``num_remaining`` hits 0 instead of
running the body to the cap and relying on per-thread early-exit.

Profile data on Kapla (10620 bodies, 67k constraint columns,
single-world layout, --num-frames=100):

* Warm-start mode, fixed loop: 12,800 partitioner kernel
  launches/100 frames = 0.21 ms/frame (all are post-validation
  no-ops since the cache covers the full constraint set).
* Warm-start mode, capture_while: 880 launches = 0.04 ms/frame --
  **18x fewer launches, ~2% step-time saving**. The watcher adds
  ~210 ``set_conditional_if_handle_kernel`` launches/100 frames,
  noise-level.

Cold-start mode (no warm-start cache available) sees a smaller
benefit: ~36% fewer launches (12.8k -> 8.2k), but those launches
were already doing real MIS work so the time savings are smaller
(~76 us/frame).

The flag is opt-in for compatibility with the existing fixed-loop
performance contract (PERF_NOTES documents the original switch from
``capture_while`` to fixed loop as a step-time win on contended
scenes). The kapla drift probe enables it by default and gains
~3 fps end-to-end on top of the warm-start lock-in fix.

The legacy fixed-loop path is preserved unchanged for scenes where
the watcher overhead would dominate (e.g. extremely small graphs
that converge in 1 iteration, or contexts where capture_while's
per-iteration conditional graph node is expensive relative to the
saved per-thread no-op launches).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PhoenX is experimental; we don't need backward-compat for these
solver defaults. Promote the three knobs that have no measurable
drawback and clear upside on the contact-heavy rigid scenes we
benchmark:

* ``warm_start_invalidate_period`` default 0 -> 4. Triggers a full
  re-color every 4 steps so the warm-start cache can't lock in a
  PGS-biasing coloring forever.
* ``warm_start_rotate_skip_color`` default False -> True. Each
  step skips re-seeding one cached colour (round-robin) so MIS
  re-derives ~1/num_colors of the assignments while the rest stay
  warm-started. Continuous low-cost stir on top of the periodic
  rebuild above.
* ``capture_while_greedy_coloring`` default False -> True. Greedy
  MIS outer loop uses ``wp.capture_while(num_remaining, ...)``
  instead of the fixed ``MAX_GREEDY_OUTER_ITERS`` host unroll. On
  the warm-start fast path the captured graph exits after ~1
  body iteration instead of running to the cap; 18x fewer MIS
  launches per build on Kapla.

Kapla 1000-frame drift probe with new defaults: max drift 0.175 m
(vs 3.33 m with all three off), full warm-start FPS retained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Çatalyürek-style speculative coloring algorithm alongside the
existing JP-MIS path. Three phases per round:

1. **Pick** -- every uncoloured constraint picks the smallest colour
   not used by its already-coloured neighbours (full forbidden-mask
   scan, no priority gating).
2. **Validate** -- each uncoloured constraint scans its uncoloured
   neighbours; aborts if any with strictly higher priority picked the
   same tentative colour. Reads-only on ``color_tags`` so no race
   with concurrent commits.
3. **Commit** -- non-aborted constraints stamp ``color_tags`` and
   decrement ``num_remaining``. Run as a separate launch to keep
   phase 2 race-free.

Determinism: same fixed priority permutation as JP-MIS, plus the
3-phase split that eliminates the read/write race a 2-phase
pick+commit would have. Same inputs -> same colouring across runs
and hardware.

Motivation: JP-MIS commits at most one colour per round
(local-maxima only at the smallest free colour), so a 67k-constraint
graph drains in ~80 inner launches. Speculative commits at multiple
colours per round (any uncoloured constraint whose tentative colour
isn't claimed by a higher-priority neighbour), bringing the round
count down to ~32. The per-round cost is higher (3 kernels with
neighbour scans on both ``color_tags`` and ``tentative_color``), so
net wall-clock on Kapla is comparable to MIS+capture_while
(~1.37 vs 1.65 ms per 100 frames cold-start in nsys).

Where speculative wins:
* Graphs that exceed the JP-MIS outer-iter cap on round count.
* Future tuning of the validate kernel (shared-mem caching of
  ``tentative_color``, better short-circuit on small forbidden
  masks).

Default OFF because it doesn't out-pace the existing greedy MIS on
the scenes we benchmark. Wire-up via
``PhoenXWorld(speculative_coloring=True)`` or
``partitioner.set_speculative_coloring(True)``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Record what we learned in the warm-start-coloring drift investigation
so future work doesn't re-litigate the dead ends:

* Warm-start cache stir (period + rotate-skip) as the actual fix,
  with measured drift / cost trade-offs.
* ``wp.capture_while`` win on the warm-start fast path (18x fewer
  MIS launches).
* Speculative coloring is implemented and deterministic but doesn't
  out-pace JP-MIS+capture_while on the scenes we benchmark; kept as
  an opt-in building block.

Also captures the negative results (symmetric / cyclic colour sweep,
per-thread early-exit, lower MAX_GREEDY_OUTER_ITERS) so future tuners
don't repeat them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes:

1. **Speculative coloring overflow fix.** When the int64 forbidden
   mask saturates (>= GREEDY_MAX_COLORS = 64 distinct neighbour
   colours), pick falls back to colour 63. If that colour is
   already taken by a coloured neighbour, validate aborts on the
   coloured-neighbour conflict check and ``num_remaining`` never
   decreases -- ``capture_while`` spins forever.

   Fix: a 1-thread ``speculative_overflow_exit_kernel`` zeroes
   ``num_remaining`` when ``overflow_flag`` is set, surfacing the
   saturation to the JP fallback wrapper. Also extends the validate
   kernel to check coloured-neighbour colour conflicts (the missing
   case that let two saturated constraints sharing a body both
   commit to colour 63).

   Speculative coloring now produces valid colourings on dense
   synthetic graphs that exceed 64 colours, same JP fallback
   contract as JP-MIS.

2. **CUDA + graph capture regression tests.** New file
   ``test_graph_coloring_speculative.py`` exercises the new code
   paths via ``wp.ScopedCapture`` + ``wp.capture_launch`` (matches
   the project convention -- the production code path is captured
   graphs, not eager launches). Tests:

   * ``test_speculative_valid_under_capture`` -- captured
     speculative build produces a valid colouring across replays.
   * ``test_speculative_deterministic_under_capture`` -- same
     inputs, same captured colouring across runs.
   * ``test_speculative_warm_start_under_capture`` -- speculative
     + warm-start cache produces valid colourings on cached replays.
   * ``test_rotate_skip_valid_under_capture`` -- rotate-skip
     re-MIS cycle stays valid across 8 replays.
   * ``test_invalidate_period_valid_under_capture`` -- periodic
     full-cache invalidate stays valid across 7 replays straddling
     two invalidation boundaries.

   Full suite: 5 tests, 6 s with warm Warp cache, 60 s on first
   compile -- within the 1-minute test policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PhoenX is experimental and the user wants the best-performing
coloring as the default. Speculative wins JP-MIS on the cold-start
path measured by nsys (1.37 vs 1.65 ms per 100 frames on Kapla,
~17 % faster) and matches it on the warm-start fast path. Coloring
quality (number of colours) is within 1-2 of JP-MIS so PGS sweep
cost is unaffected.

Determinism and correctness are equivalent: same fixed priority
permutation, same JP fallback on bitmask overflow, plus the
overflow-exit kernel that prevents capture_while from spinning when
the graph's chromatic number exceeds GREEDY_MAX_COLORS.

End-to-end FPS on Kapla is unchanged (110 fps either way) because
cold-start coloring is only 1/4 of frames under the default cache
stir. Scenes with more cold-start frames (e.g. high contact churn)
will see proportionally larger wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalises ``set_warm_start_rotate_skip`` from a single cached
colour per step to a configurable range of consecutive colours.
``width=1`` matches the previous single-colour skip (default);
``width=N`` skips ``N`` consecutive colours per step, rotating
round-robin via the step counter, so all ``num_colors`` colours
cycle through a re-MIS pass every ``num_colors / width`` steps.

Cost scales linearly with ``width`` (each skipped colour is re-MIS
work this step). On the Kapla drift probe ``width=4`` + ``period=4``
matches ``width=1`` + ``period=4`` within drift noise (0.20 m vs
0.23 m at 1000 frames, both ~111 fps). The wider stir is kept as a
knob for scenes where ``width=1`` rotates too slowly through the
colour set.

Adds ``warm_start_rotate_skip_width`` to ``PhoenXWorld`` (default
1). Replaces the single-int ``skip_color_plus_one`` cache entry
with start/end inclusive range arrays in the partitioner; the seed
kernel falls back to a no-op when ``start > end``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3-phase pipeline (pick + validate + commit) was conservatively
split to give validate a clean ``color_tags`` snapshot, but the
race it was protecting against is actually benign:

When two constraints A, B share a body and both pick colour ``c``:
priorities are a permutation (no ties), so exactly one wins. The
loser aborts regardless of whether it reads the winner as
"uncoloured but higher priority" (pre-commit) or "coloured at my
tentative" (post-commit) -- the abort outcome is identical, and
the winner only writes ``color_tags`` AFTER all of its own
neighbour checks pass.

Fusing validate and commit cuts one launch per speculative round
(~32 launches/frame on Kapla cold-start, ~14 launches/frame on
warm-start mode). Per-round wall-clock now: pick (18us) +
validate_commit (25us) + overflow_exit (0.6us) = ~44us, vs the
3-phase ~46us.

Tests in ``test_graph_coloring_speculative`` still pass under
``wp.ScopedCapture`` -- the fused kernel keeps the same
deterministic priority-tiebreak semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lower the default per-substep PGS schedule (substeps 8 -> 6,
solver_iterations 10 -> 8) -- with the new warm-start cache stir +
speculative coloring the scene settles at this shorter schedule and
runs ~25 % faster per frame without visible quality loss.

Bump default ``--grid-side`` 5 -> 20 so the benchmark scene exercises
the multi-instance contact-heavy regime that the recent partitioner
work targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The body-locality sort that runs once per ``build_csr`` used to do
two stable radix sorts back-to-back:

  1. sort by ``eid`` (deterministic tie-break for the next sort)
  2. stable sort by ``(colour << 32) | body_min``

The two-pass design was for tie-breaking determinism: the stable
sort in pass 2 preserved pass 1's eid order whenever ``(colour,
body_min)`` tied. The same final ordering can be produced with a
single sort by packing all three fields into one int64 key:

  bits 0..24  : eid  (25 bits -> 32M constraints)
  bits 25..48 : body_min  (24 bits -> 16M bodies)
  bits 49..63 : colour  (15 bits -> 32K colours)

Lexicographic sort on this packed key is equivalent to
``(colour, body_min, eid)`` ordering, which matches what the
two-pass produced.

Profile on ``example_kapla_square_tower --grid-side=20`` (100
frames):

  DeviceRadixSortOnesweep  380 -> 282 M ns (-26 %)
  Locality sort kernels    2 writeback + 2 key-build -> 1 + 1

Net step-time saving ~1 ms/frame on this scene (~4 % of total
GPU time at this scale). Tests pass under ``wp.ScopedCapture`` --
the merged sort produces the same colouring as the two-pass on
the existing stress workload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup pass over the graph-coloring + solver modules. Net -534
lines, no behaviour change (48-test sweep passes).

* Drop the legacy 2-pass body-locality kernels
  (``_locality_eid_keys_kernel``, ``_locality_compute_keys_kernel``)
  -- subsumed by ``_locality_combined_keys_kernel`` in commit
  ``d8ad6a64``. ``luby_fixed`` now also uses the single-pass sort.
* Drop the 3-phase speculative kernels
  (``speculative_validate_kernel``, ``speculative_commit_kernel``)
  -- subsumed by ``speculative_validate_commit_kernel`` in commit
  ``660944ab``.
* Shorten docstrings and inline comments in
  ``graph_coloring_common.py``, ``graph_coloring_incremental.py``,
  ``warm_start.py``, ``luby_fixed.py``, ``solver_phoenx.py``,
  ``solver_phoenx_kernels.py``. Kept the *why* (correctness
  rationale, capture safety, race analysis) and dropped the
  *narrative* (background prose, restated invariants, version
  history).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Net -114 lines. No behaviour change. Dispatcher modules and the
``SolverPhoenX`` constructor had multi-paragraph docstrings that
restated invariants already covered in the protocol / PhoenXWorld
docs. Kept the load-bearing rationale (race analysis, fallback
contract, capture safety) and dropped the narrative.

Also exports ``warm_start_mark_boundaries_kernel`` from the
warm_start module ``__all__`` (it's used by the partitioner but was
missing from the public surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file was a personal drift probe kept untracked on purpose
(the user uses it for git-bisect tests across the partitioner
work). It was swept into a cleanup commit by ``git add <dir>``;
removing it back to its untracked working-tree state.
Fix the kapla_tower2 camera-collider regression where a kinematic
body (e.g. a fly-through camera collider) sweeps through a sleeping
island without waking anything. The wake required four overlapping
fixes -- each on its own left the island re-marked as sleeping every
frame:

 - cloth_collision: treat KINEMATIC bodies as non-frozen in the
   sleep-aware broad-phase filter so kinematic-vs-sleeping pairs
   survive the SAP pass and produce contacts.
 - solver_phoenx_kernels: stop collapsing KINEMATIC bodies to -1 in
   _constraints_to_elements_kernel (inverse_mass == 0 but
   motion_type != STATIC) so the kinematic body appears as an
   adjacent node in the contact's interaction element.
 - sleeping_kernels: keep KINEMATIC bodies in the union-find input
   (_phoenx_copy_elements_to_int2d_kernel) so the kinematic mover
   lands in the same compact island as the sleeping bricks it's
   pressing against; have _phoenx_island_max_velocity_kernel read
   the kinematic's pose-derivative velocity so the score actually
   lifts above threshold.
 - solver_phoenx.step: run _kinematic_prepare_step BEFORE the
   sleeping pass so bodies.velocity reflects this step's target
   delta by the time the score kernel reads it.

Also extends the kapla_tower2 example with optional per-island
sleeping (ENABLE_SLEEPING + the share-vertex filter wiring) so the
end-to-end path is exercised in tree.

Regression coverage:
  test_sleeping.TestSleepingKinematicWake.test_moving_kinematic_wakes_sleeping_stack
Verified to fail before the fix and pass after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the share-vertex filter + per-shape AABB wiring into
``PhoenXWorld.attach_collision_pipeline``: one call replaces the
manual ``build_phoenx_share_vertex_filter_data`` + per-frame
``shape_aabb_lower / shape_aabb_upper`` arguments to ``step()``.

Adds ``PhoenXWorld.broad_phase_filter()`` returning the
(filter_func, filter_data_type) tuple to pass at CollisionPipeline
construction time -- the only sleeping detail that can't be hidden
behind the world (the pipeline must reserve a filter slot
up-front).

``step()`` now falls back to the cached AABB arrays when no
explicit ones are passed. ``wake_on_external_input()`` was already
a no-op when sleeping is disabled, so the kapla_tower2 example
collapses the entire per-frame ``if ENABLE_SLEEPING`` branch into a
uniform "apply pick force, wake, collide, step" sequence.

Example diff: -55 / +20 lines around the sleeping path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the kapla_tower2 cleanup: replaces the manual
``build_phoenx_share_vertex_filter_data`` + per-step
``shape_aabb_*`` args with a single ``world.attach_collision_pipeline``
call, and pulls the (filter_func, filter_data_type) tuple from
``PhoenXWorld.broad_phase_filter()`` at CollisionPipeline construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump SLEEP_COLOR_GAIN from 0.30 to 0.55 in both kapla_tower2 and
kapla_square_tower. The old 0.30 value pushed settled bricks
near-black, washing out the wood texture. 0.55 keeps the sleep
state clearly distinguishable from active without making it look
like a different material.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.