Dev/tw3/box3d 5#1
Open
nvtw wants to merge 661 commits into
Open
Conversation
Co-authored-by: Philipp Reist <preist@nvidia.com>
Signed-off-by: JC <jumyungc@nvidia.com>
…#2694) Co-authored-by: Philipp Reist <66367163+preist-nvidia@users.noreply.github.com>
…oft_body to improve visibility (newton-physics#2729)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: JC <jumyungc@nvidia.com>
…-physics#2733) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
newton-physics#2732) Co-authored-by: Philipp Reist <66367163+preist-nvidia@users.noreply.github.com>
…ysics#2735) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes in `SolverPhoenX.update_contacts()` that together let `SensorContact` work with the phoenx solver: 1. Negate the assembled wrench. Phoenx's normal points shape0->shape1 so `lam_n*n + lam_t*t` is the impulse on shape1, but Newton's `Contacts.force` stores the force on shape0 (matching MuJoCo's `_convert_mjw_contacts_to_newton_kernel`). Without the negate the sensor reported -m*g on a settled body. 2. Honor the body-pair-grouping sort permutation. With compound bodies the ContactContainer is keyed by `sorted_k` while `Contacts.rigid_contact_shape0/1/normal` stay in newton order; without the perm reroute, `force[k]` is misaligned with the topology arrays and the sensor attributes forces to the wrong contact slots. Add CUDA + graph-capture-only tests in `phoenx/tests/test_sensor_contact.py`: - `TestContactImpulseToForceKernel` directly drives the readback kernel with a synthetic ContactContainer and a known non-identity sort_perm; the four cases pin sort_perm wiring, the has_perm=0 identity path, t2 reconstruction with non-unit idt, and tail-slot count gating. This is what catches the sort_perm regression -- integration tests can't, since real-world same-body-pair contacts give a sort_perm that's order-undefined within the group and the swapped values are physically symmetric. - `TestSensorContactPhoenX` runs the full collide -> step -> update_contacts -> sensor.update pipeline inside a CUDA graph and asserts m*g, stack-weight propagation, per-counterpart force-matrix split with sum reconciliation, and normal/friction decomposition consistency, all at <=2% rel err. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eps) Three new integration tests in `TestSensorContactPhoenX`: - `test_tilted_normal_corner_force_balance`: sphere wedged in a corner (ground + vertical wall) with tilted gravity. Pins the total-force balance at -gravity, the per-counterpart split into ground and wall contributions, and -- critically -- `force_matrix_friction` against an analytical `f - (f.n) n` reference. A vertical-only scene can't distinguish cc.normal from rigid_contact_normal because both are +Z; this one fails immediately if those two normals disagree, since the friction projection picks up an O(weight) phantom tangential component. - `test_articulated_robot_per_link_weight`: 2-link revolute robot flat on the ground. Each link must report exactly its own weight in vertical contact force. Catches any phoenx contact + ADBS joint-constraint coupling regression that would leak vertical impulse into the contact rows. - `test_substep_steady_state_invariant`: same scene at substeps=1 vs substeps=8, settled. The readback divides by the substep dt; in steady state every substep carries the same impulse so the per-frame readout must equal m*g regardless of substep count. A regression in the dt scaling would surface as a per-substep multiplier on Fz. Test count goes 9 -> 12; runtime stays at ~2.1 s on a warm cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: JC <jumyungc@nvidia.com>
Signed-off-by: Alain Denzler <adenzler@nvidia.com> Signed-off-by: adenzler-nvidia <adenzler@nvidia.com>
Co-authored-by: JC-nvidia <116605903+jumyungc@users.noreply.github.com>
…ball` to improve visibility (newton-physics#2855)
Wires the existing UnionFindIslandBuilder into SolverPhoenX behind a new `sleeping_velocity_threshold` ctor arg. Zero (default) skips every allocation and per-step kernel; > 0 enables the full pipeline. Per step (when enabled): - Union each attached shape's world-frame AABB into a per-body AABB via 2D atomic min/max; finalize the diagonal length on the body. - Build islands over the active interaction set (rigid bodies as nodes). - Per-body score `|v| + 0.5 * diag * |omega|` atomic-maxed into the body's island slot, gated by an `if score > island_max[island]` early exit so the float atomic only fires on improving threads. - Mark islands sleeping when their max score is below threshold, then propagate to a per-body `is_sleeping` flag on BodyContainer. - Apply-forces-and-gravity kernel early-returns on sleeping bodies and marks them STATIC for the constraint solve. - Element view rewritten: any slot pointing to a sleeping body becomes -1, so the partitioner adjacency count adds nothing and the constraint drops out of every colour bucket. - Broad-phase filter (shared with the cloth share-vertex filter) gains a sleeping-aware drop for rigid-rigid pairs where both bodies sleep; sleeping-vs-awake pairs pass through so contacts can wake the island the next step. All scratch is pre-allocated; per-step kernels read device-side bounds (`num_active_constraints[0]`, `num_sets[0]`), no `.numpy()` reads. Graph capture works end-to-end. Sentinel arrays keep the Warp ABI bound when sleeping is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six disjoint 64-body clusters, each emitting ~470 mixed-arity interactions (pair / triplet / quad / 6-body) plus 2 high-degree hubs at 24 partners. Total ~2640 elements; in-test assertion guards the 2k minimum if the band sizes are ever tuned down. Builds the workload six times and byte-compares set_nr, set_sizes, set_sizes_compact, and num_islands against snapshot 0 -- catches any regression in the post-sort min-index path that would otherwise leak through smaller fixtures. Also asserts island ids stay strictly ordered by their min body id, the invariant the deterministic post-sort exists to enforce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bodies now require their island's max-velocity score to stay below sleeping_velocity_threshold for sleeping_frames_required consecutive frames (default 30, ~0.5 s at 60 Hz) before is_sleeping flips. Wake stays single-frame and island-wide: the moment any body in the island lifts the max above threshold, every body in the island resets its counter to 0 in the same kernel pass. Trade-off vs. the prior single-frame logic: barely-settled stacks no longer thrash sleep/wake every frame, which kept the partitioner rebuilding the colour layout. Cost is one int32 array (counter) and one extra saturating-increment branch in the propagate kernel. sleeping_frames_required=0 recovers the previous single-frame behavior; the saturating cap collapses to zero and the >= check fires on the first below-threshold frame. Tests cover all four code paths through the new kernel: - counter ticks up + saturates at the required count; - counter blocks premature sleep while still climbing; - counter resets to 0 in the same step where the island wakes (via a high state.body_qd injection); - sleeping_frames_required=0 still sleeps on the first below- threshold frame. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several issues conspired to make settled stacks explode the moment a body transitioned to ``is_sleeping``, and to ignore user-applied forces on sleeping islands. * The island builder unioned dynamic bodies through the world anchor via shared ground contacts -- one big island covered every stack resting on the same plane. Filter out non-dynamic endpoints in ``_phoenx_copy_elements_to_int2d_kernel`` so the anchor cannot bridge separate dynamic islands. * The broad-phase sleeping filter only dropped pairs where both shapes' bodies had ``is_sleeping``. Pairs involving the world anchor (or any static / kinematic body) and a sleeping body passed through. Extend ``phoenx_cloth_share_vertex_filter`` to treat any non-dynamic body as frozen, and add ``body_motion_type`` to the filter data. * ``contact_iterate`` / ``contact_iterate_multi`` ran the positional-bias solve even when ``access_mode == STATIC`` -- the ``inverse_mass`` of a sleeping body is non-zero, so the bias impulse compounded over 8 substeps x 6 iterations into ~18 m/s in one frame. Add a backup early-out for frozen-vs-frozen pairs to cover the sleep-transition frame, where the broad-phase contact had already been ingested before ``is_sleeping`` flipped. * ``_integrate_velocities_kernel`` advanced position for every dynamic body regardless of ``is_sleeping``, sliding the sleeping island at the residual sub-threshold velocity. Skip sleeping bodies in the position integration. * External forces (e.g. picking) hit the early-return inside ``_phoenx_apply_forces_and_gravity_kernel`` and were silently discarded for sleeping bodies. Fold ``F * inv_mass * step_dt`` (and the torque analogue) into the per-island sleep score so any meaningful user-applied wrench wakes the whole island the next step. Gravity is applied separately and is *not* counted, so awake stacks stay awake while settled stacks under gravity alone still sleep. Adds ``test_external_force_wakes_sleeping_island``; the existing nine sleeping tests continue to pass.
Adds a 40-layer Kapla-style square tower demo (single tower or grid_side x grid_side tiles in one shared world) that opts into PhoenX island sleeping. Two per-shape colour buffers (active + 30%-dim) plus a per-frame GPU kernel pick which colour to write into ``model.shape_color`` based on each body's ``is_sleeping`` flag; the existing viewer pipeline repacks and uploads on its own, so no shader changes are needed and zero Python work runs in the hot loop. The example wires the PhoenX sleeping broad-phase filter into the bare ``newton.CollisionPipeline`` it builds; without that the ground-vs-sleeping pairs would survive the broad phase, and the substep solve would re-inject energy into the freshly sleeping island.
The sleeping-aware broad-phase filter runs inside ``model.collide()``,
which the host calls *before* ``solver.step()``. By the time the
per-step sleeping pass inside ``world.step()`` lifts ``is_sleeping``
on a body that picking just pushed, the filter has already dropped
every plank-vs-plank and plank-vs-ground pair on the wake frame --
the picked plank free-accelerates against an empty stack and the
column folds. ``set_nr`` from previous frames is no help either: as
soon as the stack falls asleep the broad-phase filter collapses
every interaction, the island builder sees no edges and ``set_nr``
decays to per-body singletons, so a naive ``has external force?
wake my island'' pass only wakes the picked body.
* Snapshot ``set_nr`` per body at the end of every
``_run_sleeping_pass`` for bodies that are still awake. Bodies
that just transitioned to sleeping skip the write, so their entry
retains the previous frame's awake-state id -- the one built
while the stack still had its full contact graph.
* Add ``PhoenXWorld.wake_on_external_input``: fan-in per body that
carries non-zero ``bodies.force`` or ``bodies.torque`` into a
per-island flag (indexed by the last-awake snapshot), then clear
``is_sleeping`` for every body in any flagged island. Cheap, fully
graph-capture safe, no-op when sleeping is disabled.
* Add ``SolverPhoenX.wake_on_external_input(state)``: imports
``state.body_f`` into PhoenX's force accumulators first, then
drives the world pass. Host loop becomes::
state.body_f.assign(...) # picking / wrenches
solver.wake_on_external_input(state) # propagate wake
model.collide(state, contacts) # filter keeps pairs
solver.step(state, state_out, ...)
* Wire the kapla example through ``world.wake_on_external_input()``
between ``picking.apply_force()`` and ``model.collide()``.
With the pre-collide pass installed, picking the bottom plank of a
40-layer settled tower preserves every contact on the wake frame
(1920 vs 0 before), |v|max stays at ~0.02 m/s, z_max doesn't drift
through 30 frames of dragging. Adds
``test_external_force_wakes_full_stack_via_pre_collide_pass``; all
eleven sleeping tests pass.
Replace bodies.is_sleeping + last_awake_set_nr with one int per body:
bodies.island_root. -1 = awake; otherwise the persistent root body id
(lowest body id in the island when it slept). One field, half the
memory, and the wake mechanism keys off a stable body id instead of
the volatile per-frame compact set_nr.
Per step, the sleeping pass now:
1. self-wake fan + apply for sleeping bodies whose velocity / force
got externally injected since the last step (host-side state
mutation that bypasses wake_on_external_input);
2. detect_active_islands -- a sleeping island is "active" iff one
of its members shares a constraint with an awake dynamic body;
3. inject_chain_edges -- for every body in an active island, emit
an artificial (body, island_root) edge into a second union-find
interaction array. Inactive sleeping islands stay filtered out
for free;
4. union-find runs over (real elements + chain edges) in one pass;
5. propagate: a sleeping body in an awake compact island clears
island_root (the chain edges + awake bridge merged them into
one live island, so propagate wakes the whole group atomically
via the natural island_is_sleeping = 0 path -- no separate
wake-on-contact fanin needed);
6. awake bodies whose island has been below threshold for
sleeping_frames_required frames stamp island_root with the
compact island's lowest awake body id.
Fixes the flood-fill where waking one body of a sleeping group
propagated body-by-body through fresh contacts over multiple frames:
the chain edges pull the entire sleeping island into the live compact
island the moment any external body touches it, so all members wake
on the same step.
All sleep-aware kernels live in a new newton/_src/solvers/phoenx/
sleeping_kernels.py module, kept separate from the generic
UnionFindIslandBuilder (which gains a single optional extra_edges
parameter for the second unite pass).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Graph-coloring warm-start kept the same constraint locked in the same colour every frame. PGS with finite iterations converges to a biased fixed point under any fixed coloring; over hundreds of frames the bias accumulates and on tall stacks like the Kapla tower causes catastrophic drift (max brick displacement ~3.3 m after 1000 frames). Cold-start coloring (warm-start disabled) varies frame to frame - contact counts feed the high bits of the MIS priority, and they shift slightly as bodies settle, so the coloring drifts with them and the bias averages out (max drift 0.15 m). The fix preserves the warm-start fast path while periodically breaking the lock-in: * set_warm_start_invalidate_period(N): zero the cache every Nth build_csr so the next seed_warm_start_kernel finds an empty cache and greedy MIS re-derives the coloring from scratch. * set_warm_start_rotate_skip(True): each step round-robin one cached colour and skip seeding its entries, forcing MIS to re-pick those constraints. Over num_colors steps every colour cycles through a re-MIS pass at ~1/num_colors of the cold-start cost. Combined as period=4 + rotate_skip on the Kapla tower probe: max drift 0.19 m at 1000 frames vs 3.33 m baseline, FPS unchanged at 110 (pure cold-start drops to 99). Both knobs are exposed via PhoenXWorld kwargs (warm_start_invalidate_period, warm_start_rotate_skip_color) and default to off so existing scenes see no behavioural change. Also adds a sweep_direction device flag and a symmetric_color_sweep opt-in that alternates forward/reverse colour-iteration order each sweep (the user's original hypothesis). It only swaps edge colours - middle colours stay near the middle of the sweep - so it's not a sufficient drift fix on its own, but it's kept as a building block for future PGS scheduling experiments. The plumbing (sweep_direction array, colour-range helpers returning the mapped colour index, the extra kernel parameter on the single-world head + fused-tail kernels) is a no-op when the flag is off and graph-capture safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``capture_while_greedy_coloring`` to ``PhoenXWorld``. When set, the greedy MIS build replaces its fixed ``MAX_GREEDY_OUTER_ITERS`` host-side loop with ``wp.capture_while(num_remaining, body)``. The captured graph exits as soon as ``num_remaining`` hits 0 instead of running the body to the cap and relying on per-thread early-exit. Profile data on Kapla (10620 bodies, 67k constraint columns, single-world layout, --num-frames=100): * Warm-start mode, fixed loop: 12,800 partitioner kernel launches/100 frames = 0.21 ms/frame (all are post-validation no-ops since the cache covers the full constraint set). * Warm-start mode, capture_while: 880 launches = 0.04 ms/frame -- **18x fewer launches, ~2% step-time saving**. The watcher adds ~210 ``set_conditional_if_handle_kernel`` launches/100 frames, noise-level. Cold-start mode (no warm-start cache available) sees a smaller benefit: ~36% fewer launches (12.8k -> 8.2k), but those launches were already doing real MIS work so the time savings are smaller (~76 us/frame). The flag is opt-in for compatibility with the existing fixed-loop performance contract (PERF_NOTES documents the original switch from ``capture_while`` to fixed loop as a step-time win on contended scenes). The kapla drift probe enables it by default and gains ~3 fps end-to-end on top of the warm-start lock-in fix. The legacy fixed-loop path is preserved unchanged for scenes where the watcher overhead would dominate (e.g. extremely small graphs that converge in 1 iteration, or contexts where capture_while's per-iteration conditional graph node is expensive relative to the saved per-thread no-op launches). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PhoenX is experimental; we don't need backward-compat for these solver defaults. Promote the three knobs that have no measurable drawback and clear upside on the contact-heavy rigid scenes we benchmark: * ``warm_start_invalidate_period`` default 0 -> 4. Triggers a full re-color every 4 steps so the warm-start cache can't lock in a PGS-biasing coloring forever. * ``warm_start_rotate_skip_color`` default False -> True. Each step skips re-seeding one cached colour (round-robin) so MIS re-derives ~1/num_colors of the assignments while the rest stay warm-started. Continuous low-cost stir on top of the periodic rebuild above. * ``capture_while_greedy_coloring`` default False -> True. Greedy MIS outer loop uses ``wp.capture_while(num_remaining, ...)`` instead of the fixed ``MAX_GREEDY_OUTER_ITERS`` host unroll. On the warm-start fast path the captured graph exits after ~1 body iteration instead of running to the cap; 18x fewer MIS launches per build on Kapla. Kapla 1000-frame drift probe with new defaults: max drift 0.175 m (vs 3.33 m with all three off), full warm-start FPS retained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Çatalyürek-style speculative coloring algorithm alongside the existing JP-MIS path. Three phases per round: 1. **Pick** -- every uncoloured constraint picks the smallest colour not used by its already-coloured neighbours (full forbidden-mask scan, no priority gating). 2. **Validate** -- each uncoloured constraint scans its uncoloured neighbours; aborts if any with strictly higher priority picked the same tentative colour. Reads-only on ``color_tags`` so no race with concurrent commits. 3. **Commit** -- non-aborted constraints stamp ``color_tags`` and decrement ``num_remaining``. Run as a separate launch to keep phase 2 race-free. Determinism: same fixed priority permutation as JP-MIS, plus the 3-phase split that eliminates the read/write race a 2-phase pick+commit would have. Same inputs -> same colouring across runs and hardware. Motivation: JP-MIS commits at most one colour per round (local-maxima only at the smallest free colour), so a 67k-constraint graph drains in ~80 inner launches. Speculative commits at multiple colours per round (any uncoloured constraint whose tentative colour isn't claimed by a higher-priority neighbour), bringing the round count down to ~32. The per-round cost is higher (3 kernels with neighbour scans on both ``color_tags`` and ``tentative_color``), so net wall-clock on Kapla is comparable to MIS+capture_while (~1.37 vs 1.65 ms per 100 frames cold-start in nsys). Where speculative wins: * Graphs that exceed the JP-MIS outer-iter cap on round count. * Future tuning of the validate kernel (shared-mem caching of ``tentative_color``, better short-circuit on small forbidden masks). Default OFF because it doesn't out-pace the existing greedy MIS on the scenes we benchmark. Wire-up via ``PhoenXWorld(speculative_coloring=True)`` or ``partitioner.set_speculative_coloring(True)``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Record what we learned in the warm-start-coloring drift investigation so future work doesn't re-litigate the dead ends: * Warm-start cache stir (period + rotate-skip) as the actual fix, with measured drift / cost trade-offs. * ``wp.capture_while`` win on the warm-start fast path (18x fewer MIS launches). * Speculative coloring is implemented and deterministic but doesn't out-pace JP-MIS+capture_while on the scenes we benchmark; kept as an opt-in building block. Also captures the negative results (symmetric / cyclic colour sweep, per-thread early-exit, lower MAX_GREEDY_OUTER_ITERS) so future tuners don't repeat them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes:
1. **Speculative coloring overflow fix.** When the int64 forbidden
mask saturates (>= GREEDY_MAX_COLORS = 64 distinct neighbour
colours), pick falls back to colour 63. If that colour is
already taken by a coloured neighbour, validate aborts on the
coloured-neighbour conflict check and ``num_remaining`` never
decreases -- ``capture_while`` spins forever.
Fix: a 1-thread ``speculative_overflow_exit_kernel`` zeroes
``num_remaining`` when ``overflow_flag`` is set, surfacing the
saturation to the JP fallback wrapper. Also extends the validate
kernel to check coloured-neighbour colour conflicts (the missing
case that let two saturated constraints sharing a body both
commit to colour 63).
Speculative coloring now produces valid colourings on dense
synthetic graphs that exceed 64 colours, same JP fallback
contract as JP-MIS.
2. **CUDA + graph capture regression tests.** New file
``test_graph_coloring_speculative.py`` exercises the new code
paths via ``wp.ScopedCapture`` + ``wp.capture_launch`` (matches
the project convention -- the production code path is captured
graphs, not eager launches). Tests:
* ``test_speculative_valid_under_capture`` -- captured
speculative build produces a valid colouring across replays.
* ``test_speculative_deterministic_under_capture`` -- same
inputs, same captured colouring across runs.
* ``test_speculative_warm_start_under_capture`` -- speculative
+ warm-start cache produces valid colourings on cached replays.
* ``test_rotate_skip_valid_under_capture`` -- rotate-skip
re-MIS cycle stays valid across 8 replays.
* ``test_invalidate_period_valid_under_capture`` -- periodic
full-cache invalidate stays valid across 7 replays straddling
two invalidation boundaries.
Full suite: 5 tests, 6 s with warm Warp cache, 60 s on first
compile -- within the 1-minute test policy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PhoenX is experimental and the user wants the best-performing coloring as the default. Speculative wins JP-MIS on the cold-start path measured by nsys (1.37 vs 1.65 ms per 100 frames on Kapla, ~17 % faster) and matches it on the warm-start fast path. Coloring quality (number of colours) is within 1-2 of JP-MIS so PGS sweep cost is unaffected. Determinism and correctness are equivalent: same fixed priority permutation, same JP fallback on bitmask overflow, plus the overflow-exit kernel that prevents capture_while from spinning when the graph's chromatic number exceeds GREEDY_MAX_COLORS. End-to-end FPS on Kapla is unchanged (110 fps either way) because cold-start coloring is only 1/4 of frames under the default cache stir. Scenes with more cold-start frames (e.g. high contact churn) will see proportionally larger wins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalises ``set_warm_start_rotate_skip`` from a single cached colour per step to a configurable range of consecutive colours. ``width=1`` matches the previous single-colour skip (default); ``width=N`` skips ``N`` consecutive colours per step, rotating round-robin via the step counter, so all ``num_colors`` colours cycle through a re-MIS pass every ``num_colors / width`` steps. Cost scales linearly with ``width`` (each skipped colour is re-MIS work this step). On the Kapla drift probe ``width=4`` + ``period=4`` matches ``width=1`` + ``period=4`` within drift noise (0.20 m vs 0.23 m at 1000 frames, both ~111 fps). The wider stir is kept as a knob for scenes where ``width=1`` rotates too slowly through the colour set. Adds ``warm_start_rotate_skip_width`` to ``PhoenXWorld`` (default 1). Replaces the single-int ``skip_color_plus_one`` cache entry with start/end inclusive range arrays in the partitioner; the seed kernel falls back to a no-op when ``start > end``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3-phase pipeline (pick + validate + commit) was conservatively split to give validate a clean ``color_tags`` snapshot, but the race it was protecting against is actually benign: When two constraints A, B share a body and both pick colour ``c``: priorities are a permutation (no ties), so exactly one wins. The loser aborts regardless of whether it reads the winner as "uncoloured but higher priority" (pre-commit) or "coloured at my tentative" (post-commit) -- the abort outcome is identical, and the winner only writes ``color_tags`` AFTER all of its own neighbour checks pass. Fusing validate and commit cuts one launch per speculative round (~32 launches/frame on Kapla cold-start, ~14 launches/frame on warm-start mode). Per-round wall-clock now: pick (18us) + validate_commit (25us) + overflow_exit (0.6us) = ~44us, vs the 3-phase ~46us. Tests in ``test_graph_coloring_speculative`` still pass under ``wp.ScopedCapture`` -- the fused kernel keeps the same deterministic priority-tiebreak semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lower the default per-substep PGS schedule (substeps 8 -> 6, solver_iterations 10 -> 8) -- with the new warm-start cache stir + speculative coloring the scene settles at this shorter schedule and runs ~25 % faster per frame without visible quality loss. Bump default ``--grid-side`` 5 -> 20 so the benchmark scene exercises the multi-instance contact-heavy regime that the recent partitioner work targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The body-locality sort that runs once per ``build_csr`` used to do two stable radix sorts back-to-back: 1. sort by ``eid`` (deterministic tie-break for the next sort) 2. stable sort by ``(colour << 32) | body_min`` The two-pass design was for tie-breaking determinism: the stable sort in pass 2 preserved pass 1's eid order whenever ``(colour, body_min)`` tied. The same final ordering can be produced with a single sort by packing all three fields into one int64 key: bits 0..24 : eid (25 bits -> 32M constraints) bits 25..48 : body_min (24 bits -> 16M bodies) bits 49..63 : colour (15 bits -> 32K colours) Lexicographic sort on this packed key is equivalent to ``(colour, body_min, eid)`` ordering, which matches what the two-pass produced. Profile on ``example_kapla_square_tower --grid-side=20`` (100 frames): DeviceRadixSortOnesweep 380 -> 282 M ns (-26 %) Locality sort kernels 2 writeback + 2 key-build -> 1 + 1 Net step-time saving ~1 ms/frame on this scene (~4 % of total GPU time at this scale). Tests pass under ``wp.ScopedCapture`` -- the merged sort produces the same colouring as the two-pass on the existing stress workload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup pass over the graph-coloring + solver modules. Net -534 lines, no behaviour change (48-test sweep passes). * Drop the legacy 2-pass body-locality kernels (``_locality_eid_keys_kernel``, ``_locality_compute_keys_kernel``) -- subsumed by ``_locality_combined_keys_kernel`` in commit ``d8ad6a64``. ``luby_fixed`` now also uses the single-pass sort. * Drop the 3-phase speculative kernels (``speculative_validate_kernel``, ``speculative_commit_kernel``) -- subsumed by ``speculative_validate_commit_kernel`` in commit ``660944ab``. * Shorten docstrings and inline comments in ``graph_coloring_common.py``, ``graph_coloring_incremental.py``, ``warm_start.py``, ``luby_fixed.py``, ``solver_phoenx.py``, ``solver_phoenx_kernels.py``. Kept the *why* (correctness rationale, capture safety, race analysis) and dropped the *narrative* (background prose, restated invariants, version history). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Net -114 lines. No behaviour change. Dispatcher modules and the ``SolverPhoenX`` constructor had multi-paragraph docstrings that restated invariants already covered in the protocol / PhoenXWorld docs. Kept the load-bearing rationale (race analysis, fallback contract, capture safety) and dropped the narrative. Also exports ``warm_start_mark_boundaries_kernel`` from the warm_start module ``__all__`` (it's used by the partitioner but was missing from the public surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file was a personal drift probe kept untracked on purpose (the user uses it for git-bisect tests across the partitioner work). It was swept into a cleanup commit by ``git add <dir>``; removing it back to its untracked working-tree state.
Fix the kapla_tower2 camera-collider regression where a kinematic body (e.g. a fly-through camera collider) sweeps through a sleeping island without waking anything. The wake required four overlapping fixes -- each on its own left the island re-marked as sleeping every frame: - cloth_collision: treat KINEMATIC bodies as non-frozen in the sleep-aware broad-phase filter so kinematic-vs-sleeping pairs survive the SAP pass and produce contacts. - solver_phoenx_kernels: stop collapsing KINEMATIC bodies to -1 in _constraints_to_elements_kernel (inverse_mass == 0 but motion_type != STATIC) so the kinematic body appears as an adjacent node in the contact's interaction element. - sleeping_kernels: keep KINEMATIC bodies in the union-find input (_phoenx_copy_elements_to_int2d_kernel) so the kinematic mover lands in the same compact island as the sleeping bricks it's pressing against; have _phoenx_island_max_velocity_kernel read the kinematic's pose-derivative velocity so the score actually lifts above threshold. - solver_phoenx.step: run _kinematic_prepare_step BEFORE the sleeping pass so bodies.velocity reflects this step's target delta by the time the score kernel reads it. Also extends the kapla_tower2 example with optional per-island sleeping (ENABLE_SLEEPING + the share-vertex filter wiring) so the end-to-end path is exercised in tree. Regression coverage: test_sleeping.TestSleepingKinematicWake.test_moving_kinematic_wakes_sleeping_stack Verified to fail before the fix and pass after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the share-vertex filter + per-shape AABB wiring into ``PhoenXWorld.attach_collision_pipeline``: one call replaces the manual ``build_phoenx_share_vertex_filter_data`` + per-frame ``shape_aabb_lower / shape_aabb_upper`` arguments to ``step()``. Adds ``PhoenXWorld.broad_phase_filter()`` returning the (filter_func, filter_data_type) tuple to pass at CollisionPipeline construction time -- the only sleeping detail that can't be hidden behind the world (the pipeline must reserve a filter slot up-front). ``step()`` now falls back to the cached AABB arrays when no explicit ones are passed. ``wake_on_external_input()`` was already a no-op when sleeping is disabled, so the kapla_tower2 example collapses the entire per-frame ``if ENABLE_SLEEPING`` branch into a uniform "apply pick force, wake, collide, step" sequence. Example diff: -55 / +20 lines around the sleeping path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the kapla_tower2 cleanup: replaces the manual ``build_phoenx_share_vertex_filter_data`` + per-step ``shape_aabb_*`` args with a single ``world.attach_collision_pipeline`` call, and pulls the (filter_func, filter_data_type) tuple from ``PhoenXWorld.broad_phase_filter()`` at CollisionPipeline construction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump SLEEP_COLOR_GAIN from 0.30 to 0.55 in both kapla_tower2 and kapla_square_tower. The old 0.30 value pushed settled bricks near-black, washing out the wood texture. 0.55 keeps the sleep state clearly distinguishable from active without making it look like a different material. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Checklist
CHANGELOG.mdhas been updated (if user-facing change)Test plan
Bug fix
Steps to reproduce:
Minimal reproduction:
New feature / API change