Skip to content

Conversation

blarson-b10
Copy link
Contributor

@blarson-b10 blarson-b10 commented Oct 2, 2025

Overview:

In order to snapshot the radix tree, we must convert the nodes back into RouterEvent structs which are the same structs that we process to build the tree.

To do this we must find the external hash used in the lookup table for the block; but this is done using a linear scan over all blocks for the worker. The result is that the algorithm is worst-case quadratic in the number of blocks.

To fix, I have changed how workers are stored in the radix block; the type is now a map from worker id to external sequence hash.

When testing with 2 workers on real traffic loads, snapshot times dropped from 16s -> 600ms.

Details:

Adds new fields, tests, and some additional logging related to snapshotting.

Where should the reviewer start?

indexer.rs has most of the changes.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • Performance

    • Significantly faster snapshotting and state dumps, reducing overhead and improving throughput.
  • Observability

    • Added detailed timing for purge-and-snapshot operations.
    • Logs now include downloaded byte counts for object fetches.
    • Clearer logs when initial state initialization is skipped.
  • Refactor

    • Streamlined initialization path to only download initial state when configured, improving startup clarity and predictability.

@blarson-b10 blarson-b10 requested a review from a team as a code owner October 2, 2025 16:08
Copy link

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

github-actions bot commented Oct 2, 2025

👋 Hi blarson-b10! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Oct 2, 2025
Copy link
Contributor

coderabbitai bot commented Oct 2, 2025

Walkthrough

Introduces a reverse-lookup table in RadixTree for block-to-external-hash mapping, updates construction, event application, cleanup, and snapshot dumping to use it. Adjusts subscriber snapshot initialization logic and adds timing/logs to purge-then-snapshot. Adds a download-size debug log in NATS transport after copying object data, before deserialization.

Changes

Cohort / File(s) Summary
KV Router — Indexer (RadixTree reverse lookup, snapshotting, cleanup)
lib/llm/src/kv_router/indexer.rs
Adds reverse_block_hash_lookup to RadixTree; initializes in new_with_frequency; populates on block creation in apply_event; cleans on worker removal and clear; dump_tree_as_events uses reverse map for external hashes; adds tracing; extends tests for snapshot and cleanup consistency.
KV Router — Subscriber (init path and snapshot logging)
lib/llm/src/kv_router/subscriber.rs
Refines initial state download condition to depend on router_snapshot_threshold; logs when initialization is skipped; in purge_then_snapshot, logs start, records start_time, and logs elapsed ms on success.
Runtime — NATS transport (download-size logging)
lib/runtime/src/transports/nats.rs
Logs the number of bytes copied from bucket/key before bincode deserialization; no functional control-flow change.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Sub as Subscriber
  participant RT as RadixTree (Indexer)
  note over Sub: Startup / Operation
  Sub->>Sub: Check router_snapshot_threshold
  alt threshold is Some
    Sub->>Sub: purge_then_snapshot() start_time=now
    Sub->>RT: dump_tree_as_events()
    RT->>RT: Use reverse_block_hash_lookup for external hashes
    RT-->>Sub: Events snapshot
    Sub->>Sub: Log success with elapsed ms
  else threshold is None
    Sub->>Sub: Log radix state init skipped
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I mapped each block to hashes bright,
A burrowed index, neat and tight.
The subscriber times the hop—how swift!
NATS counts bytes, a tidy gift.
Snapshots bloom, then bunnies cheer—
Logs like clover, crisp and clear. 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description adheres to the template by including the Overview, Details, and Where should the reviewer start sections, but the Related Issues section is present without any issue references or an explicit “None” statement, which leaves it incomplete. The template requires using an action keyword with an issue number or explicitly stating that there are no related issues. Because this required information is missing, the description does not fully satisfy the repository’s template. Please update the Related Issues section to either reference the appropriate GitHub issue(s) using the recommended action keywords or explicitly state “None” if there are no related issues.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title clearly and concisely summarizes the primary change by stating that snapshot performance is improved through a reverse lookup from block to external hash, which directly reflects the main implementation in the PR. It uses a common “perf:” prefix and specific wording to make the intent immediately clear to reviewers.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@blarson-b10 blarson-b10 changed the title feat(dynamo) Improve performance of snapshot using a reverse lookup from block -> external hash perf: Improve performance of snapshot using a reverse lookup from block -> external hash Oct 3, 2025
@github-actions github-actions bot added the perf label Oct 3, 2025
@PeaBrane
Copy link
Contributor

PeaBrane commented Oct 6, 2025

@blarson-b10 Maybe an easier and safer way to do this is to directly make the field
workers: HashMap<WorkerId, ExternalSequenceBlockHash>, in RadixBlock.
(this was actually the initial python router impl we had but somehow got lost in translation)

@blarson-b10
Copy link
Contributor Author

@blarson-b10 Maybe an easier and safer way to do this is to directly make the field workers: HashMap<WorkerId, ExternalSequenceBlockHash>, in RadixBlock. (this was actually the initial python router impl we had but somehow got lost in translation)

done.

@blarson-b10 blarson-b10 requested a review from PeaBrane October 6, 2025 22:46
@PeaBrane
Copy link
Contributor

PeaBrane commented Oct 7, 2025

@blarson-b10 Looks like the DCO check still needs to be resolved — you may need to sign your commits in this PR. Hopefully it’s a painless fix!

Signed-off-by: Brian Larson <[email protected]>
@blarson-b10 blarson-b10 force-pushed the blarson/improve_snapshot_perf branch from 2c80846 to 16d6cf6 Compare October 7, 2025 04:16
@blarson-b10
Copy link
Contributor Author

@blarson-b10 Looks like the DCO check still needs to be resolved — you may need to sign your commits in this PR. Hopefully it’s a painless fix!

hopefully fixed now

@PeaBrane
Copy link
Contributor

PeaBrane commented Oct 7, 2025

/ok to test 16d6cf6

@PeaBrane
Copy link
Contributor

PeaBrane commented Oct 7, 2025

/ok to test 7b4bd02

@PeaBrane PeaBrane enabled auto-merge (squash) October 7, 2025 19:46
@PeaBrane PeaBrane merged commit 1169427 into ai-dynamo:main Oct 7, 2025
25 of 26 checks passed
ptarasiewiczNV pushed a commit that referenced this pull request Oct 8, 2025
…ck -> external hash (#3370)

Signed-off-by: Brian Larson <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Co-authored-by: PeaBrane <[email protected]>
Signed-off-by: Piotr Tarasiewicz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor perf size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants