Skip to content

Conversation

@calvinrzachman
Copy link
Contributor

@calvinrzachman calvinrzachman commented Jan 2, 2026

Change Description

This PR builds on the underlying storage primitives provided by #10049 to make the SendOnion RPC fully idempotent, allowing clients to safely retry requests after network failures.

To achieve this, the SendOnion handler now follows a "Register -> Validate -> Act -> Roll-back" pattern:

  1. Register: The handler's first action is to call InitAttempt. This "write ahead" style approach creates a durable record of intent to dispatch that serves as the idempotence anchor, ensuring any retries for the same attempt_id are correctly identified as duplicates. NOTE: This is analogous to the ControlTower.InitPayment method used by routerrpc which ensures the router doesn't launch two parallel payment lifecycle managers for the same payment hash.

  2. Validate & Act: Only after the intent to dispatched is durably written, does the handler proceed with request validation and the actual HTLC dispatch via the familiar Switch's SendHTLC method.

  3. Transition on Failure (rollback): The process of registering and then acting is not naturally atomic. To close this "atomicity gap," we perform cleanup synchronously as part of error handling to prevent attempts from being "orphaned" in an initialized but not actually dispatched state. If any validation or dispatch error occurs after InitAttempt, we transition the attempt's state from PENDING to FAILED by calling FailPendingAttempt. This ensures the operation is atomic from the client's perspective and helps prevent indefinite hangs on subsequent TrackOnion calls.

Client Error Contract

This new implementation establishes a error contract for clients, who must handle four distinct categories of outcomes:

  1. SUCCESS:
    • Signal: Success=true in the response body.
    • Meaning: The HTLC was successfully dispatched and is definitively in-flight.
    • Action: Proceed to track the payment's final result via TrackOnion.
  2. DUPLICATE:
    • Signal: ErrorCode_DUPLICATE_HTLC in the response body.
    • Meaning: Not a failure. A definitive acknowledgment that a request with the same attempt_id was already successfully processed.
    • Action: Treat as a success for the attempt and proceed directly to tracking.
  3. AMBIGUOUS FAILURE:
    • Signal: A gRPC status error of codes.Unavailable.
    • Meaning: The state of the HTLC is unknown because the server failed during the critical InitAttempt write responsible for detecting duplicates.
    • Action: The client MUST retry the exact same request to resolve the ambiguity. Moving on to a new attempt_id risks a duplicate payment.
  4. DEFINITIVE FAILURE:
    • Signal: Any other error (e.g., codes.InvalidArgument, codes.FailedPrecondition, validation failures).
    • Meaning: A guarantee that the HTLC was not and will not be dispatched.
    • Action: Fail the attempt and potentially retry the payment via a new route and a new attempt_id.

Overall, the updated implementation provides several critical benefits for SendOnion clients:

  1. Guaranteed At-Most-Once Processing: The InitAttempt gate ensures that the server will never process the same attempt_id more than once. If a request for an attempt_id is received while that ID is already in-flight or completed, it will be rejected as a duplicate. This provides strong assurance against accidental double-spending due to server-side logic failures.
  2. Safe Retries of the Same Attempt: A client never has to guess if a SendOnion request was processed after network failures (e.g., timeouts, disconnections, server unavailability). They can always resolve the ambiguity by retrying the request until definitive acknowledgement is received from the switchrpc server.
  3. Stable and Trustworthy Error Contract: Once a request passes the InitAttempt gate, its outcome is effectively "frozen" from the perspective of a retrying client. By placing the durable write before any dynamic validation (e.g., checking peer connectivity or liquidity), the API guarantees that a request R1 accepted at time t1 cannot have its retry R2 rejected for a new, transient reason at time t2. The retry will always receive a DUPLICATE_HTLC acknowledgment, which is important for preventing client-side misinterpretations that can lead to leaked attempts.

Steps to Test

The integration test suite (itest) has been updated to verify the new idempotent behavior. We now explicitly assert that concurrent or repeated calls to SendOnion with the same attempt_id are correctly rejected as duplicates, both for the entire life-cycle of the attempt and during a client retry-storm scenario.

  • make itest icase=send_onion_concurrency
  • make itest icase=send_onion
  • go test -v -timeout 30s -tags switchrpc github.com/lightningnetwork/lnd/lnrpc/switchrpc

@gemini-code-assist
Copy link

Summary of Changes

Hello @calvinrzachman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability of the SendOnion RPC by introducing idempotency. It ensures that payment attempts are processed at most once, even in the face of network interruptions or client retries. The core mechanism involves a write-ahead approach to record payment intent, coupled with a clear error handling strategy that guides clients on how to interpret and respond to various outcomes, ultimately leading to more robust payment-sending logic for external orchestrators.

Highlights

  • Idempotent SendOnion RPC: The SendOnion RPC is now fully idempotent, allowing clients to safely retry requests after network failures without risking duplicate payments. This is achieved through a "Register -> Validate -> Act -> Roll-back" pattern.
  • New Client Error Contract: A new error contract has been established for SendOnion clients, defining four distinct outcomes: SUCCESS, DUPLICATE (for retries of already processed attempts), AMBIGUOUS FAILURE (requiring exact retries), and DEFINITIVE FAILURE (guaranteeing no dispatch).
  • Guaranteed At-Most-Once Processing: The implementation ensures that the server will never process the same attempt_id more than once, providing strong assurance against accidental double-spending.
  • Enhanced Reliability: Clients can now safely retry SendOnion requests after timeouts or disconnections, and the API guarantees a stable and trustworthy error contract, preventing misleading transient errors on retries.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an important idempotency mechanism to the SendOnion RPC, which is a significant reliability improvement for clients. The implementation follows a clean "Register -> Validate -> Act -> Roll-back" pattern, which is well-documented and appears robust. The addition of a comprehensive integration test for concurrent requests provides strong confidence in the correctness of the solution. My feedback is focused on improving the new test code for better maintainability and robustness.

Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 🎉, concept ACK. It looks basically good to go.


// SendOnion handles the incoming request to send a payment using a
// preconstructed onion blob provided by the caller.
// SendOnion provides an idempotent API for dispatching a pre-formed onion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great docs! We should put some of that info into switchrpc.proto as well (only the info that concerns the api user, they don't need to know implementation details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docs in switchrpc.proto. Let me know if it looks good!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still some deviations, see other comments

@ziggie1984 ziggie1984 force-pushed the elle-base-branch-payment-service branch from 307e665 to 3e0967d Compare January 5, 2026 16:09
@ziggie1984 ziggie1984 force-pushed the switchrpc-idempotent branch from 4b1ac93 to 8d3cb70 Compare January 5, 2026 16:12
This permits the Switch RPC server to manage the state
of attempts.
@calvinrzachman calvinrzachman marked this pull request as ready for review January 5, 2026 17:24
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @calvinrzachman. I think we should still clarify how we transport errors. IMO we should send back gRPC error codes when hitting errors within the scope of SendOnion and return one_of structured CleartextFailure or Success, when it comes from SendHTLC.


// SendOnion handles the incoming request to send a payment using a
// preconstructed onion blob provided by the caller.
// SendOnion provides an idempotent API for dispatching a pre-formed onion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still some deviations, see other comments

@saubyk saubyk added this to v0.21 Jan 6, 2026
@saubyk saubyk moved this to In progress in v0.21 Jan 6, 2026
@saubyk saubyk moved this from In progress to In review in v0.21 Jan 6, 2026
Update SendOnion rpc implementation to make use of our new
Switch attempt store primitives for idempotence. This will
allow SendOnion to reject processing the same htlc attempt id
more than once - providing strong assurances to callers that
the endpoint is safe to retry when they encounter grpc client
and network related failures (eg: timeouts, service unavaiable
errors, etc.)

Additionally, to guarantee a definitive outcome and prevent
orphaned attempts, the handler synchronously transitions an
initialized (PENDING) attempt to FAILED if subsequent validation
or dispatch fails. This prevents indefinite hangs for TrackOnion
queries.
Clarify useage of API for rpc clients.
We can now assert that making multiple calls to SendOnion for
the same attempt ID is prevented.
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work 🎉

been successfully processed. A retrying client should interpret this
as a success and proceed to tracking the payment's result.
3. AMBIGUOUS FAILURE (gRPC code Unavailable or DeadlineExceeded): An
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should add that context cancelled is also such an error and that this should be the fallback if an error can't be classified in terms of the other cases (we could make this the last point), since it's always safe to retry this RPC.

@bitromortac bitromortac requested a review from ziggie1984 January 8, 2026 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

2 participants