Skip to content

Conversation

@alexluong
Copy link
Collaborator

@alexluong alexluong commented Jan 22, 2026

Moves event fetching from the delivery flow to the retry flow.

What changed

  • Retry scheduler now fetches full event data from logstore before publishing to delivery queue
  • Delivery handler no longer needs logstore dependency or retry detection logic

Why this design

Scheduled retry tasks only store event IDs (not full payloads) to keep queue messages small. Previously, the delivery handler had to detect retries via event.Time.IsZero() and conditionally fetch event data.

Moving this to the retry scheduler is cleaner because:

  1. The retry flow inherently knows it's handling retries - no heuristic detection needed
  2. Fails earlier - missing events are caught before publishing to delivery queue
  3. Removes logstore dependency from delivery handler - delivery handler always receives complete data and no longer needs to know about or access logstore. It simply processes what it's given, regardless of message origin.

Both entry points now consistently provide full event data to the delivery queue:

Flow Event data fetched at
Initial delivery Already has full event
Scheduled retry Retry scheduler (before publish)
Manual retry API handler (before publish)

@vercel
Copy link

vercel bot commented Jan 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
outpost-docs Ready Ready Preview, Comment Jan 26, 2026 5:39pm
outpost-website Ready Ready Preview, Comment Jan 26, 2026 5:39pm

Request Review

Copy link
Contributor

@alexbouchardd alexbouchardd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good refactor 👍

@alexluong alexluong force-pushed the model-delivery-event branch from 04c3ebd to 9277a3f Compare January 26, 2026 13:11
alexluong and others added 10 commits January 27, 2026 00:37
Extends TestRetryDelivery to verify that the manual retry API publishes
a DeliveryTask with complete event data to deliveryMQ. This ensures
the deliverymq handler receives full event data for manual retries,
consistent with the scheduled retry flow which fetches event data
in the retry scheduler.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add unit and e2e tests verifying that retries are not lost when the
retry scheduler queries logstore before the event has been persisted.

Test scenario:
1. Initial delivery fails, retry is scheduled
2. Retry scheduler queries logstore for event data
3. Event is not yet persisted (logmq batching delay)
4. Retry should remain in queue and be reprocessed later

Tests added:
- TestRetryScheduler_RaceCondition_EventNotYetPersisted (unit test)
- TestE2E_Regression_RetryRaceCondition (e2e test)

Also adds:
- RetryVisibilityTimeoutSeconds config option
- WithRetryVisibilityTimeout scheduler option
- mockDelayedEventGetter for simulating delayed persistence

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Return error instead of nil so the retry message stays in queue and
will be reprocessed after the visibility timeout.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@alexluong alexluong force-pushed the deliverymq-logstore-deps branch from eb51fdb to b701223 Compare January 26, 2026 17:38
@alexluong alexluong merged commit da4bd25 into model-delivery-event Jan 26, 2026
2 of 4 checks passed
@alexluong alexluong deleted the deliverymq-logstore-deps branch January 26, 2026 17:38
alexluong added a commit that referenced this pull request Jan 30, 2026
* refactor: normalize logstore driver interface

- Rename interface methods: InsertManyDeliveryEvent -> InsertMany,
  ListDeliveryEvent -> ListDelivery, RetrieveDeliveryEvent -> RetrieveDelivery
- Rename request types: ListDeliveryEventRequest -> ListDeliveryRequest, etc.
- Add DeliveryRecord type for query results with Event and Delivery
- Update memlogstore, pglogstore, chlogstore implementations
- Update all API handlers and tests to use new interface
- Remove DeliveryEventID field from Delivery struct

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: introduce LogEntry message type for logmq

* refactor: move batch processor from builder to logmq package

* refactor: rename RetryMessage to RetryTask

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: introduce DeliveryTask and update message queue flow

- Add DeliveryTask struct with IdempotencyKey() and RetryID() methods
- Update deliverymq to publish/consume DeliveryTask instead of DeliveryEvent
- Update publishmq to create and enqueue DeliveryTask
- Update RetryTask to convert to DeliveryTask
- Update API handlers, eventtracer, alert, emetrics to use DeliveryTask
- Fix Delivery fields (TenantID, Attempt, Manual) not being set before logging
- Add :manual suffix to idempotency key for manual retries

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: remove DeliveryEvent type and legacy API handlers

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* docs: update README and comments to reflect DeliveryEvent removal

- Update chlogstore/README.md method names and SQL examples
- Update pglogstore/README.md method names
- Update tracer_test.go comment to reference DeliveryTask

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: remove delivery_event_id column and legacy API docs

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* docs: generate config

* refactor: change InsertMany to accept []*LogEntry

Preserves Event-Delivery pairing through the insert flow, eliminating
the need for eventMap reconstruction in ClickHouse implementation.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: move retry event data query into retry flow (#654)

* refactor: add event fetching to retry scheduler

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: align mock eventGetter with logstore behavior

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: remove logStore from messagehandler

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* chore: remove dead eventGetter code from messagehandler tests

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* test: verify manual retry publishes full event data

Extends TestRetryDelivery to verify that the manual retry API publishes
a DeliveryTask with complete event data to deliveryMQ. This ensures
the deliverymq handler receives full event data for manual retries,
consistent with the scheduled retry flow which fetches event data
in the retry scheduler.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* test: add failing tests for retry race condition

Add unit and e2e tests verifying that retries are not lost when the
retry scheduler queries logstore before the event has been persisted.

Test scenario:
1. Initial delivery fails, retry is scheduled
2. Retry scheduler queries logstore for event data
3. Event is not yet persisted (logmq batching delay)
4. Retry should remain in queue and be reprocessed later

Tests added:
- TestRetryScheduler_RaceCondition_EventNotYetPersisted (unit test)
- TestE2E_Regression_RetryRaceCondition (e2e test)

Also adds:
- RetryVisibilityTimeoutSeconds config option
- WithRetryVisibilityTimeout scheduler option
- mockDelayedEventGetter for simulating delayed persistence

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: return error when event not found in logstore during retry

Return error instead of nil so the retry message stays in queue and
will be reprocessed after the visibility timeout.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* test: improve flaky tests

* chore: dev yaml

* chore: `make test` skip cmd/e2e by default

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>

* test: redis testcontainer flakiness

* refactor: rename Delivery to Attempt in core + API layer

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* test: rename Delivery → Attempt in all test files

Update test files to use the new Attempt naming:
- logstore/drivertest/*.go: AttemptFactory, ListAttempt, RetrieveAttempt
- deliverymq/*_test.go: AttemptStatus*, entry.Attempt
- apirouter/*_test.go: AttemptFactory, /attempts API paths
- logmq/batchprocessor_test.go: AttemptFactory, Attempt struct fields

All unit tests pass (1665 tests).

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* chore: add database migrations for Delivery → Attempt rename

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* docs: rename Delivery to Attempt in OpenAPI spec

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* feat: rename Delivery to Attempt in UI components

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* chore: fix Delivery → Attempt rename inconsistencies

- Revert attempt_metadata → delivery_metadata in OpenAPI (matches Go code)
- Fix config: delivery_prefix → attempt_prefix, example att → atm
- Fix variable names: deliveryID → attemptID, attErr → atmErr
- Fix test names: TestListDeliveries → TestListAttempts, etc.
- Update doc links for renamed API endpoints
- Update comments and test descriptions

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* chore: remove redundant inline comments

Remove comments that simply restate what the next line of code does,
such as "// Create client" before NewClient() or "// Check if exists"
before .Exists(). Preserves all meaningful comments including godoc,
section headers, WHY explanations, and unit clarifications.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* chore: add destination-scoped attempts routes, rename attempt_number, update UI name

* chore: opanapi.yaml

* test: e2e tests for attempt_number

* chore: rename portal attempt route

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants