services/horizon: Add horizon flags to enable ingestion load test ledger backend #5794

tamirms · 2025-08-18T22:53:35Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've reviewed the changes in this PR and if I consider them worthwhile for being mentioned on release notes then I have updated the relevant CHANGELOG.md within the component folder structure. For example, if I changed horizon, then I updated (services/horizon/CHANGELOG.md. I add a new line item describing the change and reference to this PR. If I don't update a CHANGELOG, I acknowledge this PR's change may not be mentioned in future release notes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Add horizon flags to run ingestion load tests. The following conditions must be maintained during an ingestion loadt test:

Only a single horizon instance can run the ingestion load test. Other horizon nodes connected to the same postgres DB should not participate in ingestion.
Once the load test completes horizon should restore the DB back to the previous state.
It should not be possible for horizon to resume normal ingestion until after the DB is restored. Even if horizon crashes in the middle of the load test, horizon ingestion will remain blocked until the DB is restored.

Why

#5481

Known limitations

[N/A]

services/horizon/internal/ingest/loadtest.go

Copilot

Pull Request Overview

This PR adds Horizon configuration flags and implementation to enable ingestion load testing using a special ledger backend that can replay synthetic ledgers while maintaining database consistency. The implementation ensures only one Horizon instance can run load tests and includes mechanisms to restore the database to its previous state after testing.

Key changes:

Added new configuration flags for load test file paths and timing
Implemented load test snapshot functionality for database state management
Integrated load test backend with ingestion system configuration
Added validation checks throughout the ingestion state machine

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
services/horizon/internal/flags.go	Added three new configuration flags for load test parameters
services/horizon/internal/config.go	Added load test configuration fields to Config struct
services/horizon/internal/init.go	Passed load test configuration to ingestion system
services/horizon/cmd/db.go	Propagated load test configuration to DB reingest operations
services/horizon/internal/ingest/main.go	Integrated load test backend creation and snapshot management
services/horizon/internal/ingest/loadtest.go	New file implementing database snapshot management for load tests
services/horizon/internal/db2/history/main.go	Added load test state management interface methods
services/horizon/internal/db2/history/key_value.go	Implemented load test state persistence in key-value store
services/horizon/internal/ingest/fsm.go	Added load test validation checks in ingestion state transitions
ingest/loadtest/ledger_backend.go	Enhanced load test backend with snapshot integration and concurrency safety
Test files	Updated test mocks and added load test snapshot initialization

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

services/horizon/internal/ingest/main.go

Co-authored-by: Copilot <[email protected]>

ingest/loadtest/ledger_backend.go

sreuland · 2025-09-03T23:36:14Z

What is driving the need to run load test on top of an existing horizon db and therefore the Snapshot and Save/Restore functionality?
Can load tests be performed on a separate/empty db instance and be as effective?

some aspects to consider:

reduce complexity by avoiding restoration layer and a smaller code footprint.
avoid downtime on prod db's
avoid co-mingling test data on production db's.
empty db as baseline may lend to more deterministic results

I think we discussed similar patterns related to load testing on datastores

Shaptic

lil' drive-by review

services/horizon/internal/flags.go

services/horizon/internal/ingest/loadtest.go

services/horizon/internal/db2/history/key_value.go

tamirms · 2025-09-04T11:06:18Z

@sreuland

Can load tests be performed on a separate/empty db instance and be as effective?

Unfortunately not because the performance of ingesting into empty postgres tables is significantly faster than ingesting into a postgres tables which already have 1 year or more of data. When ingesting into a empty table, postgres can probably keep the new data cached in-memory. In contrast, if the tables we are ingesting into already occupy 100s of gigabytes or even terrabytes, it is more likely that we will be exercising the disk more.

This does not apply to galexie because the number of existing objects in a GCS / S3 bucket does not impact the performance of subsequent PUT requests to insert new objects in the bucket. So, we should observe the same export performance in Galexie with a new bucket vs an existing bucket. The performance of listing files is impacted by the number of files that exist in the bucket but we are not using that operation in the ingestion load tests.

avoid downtime on prod db's
avoid co-mingling test data on production db's.

The intention is to run these load tests in a staging environment which has the same data as prod. I definitely don't recommend running this in production.

sreuland · 2025-09-04T16:32:09Z

Unfortunately not because the performance of ingesting into empty postgres tables is significantly faster than ingesting into a postgres tables which already have 1 year or more of data. When ingesting into a empty table, postgres can probably keep the new data cached in-memory. In contrast, if the tables we are ingesting into already occupy 100s of gigabytes or even terrabytes, it is more likely that we will be exercising the disk more.

Ok, given the intended use case of loadtest is to target non-prod, populated db instances such as staging,etc. it seems the Snapshot and Save/Restore functionality proposed in this PR could still be considered for removal since the use case sounds like we provide a simple operator guide that states how to use loadtest:

step1 - operator uses a stg db loaded with data.
step2- operator can optionally snapshot/dump db first
step3 - operator runs horizon configured with loadtest against the stg db
step4 - operator captures loadtest results from logs
step5 - operator drops db or restores it

The Snapshot and Save/Restore functionality doesn't really add value in this process?

tamirms · 2025-09-04T16:47:01Z

The Snapshot and Save/Restore functionality doesn't really add value in this process?

Initially, there was no snapshot and restore process and I did it manually instead. However, the manual process was very error prone and after running the ingestion load test a few times I felt that automating the process saved a lot of time.

tamirms · 2025-09-04T16:58:11Z

Also, the way to restore the db is not as straightforward as you would think. For example the proposal you suggested of snapshotting the db and then restoring it from a snapshot would not be ideal. When you restore from a snapshot there is a cold start problem. It can take several days for the db cache to be populated properly.

The best approach I found for restoring the db is doing a state rebuild and running a reingest command on the ledger range corresponding to the load test run. That is basically the workflow implemented in the PR

sreuland · 2025-09-04T17:55:36Z

The best approach I found for restoring the db is doing a state rebuild and running a reingest command on the ledger range corresponding to the load test run. That is basically the workflow implemented in the PR

Yes, the convenience gained by encapsulating that db management aspect is nice but the approach comes with a trade-off where now horizon and its internal ingestion state machine become coupled to load testing concerns not related to ingestion functionality. For example the horizon ingestion state machine is now invoking loadtest state checks from several states regardless of whether loadtest is configured or not which results in a db round trip.

The question comes down to 'is the convenience of db management for load test worth the added complexity in the ingestion engine or can it be offloaded to user guide best practices?'

tamirms · 2025-09-04T18:18:09Z

For example the horizon ingestion state machine is now invoking loadtest state checks from several states regardless of whether loadtest is configured or not which results in a db round trip.

That coupling is there for safety reasons and does not actually have to do with the logic to restore the horizon db back to the previous state.

The code in the ingestion state machine prevents normal ingestion from running while a loadtest is ongoing. Even if we remove the restore functionality those checks would have to remain if we want to ensure safety. In our case we have multiple ingesting instances connected to the same horizon db, so it's very easy to make a mistake while running the load test. Of course it's possible to come up with a manual process but it will still be error prone.

One thing I could do is refactor the code so that we have one common function used by multiple states to check if the db is in a safe state to continue ingestion. Within that function we can assert:

ingestion version is up to date
no load test is ongoing

sreuland · 2025-09-04T19:15:02Z

One thing I could do is refactor the code so that we have one common function used by multiple states to check if the db is in a safe state to continue ingestion. Within that function we can assert:

I don't think the refactor is necessary as there's just a few lines there and its clear what its doing. I'm ready to approve.

I'm bumping more into the overall design approach for integrating loadtest and the footprint for using it going forward in other application code paths and this motivated some thoughts on a different 'blackbox' design approach which I spun up on a Spike - #5810

services/horizon/internal/ingest/loadtest.go

Shaptic

Super cool feature!

services/horizon/internal/ingest/loadtest.go

tamirms added 2 commits August 26, 2025 00:18

Add horizon flags to enable ingestion load test ledger backend

b4458f5

Add snapshot to rollback side effects

312ac1f

tamirms force-pushed the horizon-loadtest-flags branch from 386aa72 to be2c456 Compare August 25, 2025 23:19

fix govet

a0ac69e

tamirms force-pushed the horizon-loadtest-flags branch from be2c456 to a0ac69e Compare August 25, 2025 23:30

tamirms added 3 commits August 26, 2025 12:10

make Restore() idempotent

28e7ba4

fix comments

3119cb2

fix corner case in Restore()

b525633

github-advanced-security bot found potential problems Aug 26, 2025

View reviewed changes

services/horizon/internal/ingest/loadtest.go Fixed Show fixed Hide fixed

fix error shadowing

2d28d32

tamirms marked this pull request as ready for review August 26, 2025 13:11

Copilot AI review requested due to automatic review settings August 26, 2025 13:11

Copilot AI reviewed Aug 26, 2025

View reviewed changes

services/horizon/internal/ingest/main.go Outdated Show resolved Hide resolved

tamirms requested a review from a team August 26, 2025 13:12

This was referenced Aug 26, 2025

Add flags to enable ingestion load test ledger backend in rpc stellar/stellar-rpc#502

Open

Add support for ingestion load testing in galexie #5801

Open

tamirms and others added 3 commits August 26, 2025 23:50

Add more tests

6cd5c1a

Update services/horizon/internal/ingest/main.go

ffeddf2

Co-authored-by: Copilot <[email protected]>

Merge branch 'master' into horizon-loadtest-flags

90ef81e

sreuland reviewed Sep 3, 2025

View reviewed changes

ingest/loadtest/ledger_backend.go Outdated Show resolved Hide resolved

Shaptic reviewed Sep 4, 2025

View reviewed changes

services/horizon/internal/flags.go Outdated Show resolved Hide resolved

services/horizon/internal/ingest/loadtest.go Show resolved Hide resolved

services/horizon/internal/db2/history/key_value.go Outdated Show resolved Hide resolved

sreuland mentioned this pull request Sep 4, 2025

ingest/loadtest: Spike - consider external tool approach #5810

Open

Address review feedback

e8817bb

github-advanced-security bot found potential problems Sep 5, 2025

View reviewed changes

services/horizon/internal/ingest/loadtest.go Dismissed Show dismissed Hide dismissed

tamirms added 12 commits September 5, 2025 15:40

remove dead code, fix error strings

168484c

fix go vet

0c1a3cc

Add integration tests for new stress test commands

35134ec

Merge branch 'master' into horizon-loadtest-flags

30a7bee

fix shadow lint errors

5b4e1b8

fix more shadow warnings

0362256

bump integration tests timeout

dbf4208

fix TestIngestLoadTestCmd() and TestIngestLoadTestRestoreCmd()

6616382

fix TestEnvironmentPreserved

3c6ae2f

fix TestDisableTxSub()

d63a617

fix data race

b65c592

simplify

feee4e1

tamirms requested review from Shaptic and sreuland September 11, 2025 09:55

Shaptic approved these changes Sep 11, 2025

View reviewed changes

services/horizon/internal/ingest/loadtest.go Show resolved Hide resolved

tamirms merged commit b61bfee into stellar:master Sep 12, 2025
35 of 43 checks passed

tamirms deleted the horizon-loadtest-flags branch September 12, 2025 09:32

services/horizon: Add horizon flags to enable ingestion load test ledger backend #5794

services/horizon: Add horizon flags to enable ingestion load test ledger backend #5794

Uh oh!

Conversation

tamirms commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

sreuland commented Sep 3, 2025

Uh oh!

Shaptic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tamirms commented Sep 4, 2025

Uh oh!

sreuland commented Sep 4, 2025

Uh oh!

tamirms commented Sep 4, 2025

Uh oh!

tamirms commented Sep 4, 2025

Uh oh!

sreuland commented Sep 4, 2025

Uh oh!

tamirms commented Sep 4, 2025

Uh oh!

sreuland commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Shaptic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tamirms commented Aug 18, 2025 •

edited

Loading

sreuland commented Sep 4, 2025 •

edited

Loading