Skip to content

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Aug 18, 2025

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've reviewed the changes in this PR and if I consider them worthwhile for being mentioned on release notes then I have updated the relevant CHANGELOG.md within the component folder structure. For example, if I changed horizon, then I updated (services/horizon/CHANGELOG.md. I add a new line item describing the change and reference to this PR. If I don't update a CHANGELOG, I acknowledge this PR's change may not be mentioned in future release notes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Add horizon flags to run ingestion load tests. The following conditions must be maintained during an ingestion loadt test:

  • Only a single horizon instance can run the ingestion load test. Other horizon nodes connected to the same postgres DB should not participate in ingestion.
  • Once the load test completes horizon should restore the DB back to the previous state.
  • It should not be possible for horizon to resume normal ingestion until after the DB is restored. Even if horizon crashes in the middle of the load test, horizon ingestion will remain blocked until the DB is restored.

Why

#5481

Known limitations

[N/A]

@tamirms tamirms force-pushed the horizon-loadtest-flags branch from 386aa72 to be2c456 Compare August 25, 2025 23:19
@tamirms tamirms force-pushed the horizon-loadtest-flags branch from be2c456 to a0ac69e Compare August 25, 2025 23:30
@tamirms tamirms marked this pull request as ready for review August 26, 2025 13:11
@Copilot Copilot AI review requested due to automatic review settings August 26, 2025 13:11
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Horizon configuration flags and implementation to enable ingestion load testing using a special ledger backend that can replay synthetic ledgers while maintaining database consistency. The implementation ensures only one Horizon instance can run load tests and includes mechanisms to restore the database to its previous state after testing.

Key changes:

  • Added new configuration flags for load test file paths and timing
  • Implemented load test snapshot functionality for database state management
  • Integrated load test backend with ingestion system configuration
  • Added validation checks throughout the ingestion state machine

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/horizon/internal/flags.go Added three new configuration flags for load test parameters
services/horizon/internal/config.go Added load test configuration fields to Config struct
services/horizon/internal/init.go Passed load test configuration to ingestion system
services/horizon/cmd/db.go Propagated load test configuration to DB reingest operations
services/horizon/internal/ingest/main.go Integrated load test backend creation and snapshot management
services/horizon/internal/ingest/loadtest.go New file implementing database snapshot management for load tests
services/horizon/internal/db2/history/main.go Added load test state management interface methods
services/horizon/internal/db2/history/key_value.go Implemented load test state persistence in key-value store
services/horizon/internal/ingest/fsm.go Added load test validation checks in ingestion state transitions
ingest/loadtest/ledger_backend.go Enhanced load test backend with snapshot integration and concurrency safety
Test files Updated test mocks and added load test snapshot initialization

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sreuland
Copy link
Contributor

sreuland commented Sep 3, 2025

What is driving the need to run load test on top of an existing horizon db and therefore the Snapshot and Save/Restore functionality?
Can load tests be performed on a separate/empty db instance and be as effective?

some aspects to consider:

  • reduce complexity by avoiding restoration layer and a smaller code footprint.
  • avoid downtime on prod db's
  • avoid co-mingling test data on production db's.
  • empty db as baseline may lend to more deterministic results

I think we discussed similar patterns related to load testing on datastores

Copy link
Contributor

@Shaptic Shaptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lil' drive-by review

@tamirms
Copy link
Contributor Author

tamirms commented Sep 4, 2025

@sreuland

Can load tests be performed on a separate/empty db instance and be as effective?

Unfortunately not because the performance of ingesting into empty postgres tables is significantly faster than ingesting into a postgres tables which already have 1 year or more of data. When ingesting into a empty table, postgres can probably keep the new data cached in-memory. In contrast, if the tables we are ingesting into already occupy 100s of gigabytes or even terrabytes, it is more likely that we will be exercising the disk more.

This does not apply to galexie because the number of existing objects in a GCS / S3 bucket does not impact the performance of subsequent PUT requests to insert new objects in the bucket. So, we should observe the same export performance in Galexie with a new bucket vs an existing bucket. The performance of listing files is impacted by the number of files that exist in the bucket but we are not using that operation in the ingestion load tests.

avoid downtime on prod db's
avoid co-mingling test data on production db's.

The intention is to run these load tests in a staging environment which has the same data as prod. I definitely don't recommend running this in production.

@sreuland
Copy link
Contributor

sreuland commented Sep 4, 2025

Unfortunately not because the performance of ingesting into empty postgres tables is significantly faster than ingesting into a postgres tables which already have 1 year or more of data. When ingesting into a empty table, postgres can probably keep the new data cached in-memory. In contrast, if the tables we are ingesting into already occupy 100s of gigabytes or even terrabytes, it is more likely that we will be exercising the disk more.

Ok, given the intended use case of loadtest is to target non-prod, populated db instances such as staging,etc. it seems the Snapshot and Save/Restore functionality proposed in this PR could still be considered for removal since the use case sounds like we provide a simple operator guide that states how to use loadtest:

  • step1 - operator uses a stg db loaded with data.
  • step2- operator can optionally snapshot/dump db first
  • step3 - operator runs horizon configured with loadtest against the stg db
  • step4 - operator captures loadtest results from logs
  • step5 - operator drops db or restores it

The Snapshot and Save/Restore functionality doesn't really add value in this process?

@tamirms
Copy link
Contributor Author

tamirms commented Sep 4, 2025

The Snapshot and Save/Restore functionality doesn't really add value in this process?

Initially, there was no snapshot and restore process and I did it manually instead. However, the manual process was very error prone and after running the ingestion load test a few times I felt that automating the process saved a lot of time.

@tamirms
Copy link
Contributor Author

tamirms commented Sep 4, 2025

Also, the way to restore the db is not as straightforward as you would think. For example the proposal you suggested of snapshotting the db and then restoring it from a snapshot would not be ideal. When you restore from a snapshot there is a cold start problem. It can take several days for the db cache to be populated properly.

The best approach I found for restoring the db is doing a state rebuild and running a reingest command on the ledger range corresponding to the load test run. That is basically the workflow implemented in the PR

@sreuland
Copy link
Contributor

sreuland commented Sep 4, 2025

The best approach I found for restoring the db is doing a state rebuild and running a reingest command on the ledger range corresponding to the load test run. That is basically the workflow implemented in the PR

Yes, the convenience gained by encapsulating that db management aspect is nice but the approach comes with a trade-off where now horizon and its internal ingestion state machine become coupled to load testing concerns not related to ingestion functionality. For example the horizon ingestion state machine is now invoking loadtest state checks from several states regardless of whether loadtest is configured or not which results in a db round trip.

The question comes down to 'is the convenience of db management for load test worth the added complexity in the ingestion engine or can it be offloaded to user guide best practices?'

@tamirms
Copy link
Contributor Author

tamirms commented Sep 4, 2025

For example the horizon ingestion state machine is now invoking loadtest state checks from several states regardless of whether loadtest is configured or not which results in a db round trip.

That coupling is there for safety reasons and does not actually have to do with the logic to restore the horizon db back to the previous state.

The code in the ingestion state machine prevents normal ingestion from running while a loadtest is ongoing. Even if we remove the restore functionality those checks would have to remain if we want to ensure safety. In our case we have multiple ingesting instances connected to the same horizon db, so it's very easy to make a mistake while running the load test. Of course it's possible to come up with a manual process but it will still be error prone.

One thing I could do is refactor the code so that we have one common function used by multiple states to check if the db is in a safe state to continue ingestion. Within that function we can assert:

  • ingestion version is up to date
  • no load test is ongoing

@sreuland
Copy link
Contributor

sreuland commented Sep 4, 2025

One thing I could do is refactor the code so that we have one common function used by multiple states to check if the db is in a safe state to continue ingestion. Within that function we can assert:

I don't think the refactor is necessary as there's just a few lines there and its clear what its doing. I'm ready to approve.

I'm bumping more into the overall design approach for integrating loadtest and the footprint for using it going forward in other application code paths and this motivated some thoughts on a different 'blackbox' design approach which I spun up on a Spike - #5810

@tamirms tamirms requested review from Shaptic and sreuland September 11, 2025 09:55
Copy link
Contributor

@Shaptic Shaptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool feature!

@tamirms tamirms merged commit b61bfee into stellar:master Sep 12, 2025
35 of 43 checks passed
@tamirms tamirms deleted the horizon-loadtest-flags branch September 12, 2025 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants