-
Notifications
You must be signed in to change notification settings - Fork 531
services/horizon: Add horizon flags to enable ingestion load test ledger backend #5794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
386aa72
to
be2c456
Compare
be2c456
to
a0ac69e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Horizon configuration flags and implementation to enable ingestion load testing using a special ledger backend that can replay synthetic ledgers while maintaining database consistency. The implementation ensures only one Horizon instance can run load tests and includes mechanisms to restore the database to its previous state after testing.
Key changes:
- Added new configuration flags for load test file paths and timing
- Implemented load test snapshot functionality for database state management
- Integrated load test backend with ingestion system configuration
- Added validation checks throughout the ingestion state machine
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
services/horizon/internal/flags.go | Added three new configuration flags for load test parameters |
services/horizon/internal/config.go | Added load test configuration fields to Config struct |
services/horizon/internal/init.go | Passed load test configuration to ingestion system |
services/horizon/cmd/db.go | Propagated load test configuration to DB reingest operations |
services/horizon/internal/ingest/main.go | Integrated load test backend creation and snapshot management |
services/horizon/internal/ingest/loadtest.go | New file implementing database snapshot management for load tests |
services/horizon/internal/db2/history/main.go | Added load test state management interface methods |
services/horizon/internal/db2/history/key_value.go | Implemented load test state persistence in key-value store |
services/horizon/internal/ingest/fsm.go | Added load test validation checks in ingestion state transitions |
ingest/loadtest/ledger_backend.go | Enhanced load test backend with snapshot integration and concurrency safety |
Test files | Updated test mocks and added load test snapshot initialization |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
What is driving the need to run load test on top of an existing horizon db and therefore the Snapshot and Save/Restore functionality? some aspects to consider:
I think we discussed similar patterns related to load testing on datastores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lil' drive-by review
Unfortunately not because the performance of ingesting into empty postgres tables is significantly faster than ingesting into a postgres tables which already have 1 year or more of data. When ingesting into a empty table, postgres can probably keep the new data cached in-memory. In contrast, if the tables we are ingesting into already occupy 100s of gigabytes or even terrabytes, it is more likely that we will be exercising the disk more. This does not apply to galexie because the number of existing objects in a GCS / S3 bucket does not impact the performance of subsequent PUT requests to insert new objects in the bucket. So, we should observe the same export performance in Galexie with a new bucket vs an existing bucket. The performance of listing files is impacted by the number of files that exist in the bucket but we are not using that operation in the ingestion load tests.
The intention is to run these load tests in a staging environment which has the same data as prod. I definitely don't recommend running this in production. |
Ok, given the intended use case of loadtest is to target non-prod, populated db instances such as staging,etc. it seems the Snapshot and Save/Restore functionality proposed in this PR could still be considered for removal since the use case sounds like we provide a simple operator guide that states how to use loadtest:
The Snapshot and Save/Restore functionality doesn't really add value in this process? |
Initially, there was no snapshot and restore process and I did it manually instead. However, the manual process was very error prone and after running the ingestion load test a few times I felt that automating the process saved a lot of time. |
Also, the way to restore the db is not as straightforward as you would think. For example the proposal you suggested of snapshotting the db and then restoring it from a snapshot would not be ideal. When you restore from a snapshot there is a cold start problem. It can take several days for the db cache to be populated properly. The best approach I found for restoring the db is doing a state rebuild and running a reingest command on the ledger range corresponding to the load test run. That is basically the workflow implemented in the PR |
Yes, the convenience gained by encapsulating that db management aspect is nice but the approach comes with a trade-off where now horizon and its internal ingestion state machine become coupled to load testing concerns not related to ingestion functionality. For example the horizon ingestion state machine is now invoking loadtest state checks from several states regardless of whether loadtest is configured or not which results in a db round trip. The question comes down to 'is the convenience of db management for load test worth the added complexity in the ingestion engine or can it be offloaded to user guide best practices?' |
That coupling is there for safety reasons and does not actually have to do with the logic to restore the horizon db back to the previous state. The code in the ingestion state machine prevents normal ingestion from running while a loadtest is ongoing. Even if we remove the restore functionality those checks would have to remain if we want to ensure safety. In our case we have multiple ingesting instances connected to the same horizon db, so it's very easy to make a mistake while running the load test. Of course it's possible to come up with a manual process but it will still be error prone. One thing I could do is refactor the code so that we have one common function used by multiple states to check if the db is in a safe state to continue ingestion. Within that function we can assert:
|
I don't think the refactor is necessary as there's just a few lines there and its clear what its doing. I'm ready to approve. I'm bumping more into the overall design approach for integrating loadtest and the footprint for using it going forward in other application code paths and this motivated some thoughts on a different 'blackbox' design approach which I spun up on a Spike - #5810 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool feature!
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
CHANGELOG.md
within the component folder structure. For example, if I changed horizon, then I updated (services/horizon/CHANGELOG.md. I add a new line item describing the change and reference to this PR. If I don't update a CHANGELOG, I acknowledge this PR's change may not be mentioned in future release notes.semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
Add horizon flags to run ingestion load tests. The following conditions must be maintained during an ingestion loadt test:
Why
#5481
Known limitations
[N/A]