Event Log Rotation #1581

Quantumplation · 2024-08-21T15:49:53Z

Why

While working on the hydra-doom project, we noticed that both the on-disk state and the in memory state grew without bound (see #1572)

This meant that, at the sustained load that the hydra doom demo was producing, nodes became inoperable after just a few hours. The hack in #1572 helped, but on-disk state still needed to be rotated regularly, by hand.

This consisted of stopping the nodes, renaming the data directory, bringing the nodes back up, and then shipping the data directory off to archival storage. And this only worked because we were using offline nodes and didn't mind interrupting the head.

What

I'd like to propose that the hydra head implement checkpointing for the event log.

How

This is just a proposed implementation, feel free to adapt to better fit the intricacies of the hydra codebase.

The hydra node default file sync will be updated to write into event log files or directories that are named by a starting sequence number, such as data/seq-0/state or data/seq-12345/state
The first message in the event log will be a "checkpoint" event, which contains any state needed to recover from that point in time without regard to any messages that came before
After a certain number of messages written (or time interval, or bytes, etc.), the hydra node will close the previous files, create a new file/directory, write the checkpoint event to the file, drop any previous events from memory, and then emit this checkpoint event to the websocket API
On startup, the default file source would identify the "latest" event log directory and begin consuming events from that log; the initial checkpoint event would allow it to recover any state, such as the current UTXO, etc.
A new websocket message, "trigger checkpoint", would allow external orchestration to request a checkpoint if its required, such as maintenance windows, file backpressure, etc.

This would allow a 3rd party agent to detect the checkpoint and trigger any appropriate archival / backup / cleanup that was needed, without interrupting the hydra head, hydra heads would be able to recover faster after a failure, and memory usage would be kept within a bounded limit.

Again, I'm super unfamiliar with the hydra codebase, so there might be more subtleties that are needed, but I just wanted to get the ball rolling on a discussion :)

The text was updated successfully, but these errors were encountered:

ch1bo · 2024-08-22T12:31:57Z

As it was only mentioned in passing in this item, we might want to scope separate item(s) about the memory growth in:

The API server keeps an ever growing history of output events. This could be addressed by projecting it from the base event stream (stored in state) and re-reading the persisted events on demand. The proposed checkpointing from this item here would truncate the API history too.
The network reliability component which keeps an ever growing outbound buffer of sent messages. The algorithm must be changed to have only a bounded resilience against network faults and consequently a bounded buffer of messages it can resend.

ch1bo · 2024-09-10T11:44:11Z

Created #1618 to cover the API server part of tackling memory growth.

noonio · 2025-02-04T13:24:23Z

What remains to do here is the state file that contains all our events; this can be turned into an issue once the other items are addressed.

ch1bo · 2025-04-27T12:26:56Z

While the midnight glacier drop originally was a use case that could benefit from this, the architect of that solution was actually happy to hear that we keep the full log of all transaction (and don't squash things into checkpoints). Also disk usage after millions of transactions was not a topic for them anymore.

Which leaves us only with restart times as a possible motivating purpose for this issue. Checkpoints are not the only possible solution for that as the storage format, encoding and in general the code has still lots of room for optimization. See also #1585

ffakenz · 2025-04-29T11:24:44Z

We need to understand the impact of this request to the ?history API.

As mentioned in #1581 (comment)

Maybe its enough to make it configurable somehow.

Quantumplation · 2025-04-29T14:03:00Z

While startup times are probably the lowest hanging fruit to motivate this issue, the thinking when I wrote this had very little to do with startup times. I also wasn't envisioning this would (by default) discard the transaction history, but to segment it so that someone can operationalize around those segments to meet whatever their organizational needs are based on those segments.

This comes up a lot in large enterprise systems, particularly around things like logs or event streams. Large scale enterprise deployments of Hydra are likely going to need to do some combination of the following:

Back-up the event log progressively; if the file is constantly being written to, this is difficult to do, because you have no knowledge of the format of the file and may end up backing up the file mid-write and end up with a malformed file;
Move portions of the event log to different storage; A large enterprise deployment might store the last 24 hours of logs on an expensive SSD, move the last 2 weeks of activity to a cheap HDD with disk-level compression enabled, and move anything older than that to extremely cheap glacier storage. These files are likely still available if needed (sometimes even surfaced via the API), but less frequently than the "recent" history
Decide, based on the enterprise policy, when to truncate; for example, deleting records older than one month is an easy way to ensure they are compliant with GDPR's "right to be forgotten" request: any data will be expunged after one month, the time limit for an organization to comply with such a request anyway, and so when someone requests this, it requires no additional action on the organizations part.

ch1bo · 2025-04-29T14:24:45Z

Thanks for adding more context @Quantumplation! Would you be fine with this first initial scope?

hydra-node does not rotate by default
a new --persistence-rotate-after <number of events> allows to configure the event log rotation
a new PersistenceRotated server output indicates to clients when this happens
the feature will work with whatever EventSink/EventSource pair is used for event persistence (= EventStore) by hydra-node
the websocket API will only resend the current logs history when ?history=true is set
- I'd be more keen to tackle this in Display history of API messages from the specified starting point #1846 or another separate issue
- also see Simple endpoint to report Head status #1957 as a related item to (could supersede the greetings hack we had)
no external command to trigger event log rotation

Quantumplation · 2025-04-29T14:38:50Z

yes, that seems totally reasonable, though --persistence-rotate-after <time interval> might be more useful than number of events.

Quantumplation added the 💭 idea An idea or feature request label Aug 21, 2024

github-project-automation bot added this to ☕ Hydra Team Work Aug 21, 2024

github-project-automation bot moved this to In Progress 🕐 in ☕ Hydra Team Work Aug 21, 2024

ch1bo mentioned this issue Aug 21, 2024

DONT MERGE: Not leak memory #1572

Closed

4 tasks

ch1bo added this to 🚢 Hydra Head Roadmap Aug 28, 2024

ch1bo moved this to Later in 🚢 Hydra Head Roadmap Aug 28, 2024

This was referenced Sep 10, 2024

Head in Head #1612

Open

Constant memory API server history using StateEvents #1618

Closed

noonio mentioned this issue Nov 21, 2024

*DO NOT MERGE* Doom hotfixes #1745

Closed

2 tasks

noonio added the amber ⚠️ Medium complexity or partly unclear feature label Feb 4, 2025

noonio added this to the 0.x.x milestone Mar 27, 2025

ch1bo moved this to Next in 🚢 Hydra Head Roadmap Mar 31, 2025

ch1bo removed this from the 0.22.0 milestone Mar 31, 2025

noonio moved this to Triage 🏥 in ☕ Hydra Team Work Apr 15, 2025

ffakenz self-assigned this Apr 23, 2025

ffakenz linked a pull request Apr 26, 2025 that will close this issue

Spike: sqlite persistence + event log rotation #1968

Closed

4 tasks

ch1bo changed the title ~~Event Log Rotation and Memory Growth~~ Event Log Rotation Apr 27, 2025

ch1bo mentioned this issue Apr 27, 2025

Spike: sqlite persistence + event log rotation #1968

Closed

4 tasks

ffakenz moved this from Triage 🏥 to In progress 🕐 in ☕ Hydra Team Work Apr 29, 2025

ffakenz moved this from In progress 🕐 to Triage 🏥 in ☕ Hydra Team Work Apr 29, 2025

ffakenz moved this from Triage 🏥 to In progress 🕐 in ☕ Hydra Team Work Apr 30, 2025

ch1bo added 💬 feature A feature on our roadmap and removed 💭 idea An idea or feature request 💬 feature A feature on our roadmap labels May 5, 2025

ffakenz linked a pull request May 8, 2025 that will close this issue

Event log rotation #1997

Open

4 tasks

noonio added this to the 0.22.0 milestone May 13, 2025

noonio mentioned this issue May 13, 2025

Node blocks processing transactions on new API connections with history #1950

Open

ffakenz moved this from In progress 🕐 to In review 👀 in ☕ Hydra Team Work May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Event Log Rotation #1581

Event Log Rotation #1581

Quantumplation commented Aug 21, 2024

ch1bo commented Aug 22, 2024

Uh oh!

ch1bo commented Sep 10, 2024

Uh oh!

noonio commented Feb 4, 2025

Uh oh!

ch1bo commented Apr 27, 2025 •

edited

Loading

Uh oh!

ffakenz commented Apr 29, 2025 •

edited

Loading

Uh oh!

Quantumplation commented Apr 29, 2025

Uh oh!

ch1bo commented Apr 29, 2025

Uh oh!

Quantumplation commented Apr 29, 2025

Uh oh!

Event Log Rotation #1581

Event Log Rotation #1581

Comments

Quantumplation commented Aug 21, 2024

Why

What

How

ch1bo commented Aug 22, 2024

Uh oh!

ch1bo commented Sep 10, 2024

Uh oh!

noonio commented Feb 4, 2025

Uh oh!

ch1bo commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffakenz commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Quantumplation commented Apr 29, 2025

Uh oh!

ch1bo commented Apr 29, 2025

Uh oh!

Quantumplation commented Apr 29, 2025

Uh oh!

ch1bo commented Apr 27, 2025 •

edited

Loading

ffakenz commented Apr 29, 2025 •

edited

Loading