Skip to content

Event Log Rotation #1581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Quantumplation opened this issue Aug 21, 2024 · 8 comments · May be fixed by #1997
Open

Event Log Rotation #1581

Quantumplation opened this issue Aug 21, 2024 · 8 comments · May be fixed by #1997
Assignees
Labels
amber ⚠️ Medium complexity or partly unclear feature
Milestone

Comments

@Quantumplation
Copy link
Contributor

Why

While working on the hydra-doom project, we noticed that both the on-disk state and the in memory state grew without bound (see #1572)

This meant that, at the sustained load that the hydra doom demo was producing, nodes became inoperable after just a few hours. The hack in #1572 helped, but on-disk state still needed to be rotated regularly, by hand.

This consisted of stopping the nodes, renaming the data directory, bringing the nodes back up, and then shipping the data directory off to archival storage. And this only worked because we were using offline nodes and didn't mind interrupting the head.

What

I'd like to propose that the hydra head implement checkpointing for the event log.

How

This is just a proposed implementation, feel free to adapt to better fit the intricacies of the hydra codebase.

  • The hydra node default file sync will be updated to write into event log files or directories that are named by a starting sequence number, such as data/seq-0/state or data/seq-12345/state
  • The first message in the event log will be a "checkpoint" event, which contains any state needed to recover from that point in time without regard to any messages that came before
  • After a certain number of messages written (or time interval, or bytes, etc.), the hydra node will close the previous files, create a new file/directory, write the checkpoint event to the file, drop any previous events from memory, and then emit this checkpoint event to the websocket API
  • On startup, the default file source would identify the "latest" event log directory and begin consuming events from that log; the initial checkpoint event would allow it to recover any state, such as the current UTXO, etc.
  • A new websocket message, "trigger checkpoint", would allow external orchestration to request a checkpoint if its required, such as maintenance windows, file backpressure, etc.

This would allow a 3rd party agent to detect the checkpoint and trigger any appropriate archival / backup / cleanup that was needed, without interrupting the hydra head, hydra heads would be able to recover faster after a failure, and memory usage would be kept within a bounded limit.

Again, I'm super unfamiliar with the hydra codebase, so there might be more subtleties that are needed, but I just wanted to get the ball rolling on a discussion :)

@Quantumplation Quantumplation added the 💭 idea An idea or feature request label Aug 21, 2024
@github-project-automation github-project-automation bot moved this to In Progress 🕐 in ☕ Hydra Team Work Aug 21, 2024
@ch1bo ch1bo mentioned this issue Aug 21, 2024
4 tasks
@ch1bo
Copy link
Member

ch1bo commented Aug 22, 2024

As it was only mentioned in passing in this item, we might want to scope separate item(s) about the memory growth in:

  • The API server keeps an ever growing history of output events. This could be addressed by projecting it from the base event stream (stored in state) and re-reading the persisted events on demand. The proposed checkpointing from this item here would truncate the API history too.

  • The network reliability component which keeps an ever growing outbound buffer of sent messages. The algorithm must be changed to have only a bounded resilience against network faults and consequently a bounded buffer of messages it can resend.

@ch1bo
Copy link
Member

ch1bo commented Sep 10, 2024

Created #1618 to cover the API server part of tackling memory growth.

@noonio
Copy link
Contributor

noonio commented Feb 4, 2025

What remains to do here is the state file that contains all our events; this can be turned into an issue once the other items are addressed.

@noonio noonio added the amber ⚠️ Medium complexity or partly unclear feature label Feb 4, 2025
@noonio noonio added this to the 0.x.x milestone Mar 27, 2025
@ch1bo ch1bo removed this from the 0.22.0 milestone Mar 31, 2025
@noonio noonio moved this to Triage 🏥 in ☕ Hydra Team Work Apr 15, 2025
@ffakenz ffakenz self-assigned this Apr 23, 2025
@ffakenz ffakenz linked a pull request Apr 26, 2025 that will close this issue
4 tasks
@ch1bo ch1bo changed the title Event Log Rotation and Memory Growth Event Log Rotation Apr 27, 2025
@ch1bo
Copy link
Member

ch1bo commented Apr 27, 2025

While the midnight glacier drop originally was a use case that could benefit from this, the architect of that solution was actually happy to hear that we keep the full log of all transaction (and don't squash things into checkpoints). Also disk usage after millions of transactions was not a topic for them anymore.

Which leaves us only with restart times as a possible motivating purpose for this issue. Checkpoints are not the only possible solution for that as the storage format, encoding and in general the code has still lots of room for optimization. See also #1585

@ffakenz
Copy link
Contributor

ffakenz commented Apr 29, 2025

We need to understand the impact of this request to the ?history API.

As mentioned in #1581 (comment)

Maybe its enough to make it configurable somehow.

@Quantumplation
Copy link
Contributor Author

While startup times are probably the lowest hanging fruit to motivate this issue, the thinking when I wrote this had very little to do with startup times. I also wasn't envisioning this would (by default) discard the transaction history, but to segment it so that someone can operationalize around those segments to meet whatever their organizational needs are based on those segments.

This comes up a lot in large enterprise systems, particularly around things like logs or event streams. Large scale enterprise deployments of Hydra are likely going to need to do some combination of the following:

  • Back-up the event log progressively; if the file is constantly being written to, this is difficult to do, because you have no knowledge of the format of the file and may end up backing up the file mid-write and end up with a malformed file;
  • Move portions of the event log to different storage; A large enterprise deployment might store the last 24 hours of logs on an expensive SSD, move the last 2 weeks of activity to a cheap HDD with disk-level compression enabled, and move anything older than that to extremely cheap glacier storage. These files are likely still available if needed (sometimes even surfaced via the API), but less frequently than the "recent" history
  • Decide, based on the enterprise policy, when to truncate; for example, deleting records older than one month is an easy way to ensure they are compliant with GDPR's "right to be forgotten" request: any data will be expunged after one month, the time limit for an organization to comply with such a request anyway, and so when someone requests this, it requires no additional action on the organizations part.

@ch1bo
Copy link
Member

ch1bo commented Apr 29, 2025

Thanks for adding more context @Quantumplation! Would you be fine with this first initial scope?

  • hydra-node does not rotate by default
  • a new --persistence-rotate-after <number of events> allows to configure the event log rotation
  • a new PersistenceRotated server output indicates to clients when this happens
  • the feature will work with whatever EventSink/EventSource pair is used for event persistence (= EventStore) by hydra-node
  • the websocket API will only resend the current logs history when ?history=true is set
  • no external command to trigger event log rotation

@Quantumplation
Copy link
Contributor Author

yes, that seems totally reasonable, though --persistence-rotate-after <time interval> might be more useful than number of events.

@ffakenz ffakenz moved this from Triage 🏥 to In progress 🕐 in ☕ Hydra Team Work Apr 29, 2025
@ffakenz ffakenz moved this from In progress 🕐 to Triage 🏥 in ☕ Hydra Team Work Apr 29, 2025
@ffakenz ffakenz moved this from Triage 🏥 to In progress 🕐 in ☕ Hydra Team Work Apr 30, 2025
@ch1bo ch1bo added 💬 feature A feature on our roadmap and removed 💭 idea An idea or feature request 💬 feature A feature on our roadmap labels May 5, 2025
@ffakenz ffakenz linked a pull request May 8, 2025 that will close this issue
4 tasks
@noonio noonio added this to the 0.22.0 milestone May 13, 2025
@ffakenz ffakenz moved this from In progress 🕐 to In review 👀 in ☕ Hydra Team Work May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amber ⚠️ Medium complexity or partly unclear feature
Projects
Status: In review 👀
Status: Next
4 participants