fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention#140
Open
fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention#140
Conversation
… retention Apr 21 incident: repo1 Longhorn PVC (live 500 Gi vs Git 300 Gi) filled -> archive-push to repo2 MinIO broke -> primary pg_wal accumulated. Applied live: retention cuts (~129 GB freed), Tier 1 Longhorn cleanup (~716 GB freed). This codifies sustainable config so drift doesn't recur. - CronJob activeDeadlineSeconds: full 3600->7200, diff 1800->5400. Healthy runs observed 28-45 min since Apr 14 cluster upgrade; prior tight deadlines silently killed jobs via DeadlineExceeded. - repo1 PVC: 300 Gi -> 1000 Gi (live is 500 Gi via manual expand). Accommodates ~40 GB/day WAL + retention + margin. - repo2 retention: full 8->4, diff 14->7, archive 4->2. Matches what was applied live via pgbackrest expire. PITR: 4 weekly restore points + ~14 days continuous. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Apr 21 incident: repo1 Longhorn PVC filled (live 500 Gi vs declared 300 Gi) → WAL archive-push to repo2 MinIO broke → primary
pg_walaccumulated. Applied live: retention cuts (~129 GB freed), Tier 1 Longhorn cleanup (~716 GB freed). This PR codifies the sustainable settings so drift doesn't recur.Changes
activeDeadlineSeconds: full 3600 → 7200, diff 1800 → 5400. Healthy runs observed at 28–45 min since the Apr 14 cluster upgrade; the old tight deadlines were silently killing jobs viaDeadlineExceeded.retention-full8 → 4,retention-diff14 → 7,retention-archive4 → 2. Matches what was applied live viapgbackrest expire. PITR window: 4 weekly restore points + ~14 days continuous.Verify post-merge
Out of scope (separate tracks)
cxs-pg-repo-host-0pod restart).rvfc-0replica reinit (TL 24 vs 28 divergence from Apr 14 upgrade).mc rm --recursivefor empty-shell prefixes,mc admin infofor capacity visibility) — for Gissur.🤖 Generated with Claude Code