Skip to content

fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention#140

Open
JoiDaggir wants to merge 1 commit intomainfrom
fix/pg-backup-deadlines-capacity-retention
Open

fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention#140
JoiDaggir wants to merge 1 commit intomainfrom
fix/pg-backup-deadlines-capacity-retention

Conversation

@JoiDaggir
Copy link
Copy Markdown
Collaborator

Context

Apr 21 incident: repo1 Longhorn PVC filled (live 500 Gi vs declared 300 Gi) → WAL archive-push to repo2 MinIO broke → primary pg_wal accumulated. Applied live: retention cuts (~129 GB freed), Tier 1 Longhorn cleanup (~716 GB freed). This PR codifies the sustainable settings so drift doesn't recur.

Changes

  • CronJob activeDeadlineSeconds: full 3600 → 7200, diff 1800 → 5400. Healthy runs observed at 28–45 min since the Apr 14 cluster upgrade; the old tight deadlines were silently killing jobs via DeadlineExceeded.
  • repo1 PVC: 300 Gi → 1000 Gi (live was 500 Gi via manual expansion; Git was out of sync). Accommodates ~40 GB/day WAL + retention + margin.
  • repo2 retention: retention-full 8 → 4, retention-diff 14 → 7, retention-archive 4 → 2. Matches what was applied live via pgbackrest expire. PITR window: 4 weekly restore points + ~14 days continuous.

Verify post-merge

  1. ArgoCD syncs within ~3 min.
  2. Longhorn online-expands repo1 PVC 500 → 1000 Gi (no downtime, no pod restart).
  3. Next scheduled CronJob run completes inside the new deadlines.

Out of scope (separate tracks)

  • pgBackRest upgrade — not needed (99.9% pipe_w hang fixed operationally by yesterday's cxs-pg-repo-host-0 pod restart).
  • rvfc-0 replica reinit (TL 24 vs 28 divergence from Apr 14 upgrade).
  • Grafana metrics backend (dead Prometheus datasource since Rancher Monitoring removal).
  • MinIO admin-side cleanup (mc rm --recursive for empty-shell prefixes, mc admin info for capacity visibility) — for Gissur.

🤖 Generated with Claude Code

… retention

Apr 21 incident: repo1 Longhorn PVC (live 500 Gi vs Git 300 Gi) filled ->
archive-push to repo2 MinIO broke -> primary pg_wal accumulated.
Applied live: retention cuts (~129 GB freed), Tier 1 Longhorn cleanup
(~716 GB freed). This codifies sustainable config so drift doesn't recur.

- CronJob activeDeadlineSeconds: full 3600->7200, diff 1800->5400.
  Healthy runs observed 28-45 min since Apr 14 cluster upgrade;
  prior tight deadlines silently killed jobs via DeadlineExceeded.
- repo1 PVC: 300 Gi -> 1000 Gi (live is 500 Gi via manual expand).
  Accommodates ~40 GB/day WAL + retention + margin.
- repo2 retention: full 8->4, diff 14->7, archive 4->2.
  Matches what was applied live via pgbackrest expire.
  PITR: 4 weekly restore points + ~14 days continuous.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant