fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention by JoiDaggir · Pull Request #140 · smartdataHQ/cxs

JoiDaggir · 2026-04-21T15:26:35Z

Context

Apr 21 incident: repo1 Longhorn PVC filled (live 500 Gi vs declared 300 Gi) → WAL archive-push to repo2 MinIO broke → primary pg_wal accumulated. Applied live: retention cuts (~129 GB freed), Tier 1 Longhorn cleanup (~716 GB freed). This PR codifies the sustainable settings so drift doesn't recur.

Changes

CronJob activeDeadlineSeconds: full 3600 → 7200, diff 1800 → 5400. Healthy runs observed at 28–45 min since the Apr 14 cluster upgrade; the old tight deadlines were silently killing jobs via DeadlineExceeded.
repo1 PVC: 300 Gi → 1000 Gi (live was 500 Gi via manual expansion; Git was out of sync). Accommodates ~40 GB/day WAL + retention + margin.
repo2 retention: retention-full 8 → 4, retention-diff 14 → 7, retention-archive 4 → 2. Matches what was applied live via pgbackrest expire. PITR window: 4 weekly restore points + ~14 days continuous.

Verify post-merge

ArgoCD syncs within ~3 min.
Longhorn online-expands repo1 PVC 500 → 1000 Gi (no downtime, no pod restart).
Next scheduled CronJob run completes inside the new deadlines.

Out of scope (separate tracks)

pgBackRest upgrade — not needed (99.9% pipe_w hang fixed operationally by yesterday's cxs-pg-repo-host-0 pod restart).
rvfc-0 replica reinit (TL 24 vs 28 divergence from Apr 14 upgrade).
Grafana metrics backend (dead Prometheus datasource since Rancher Monitoring removal).
MinIO admin-side cleanup (mc rm --recursive for empty-shell prefixes, mc admin info for capacity visibility) — for Gissur.

🤖 Generated with Claude Code

… retention Apr 21 incident: repo1 Longhorn PVC (live 500 Gi vs Git 300 Gi) filled -> archive-push to repo2 MinIO broke -> primary pg_wal accumulated. Applied live: retention cuts (~129 GB freed), Tier 1 Longhorn cleanup (~716 GB freed). This codifies sustainable config so drift doesn't recur. - CronJob activeDeadlineSeconds: full 3600->7200, diff 1800->5400. Healthy runs observed 28-45 min since Apr 14 cluster upgrade; prior tight deadlines silently killed jobs via DeadlineExceeded. - repo1 PVC: 300 Gi -> 1000 Gi (live is 500 Gi via manual expand). Accommodates ~40 GB/day WAL + retention + margin. - repo2 retention: full 8->4, diff 14->7, archive 4->2. Matches what was applied live via pgbackrest expire. PITR: 4 weekly restore points + ~14 days continuous. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention#140

fix(postgres): bump backup deadlines, expand repo1 PVC, tighten repo2 retention#140
JoiDaggir wants to merge 1 commit intomainfrom
fix/pg-backup-deadlines-capacity-retention

JoiDaggir commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoiDaggir commented Apr 21, 2026

Context

Changes

Verify post-merge

Out of scope (separate tracks)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant