Skip to content

Conversation

domdomegg
Copy link
Member

@domdomegg domdomegg commented Aug 22, 2025

Implements automated database backup functionality with Google Cloud Storage integration, including retention policies and MinIO support for local development.

Motivation and Context

This change adds critical database backup capabilities to ensure data durability and disaster recovery. The solution provides automated backups with configurable retention policies and supports both production (GCS) and development (MinIO) environments.

How Has This Been Tested?

  • Tested backup creation and restoration with MinIO in local development
  • Verified GCS bucket lifecycle policies for automatic deletion after 60 days
  • Tested backup retention and cleanup logic

Breaking Changes

No breaking changes. This is an additive feature that doesn't affect existing functionality.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

  • Implements 60-day retention period for backups as a safety net
  • Uses GCS lifecycle rules for automatic cleanup
  • Includes MinIO setup instructions for local development testing
  • Port-forwarding commands updated for consistency across documentation
  • Fixes Backup cadence implemented for database #184

domdomegg and others added 5 commits August 22, 2025 14:02
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
@domdomegg
Copy link
Member Author

@claude can you review?

Copy link
Contributor

claude bot commented Aug 22, 2025

Claude finished @domdomegg's task —— View job


PR Review: Database Backup Functionality

Comprehensive review of the database backup implementation with GCS integration and K8up operator.

✅ Overall Assessment

This is a well-implemented backup solution that adds critical data durability to the MCP Registry. The architecture is sound, using industry-standard tools (K8up/Restic) with proper cloud integration.

Key Strengths:

  • Clean provider pattern supporting both GCP production and local development
  • Proper retention policies with safety nets
  • Good separation of concerns in the codebase
  • Comprehensive documentation updates
  • Environment-specific configurations
🔧 Technical Implementation

Backup Architecture

  • K8up operator: Kubernetes-native backup solution using Restic
  • Dual storage: GCS for production, MinIO for local development
  • S3-compatible interface: Unified approach across environments
  • Scheduled operations: Daily backups at 4:46 AM, pruning at 5:46 AM

Code Quality

  • Well-structured provider pattern in deploy/pkg/providers/
  • Proper dependency management with Pulumi resources
  • Good error handling and resource cleanup
  • Environment-specific configurations (frequent backups for testing)
⚠️ Security & Reliability Concerns

Issues Found:

  1. Hardcoded Credentials in Local Provider (deploy/pkg/providers/local/provider.go:63):

    StringData: pulumi.StringMap{
        "password": pulumi.String("password"), // In production we use GCS, which is already encrypted
    • Uses weak password "password" for repository encryption
    • Comment acknowledges GCS encryption but local backups still need protection
  2. Default MinIO Credentials (deploy/pkg/providers/local/provider.go:69-70, 260-261):

    "accesskey": pulumi.String("minioadmin"),
    "secretkey": pulumi.String("minioadmin"),
    • Uses default credentials that are well-known
  3. Root Container Execution (deploy/pkg/k8s/backup.go:115):

    podSecurityContext: map[string]any{
        "runAsUser": 0, // Run as root to access all files
    • While necessary for backup access, consider if more restrictive permissions are possible

Security Recommendations:

  • Generate random password for local Restic repository
  • Use configurable MinIO credentials with secure defaults
  • Document security implications of root execution requirement
🏗️ Design & Architecture

Well-Designed Aspects:

  • Provider abstraction: Clean interface allowing easy addition of new cloud providers
  • Dependency management: Proper resource dependencies ensure correct deployment order
  • Configuration flexibility: Environment-specific settings for development vs production
  • Retention strategy: 28-day K8up pruning + 60-day GCS safety net

Areas for Enhancement:

  • Consider making backup schedules configurable via Pulumi config
  • Add monitoring/alerting configuration for backup failures
  • Document disaster recovery procedures beyond the current restore steps
📚 Documentation

Excellent Documentation Updates:

  • Clear setup instructions for both local and production environments
  • Step-by-step backup access and restoration procedures
  • Troubleshooting section with helpful kubectl commands
  • Updated port-forwarding commands for consistency

Minor Suggestions:

  • Add example of monitoring backup job status
  • Consider adding a backup verification procedure
  • Document retention policy rationale (why 60 vs 28 days)
🔍 Code-Specific Issues

File: deploy/pkg/k8s/backup.go

  • Line 26: Hardcoded K8up version 4.8.4 - consider making configurable
  • Line 71-73: Schedule times use specific minutes (:46) - document rationale
  • Lines 75-79: Test schedules very aggressive (every minute) - might overwhelm test environments

File: deploy/pkg/providers/gcp/provider.go

  • Line 208: 60-day retention hardcoded - should align with configuration management approach
  • Line 268-269: HMAC credentials stored as StringMap - verify this is secure for production

File: deploy/pkg/providers/local/provider.go

  • Line 269: MinIO endpoint uses internal DNS name - ensure this works in all local environments
  • Lines 230-236: Bucket initialization script could benefit from better error handling
✨ Recommendations

Priority 1 (Security):

  1. Generate secure random password for Restic repository in local environments
  2. Make MinIO credentials configurable rather than using defaults
  3. Add warning documentation about security implications

Priority 2 (Robustness):

  1. Make K8up version and schedules configurable
  2. Add backup monitoring/alerting configuration
  3. Improve error handling in MinIO initialization

Priority 3 (Enhancement):

  1. Add backup verification procedures
  2. Consider backup encryption at rest documentation
  3. Add example disaster recovery runbook

Recommendation: Approve with minor security improvements

This is a solid implementation that significantly improves the registry's data durability. The security concerns are primarily related to local development environments and don't affect production deployment security. The architecture is well-designed and the documentation is comprehensive.


@domdomegg
Copy link
Member Author

Re Claude's review:

  1. Using default passwords in local dev environment seems fine, and simplifies setup + contributor flow for almost no security benefit IMO.
  2. This is configuration as code, it is already configurable! Monitoring+alerting maybe a good idea, but going to punt that as out of scope for now (can use kubectl in mean time). Minio spins up fine locally with current code, Pulumi already handles errors.
  3. Think these are out of scope for getting something basic working now.

🏠 Remote-Dev: homespace
@domdomegg
Copy link
Member Author

@rdimitrov would you be up for reviewing this?

@rdimitrov
Copy link
Contributor

@domdomegg - I'll be happy to 👍 I'll have a look at it later today/tomorrow if it's not too late for you?

@domdomegg
Copy link
Member Author

Works great for me, thanks 🙌

Copy link
Contributor

@rdimitrov rdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t find anything concerning, so from my side I think it’s ready for merging 👍

The only thought that I have is more on the operational side. Backups usually only get used when things have already gone wrong which is the worst possible time to discover they’re not working 😄 So having some simple monitoring, playbooks for common scenarios, and maybe a periodic validation job could be nice future improvements.

Comment on lines +71 to +73
backupSchedule := "46 4 * * *" // Daily at 4:46 AM
pruneSchedule := "46 5 * * *" // Daily at 5:46 AM
keepDaily := 28 // Keep daily backups for 28 days
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we have these and others like it as constants?

Also I wasn't sure if the diff in the retention days is a good practice (28 vs 60) but turned out it is so nice 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to keep them inline for now, given we don't use them anywhere else and backup.go is fairly short. But if we end up wanting to configure them/change them often I can pull them out.

@domdomegg domdomegg merged commit 03ac9c9 into main Aug 25, 2025
5 checks passed
@domdomegg domdomegg deleted the adamj/backups branch August 25, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backup cadence implemented for database
2 participants