Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups #24

Open
10 tasks
kabilar opened this issue Jan 8, 2024 · 5 comments
Open
10 tasks

Backups #24

kabilar opened this issue Jan 8, 2024 · 5 comments
Assignees

Comments

@kabilar
Copy link
Member

kabilar commented Jan 8, 2024

Requirements

  • Routine and manual backups
  • Snapshots at X intervals
  • Preservation of data if system goes up or down
  • Data integrity
  • DANDI strategy?

Postgres database

  • Options
    • Heroku CLI
    • Postgres command pg-dump
      • Determine optimal strategy for pg-upload
    • GitHub actions with Postgres command pg-dump to S3 bucket

Data

  • S3 buckets
    • Remove delete permissions for all roles
    • S3 versioning pricing?
    • MFA delete?
  • Investigate datalad and git-annex

Example heroku command line tool

heroku pg:backups:capture --app linc-staging-terraform

Example Postgres pg-dump script

#!/bin/bash

# Database credentials
HOST="ec2-35-168-130-158.compute-1.amazonaws.com"
PORT="5432"
USER="u8cfndphbguhq8"
DBNAME="dcq75eotjue787"

# Get all tables
TABLES=$(psql -h $HOST -p $PORT -U $USER -d $DBNAME -t -c "SELECT tablename FROM pg_tables WHERE schemaname = 'public';")

# Backup data of each table
for TABLE in $TABLES; do
    pg_dump -h $HOST -p $PORT -U $USER -t $TABLE $DBNAME >> backup.sql
done
@kabilar
Copy link
Member Author

kabilar commented Jan 8, 2024

Let's see if there are possible alternatives within AWS? Possibly duplicate bucket into cold storage?

@kabilar kabilar changed the title Investigate datalad and git-annex engineering for LINC project Backups Jan 19, 2024
@aaronkanzer
Copy link

aaronkanzer commented Jan 24, 2024

@kabilar -- leaving some notes / research here, feel free to edit, append, etc.

Postgres:

Roni mentioned that we get continuous backups (purely via the Postgres instance type we are using) from https://devcenter.heroku.com/articles/heroku-postgres-data-safety-and-continuous-protection -- (for reference, we are using the standard-0 Postgres instance type -- code reference here from dandi-infrastructure

All Heroku Postgres databases are protected through continuous physical backups. These backups are stored in the same region as the database and retrieved through [Heroku Postgres Rollbacks](https://devcenter.heroku.com/articles/heroku-postgres-rollback) on Standard-tier or higher databases

We could go one step further and perform Heroku "logical" backups: https://devcenter.heroku.com/articles/heroku-postgres-logical-backups

This is pretty much Heroku's CLI tool making pg_dump commands more user-friendly. Depending on our preferences, we could invoke this command if desired -- it doesn't seem too costly, just another thing to maintain. I'm not sure there is much to gain here other than being vendor-agnostic with the pg_dump output -- e.g. we could apply the backup files to AWS-version of Postgres if one day we wanted to move away from Heroku, etc.

S3 Data:

S3 has a few mechanisms that we should use for data integrity & prevention of data loss.

Enable S3 Versioning -- this is simply a property that we switch "on" for the bucket. Version IDs are available to go back-in-time. I still need to check on long-term implications of pricing...

Object Lock -- this creates a potentially indefinite time in which an S3 object cannot be deleted or overwritten: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html

MFA Deletion -- we should definitely use this on top of proper IAM permissions (especially when it comes to any mechanism that could delete the entire bucket -- spooky!)

I'm going to evaluate the Terraform code we have to configure our S3 bucket. I need to read more of the codebase, but ideally these things are already present/obtainable. Worst case scenario, we go into the AWS management console directly (e.g. circumvent any infrastructure-as-code) and edit in the short-term as a safeguard.

Assets / datalad / git-annex

We could implement this as well via https://github.com/dandi/backups2datalad -- started some initial conversations -- Yarik, John, Dartmouth seem to be the point persons: https://dandiarchive.slack.com/archives/GMRLT5RQ8/p1706046120749959

I think we should evaluate what utility we gain from datalad and S3 -- I'm assuming that the utility of datalad/git-annex is the ability to save portions of assets that have been changed rather than the entire asset itself; however, I'm curious if the requirements of LINC make the S3 storage sufficient (especially as datalad would be another "service" to technically manage, but perhaps it is a set-it-and-forget-it type of service we can intergrate) -- can discuss further

@aaronkanzer
Copy link

@kabilar (ignore this for now if you see it, enjoy the wedding! 😄 ) -- just noting so I don't forget here...

I did a proof-of-concept yesterday for setting up a cron job that performs manual pg_dump of our Heroku PostgresDB into an S3 bucket...and it works well!

See here:

https://github.com/lincbrain/linc-archive/blob/master/.github/workflows/pg-backup.yml

Can refine a bit more moving forward of course -- GPT made this a quick win for sure.

@aaronkanzer
Copy link

aaronkanzer commented Feb 7, 2024

@kabilar

A bit more research/experimentation on the Object Lock and 2FA Deletion features in AWS:

Object Lock

While we were able to successfully use Terraform in our sandbox example for enabling Object Lock (https://github.com/lincbrain/linc-sandbox/blob/main/aws/bucket.tf), providing a similar flag in the pre-existing dandi-infrastructure failed somewhat silently (e.g. it seems Terraform tries to re-create the bucket; however, all buckets have the lifecycle {prevent_destroy = true} rule already, which essentially does what Object Lock would, but at the parent bucket level.

(just as an FYI, the main use case of lifecycle {prevent_destroy = true} is to prevent the command of terraform destroy -- e.g. what is called when you would remove a resource/code from a TF file -- upon a given resource.)

Nevertheless, we can still go into the AWS Management Console for the bucket and turn the feature on, thus, the conclusion here would be this is a Terraform issue, and not an AWS issue -- my recommendation here would be that we document the manual steps and turn the feature on for relevant buckets as a good safeguard.

There are two flags GOVERNANCE and COMPLIANCE -- my suggestion would be to use GOVERNANCE for the case of the LINC project, as COMPLIANCE removes any ability to delete data, whereas GOVERNANCE limits deletion to certain users.

2FA Deletion

AWS is like a fortress setting this up 😂 -- e.g. only the root user can enable it (not even users with AdministratorAccess in AWS IAM). I messaged Satra for finding the root user/access.

MFA can only be enabled via the AWS CLI, with a command as such:

aws s3api put-bucket-versioning --bucket <bucket-name> --versioning-configuration Status=Enabled,MFADelete=Enabled --mfa "<my-device-arn:aws:iam::151312473579:mfa/<user>-iphone> <code-from-DUO"

The gain we get here is if any of our credentials were leaked, which I think is reason enough to set it up. This should be straightfoward once we get the root user account info.

As an aside, we should explore the S3 Intelligent Pricing feature in terms of Hot Storage -> Glacier -> Deep Glacier. I'm not sure how 1<>1 the DANDI Garbage collection is with the LINC project, so the S3 Intelligent Pricing feature could be quite useful for us.

Let me know if you have any questions/concerns in the meantime.

@aaronkanzer
Copy link

Cc @satra -- just wanted to notify you of this Github Issue, as it relates to some of the questions you've had regarding resiliency of the DANDI/LINC infrastructure, especially when it comes to data integrity and preservation in a worst-case-scenario, such as Terraform accidentally de-provisioning some infrastructure, a malicious actor getting credentials, overwrites, etc.

@kabilar and I will soon consolidate these thoughts into an organized design doc (that can perhaps extend some of these proof-of-concepts into DANDI), but just wanted you to be aware in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants