The GsasSync script manages the transfer of dissertations from GSAS to the library's collections.
In particular, it will run on a cron job once a month in order to syncronize and verify the transfer of GSAS dissertations. It retrieves the files from an AWS Transfer Server via SFTP, verifies that the correct files have been downloaded successfully and in the expected format, and then removes them from the remote server. This Transfer Server is managed by GSAS, who will upload items each month and maintain backup copies. An email will be sent to relevant parties when the process succeeds or fails (with a log file attached).
git clone [email protected]:cul/gsas_sync.git
cd gsas_sync
bundle install # Install dependencies
rbenv install # Install correct ruby version
# Create an SSH Tunnel
sshuttle -r [email protected] 0.0.0.0/0
ruby gsas_sync_main.rb # Run the script
The gsas sync script supports two command line arguments for setting the standard out logging level and running the script in dry-run mode. Use the -h
flag to see the usage script:
% ruby gsas_sync_main.rb --help
Usage:
ruby gsas_sync_main.rb [options]
-l, --log-level [LLVL] Specify the runtime log level (debug, info, warn, error, fatal - default is 'debug')
--dry-run [DRYRUN] Run as dry-run
The gsas sync script supports a dry-run option to download and validate transfer directories without making any permanent changes to either the rmeote server or the local host.
ruby gsas_sync_main.rb --dry-run
In detail:
- files will be downloaded to a
.temp
directory in the configurable storage location - downloaded
.temp
directories will be validated (see Validation Rules section) and then deleted - no files will be deleted from the remote transfer server
- a progress log file will be created and saved under the configurable logs location
- no email notifications will be sent
- skipped operations (like removing the files from the remote transfer server, moving the
.temp
directory to a permanent one, sending email notifications, e.g.) will be logged
The script expects a configuration file located in <project_root>/config/config.yml
. It should have the following structure:
config:
sftp_server:
host: TRANSFER_SERVER_IP_OR_HOSTNAME
user: TRANSFER_SERVER_USER
key: PATH_TO_SSH_KEY
mail_server:
host: MAIL_SERVER_IP_OR_HOSTNAME
port: PORT
sender_address: "[email protected]" (CAN BE ANY EMAIL YOU WOULD LIKE)
success_recipients:
- EMAIL_RECPIENT_ADDRESS
- ...
failure_recipients:
- EMAIL_RECPIENT_ADDRESS
- ...
logs:
directory: LOGS_DIRECTORY_NAME
storage:
directory: ABSOLUTE_PATH_TO_FINAL_STORAGE_DIRECTORY
While developing locally, we connect to the test transfer server as the special transfer user. You should obtain a copy of that user's private SSH key and put it in your dev machine's ~/.ssh
directory. Additionally, create a local config/config.yml
and populate it with the proper credentials. Refer to spec/fixtures/config.yml for reference.
The test and production transfer servers will only allow connections from connect.cul.columbia.edu
, so you should SSH tunnel to that server while developing locally. You can do so with sshuttle (on mac, you can install sshhuttle with Homebrew). Then use the following command to forward your traffic to the connect server while developing:
sshuttle -r [email protected] 0.0.0.0/0
Alternatively, you can run your own server to use as the test transfer server. This is nice because you can put whatever you want in the server you spin up, without worrying about access rights or muddying the test transfer server that is maintained by LIT. Here is a brief guide to setting this up:
- Install virtual machine software. On mac, we recommend VMWare Fusion (this will require account creation in order to install). On windows, WSL2 would likely be a great option.
- Download an ISO appropriate for our needs and spin up a VM. I use an Ubuntu Live Server (for ARM) image. Make sure to download the ARM architecture ISO if using a mac with Apple Silicon. Boot and set up the new VM.
- To find the IP address of your new machine, you can use the
ip a
command.
- To find the IP address of your new machine, you can use the
- Create an SSH-keypair on your parent machine. Add your public key to the
authorized_keys
file on the remote host.- Method 1 (recommended): Add your key to the remote with the
ssh-copy-id
utility.ssh-copy-id -i ~/.ssh/vm-id_ed25519.pub ROOT_USER@REMOTE_IP_ADDR
from your 'parent' machine.
- Method 2: You can also use the secure copy command to copy the public key you created to the remote machine:
scp ~/.ssh/your_ssh_key.pub ROOT_USER@IP_OF_VM:~/.ssh/your_ssh_key.pub
to copy the file to the remote host.- then add it to the
authorized_keys
file (Using any means you like. Example: in an SSH session:ROOT_USR@REMOTE: ~/.ssh$ echo your_ssh_key.pub >> authorized_keys
).
- Method 1 (recommended): Add your key to the remote with the
- Confirm that you can reach the VM without a password:
ssh ROOT_USER@REMOTE_IP_ADDR -i ~/.ssh/your_ssh_key
- On the remote host, create an
uploads/' directory in the home directory.
gsas_sync` expects this directory to exist and will download dissertations from there. Populate it with any data you'd like while developing. See "expected transfer server directory structure" for more information. - Add your VM host name, root user name, and private key location (on parent machine) to
config/config.yml
. These values will be used whengsas_sync
makes connections to the transfer server.
1. All required files present
1.1. The manifest file exists and has an accepted algorithm in it
1.2. An yyyy_mm_items.csv file with a matching prefix exists
1.3. An yyyy_mm_assets.csv file with a matching prefix exist
2. No undesireable characters are present in any of the file/directory names
3. All files listed in the manifest file are accounted for
3.1. No metadata files (assets and items csv, e.g.) are listed in the manifest file
3.2. All files listed in the manifest exist in the downloaded data/ temp directory
3.3. All files in the downloaded temp directory (besides metadata files) are listed in the manifest
4. The manifest file is valid
4.1 The checksums listed for each file in the manifest match the checksums for what was downloaded
4.2 Each checksum listed in the manifest file is unique
uploads/
├─ 2025_04_dissertations/
│ ├─ data
│ ├─ data
| ├─ bowie_david
| ├─ who_is_ziggy_stardust.pdf
| ├─ ziggy_stardust.pptx
| ├─ ziggy_stardust_explained.mp4
| ├─ haines_emily
| ├─ haines_dissertation.pdf
| ├─ daltrey_roger
| ├─ daltrey_dissertation.pdf
│ ├─ manifest-sha256.txt
│ ├─ 2025_04_items.csv
│ ├─ 2025_04_assets.csv
├─ 2025_05_dissertations/
│ ├─ data
│ ├─ manifest-sha256.txt
│ ├─ 2025_05_items.csv
│ ├─ 2025_05_assets.csv
bundle exec rspec spec