diff --git a/docs/platforms/aws/aws-instructions.md b/docs/platforms/aws/aws-instructions.md new file mode 100644 index 000000000..494a3ea95 --- /dev/null +++ b/docs/platforms/aws/aws-instructions.md @@ -0,0 +1,175 @@ +# Swell on AWS (`smce-gmao`) + +## General platform description + +### Compute + +We are running an instance of AWS ParallelCluster on AWS. +This is just a `slurm` cluster very similar to Discover, with a login node and a compute node pool. + +The login node pool (3 nodes, assigned randomly; similar to Discover) is always on and costs a fixed amount whether we use it or not. +Each login node has 2 CPUs and 8 GB of RAM. + +Compute nodes only cost money when they are running jobs and are destroyed when not in use. +We have several compute queues available, which can be viewed via `sinfo`. +Example output might look like this: + +``` +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +demand-8cpu* up infinite 19 idle~ demand-8cpu-dy-demand-8cpu-nodes-[2-20] +demand-8cpu* up infinite 1 mix demand-8cpu-dy-demand-8cpu-nodes-1 +demand-16cpu up infinite 20 idle~ demand-16cpu-dy-demand-16cpu-nodes-[1-20] +spot-8cpu up infinite 20 idle~ spot-8cpu-dy-spot-8cpu-nodes-[1-20] +spot-16cpu up infinite 20 idle~ spot-16cpu-dy-spot-16cpu-nodes-[1-20] +``` + +The `demand` nodes are reserved and guaranteed to be available for the entire length of the job, but are also somewhat more expensive. +The `spot` nodes are 3-4x cheaper per CPU-hour, but may fail unexpectedly (when other AWS users outbid us for them). +For smaller, lower-priority, and failure-tolerant jobs, I recommend using the spot partitions (e.g., `sbatch -p spot-8cpu`) to save some costs...but if you need the `demand` nodes, use them! + +All nodes are in the [`c7i-flex` class](https://aws.amazon.com/ec2/instance-types/c7i/) --- x86_64, custom 4th Generation Intel Xeon Scalable processors ("Sapphire Rapids") + +Nodes take around 7 minutes to launch from idle (shut down) state to active. +After a job completes or fails, nodes will persist for 10 minutes before shutting down. +Jobs submitted within that 10 minute window should start almost instantly. +This should make it _much_ easier to run multi-job workflows (like Swell) and to debug issues. + +### Networking + +There are **no restrictions on network access**. +Unlike Discover, both login and compute nodes can download from the open internet at no cost. +Some network instability may be possible due to cost-saving network configurations. + +### Storage + +All storage locations should be available at the same file paths to both login and compute nodes. +None of the storage locations have inode limits, but some storage types do have (small) costs per read/write operation, so pay some attention to this. + +* Home directories are mounted on AWS Elastic File System (EFS). This is a pay-for-what-you-use storage device with no practical upper limit. The price is $0.30/GB-month ($300/TB-month), plus a $0.03/GB charge for reads and a $0.06/GB charge for writes. Performance characteristics are average; decent, but not amazing and possibly inconsistent, throughput and latency. +* There is also a shared `/efs` folder with the same pricing and performance characteristics as above. +* `/fast1` is pre-allocated SSD storage, fixed (for now) to 200 GB. We pay for 200 GB whether this is 0% or 100% full. Note that although this is SSD, it is mounted to compute nodes via NFS (over the network), so the performance will likely be (significantly) worse than advertised. There is no cost to read or write operations. +* `/slow1` is pre-allocated HDD (spinning disk) storage, fixed to 1 TB (as above, but much cheaper and with worse performance). Since it's pre-allocated, this is a great place to store infrequently accessed data. Again, no cost to read or write operations. +* `/s3` is a mounted S3 bucket. Like EFS, this is pay-for-what-you-use and infinitely expandable, but significantly cheaper -- $0.023/GB-month ($23/TB-month). However, this is not quite a true POSIX file system. It should work great for reading and writing files, but not very well for editing files. There is also a small fee for each read, write, and delete operation (~$0.005 per 1000 ops...but it can add up for certain access patterns). + +### Final thoughts + +This AWS resource is here to be used; do not be intimidated by the storage or compute costs. +As long as you are reasonably prudent about storage (e.g., don't dump 100s of TB of output on EFS and leave them there for long stretches of time) and compute (e.g., don't accidentally leave compute nodes cycling a failed task for days at a time), cost shouldn't be a problem. +We also get cost alerts when we spike or drift well above an average. + +Finally, remember that **labor costs money too**. +Your time is worth between $20-50/hour (depending on seniority), so if you spend an hour trying to save cloud compute costs, if you don't save at least ~$50 of compute costs (170 GB-months of storage; 150 hours of compute node runtime), _you are wasting money_. + +## Cylc + +The Swell AWS installation comes with a global installation of cylc. +You should be able to use it with no additional configuration (assuming `/usr/local/bin` is on your `PATH`). + +The `cylc` configuration on AWS is basically identical to Discover. +Ensure the following are in your `~/.cylc/flow/global.cylc` file. + +``` +[scheduler] + UTC mode = True + process pool timeout = PT10M + process pool size = 4 + +[platforms] + [[aws]] + job runner = slurm + install target = localhost + hosts = localhost +``` + +### (Optional) Install your own version of cylc + +If you would like to install your own `cylc`, read on: + +A very easy and convenient way to install `cylc` is using the [pixi package manager](https://pixi.sh/latest/): + +1. Install `pixi` itself (per its instructions). +Note that this is a user-level install; you do not need sudo permissions. +Then, restart your shell (or log out and back in). + +2. Install cylc with `pixi global install cylc-flow --expose cylc`. +This will make `cylc` available as a global standalone executable available everywhere (including Swell). + +## Installing swell + +1. Clone Swell: `git clone https://github.com/geos-esm/swell` + +2. Enter the `swell` directory. + +3. Activate Swell modules: `source /shared/swell-bundle` + +4. Create a virtual environment: + + ```sh + python -m venv .venv + # ...or with uv: + uv venv + ``` + +5. Activate the virtual environment: + + ```sh + source .venv/bin/activate + ``` + +6. Install Swell dependencies: + + ```sh + pip install -r requirements.txt -r requirements-aws.txt + # ...or with uv: + uv pip install -r requirements.txt -r requirements-aws.txt + ``` + +7. Install Swell itself (note: `-e` means "editable" mode, so changes to the code will automatically be detected as Swell runs.): + + ```sh + pip install -e . + # ...or with uv + uv pip install -e . + ``` + +## Using Swell installations + +1. Source swell modules: `source /shared/swell-bundle`. + +2. Activate your Python virtual environment (from inside the Swell directory): `source .venv/bin/activate` +(If you are not in the Swell directory, just pass an absolute path: `source /path/to/your/swell/.venv/bin/activate`). + +Note: Optionally, you can skip step 1 here by manually editing the `.venv/bin/activate` script to include the line from step 1 (`source /shared/swell-bundle`). +Then, all you have to do is run step 2. + +## Known issues + +### Issues with `uv` and `git-lfs` (e.g., for `eva`) + +There is a known issue with `uv pip` and repositories that use git LFS (like eva). +See this for more details: https://github.com/astral-sh/uv/issues/3312 + +One solution is to configure LFS to force skipping smudge checks (though this may have the side effect of not downloading any LFS files at all). + +``` +git lfs install --force --skip-smudge +``` + +A better solution may be to skip smudge checks only for the uv cache: + +1. Create a file called `~/.gitconfig-nolfs`. + + ``` + [filter "lfs"] + clean = git-lfs clean -- %f + smudge = git-lfs smudge --skip -- %f + process = git-lfs filter-process --skip + required = true + ``` + +2. Add this to your `~/.gitconfig`. + + ``` + [includeIf "gitdir:~/.cache/uv/**"] + path = ~/.gitconfig-nolfs.inc + ``` diff --git a/docs/platforms/aws/aws-setup.md b/docs/platforms/aws/aws-setup.md new file mode 100644 index 000000000..48c2cac11 --- /dev/null +++ b/docs/platforms/aws/aws-setup.md @@ -0,0 +1,472 @@ +# (Advanced) Setting up AWS for Swell + +## Build spack-stack + +Building spack-stack (at least, the unified-dev environment --- we may be able to get away with a subset of what's in there) takes a long time (~6 hours) and produces ~11 GB of binaries. +Before doing this, check if an existing spack-stack install for the relevant operating system exists. +For example, for the install below, a pre-existing spack-stack installation is stored in `/fast1/spack-envs/`. + +### Install spack-stack dependencies + +For this cluster, these dependencies are installed as part of the AMI (virtual machine image) used by the cluster. +For the latest version of that configuration, see scripts in https://github.com/ashiklom/smce-gmao-tf/tree/main/deployments/pcluster/image (note: this is a private repository, for security reasons). + +An excerpt of the dependencies (for Ubuntu 24.04) is listed below for reference: + +```sh +sudo apt-get update +sudo apt-get upgrade -y +sudo apt-get install -y \ + build-essential \ + g++-11 \ + g++-12 \ + g++-13 \ + gcc-11 \ + gcc-12 \ + gcc-13 \ + gfortran-11 \ + gfortran-12 \ + gfortran-13 \ + make \ + apt-utils \ + autoconf \ + automake \ + autopoint \ + bc \ + bzip2 \ + cmake \ + cpp-11 \ + curl \ + file \ + flex \ + gettext \ + gh \ + git \ + git-lfs \ + golang \ + gnupg2 \ + iproute2 \ + less \ + libcurl4-openssl-dev \ + libgomp1 \ + liblua5.3-dev \ + liblua5.3.0 \ + libmysqlclient-dev \ + libqt5svg5-dev \ + libtcl8.6 \ + libtool \ + libtree \ + locales \ + lua-bit32 \ + lua-posix \ + lua-posix-dev \ + lua5.3 \ + make \ + mysql-server \ + pkg-config \ + python3 \ + python3-pip \ + python3-setuptools \ + qt5-qmake \ + qt5dxcb-plugin \ + qtbase5-dev \ + tcl \ + tcl-dev \ + tcl8.6 \ + tcl8.6-dev\ + unzip \ + wget + +# Install lmod manually +( + LMOD_TMP=$(mktemp -d) + cd "$LMOD_TMP" + wget https://github.com/TACC/Lmod/archive/refs/tags/8.7.60.tar.gz + tar -xf 8.7.60.tar.gz + cd Lmod-8.7.60 + sudo mkdir -p /opt + ./configure --prefix=/opt/ --with-lmodConfigDir=/opt/lmod/8.7/config + sudo make install +) +sudo ln -sf /opt/lmod/lmod/init/profile /etc/profile.d/z00_lmod.sh +sudo ln -sf /opt/lmod/lmod/init/cshrc /etc/profile.d/z00_lmod.csh +sudo ln -sf /opt/lmod/lmod/init/profile.fish /etc/profile.d/z00_lmod.fish +``` + +### Install spack-stack + +NOTE: This uses the `unified-dev` environment, which installs _everything_ --- GEOS, Skylab, NEPTUNE, GSI. +Therefore, it is very large (final install is ~12 GB) and takes a long time (~4 hours on a `c7i.xlarge`). + +```sh +#!/usr/bin/env bash + +set -uo pipefail +# set -euxo pipefail + +umask 022 + +if [[ -z $SPACK_STACK_VERSION ]]; then + echo "SPACK_STACK_VERSION is unset" + exit 1 +fi + +if [[ -z $COMPILER ]]; then + echo "COMPILER is unset" + exit 1 +fi + +if [[ -z $ENVNAME ]]; then + echo "ENVNAME is unset" + exit 1 +fi + +ROOTDIR="/opt/spack/" +SRCDIR="$ROOTDIR/spack-stack" +ENVDIR="$ROOTDIR/envs" + +SCRIPT_USER=$(whoami) + +sudo mkdir -p "$ROOTDIR" +sudo chown "$SCRIPT_USER:$SCRIPT_USER" "$ROOTDIR" +chmod 755 "$ROOTDIR" + +git clone --recurse-submodules "https://github.com/jcsda/spack-stack" $SRCDIR + +cd "$SRCDIR" +git checkout "$SPACK_STACK_VERSION" +git submodule update + +source setup.sh + +# Change tcl to lmod +sed -i 's/tcl/lmod/g' configs/sites/tier2/linux.default/modules.yaml + +spack stack create env \ + --site linux.default \ + --template unified-dev \ + --dir "$ENVDIR" \ + --name "$ENVNAME" \ + --compiler "$COMPILER" + +cd "$ENVDIR/$ENVNAME" +spack env activate -p . + +export SPACK_SYSTEM_CONFIG_PATH="$PWD/site" + +spack external find --scope system \ + --exclude python \ + --exclude openssl \ + --exclude cmake + +spack external find --scope system wget +spack external find --scope system mysql +spack external find --scope system grep +spack external find --scope system go + +# Manually add gh +if [[ ! -f site/packages.yaml.bak ]]; then + cp site/packages.yaml{,.bak} +fi +cat <<-EOF >> site/packages.yaml + gh: + externals: + - spec: gh@2.45 + prefix: /usr +EOF + +# spack compiler find --scope system "$COMPILER" + +GCC13_VERSION=$("$COMPILER-13" --version | head -n1 | grep -oP ' \d+\.\d+\.\d+ *$' | xargs) + +QT_VERSION=$(apt-cache show qtbase5-dev | grep 'Version: ' | grep -oP '\d+\.\d+\.\d+') + +unset SPACK_SYSTEM_CONFIG_PATH + +spack config add "packages:all:compiler:[gcc@$GCC13_VERSION]" +spack config add "packages:all:providers:mpi:[openmpi@5.0.5]" +spack config add "packages:fontconfig:variants:+pic" +spack config add "packages:pixman:variants:+pic" +spack config add "packages:cairo:variants:+pic" +spack config add "packages:ewok-env:variants:+mysql" + +# Concretize and install +spack concretize 2>&1 | tee log.concretize +# cat log.concretize | ${SPACK_STACK_DIR}/util/show_duplicate_packages.py +spack install --fail-fast 2>&1 | tee log.install + +# Install lmod modules +spack module lmod refresh +spack stack setup-meta-modules +``` + +### (Optional, but recommended) Set up shortcuts for swell modules + +To avoid asking all users to remember what modules need to be loaded for Swell, create a file with the following contents that can be `source`-d to quickly load everything needed for Swell. + +Be sure to adjust the path in the first `module use` statement to wherever you installed spack-stack in the previous step. + +```sh +# Adapted from: +# /discover/nobackup/projects/gmao/advda/swell/jedi_modules/spackstack_1.9_intel + +module purge + +# NOTE: Change this path to match the spack-stack installation above. +module use /fast1/spack-envs/unified-env-gcc/install/modulefiles/Core + +module load stack-gcc/13.3.0 +module load stack-openmpi/5.0.5 +module load stack-python/3.11.7 + +# JEDI +module load jedi-fv3-env +module load soca-env +module load gmao-swell-env + +# Extras +module load git-lfs/3.4.1 +module load py-pip/23.1.2 + +# vim: set filetype=sh : +``` + +## Building JEDI + +Swell uses [`jedi_bundle`](https://github.com/geos-esm/jedi_bundle) to build JEDI. +This will clone, configure, and build specific versions of all JEDI components needed for Swell. +Note that some of these components are in private repositories, so you will need to follow the [instructions in the `jedi_bundle` documentation](https://geos-esm.github.io/jedi_bundle/#/git_credentials) to set up your Git credentials. + +An install script like the following should work. + +**NOTE**: The `cat < ...` step below creates a new file in the `jedi_bundle` _source_ repository to create an AWS configuration. In the future, this will be included in the main `jedi_bundle` repo and will not be necessary here. + +```sh +#!/usr/bin/env bash +#SBATCH --partition demand-16cpu + +# ^^ SBATCH directive here is for building this directly on the cluster. + +JEDI_ROOT="/efs/jedi/" +JEDI_BUNDLE_SRC="$JEDI_ROOT/jedi_bundle" +SPACK_ROOT="/fast1/spack-envs/unified-env-gcc/" +S3DIR="/s3" + +VERSION="latest" +GCCVER="13.3.0" +SKYLAB_VERSION="2.4.1_skylab_4.0" + +N_AVAILABLE_CORES=$(nproc) + +mkdir -p "$JEDI_ROOT" + +if [[ ! -f "$SPACK_ROOT/spack.lock" ]]; then + echo "$SPACK_ROOT not found or improperly configured" + exit 1 +fi + +if [[ ! -d "$S3DIR/SwellStaticFiles" ]]; then + echo "Couldn't find $S3DIR/SwellStaticFiles" + exit 1 +fi + +JEDI_BUILD="$JEDI_ROOT/builds/jedi-build-gcc_$GCCVER" +if [[ -d "$JEDI_BUILD" ]]; then + echo "Existing JEDI build found in this directory. Exiting..." + exit 1 +fi + +if [[ ! -d "$JEDI_BUNDLE_SRC" ]]; then + git clone https://github.com/geos-esm/jedi_bundle $JEDI_BUNDLE_SRC +fi + +cd "$JEDI_BUNDLE_SRC" + +## Using geos-esm/jedi_bundle +module use -a $SPACK_ROOT/install/modulefiles/Core + +module purge +module load stack-gcc/13.3.0 +module load stack-openmpi/5.0.5 +module load stack-python/3.11.7 +module load git-lfs/3.4.1 +module load py-pip/23.1.2 + +mkdir -p $JEDI_BUILD +cd $JEDI_BUILD +python -m venv ".venv" +source .venv/bin/activate + +# Before we install JEDI, need to add an AWS configuration +cat < $JEDI_BUNDLE_SRC/src/jedi_bundle/config/platforms/aws.yaml +platform_name: aws + +is_it_me: + - command: 'echo \$SLURM_CLUSTER_NAME' + contains: 'gmao-pcluster' +crtm_coeffs_path: "$S3DIR/SwellStaticFiles/jedi/crtm_coefficients/" +crtm_coeffs_version: "$SKYLAB_VERSION" +modules: + default_modules: gnu + gnu: + init: + - source /opt/lmod/lmod/init/bash + load: + - module purge + - module use $SPACK_ROOT/install/modulefiles/Core + - module load stack-gcc/13.3.0 + - module load stack-openmpi/5.0.5 + - module load stack-python/3.11.7 + - module load jedi-fv3-env + - module load soca-env + - module load gmao-swell-env + configure: '-DCMAKE_Fortran_FLAGS="-ffree-line-length-none"' + # configure: -DMPIEXEC_EXECUTABLE="/usr/bin/srun" -DMPIEXEC_NUMPROC_FLAG="-n" + gnu-geos: + init: + - source /opt/lmod/lmod/init/bash + load: + - module purge + - module use $SPACK_ROOT/install/modulefiles/Core + - module load stack-gcc/13.3.0 + - module load stack-openmpi/5.0.5 + - module load stack-python/3.11.7 + - module load jedi-fv3-env + - module load soca-env + - module load gmao-swell-env + - module load esmf python py-pyyaml py-numpy pflogger fargparse zlib-ng cmake + configure: '-DCMAKE_Fortran_FLAGS="-ffree-line-length-none"' + # configure: -DMPIEXEC_EXECUTABLE="/usr/bin/srun" -DMPIEXEC_NUMPROC_FLAG="-n" +EOF + +pip install "$JEDI_BUNDLE_SRC" + +echo "JEDI bundle path:" +which jedi_bundle + +# Generate config file +jedi_bundle --pinned_versions + +# Tweak config file +sed -i "/ *cores_to_use_for_make/s/6/$N_AVAILABLE_CORES/" build.yaml + +# Run +jedi_bundle all build.yaml + +``` + +## Building GEOS + +Follow the instructions on the GEOS-ESM repo (https://github.com/geos-esm/geosgcm). +The instructions below are abbreviated and opinionated and are meant only to document the configuration used for the current AWS Swell deployment. + +Clone GEOS and checkout the relevant tag. + +```sh +mkdir -p /shared/GEOSgcm +git clone https://github.com/geos-esm/geosgcm /shared/GEOSgcm/main + +cd /shared/GEOSgcm/main +git worktree add ../v11.6.0 v11.6.0 +``` + +Load required modules. +(NOTE: This includes a `mepo` installation). + +```sh +module use /shared/spack-stack/envs/swell.my_aws/install/modulefiles/Core/ +module load stack-gcc/13.3.0 +module load stack-openmpi/5.0.5 +module load geos-gcm-env/1.0.0 +``` + +Clone stuff that GEOS needs. + +```sh +cd /shared/GEOSgcm/v11.6.0 +mepo clone +``` + +Build using cmake. +(NOTE: This assumes build directory `./build` and install directory `./install`). + +```sh +# Configure the build +cmake -B build -S . --install-prefix=install +# ...and actually do the build +cmake --build build --target install +``` + +The resulting GEOS installation lives is in `/efs/GEOSgcm/v11.6.0/install`. + +## Essential data for Swell + +NOTE: These instructions are current as of **May 5, 2025**. +Data used by Swell change frequently as Swell evolves, so these instructions may quickly become outdated. +Hopefully, they give you a sense of how Swell looks for files. + +### `SwellStaticFiles` + +On Discover, these are stored in `/discover/nobackup/projects/gmao/advda/SwellStaticFiles`. +The relevant `task_question`s are: +- `swell_static_files` --- root directory +- `geos_experiment_directory` --- expands to: `/geos/run_dirs/` +- `geos_restarts_directory` --- expands to `/geos/restarts/` + +The complete `SwellStaticFiles` directory on Discover is several hundred GB, but not all of the data are needed for basic Swell tier 1 tests. +You may be able to get away with copying over only the following: + +- `/discover/nobackup/projects/gmao/advda/SwellStaticFiles/` + - `/jedi/` + - `interfaces/` + - `/geos_ocean/model/` + - `/geos_atmosphere/` + - `/crtm_coefficients/` + - `/geos/` + - `/run_dirs/5deg_0701/` + - `/restarts/restarts_20210701_210000_5deg/` + +### `R2D2DataStore` + +On Discover, this is stored in `/discover/nobackup/projects/gmao/advda/R2D2DataStore/Shared`. +As above, the full directory is quite large, but you may be able to get away with just the following: + +- `/discover/nobackup/projects/gmao/advda/R2D2DataStore/Shared` + - `mom6_cice6_UFS/fc/s2s/` + - `geos/fc/x0048/` + +### Local ensemble DA inputs + +NOTE: These are only needed for the `localensembleda` suite, which is not a part of the core Swell tests (yet). +So, you may not need these...which is good, because these are massive (10s of TB). Be judicious about what you copy over! +In both cases, note also the `background_experiment` and background experiment start and end dates, as these will determine exactly which folders and files you need. +- Backgrounds: + - See the `geos_x_background_directory` variable and the `GetEnsembleGeosExperiment` task. + - By default, on Discover, these are in `/discover/nobackup/projects/gmao/dadev/rtodling/archive/Restarts/JEDI/541x`. + - On AWS, these are in `/efs/shared/restarts/jedi/541x/`. + - An rsync command like the following may be useful: + + ```sh + nohup rsync -avz --copy-unsafe-links --progress \ + --dry-run \ + --filter '+ 13/**' \ + --filter '+ 19/**' \ + --filter '+ 181/' \ + --filter '+ 181/x0050/' \ + --filter '+ 181/x0050/atmens' \ + --filter '+ 181/x0050/atmens/Y2023/' \ + --filter '+ 181/x0050/atmens/Y2023/M10/' \ + --filter '+ 181/x0050/atmens/Y2023/M10/*.20231009*' \ + --filter '+ 181/x0050/atmens/Y2023/M10/*.20231010*' \ + --filter '- 181/**'\ + --filter '- *'\ + /discover/nobackup/projects/gmao/dadev/rtodling/archive/Restarts/JEDI/541x/ \ + swelldev:/efs/shared/restarts/jedi/541x/ \ + &> ~/geos_bkg.log & + ``` + +- Ensembles: + - NOTE: These are needed only for the `localensembleda` suite, which is not a part of the core tests. + - See the `geos_x_ensemble_directory` variable, the `GetEnsembleGeosExperiment` task, and the file `src/swell/configuration/jedi/interfaces/geos_atmosphere/task_questions.yaml` + - Background experiment: `x0050` + - By default, on Discover, these are in `/discover/nobackup/projects/gmao/dadev/rtodling/archive/541/Milan`. diff --git a/requirements-aws.txt b/requirements-aws.txt new file mode 100644 index 000000000..56ad5f1eb --- /dev/null +++ b/requirements-aws.txt @@ -0,0 +1,6 @@ +mepo>=2.3.2 +eva @ git+https://github.com/JCSDA-internal/eva@1.6.5 +jedibundle @ git+https://github.com/GEOS-ESM/jedi_bundle@1.0.27 + +git+file:///shared/jedi-src/spack19/solo +git+file:///shared/jedi-src/spack19/r2d2 diff --git a/src/swell/deployment/platforms/aws/__init__.py b/src/swell/deployment/platforms/aws/__init__.py new file mode 100644 index 000000000..f82722eaf --- /dev/null +++ b/src/swell/deployment/platforms/aws/__init__.py @@ -0,0 +1,9 @@ +# (C) Copyright 2021- United States Government as represented by the Administrator of the +# National Aeronautics and Space Administration. All Rights Reserved. +# +# This software is licensed under the terms of the Apache Licence Version 2.0 +# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0. + +import os + +repo_directory = os.path.dirname(__file__) diff --git a/src/swell/deployment/platforms/aws/modules b/src/swell/deployment/platforms/aws/modules new file mode 100644 index 000000000..5e25e081b --- /dev/null +++ b/src/swell/deployment/platforms/aws/modules @@ -0,0 +1,69 @@ +# Module initialization +# --------------------- +# NOTE: Maybe not necessary? +# source /usr/share/lmod/lmod/init/bash + +# Purge modules +# ------------- +module purge + +# Spack stack modules +# ------------------- +module use /shared/spack-stack/envs/swell.my_aws/install/modulefiles/Core +module load stack-gcc/11.4.0 +module load stack-openmpi/5.0.5 +module load stack-python/3.11.7 + +module load git-lfs/3.0.2 +module load py-pip/23.1.2 + +module load jedi-fv3-env/1.0.0 +module load soca-env/1.0.0 +module load gmao-swell-env/1.0.0 + +# NOTE: Not sure this is necessary? +# module unload gsibec crtm fms +# module load fms/2024.02 + +# JEDI Python Path +# ---------------- +PYTHONPATH={{experiment_root}}/{{experiment_id}}/jedi_bundle/build/lib/python{{python_majmin}}:$PYTHONPATH + +# Aircraft Bias Python Path +# ------------------------- +PYTHONPATH={{experiment_root}}/{{experiment_id}}/jedi_bundle/source/iodaconv/src/gsi_varbc:$PYTHONPATH + +# Load GEOS modules +# ----------------- +# NOTE: Install via pip (requirements-aws.txt) +# module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES15 +# module load other/mepo + +# Load r2d2 modules +# ----------------- +# module use -a /discover/nobackup/projects/gmao/advda/JediOpt/modulefiles/core +# module load solo/sles15_skylab9 +# NOTE: Pull from spack-stack +module load py-boto3/1.34.44 +# NOTE: Install via pip (requirements-aws.txt) +# module load r2d2/sles15_spack19 + +# Load eva and jedi_bundle +# ------------------------ +# NOTE: Install via pip (requirements-aws.txt) +# module load eva/sles15_skylab9 +# NOTE: Install via pip (requirements-aws.txt) +# module load jedi_bundle/sles15_skylab9 + +# Set the swell paths +# ------------------- +PATH={{swell_bin_path}}:$PATH +PYTHONPATH={{swell_lib_path}}:$PYTHONPATH + +# Unlimited Stacksize +# ------------------- +ulimit -S -s unlimited +ulimit -S -v unlimited +umask 022 + +# vim: set filetype=sh : diff --git a/src/swell/deployment/platforms/aws/properties.yaml b/src/swell/deployment/platforms/aws/properties.yaml new file mode 100644 index 000000000..09337873d --- /dev/null +++ b/src/swell/deployment/platforms/aws/properties.yaml @@ -0,0 +1,3 @@ +hostname: + login: ip + compute: compute-dy diff --git a/src/swell/deployment/platforms/aws/r2d2_config.yaml b/src/swell/deployment/platforms/aws/r2d2_config.yaml new file mode 100755 index 000000000..0ef2cf805 --- /dev/null +++ b/src/swell/deployment/platforms/aws/r2d2_config.yaml @@ -0,0 +1,22 @@ +databases: + + ${USER}: + class: LocalDB + root: {{r2d2_local_path}} + cache_fetch: false + + gmao-shared: + class: LocalDB + root: /efs/shared/R2D2DataStore/Shared + cache_fetch: false + +# when fetching data, in which order should the databases accessed? +fetch_order: + - ${USER} + - gmao-shared + +# when storing data, in which order should the databases accessed? +store_order: + - ${USER} + +cache_name: ${USER} diff --git a/src/swell/deployment/platforms/aws/slurm.yaml b/src/swell/deployment/platforms/aws/slurm.yaml new file mode 100644 index 000000000..650f80b75 --- /dev/null +++ b/src/swell/deployment/platforms/aws/slurm.yaml @@ -0,0 +1,2 @@ +nodes: 1 +no-requeue: '' diff --git a/src/swell/deployment/platforms/aws/suite_questions.yaml b/src/swell/deployment/platforms/aws/suite_questions.yaml new file mode 100644 index 000000000..98917f2bb --- /dev/null +++ b/src/swell/deployment/platforms/aws/suite_questions.yaml @@ -0,0 +1,5 @@ +experiment_root: + default_value: /efs/${USER}/SwellExperiments + +r2d2_local_path: + default_value: /efs/${USER}/R2D2DataStore/Local diff --git a/src/swell/deployment/platforms/aws/task_questions.yaml b/src/swell/deployment/platforms/aws/task_questions.yaml new file mode 100644 index 000000000..84f1132b1 --- /dev/null +++ b/src/swell/deployment/platforms/aws/task_questions.yaml @@ -0,0 +1,35 @@ +crtm_coeff_dir: + default_value: /shared/SwellStaticFiles/jedi/crtm_coefficients/2.4.1/ + +existing_geos_gcm_build_path: + default_value: /shared/GEOSgcm/v11.6.0/install/ + +existing_geos_gcm_source_path: + default_value: /shared/GEOSgcm/v11.6.0/ + +existing_jedi_build_directory: + default_value: /shared/build-jedi/build + +existing_jedi_source_directory: + default_value: /shared/build-jedi/jedi-bundle/ + +existing_jedi_build_directory_pinned: + default_value: /shared/jedi-bundles/current-pinned-jedi-bundle/build + +existing_jedi_source_directory_pinned: + default_value: /shared/jedi-bundles/current-pinned-jedi-bundle/source + +geos_experiment_directory: + # Prefix: /geos/run_dirs/ + default_value: 5deg_0701 + +geos_restarts_directory: + # Prefix: /geos/restarts/ + default_value: restarts_20210701_210000_5deg + +r2d2_local_path: + default_value: /efs/${USER}/R2D2DataStore/Local + +swell_static_files: + default_value: /shared/SwellStaticFiles + diff --git a/src/swell/deployment/prepare_config_and_suite/prepare_config_and_suite.py b/src/swell/deployment/prepare_config_and_suite/prepare_config_and_suite.py index 7b6d756bd..e63ab2145 100644 --- a/src/swell/deployment/prepare_config_and_suite/prepare_config_and_suite.py +++ b/src/swell/deployment/prepare_config_and_suite/prepare_config_and_suite.py @@ -276,8 +276,23 @@ def override_with_defaults(self) -> None: for suite_task in ['suite', 'task']: platform_dict_file = os.path.join(get_swell_path(), 'deployment', 'platforms', self.platform, f'{suite_task}_questions.yaml') - with open(platform_dict_file, 'r') as ymlfile: - platform_defaults.update(yaml.safe_load(ymlfile)) + try: + with open(platform_dict_file, 'r') as ymlfile: + platform_defaults.update(yaml.safe_load(ymlfile)) + except FileNotFoundError: + self.logger.info( + f"Platform defaults file {platform_dict_file} not found. " + "Assuming no platform defaults. " + "Note that your workflows are likely to fail unless you " + "have manually configured every platform-specific default " + "in your overrides." + ) + except TypeError as err: + if str(err) == "'Nonetype' object is not callable": + self.logger.info( + f"Platform defaults file {platform_dict_file} is empty. " + "Assuming no platform defaults." + ) # Loop over the keys in self.question_dictionary_model_ind and update with platform_defaults # if that dictionary shares the key diff --git a/src/swell/test/code_tests/test_generate_observing_system.py b/src/swell/test/code_tests/test_generate_observing_system.py index e5b4fbedf..825321fe5 100644 --- a/src/swell/test/code_tests/test_generate_observing_system.py +++ b/src/swell/test/code_tests/test_generate_observing_system.py @@ -1,6 +1,8 @@ import os import unittest import subprocess +import shutil +import atexit from datetime import datetime as dt from swell.utilities.logger import get_logger from swell.utilities.exceptions import SwellError @@ -26,6 +28,8 @@ def setup_geos_mksi(reference: str): if not os.path.exists(geos_mksi_path): git_clone_cmd = ["git", "clone", url, geos_mksi_path] subprocess.run(git_clone_cmd, stderr=subprocess.DEVNULL) + # Delete the GEOS_mksi directory after the test runs, to keep things clean + atexit.register(lambda: shutil.rmtree("GEOS_mksi")) git_checkout_cmd = ["git", "checkout", reference] subprocess.run(git_checkout_cmd, cwd=geos_mksi_path, stdout=subprocess.DEVNULL, diff --git a/src/swell/test/code_tests/test_pinned_versions.py b/src/swell/test/code_tests/test_pinned_versions.py index 91d8501c1..d46851e8c 100644 --- a/src/swell/test/code_tests/test_pinned_versions.py +++ b/src/swell/test/code_tests/test_pinned_versions.py @@ -1,6 +1,8 @@ import os import unittest import subprocess +import shutil +import atexit from swell.utilities.logger import get_logger from swell.utilities.exceptions import SwellError from swell.test.code_tests.testing_utilities import suppress_stdout @@ -21,6 +23,8 @@ def test_wrong_hash(self) -> None: if not os.path.exists(jedi_bundle_dir): os.makedirs(jedi_bundle_dir) + # Remove directory after the test + atexit.register(lambda: shutil.rmtree(jedi_bundle_dir)) # Clone oops repository in jedi_bundle (develop hash) if not os.path.exists(jedi_bundle_dir + "oops"): diff --git a/src/swell/utilities/question_defaults.py b/src/swell/utilities/question_defaults.py index 8333ffc23..2aae9ff51 100644 --- a/src/swell/utilities/question_defaults.py +++ b/src/swell/utilities/question_defaults.py @@ -1173,7 +1173,7 @@ class path_to_gsi_nc_diags(TaskQuestion): @dataclass class perhost(TaskQuestion): - default_value: str = None + default_value: str | None = None question_name: str = "perhost" ask_question: bool = True options: List[bool] = mutable_field([ @@ -1255,7 +1255,7 @@ class swell_static_files(TaskQuestion): @dataclass class swell_static_files_user(TaskQuestion): - default_value: str = "None" + default_value: str | None = None question_name: str = "swell_static_files_user" prompt: str = "What is the path to the user provided Swell Static Files directory?" widget_type: WType = WType.STRING