diff --git a/.gitignore b/.gitignore index 52aac2301..4340020ec 100644 --- a/.gitignore +++ b/.gitignore @@ -145,7 +145,4 @@ r2d2_credentials.yaml # Miscellaneous GEOS_mksi/ jedi_bundle/ -output/ - - -*.md +output/ \ No newline at end of file diff --git a/docs/_sidebar.md b/docs/_sidebar.md index eef119c99..b7e19d89f 100644 --- a/docs/_sidebar.md +++ b/docs/_sidebar.md @@ -23,11 +23,13 @@ - [3DVAR](examples/soca/3dvar.md) - [3DFGAT_cycle]((examples/soca/3dfgat_cycle.md)) - **R2D2 - Storing Data** + - [Understanding R2D2](examples/r2d2_intro.md) - [Storing Observations to R2D2](examples/ingest_obs.md) - Configuration files in swell - [Observation configuration](configs/observation_configuration.md) + - [R2D2 v3 credentials](configs/r2d2_v3_credentials.md) - [SLURM configuration](configs/slurm_configuration.md) - Model configuration: - [CICE6](configs/model_configurations/cice6.md) diff --git a/docs/creating_an_experiment.md b/docs/creating_an_experiment.md index 508e1da8c..2875dec28 100644 --- a/docs/creating_an_experiment.md +++ b/docs/creating_an_experiment.md @@ -4,12 +4,16 @@ Once you have installed `swell` and configured `cylc` you should be able to crea A useful command when using swell is `swell --help`. This will take you through all the options within swell. The help traverses through the applications so you can similarly issue `swell create --help` +- Make sure you've configured `~/.swell/r2d2_credentials.yaml` as described in [R2D2 v3 credentials](configs/r2d2_v3_credentials.md). + The first step is to create an experiment which is done with ```bash swell create ``` +**During `swell create`**: Credentials are loaded, and the experiment are registered in R2D2 automatically. The experiment ID is stored in `experiment.yaml` and used by STORE operations such as SaveRestart and SaveObsDiags. + This will create a directory with your experiment ID in the experiment root. - If you specify no options the resulting experiment will be configured the way that suite is run in the tier 1 testing. diff --git a/docs/examples/r2d2_intro.md b/docs/examples/r2d2_intro.md new file mode 100644 index 000000000..3f2295443 --- /dev/null +++ b/docs/examples/r2d2_intro.md @@ -0,0 +1,289 @@ +# R2D2: Research Repository for Data and Diagnostics + +## Table of Contents + +1. [What is R2D2?](#what-is-r2d2) +3. [How R2D2 Works](#how-r2d2-works) +4. [R2D2 Concepts](#r2d2-concepts) +5. [How Swell Uses R2D2](#how-swell-uses-r2d2) +6. [Store & Fetch Quick Reference](#store--fetch-quick-reference) +7. [Storing Observations to R2D2](examples/ingest_obs.md) + +--- + +## What is R2D2? + +**R2D2** is a metadata + storage system for scientific data: it keeps a **MySQL database** of what files exist and where they live, while the **actual files** go in S3 or local storage. When you `fetch` or `store`, you talk to the R2D2 API for metadata; file transfers go **directly** to/from storage. Swell uses R2D2 to fetch observations, store backgrounds, and manage experiment data. + +Think of R2D2 as a **central database for scientific data** that: +- Knows exactly where every file is stored +- Tracks what type of data each file contains (observations, forecasts, analyses, etc.) +- Remembers when data was created and by whom +- Can quickly retrieve the right file when you need it + +**Swell + R2D2**: When you run a Swell experiment, it uses R2D2 to fetch observations, store/retrieve background and analysis files, and manage experiment metadata. + +--- + +### Why R2D2 + +R2D2 serves as the centralized source for managing and accessing scientific data: + +With R2D2 you can: +- Retrieve specific files easily: + - + ```python + r2d2.fetch( + item='observation', + provider='nasa', + observation_type='airs', + window_start='20240103T120000Z', + window_length='PT6H', + target_file='obs.nc4' + ) + ``` +- Store new data and make it accessible: + - + ```python + r2d2.store( + item='analysis', + model='geos', + experiment='my_exp', + file_extension='nc4', + date='20240103T120000Z', + source_file='./an.nc' + ) + ``` +- Automatically track data versions and timestamps +- Share data securely with authorized users across locations +- Prevent duplicate storage + +--- + +## How R2D2 Works + +### Architecture Example: + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ R2D2 Server (metadata only) │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ R2D2 API │ MySQL / Database │ │ +│ │ (HTTP) │ (what exists) │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ │ Answers: "What files match? Where are they stored?" │ +└──────────────┼──────────────────────────────────────────────────────────────┘ + │ + │ Client does NOT send files through the server. + │ Client talks to server for metadata, then transfers + │ files directly to/from storage (S3, local, etc.). + │ + ┌───────┴────────────────────────────────────────────────────┐ + │ Compute client │ + │ (HPC/Discover, cloud etc.) │ + │ │ + │ import r2d2 │ + │ r2d2.fetch(item='observation', provider='nasa', ...) │ + │ r2d2.store(item='observation', source_file='obs.nc', ...)│ + └────────────────────────────────────────────────────────────┘ + │ ▲ + │ Fetch: get metadata and │ Direct transfer + │ download from storage │ to/from storage + ▼ │ + ┌─────────────────────────────────────────────────────────────┐ + │ Data storage (S3, local disk, etc.) │ + │ observation/ forecast/ analysis/ bias_correction/ ... │ + └─────────────────────────────────────────────────────────────┘ +``` + + +1. **R2D2 Server**: Only handles metadata queries + - "What observations exist for this window?" + - "Where is this file stored?" (returns S3 path or local path) + +2. **S3 / local storage**: Stores the actual data files + - File transfers go **directly** between your client and S3; *not through the R2D2 server* + +Even with a small EC2 instance, R2D2 can serve metadata for terabytes of data. The server doesn't proxy file I/O. + +--- + +## R2D2 Concepts + +### Data Hub +A **Data Hub** is a storage platform or cloud region where data can be stored. + +| Property | Description | Example Values | +|----------|-------------|----------------| +| `name` | Unique identifier | `aws-us-east-1`, `discover-local`, `azure-eastus` | +| `platform` | Storage platform type | `aws`, `local`, `azure`, `gcloud` | +| `region` | Geographic region | `us-east-1`, `us-west-2` | + +**Why it exists**: You may access data from different cloud providers or on-premise storage. A data hub tells R2D2 which storage system to use. + +### Data Store +A **Data Store** is our data repository, think of it like a specific storage location (like an S3 bucket or file system path) within a Data Hub. + +| Property | Description | Example Values | +|----------|-------------|----------------| +| `name` | Unique identifier (often the bucket name) | `r2d2-experiments-prod-us-east-1` | +| `data_hub` | Which Data Hub this belongs to | `aws-us-east-1` | +| `data_store_type` | Category of data | `experiments`, `archive`, `skylab` | +| `basedir` | Base directory path | `/data/r2d2/` or empty for S3 root | +| `read_only` | Whether writes are allowed | `true` or `false` | + + +### Compute Host +A **Compute Host** is our compute environment, it represents a computing environment where scientists run their code. + +| Property | Description | Example Values | +|----------|-------------|----------------| +| `name` | Unique identifier | `discover-intel`, `localhost-gnu`, `aws-graviton-gnu` | +| `hostname` | Machine identifier | `discover`, `localhost`, `ip179-99-99-99` | +| `compiler` | Compiler used to build software | `intel`, `gnu`, `nvhpc` | + + +### How They Connect + +``` + ┌───────────────────┐ + │ Compute Host │ + │ (discover-intel) │ + └───────────────────┘ + │ + │ "Where should I store/fetch data?" + │ + ▼ + ┌─────────────────────────────────┐ + │ compute_host_register │ + │ (links hosts to data hubs) │ + │ │ + │ discover-intel → aws-us-east-1 │ + │ localhost-gnu → aws-us-east-1 │ + └─────────────────┬───────────────┘ + │ + ▼ + ┌─────────────────┐ + │ Data Hub │ + │ (aws-us-east-1)│ + └────────┬────────┘ + │ + │ "Which bucket within this hub?" + │ + ▼ + ┌─────────────────┐ + │ Data Store │ + │ (r2d2-bucket) │ + └─────────────────┘ +``` + +--- + +## How Swell Uses R2D2 + +When you run a Swell experiment, R2D2 is used behind the scenes in several tasks: + +| Swell Task | What it does with R2D2 | +|------------|------------------------| +| **Get Observations** | Fetches observation files from R2D2 by `provider`, `observation_type`, `window_start`, `window_length`; falls back to empty observations if not found | +| **Store Background** | Stores forecast/background files so they can be reused by later cycles | +| **Get Background** | Fetches background files for the current cycle from R2D2 | +| **Ingest Obs** | Ingest suite that stores newly processed observations into R2D2 | +| **Save Obs Diags** | Stores feedback/diagnostic files (`item='feedback'`) | +| **Save Restart** | Stores forecast and analysis restart files for model components | + +> **Note**: R2D2 adaptation in Swell is under active development. Task behavior and configuration may change as implementation continues. + +--- + +## Store & Fetch Quick Reference + +### Observation (shared input data — no experiment) + +```python +# Fetch +r2d2.fetch(item='observation', + provider='ncdiag', + observation_type='airs', + file_extension='nc4', + window_start='20240103T120000Z', + window_length='PT6H', + target_file='obs.nc4') + +# Store +r2d2.store(item='observation', + provider='ncdiag', + observation_type='airs', + file_extension='nc4', + window_start='20240103T120000Z', + window_length='PT6H', + source_file='./obs.nc4') +``` + +**Required:** `provider`, `observation_type`, `file_extension`, `window_start`, `window_length` + +--- + +### Analysis & forecast/background (experiment-specific) + +**Required:** `model`, `experiment`, `file_extension`, `date`. For forecast also: `resolution`, `step`. + +```python +# Fetch analysis +r2d2.fetch(item='analysis', + model='geos', + experiment='my_exp', + file_extension='nc4', + date='20240103T120000Z', + target_file='an.nc4') + +# Fetch forecast (background) +r2d2.fetch(item='forecast', + model='geos', + experiment='my_exp', + file_extension='nc4', + resolution='c90', + step='PT6H', + date='20240103T120000Z', + target_file='bkg.nc4') + +# Store analysis +r2d2.store(item='analysis', + model='geos', + experiment='my_exp', + file_extension='nc4', + date='20240103T120000Z', + source_file='./an.nc4') + +# Store forecast +r2d2.store(item='forecast', + model='geos', + experiment='my_exp', + file_extension='nc4', + resolution='c90', + step='PT6H', + date='20240103T120000Z', + source_file='./bkg.nc4') +``` + +**Note:** `experiment` must be registered in R2D2 first. + +--- + +### Bias correction (experiment-specific) + +**Required:** `model`, `experiment`, `provider`, `observation_type`, `file_extension`, `file_type`, `date` + +```python +r2d2.fetch(item='bias_correction', + model='geos', + experiment='my_exp', + provider='gsi', + observation_type='airs', + file_extension='satbias', + file_type='satbias', + date='20240103T120000Z', + target_file='satbias.nc') +``` diff --git a/docs/installing_swell.md b/docs/installing_swell.md index 1bb739d76..5bb035f4f 100644 --- a/docs/installing_swell.md +++ b/docs/installing_swell.md @@ -32,3 +32,6 @@ pip install --prefix=/path/to/install/swell/ . To make the software usable ensure `/path/to/install/swell/bin` is in the `$PATH`. Also ensure that `/path/to/install/swell/lib/python/site-packages` is in the `$PYTHONPATH`, where `` denotes the version of Python used for the install, e.g. `3.9`. Swell makes use of additional packages which are located in shared directories on Discover, such as under `/discover/nobackup/projects/gmao`. When installed correctly, many of these libraries should be visible in the `$PYTHONPATH`. + + +Configure `~/.swell/r2d2_credentials.yaml` as described in [R2D2 v3 credentials](configs/r2d2_v3_credentials.md). \ No newline at end of file