Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,4 @@ r2d2_credentials.yaml
# Miscellaneous
GEOS_mksi/
jedi_bundle/
output/


*.md
output/
2 changes: 2 additions & 0 deletions docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,13 @@
- [3DVAR](examples/soca/3dvar.md)
- [3DFGAT_cycle]((examples/soca/3dfgat_cycle.md))
- **R2D2 - Storing Data**
- [Understanding R2D2](examples/r2d2_intro.md)
- [Storing Observations to R2D2](examples/ingest_obs.md)

- Configuration files in swell

- [Observation configuration](configs/observation_configuration.md)
- [R2D2 v3 credentials](configs/r2d2_v3_credentials.md)
- [SLURM configuration](configs/slurm_configuration.md)
- Model configuration:
- [CICE6](configs/model_configurations/cice6.md)
Expand Down
4 changes: 4 additions & 0 deletions docs/creating_an_experiment.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@ Once you have installed `swell` and configured `cylc` you should be able to crea

A useful command when using swell is `swell --help`. This will take you through all the options within swell. The help traverses through the applications so you can similarly issue `swell create --help`

- Make sure you've configured `~/.swell/r2d2_credentials.yaml` as described in [R2D2 v3 credentials](configs/r2d2_v3_credentials.md).

The first step is to create an experiment which is done with

```bash
swell create <suite> <options>
```

**During `swell create`**: Credentials are loaded, and the experiment are registered in R2D2 automatically. The experiment ID is stored in `experiment.yaml` and used by STORE operations such as SaveRestart and SaveObsDiags.

This will create a directory with your experiment ID in the experiment root.

- If you specify no options the resulting experiment will be configured the way that suite is run in the tier 1 testing.
Expand Down
289 changes: 289 additions & 0 deletions docs/examples/r2d2_intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
# R2D2: Research Repository for Data and Diagnostics

## Table of Contents

1. [What is R2D2?](#what-is-r2d2)
3. [How R2D2 Works](#how-r2d2-works)
4. [R2D2 Concepts](#r2d2-concepts)
5. [How Swell Uses R2D2](#how-swell-uses-r2d2)
6. [Store & Fetch Quick Reference](#store--fetch-quick-reference)
7. [Storing Observations to R2D2](examples/ingest_obs.md)

---

## What is R2D2?

**R2D2** is a metadata + storage system for scientific data: it keeps a **MySQL database** of what files exist and where they live, while the **actual files** go in S3 or local storage. When you `fetch` or `store`, you talk to the R2D2 API for metadata; file transfers go **directly** to/from storage. Swell uses R2D2 to fetch observations, store backgrounds, and manage experiment data.

Think of R2D2 as a **central database for scientific data** that:
- Knows exactly where every file is stored
- Tracks what type of data each file contains (observations, forecasts, analyses, etc.)
- Remembers when data was created and by whom
- Can quickly retrieve the right file when you need it

**Swell + R2D2**: When you run a Swell experiment, it uses R2D2 to fetch observations, store/retrieve background and analysis files, and manage experiment metadata.

---

### Why R2D2

R2D2 serves as the centralized source for managing and accessing scientific data:

With R2D2 you can:
- Retrieve specific files easily:
-
```python
r2d2.fetch(
item='observation',
provider='nasa',
observation_type='airs',
window_start='20240103T120000Z',
window_length='PT6H',
target_file='obs.nc4'
)
```
- Store new data and make it accessible:
-
```python
r2d2.store(
item='analysis',
model='geos',
experiment='my_exp',
file_extension='nc4',
date='20240103T120000Z',
source_file='./an.nc'
)
```
- Automatically track data versions and timestamps
- Share data securely with authorized users across locations
- Prevent duplicate storage

---

## How R2D2 Works

### Architecture Example:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ R2D2 Server (metadata only) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ R2D2 API │ MySQL / Database │ │
│ │ (HTTP) │ (what exists) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Answers: "What files match? Where are they stored?" │
└──────────────┼──────────────────────────────────────────────────────────────┘
│ Client does NOT send files through the server.
│ Client talks to server for metadata, then transfers
│ files directly to/from storage (S3, local, etc.).
┌───────┴────────────────────────────────────────────────────┐
│ Compute client │
│ (HPC/Discover, cloud etc.) │
│ │
│ import r2d2 │
│ r2d2.fetch(item='observation', provider='nasa', ...) │
│ r2d2.store(item='observation', source_file='obs.nc', ...)│
└────────────────────────────────────────────────────────────┘
│ ▲
│ Fetch: get metadata and │ Direct transfer
│ download from storage │ to/from storage
▼ │
┌─────────────────────────────────────────────────────────────┐
│ Data storage (S3, local disk, etc.) │
│ observation/ forecast/ analysis/ bias_correction/ ... │
└─────────────────────────────────────────────────────────────┘
```


1. **R2D2 Server**: Only handles metadata queries
- "What observations exist for this window?"
- "Where is this file stored?" (returns S3 path or local path)

2. **S3 / local storage**: Stores the actual data files
- File transfers go **directly** between your client and S3; *not through the R2D2 server*

Even with a small EC2 instance, R2D2 can serve metadata for terabytes of data. The server doesn't proxy file I/O.

---

## R2D2 Concepts

### Data Hub
A **Data Hub** is a storage platform or cloud region where data can be stored.

| Property | Description | Example Values |
|----------|-------------|----------------|
| `name` | Unique identifier | `aws-us-east-1`, `discover-local`, `azure-eastus` |
| `platform` | Storage platform type | `aws`, `local`, `azure`, `gcloud` |
| `region` | Geographic region | `us-east-1`, `us-west-2` |

**Why it exists**: You may access data from different cloud providers or on-premise storage. A data hub tells R2D2 which storage system to use.

### Data Store
A **Data Store** is our data repository, think of it like a specific storage location (like an S3 bucket or file system path) within a Data Hub.

| Property | Description | Example Values |
|----------|-------------|----------------|
| `name` | Unique identifier (often the bucket name) | `r2d2-experiments-prod-us-east-1` |
| `data_hub` | Which Data Hub this belongs to | `aws-us-east-1` |
| `data_store_type` | Category of data | `experiments`, `archive`, `skylab` |
| `basedir` | Base directory path | `/data/r2d2/` or empty for S3 root |
| `read_only` | Whether writes are allowed | `true` or `false` |


### Compute Host
A **Compute Host** is our compute environment, it represents a computing environment where scientists run their code.

| Property | Description | Example Values |
|----------|-------------|----------------|
| `name` | Unique identifier | `discover-intel`, `localhost-gnu`, `aws-graviton-gnu` |
| `hostname` | Machine identifier | `discover`, `localhost`, `ip179-99-99-99` |
| `compiler` | Compiler used to build software | `intel`, `gnu`, `nvhpc` |


### How They Connect

```
┌───────────────────┐
│ Compute Host │
│ (discover-intel) │
└───────────────────┘
│ "Where should I store/fetch data?"
┌─────────────────────────────────┐
│ compute_host_register │
│ (links hosts to data hubs) │
│ │
│ discover-intel → aws-us-east-1 │
│ localhost-gnu → aws-us-east-1 │
└─────────────────┬───────────────┘
┌─────────────────┐
│ Data Hub │
│ (aws-us-east-1)│
└────────┬────────┘
│ "Which bucket within this hub?"
┌─────────────────┐
│ Data Store │
│ (r2d2-bucket) │
└─────────────────┘
```

---

## How Swell Uses R2D2

When you run a Swell experiment, R2D2 is used behind the scenes in several tasks:

| Swell Task | What it does with R2D2 |
|------------|------------------------|
| **Get Observations** | Fetches observation files from R2D2 by `provider`, `observation_type`, `window_start`, `window_length`; falls back to empty observations if not found |
| **Store Background** | Stores forecast/background files so they can be reused by later cycles |
| **Get Background** | Fetches background files for the current cycle from R2D2 |
| **Ingest Obs** | Ingest suite that stores newly processed observations into R2D2 |
| **Save Obs Diags** | Stores feedback/diagnostic files (`item='feedback'`) |
| **Save Restart** | Stores forecast and analysis restart files for model components |

> **Note**: R2D2 adaptation in Swell is under active development. Task behavior and configuration may change as implementation continues.

---

## Store & Fetch Quick Reference

### Observation (shared input data — no experiment)

```python
# Fetch
r2d2.fetch(item='observation',
provider='ncdiag',
observation_type='airs',
file_extension='nc4',
window_start='20240103T120000Z',
window_length='PT6H',
target_file='obs.nc4')

# Store
r2d2.store(item='observation',
provider='ncdiag',
observation_type='airs',
file_extension='nc4',
window_start='20240103T120000Z',
window_length='PT6H',
source_file='./obs.nc4')
```

**Required:** `provider`, `observation_type`, `file_extension`, `window_start`, `window_length`

---

### Analysis & forecast/background (experiment-specific)

**Required:** `model`, `experiment`, `file_extension`, `date`. For forecast also: `resolution`, `step`.

```python
# Fetch analysis
r2d2.fetch(item='analysis',
model='geos',
experiment='my_exp',
file_extension='nc4',
date='20240103T120000Z',
target_file='an.nc4')

# Fetch forecast (background)
r2d2.fetch(item='forecast',
model='geos',
experiment='my_exp',
file_extension='nc4',
resolution='c90',
step='PT6H',
date='20240103T120000Z',
target_file='bkg.nc4')

# Store analysis
r2d2.store(item='analysis',
model='geos',
experiment='my_exp',
file_extension='nc4',
date='20240103T120000Z',
source_file='./an.nc4')

# Store forecast
r2d2.store(item='forecast',
model='geos',
experiment='my_exp',
file_extension='nc4',
resolution='c90',
step='PT6H',
date='20240103T120000Z',
source_file='./bkg.nc4')
```

**Note:** `experiment` must be registered in R2D2 first.

---

### Bias correction (experiment-specific)

**Required:** `model`, `experiment`, `provider`, `observation_type`, `file_extension`, `file_type`, `date`

```python
r2d2.fetch(item='bias_correction',
model='geos',
experiment='my_exp',
provider='gsi',
observation_type='airs',
file_extension='satbias',
file_type='satbias',
date='20240103T120000Z',
target_file='satbias.nc')
```
3 changes: 3 additions & 0 deletions docs/installing_swell.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,6 @@ pip install --prefix=/path/to/install/swell/ .
To make the software usable ensure `/path/to/install/swell/bin` is in the `$PATH`. Also ensure that `/path/to/install/swell/lib/python<version>/site-packages` is in the `$PYTHONPATH`, where `<version>` denotes the version of Python used for the install, e.g. `3.9`.

Swell makes use of additional packages which are located in shared directories on Discover, such as under `/discover/nobackup/projects/gmao`. When installed correctly, many of these libraries should be visible in the `$PYTHONPATH`.


Configure `~/.swell/r2d2_credentials.yaml` as described in [R2D2 v3 credentials](configs/r2d2_v3_credentials.md).