Skip to content

lincc-frameworks/rubin-dash

Repository files navigation

rubin-dash

DRP Afterburner for Super HATS — converts Rubin DRP outputs into HATS catalogs suitable for use with lsdb.

Template PyPI GitHub Workflow Status Codecov Read The Docs

Overview

The pipeline runs a sequence of stages that read from a Butler repository and write HATS catalogs to an output directory:

Stage Description
butler Find catalog parquet files from the Butler repository
raw_sizes Measure raw parquet file sizes
import Import catalogs into HATS format
postprocess Post-process imported catalogs
nesting Build nested (light-curve) catalogs
collections Generate HATS collections
crossmatch Cross-match against external surveys (e.g. ZTF, PS1)
generate_json Generate JSON metadata for the HATS collections

Setting up the environment

This pipeline requires IDAC access and is normally run on USDF SLAC nodes. It cannot be run on the login node. It is highly recommended to use tmux or screen so you can detach and reattach without losing your session. The pipeline typically takes at least ~5h and can take closer to ~15h.

Request a reserved node

Your connection path should look like this:

graph LR
    L["<i>login node</i>"] --> T("<code>tmux/screen</code>")
    T --> I["<i>interactive node</i>"]
    I --> R["<i>reserved node</i>"]
style T fill:lightblue,stroke:darkblue,stroke-width:2px
Loading

From an interactive node, request a reserved node:

srun --pty --exclusive --nodes=1 --time=48:00:00 \
     --partition=milano --account=rubin:commissioning bash

Do not exit the reserved node shell directly — use tmux detach or screen's ctrl+a -> d instead so the job keeps running.

Load the LSST stack

source /sdf/group/rubin/sw/loadLSST.sh
setup lsst_distrib

Install rubin-dash

pip install rubin-dash

Running the pipeline

1. Create a config file

The package ships a default_config.toml with sensible defaults for all catalogs, nested catalogs, collections, crossmatch surveys, and Dask settings. Your config file is merged on top of those defaults — you only need to specify what changes for your run.

Copy example_config.toml and fill in the [run] section. The values come from the JIRA ticket associated with the weekly release. For example, the collection string LSSTCam/runs/DRP/20250417_20250921/w_2025_49/DM-53545 breaks down as:

[run]
instrument = "LSSTCam"
repo       = "/repo/embargo"         # Butler repo path
version    = "w_2025_49"
collection = "DM-53545"
output_dir = "/sdf/data/rubin/shared/lsdb_commissioning"
run        = "20250417_20250921"      # optional — omit for releases without a run segment

Overriding stages

By default all stages run. Restrict to a subset:

[stages]
enabled = ["butler", "raw_sizes", "import", "postprocess"]

Overriding catalogs

By default all six catalogs are processed: dia_object, dia_source, dia_object_forced_source, object, source, object_forced_source. Restrict to a subset:

[catalogs]
enabled = ["dia_object", "object"]

Override settings for a specific catalog:

[catalogs.object]
chunksize = 100_000   # DimensionParquetReader batch size (default 250_000 for object)

[catalogs.object.import_args]
pixel_threshold = 500_000   # override any hats-import argument

Add a custom catalog not in the defaults (all fields required):

[catalogs.my_catalog]
dims            = ["tract"]
group_by        = ["tract"]
flux_columns    = []
add_mjds        = false
use_schema_file = false
chunksize       = 500_000

[catalogs.my_catalog.import_args]
ra_column       = "ra"
dec_column      = "dec"
catalog_type    = "object"
pixel_threshold = 1_000_000

Overriding nested catalogs

The defaults define two nested catalogs (dia_object_lc and object_lc). Override settings or restrict which ones are built:

[nested]
enabled = ["object_lc"]   # omit to run all

[nested.object_lc]
pixel_threshold       = 20_000   # override any field
highest_healpix_order = 10

Overriding collections

[collections]
enabled = ["object_collection"]   # omit to run all

[collections.object_collection]
margin_threshold = 10.0

Overriding crossmatch surveys

The defaults cross-match against ZTF DR22 and PS1. Add, remove, or reconfigure:

# Disable all crossmatches by leaving surveys empty
[crossmatch]

# Or override a survey's search radius
[crossmatch.surveys.ztf_dr22]
radius_arcsec = 0.5

Overriding Dask settings

Global settings apply to all stages; stage-specific sections override them for that stage only:

[dask]
n_workers        = 32
threads_per_worker = 1
memory_limit     = "16GB"

[dask.stages.nesting]
n_workers    = 8
memory_limit = "32GB"

Layering multiple config files

You can split settings across files and layer them at run time — later files override earlier ones:

rubin-dash run --config base.toml --config this_week.toml --config overrides.toml

2. Run the full pipeline

rubin-dash run --config my_config.toml

CLI options

rubin-dash run --config CONFIG [--config CONFIG ...]
               [--stages butler,import,postprocess]
               [--from-stage STAGE]
               [--catalogs dia_object,object]
               [--nestings object_lc]
               [--collections object_collection]
Option Description
--config TOML config file. Repeat to layer overrides (later files win).
--stages Comma-separated list of stages to run.
--from-stage Run all enabled stages starting from this one.
--catalogs Restrict to a subset of catalogs.
--nestings Restrict to specific nested catalogs.
--collections Restrict to specific collections.

Examples:

# Re-run only the import and postprocess stages
rubin-dash run --config my_config.toml --stages import,postprocess

# Resume from the nesting stage onward
rubin-dash run --config my_config.toml --from-stage nesting

# Layer a base config with per-run overrides
rubin-dash run --config base.toml --config overrides.toml

3. Interactive notebook access

To open the notebooks interactively from within the processing environment:

rubin-dash notebook --port 8769

This starts a Jupyter server and prints the SSH tunnel command you need to run on your laptop to forward the port. It will look something like:

ssh -J [email protected],user@sdfiana004 \
    -L 8769:localhost:8769 \
    user@sdfmilan005

4. Rerunning a single stage after a failure

If the pipeline fails partway through, you can rerun from a specific stage:

rubin-dash run --config my_config.toml --from-stage import

Or run a single stage in isolation:

rubin-dash run --config my_config.toml --stages import

If you need to debug interactively, the notebooks/ directory contains a notebook for each stage. Run them individually after confirming the environment variables are set. If you encounter unexpected issues with upstream data, reach out in #dm-algorithms-pipelines on the Rubin Observatory Slack.

Development

conda create -n rubin-dash python=3.11
conda activate rubin-dash
pip install -e ".[dev]"
chmod +x .setup_dev.sh
./.setup_dev.sh

About

DRP Afterburner for Super HATS - importing rubin catalogs to HATS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors