Skip to content

Latest commit

 

History

History
148 lines (111 loc) · 7.41 KB

design.md

File metadata and controls

148 lines (111 loc) · 7.41 KB

Step 1: Create metadata describing each NWP dataset

There will be four parts to the metadata:

  1. Hand-written metadata: NWP name, bucket, ensemble members, forecast steps.
  2. Algorithmically generated by the explore_nwps Python package: NWP variables, vertical levels, horizontal coordinates
  3. Data contract.
  4. Reference datetimes (AKA init datetimes). These can be grabbed at "inference time" within a few seconds. Or even faster if we "burn in" the NWP ref datetimes, and only grab the most recent.

Step 2: Load the metadata and load data

Open dataset:

da = xr.open_dataset(URL, engine="hypergrib")

hypergrib loads the metadata and passes to xarray the full list of coordinates and dimension names, e.g.:

dims = ["init_time", "variable", "vertical_level", "timestep", "ensemble_member"]
coords = {
  "init_time": ["2024-01-01", "2024-01-02"],
  # etc.
}

User request:

da.sel(
  init_time="2024-01-01",
  nwp_variable=["temperature", "wind_speed"],
  vertical_level="2meters",
  # all forecast time steps
  # all ensemble members
)

xarray converts these coordinate labels to integer indexes:

da.isel(
  init_time=0,
  nwp_variable=[0, 1],
  vertical_level=0,
  # all forecast time steps
  # all ensemble members
)

The integer indexes get passed to the hypergrib backend for xarray. (In the future, hypergrib may implement a custom xarray index, so we can avoid the redundant conversion to integer indexes and back to coordinate labels).

Planned hypergrib MVP features:

  • Load the metadata (which was produced by step 1).
  • Convert integer indicies back to coordinate labels by looking up the appropriate labels in hypergrib's coords arrays.
  • Find the unique tuples of init date, init hour, ensemble member, and step.
  • Algorithmically generate the location of all the .idx files we need. For example, the GEFS location strings look like this: noaa-gefs-pds/gefs.<init date>/<init hour>/pgrb2b/gep<ensemble member>.t<init hour>z.pgrb2af<step>
  • In parallel, submit GET requests for all these .idx files.
  • As soon as an .idx file arrives, decode it, and look up byte ranges of the GRIB files we need, and immediately submit GET requests for those byte ranges of the GRIB file. (This step is probably so fast that we perhaps don't need to multi-thread this... for the MVP, let's use a single thread for decoding .idx files and if that's too slow then we can add more threads). Maybe stop decoding rows in the .idx file once we've found all the metadata we need.
  • If an .idx file doesn't exist then:
    • Allow the user to determine what happens if hypergrib tries but fails to read an .idx file. Three options:
    • Silent: Don't complain about the missing .idx. Just load the GRIB, scan it, and keep in mem (because we'll soon extract binary data from it).
    • Warn: Log a warning about the missing .idx. And load the GRIB, scan it, and keep in mem.
    • Fail: Complain loudly about the missing .idx! Don't load the GRIB.
    • (Maybe, in a future version, we could offer the option to generate and cache .idx files locally)
  • If no GRIB exists then log another warning and insert the MISSING DATA indicator into the array (which will probably be NaN for floating point data).
  • As soon as GRIB data arrives, decode it, and place it into the final array. Decoding GRIB data should be multi-threaded.
  • Benchmark! See recent discussion on "Large Scale Geospatial Benchmarks" on the Pangeo forum.

Features beyond the MVP

Allow the user to specify whether to load .idx files

Allow the user to set a threshold for when to load .idx files.

If the user requests more than THRESHOLD% of the GRIB messages in any GRIB file then skip the .idx and just load the GRIB. Otherwise, attempt to load the .idx. (The motivation being that, if the user wants to read most of the GRIB file, then loading the .idx first will add unnecessary latency).

Set the threshold to 100% to always try to load the .idx file before the GRIB.

Set the threshold to 0% to never load the .idx, and always load the GRIB file first.

Define an extended idx format

See #17

Schedule the network IO to balance several different objectives:

  • Keep a few hundred network request in-flight at any given moment (user configurable). (Why? Because the AnyBlob paper suggests that is what's required to achieve max throughput).
  • Consolidate nearby byterange requests (user configurable) to minimise overhead, and reduce the total number of IO operations.

Slice into each GRIB message

For example, some GRIBs are compressed in JPEG2000, and JPEG2000 allows parts of the image to be decompressed. (But would we still have to download the entire GRIB message, and the only benefit would be a small reduction in decompression time?)

And maybe we could decompress each GRIB file and save the state of the decompressor every, say, 4 kB. Then, at query time, if we want a single pixel then we'd have to stream at most 4 kB of data from disk. Although that has its own issues.).

But, to get a real speed up, we'd want to only read a subset of each GRIB message. And GRIB data is probably stored as a sequence of horizontal scan lines. So, for example, if you wanted to read a single consecutive byte range for just the United Kingdom from a global NWP then the best you can do might be to read all the horizontal scan lines that include the UK. So you'd end up with a short but very wide slice that includes. But that's still a significant speedup, so might be worth pursuing.

Get hypergrib working for as many NWPs as possible

See Issue #30.

Implement existing protocols

See Issue #19.

Distribute hypergrib's workload across multiple machines

So, for example, users can get acceptable IO performance even if they ask for "churro-shaped" data arrays.

If it's too slow to get .idx files:

  • For small GRIB files, just read the entirety of each GRIB file?
  • Store .idx files locally?
  • Convert .idx files to a more concise and cloud-friendly file format, which is published in a cloud bucket?
  • Put all the .idx data into a cloud-side database?
  • Put all the .idx data into a local database? DuckDB?
  • We probably want to avoid using a manifest file, or putting metadata for every GRIB message into a database, because we want to scale to datasets with trillions of GRIB messages. See #14

A file structure for describing NWP datasets

Hand written YAML per dataset:

datasets/gefs/index.yaml: See https://github.com/JackKelly/explore_nwps/blob/main/metadata/gefs/index.yaml

Output by Python scripts:

datasets/gefs/parameters.yaml:

- desc: Temperature
  abbr: TMP
  unit: K
  max: 373
  min: 173
  data_type: <f8
  compression: jpeg2000

datasets/gefs/spatial_coords/v12_0.5_degree.yaml:

- x:
  coord_labels: [0.0, 0.1, 0.2]
  unit: degrees
- y: 
  coord_labels: [0.0, 0.2, 0.4]
  unit: degrees
- z: 
  coord_labels: [5, 10, 20]
  unit: mb