There will be four parts to the metadata:
- Hand-written metadata: NWP name, bucket, ensemble members, forecast steps.
- Algorithmically generated by the
explore_nwps
Python package: NWP variables, vertical levels, horizontal coordinates - Data contract.
- Reference datetimes (AKA init datetimes). These can be grabbed at "inference time" within a few seconds. Or even faster if we "burn in" the NWP ref datetimes, and only grab the most recent.
Open dataset:
da = xr.open_dataset(URL, engine="hypergrib")
hypergrib
loads the metadata and passes to xarray the full list of coordinates and dimension names, e.g.:
dims = ["init_time", "variable", "vertical_level", "timestep", "ensemble_member"]
coords = {
"init_time": ["2024-01-01", "2024-01-02"],
# etc.
}
User request:
da.sel(
init_time="2024-01-01",
nwp_variable=["temperature", "wind_speed"],
vertical_level="2meters",
# all forecast time steps
# all ensemble members
)
xarray converts these coordinate labels to integer indexes:
da.isel(
init_time=0,
nwp_variable=[0, 1],
vertical_level=0,
# all forecast time steps
# all ensemble members
)
The integer indexes get passed to the hypergrib
backend for xarray. (In the future, hypergrib
may implement a custom xarray index, so we can avoid the redundant conversion to integer indexes and back to coordinate labels).
- Load the metadata (which was produced by step 1).
- Convert integer indicies back to coordinate labels by looking up the appropriate labels in
hypergrib
's coords arrays. - Find the unique tuples of init date, init hour, ensemble member, and step.
- Algorithmically generate the location of all the
.idx
files we need. For example, the GEFS location strings look like this:noaa-gefs-pds/gefs.<init date>/<init hour>/pgrb2b/gep<ensemble member>.t<init hour>z.pgrb2af<step>
- In parallel, submit GET requests for all these
.idx
files. - As soon as an
.idx
file arrives, decode it, and look up byte ranges of the GRIB files we need, and immediately submit GET requests for those byte ranges of the GRIB file. (This step is probably so fast that we perhaps don't need to multi-thread this... for the MVP, let's use a single thread for decoding.idx
files and if that's too slow then we can add more threads). Maybe stop decoding rows in the.idx
file once we've found all the metadata we need. - If an
.idx
file doesn't exist then:- Allow the user to determine what happens if
hypergrib
tries but fails to read an.idx
file. Three options: - Silent: Don't complain about the missing
.idx
. Just load the GRIB, scan it, and keep in mem (because we'll soon extract binary data from it). - Warn: Log a warning about the missing
.idx
. And load the GRIB, scan it, and keep in mem. - Fail: Complain loudly about the missing
.idx
! Don't load the GRIB. - (Maybe, in a future version, we could offer the option to generate and cache
.idx
files locally)
- Allow the user to determine what happens if
- If no GRIB exists then log another warning and insert the MISSING DATA indicator into the array (which will probably be NaN for floating point data).
- As soon as GRIB data arrives, decode it, and place it into the final array. Decoding GRIB data should be multi-threaded.
- Benchmark! See recent discussion on "Large Scale Geospatial Benchmarks" on the Pangeo forum.
Allow the user to set a threshold for when to load .idx
files.
If the user requests more than THRESHOLD% of the GRIB messages in any GRIB file then skip the .idx
and just load the GRIB. Otherwise, attempt to load the .idx
. (The motivation being that, if the user wants to read most of the GRIB file, then loading the .idx
first will add unnecessary latency).
Set the threshold to 100% to always try to load the .idx
file before the GRIB.
Set the threshold to 0% to never load the .idx
, and always load the GRIB file first.
See #17
- Keep a few hundred network request in-flight at any given moment (user configurable). (Why? Because the AnyBlob paper suggests that is what's required to achieve max throughput).
- Consolidate nearby byterange requests (user configurable) to minimise overhead, and reduce the total number of IO operations.
For example, some GRIBs are compressed in JPEG2000, and JPEG2000 allows parts of the image to be decompressed. (But would we still have to download the entire GRIB message, and the only benefit would be a small reduction in decompression time?)
And maybe we could decompress each GRIB file and save the state of the decompressor every, say, 4 kB. Then, at query time, if we want a single pixel then we'd have to stream at most 4 kB of data from disk. Although that has its own issues.).
But, to get a real speed up, we'd want to only read a subset of each GRIB message. And GRIB data is probably stored as a sequence of horizontal scan lines. So, for example, if you wanted to read a single consecutive byte range for just the United Kingdom from a global NWP then the best you can do might be to read all the horizontal scan lines that include the UK. So you'd end up with a short but very wide slice that includes. But that's still a significant speedup, so might be worth pursuing.
See Issue #30.
See Issue #19.
So, for example, users can get acceptable IO performance even if they ask for "churro-shaped" data arrays.
- For small GRIB files, just read the entirety of each GRIB file?
- Store
.idx
files locally? - Convert
.idx
files to a more concise and cloud-friendly file format, which is published in a cloud bucket? - Put all the
.idx
data into a cloud-side database? - Put all the
.idx
data into a local database? DuckDB? - We probably want to avoid using a manifest file, or putting metadata for every GRIB message into a database, because we want to scale to datasets with trillions of GRIB messages. See #14
Hand written YAML per dataset:
datasets/gefs/index.yaml
: See https://github.com/JackKelly/explore_nwps/blob/main/metadata/gefs/index.yaml
Output by Python scripts:
datasets/gefs/parameters.yaml
:
- desc: Temperature
abbr: TMP
unit: K
max: 373
min: 173
data_type: <f8
compression: jpeg2000
datasets/gefs/spatial_coords/v12_0.5_degree.yaml
:
- x:
coord_labels: [0.0, 0.1, 0.2]
unit: degrees
- y:
coord_labels: [0.0, 0.2, 0.4]
unit: degrees
- z:
coord_labels: [5, 10, 20]
unit: mb