Skip to content

HDF files subsetting incorrectly #394

@ocsmit

Description

@ocsmit

We found another regression in L2SSpy for OMSO2_004.
here is the request URL and a look at the dimensions in the output file:

FILE: OMI-Aura_L2-OMSO2_2025m0616t1143-o111288_v004-2025m0616t170426.he5
6 DIMENSIONS (* means it has a scale variable)
  1644   HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_0
    60   HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_1
    72   HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_2
  1644   HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_3
    60   HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_4
     4   HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_5

The dimensions for the coordinate variables (in the Geolocation Fields subdirectory) do not match the dimensions for the science variables (in the Data Fields subdirectory) and the end results is that the science variables are not getting spatially subsetted.
Note the mismatched sizes of the dimensions of the coordinates and the ColumnAmountSO2 variable in the subset file:

FILE: OMI-Aura_L2-OMSO2_2004m1001t1631-o001142_v004-2025m0606t121645_subsetted.he5

4 DIMENSIONS (* means it has a scale variable)
  1496   HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_0
    60   HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_1
   556   HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_3
    52   HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_4

4 VARIABLES
Name: HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/ColumnAmountSO2 
Dims: HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_0 {1496}
      HDFEOS/SWATHS/OMI Total Column Amount SO2/Data Fields/phony_dim_1 {60}

Name: HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/Latitude 
Dims: HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_3 {556}
      HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_4 {52}

Name: HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/Longitude 
Dims: HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_3 {556}
      HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_4 {52}

Name: HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/Time 
Dims: HDFEOS/SWATHS/OMI Total Column Amount SO2/Geolocation Fields/phony_dim_3 {556}

I have found the cause to be due to a lack of dimension scale so the backend is assigning dimensions as phony_dims_N where N is just incremented for each group as they are read and even if groups have the same dimensions, they will still get different phony_dim_N names. E.g. if there are sibling groups A and B, and A has two dimensions dim1 and dim1, and B has the same dimensions dim1 and dim2, group A will get the dimensions phony_dim_0 & phony_dim_1 while group B will get the dimensions phony_dim_2 and phony_dim_3

So l2ss is finding the Latitude, Longitude, and Time variables appropriately from the Geolocation Fields sibling group but is then seeing that each of their associated dimensions are phony_dim_3 and phony_dim_4 (which correspond to nTimes & nXtrack).

When iterating over each group, it checks that the variables include these two dimensions. For the Latitude, Longitude, and Time we see that it does contain these, and thus they get subsetted appropriately.

Image

However, for the ColumnAmountO3 we can see that it's dimensions have been enumerated as phony_dim_0 and phony_dim_1 (these also correspond to nTimes & nXtrack).

Image

Thus no subsetting actually occurs on this variable because the system only knows that it is trying to subset the dimensions phony_dim_3 and phony_dim_4 based on what is contained in Latitude, Longitude, and Time.

this is an artifact with how netcdf-c interops with HDF5. See the following from the netcdf documentation

Reading and Editing HDF5 Files with NetCDF-4
Assuming a HDF5 file is written in accordance with the netCDF-4 rules (i.e. no strange types, no looping groups), and assuming that every dataset has a dimension scale attached to each dimension, the netCDF-4 API can be used to read and edit the file, quite easily.

In HDF5 (version 1.8.0 and later), dimension scales are (generally) 1D datasets, that hold dimension data. A multi-dimensional dataset can then attach a dimension scale to any or all of its dimensions. For example, a user might have 1D dimension scales for lat and lon, and a 2D dataset which has lat attached to the first dimension, and lon to the second.

If dimension scales are not used, then netCDF-4 can still edit the file, and will invent anonymous dimensions for each variable shape. This is done by iterating through the space of each dataset. As each space size is encountered, a phony dimension of that size is checked for. It it does not exist, a new phony dimension is created for that size. In this way, a HDF5 file with datasets that are using shared dimensions can be seen properly in netCDF-4. (There is no shared dimension in HDF5, but data users will freqently write many datasets with the same shape, and intend these to be shared dimensions.)

Starting with version 4.7.3, if a dataset is encountered with uses the same size for two or more of its dataspace lengths, then a new phony dimension will be created for each. That is, a dataset with size [100][100] will result in two phony dimensions, each of size 100.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions