Indexing of ASTER LP DAAC Datasets to datacube by ashoka1234 · Pull Request #213 · GeoscienceAustralia/digitalearthau

ashoka1234 · 2019-03-22T05:34:50Z

Reason for this PR

ASTER data consists of visible and near infrared (VNIR) frequencies
with three bands at 15-meter resolution, short-wave infrared (SWIR)
frequencies with six bands at 30-meter resolution, and thermal infrared (TIR)
wavelength with five bands at 90-meter resolution. ASTER SWIR data acquired since April 2008 are not usable, and show saturation of values and severe striping. Data CRS definitions are not available directly
from the datasets but can be computed from the available information from the datasets. This PR attempt to compute CRS definitions and index ASTER datasets.

Proposed Solution

Three products are derived from the datasets corresponding to VNIR, SWIR, and TIR sensors so that each derived product has the same resolution simplifying Geo Transform computations. The CRS definition is computed per original dataset while GeoTransform is computed per sensor, i.e. datacube product.
Each indexed dataset points to a corresponding VRT file which defines the bands corresponding to the product referring to the original .hdf file.
The workflow is similar to modis indexing scripts with additional commands for VRT file generation. Thus there are separate commands for deriving and adding a product definition, generation VRT file, Generating and indexing a dataset all corresponding to a particular hdf file and for a particular product. A seperate bash script is provided to index multiple files.
The VRT references to hdf files uses absulute path names but indexed datasets uses relative path names to the VRT files. This is due to a GDAL/rasterio library restriction.
The definitions of which bands correspond to each product is defined by a global constant PRODUCTS.
The additional metadata to include or exclude are defined by means of common prefixes in a global constant EXTRA_METADATA_PREFIXES.

Tests

A datacube is created with the name aster_lpdaac (Use psql -h agdcdev-db.nci.org.au -d aster_lpdaac) for this work and number of datasets are indexed corresponding to three products, aster_l1t_vnir, aster_l1t_swir, and aster_l1t_tir.
Testing scripts are written (See scripts_tests module) and basic form of VRT schema was derived and tests provided for VRT file content validation (in memory).

codecov · 2019-03-22T05:48:28Z

Codecov Report

Merging #213 into develop will decrease coverage by 0.07%.
The diff coverage is 33.33%.

@@             Coverage Diff             @@
##           develop     #213      +/-   ##
===========================================
- Coverage     66.8%   66.72%   -0.08%     
===========================================
  Files           42       42              
  Lines         3190     3189       -1     
===========================================
- Hits          2131     2128       -3     
- Misses        1059     1061       +2

Impacted Files	Coverage Δ
digitalearthau/testing/plugin.py	`80% <33.33%> (-7.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7425e06...f4a5dc7. Read the comment docs.

Kirill888 · 2019-03-24T23:09:50Z

@ashoka1234, @omad I have a few questions/comments:

Do we really want 3 separate products?
- What is the reason for grouping bands by resolution into separate products? I'm pretty sure that a single product with bands of different resolution would work just fine.
Why custom indexing step instead of just writing yaml file next to vrt file and using datacube dataset add instead?
Probably doesn't matter for a once off tool like this but still, I noticed you use file path as an API argument, as a result, same netcdf file is opened/closed many times. I think it would be more efficient but also clearer from the data flow perspective if those methods operated on some data loaded from file once. In this case tags and whatever other file metadata you require to generate VRT and YAML docs. This would also make writing tests easier as it is easier to mock a data structure than a whole netcdf file.

ashoka1234 · 2019-03-26T22:46:12Z

Thanks @Kirill888 for review comments. Answers to questions respectively:

The decision to have three separate products was driven by the limitations that I saw with datacube and VRT files regarding the specification of Geo Transform. The reference given for computation of CRS adjust bounds used for geo transform values by half of pixel resolution and this resolution is dependent on sensor. VRTRasterBand don't seem to have geo transform elements so had to specify geo tranbsform at VRTDataset level and did not see a way to specify number of VRTDatasets per VRT file. I also didn't see datacube specify geo transforms per measurement. If the geo transform is approximate enough without resolution dependent adjustment or VRT/datacube could specify geo transform per band then we could ideally use a single product.
Hm. for this one, I followed the style of modis indexing script but I agree with this style because it is simplistic but do you see any additional benefits of having yamls?
Yes you are right on this one and I did have this concern as well and, I had extracted data passed to those functions earlier ensuring once read as much as possible but i felt interface was messy and resorted to passing file paths. But I mitigated the multiple file read of band names by means of functools.lru_cache. Still some multiple file opens remains (I think multiple opens for subdataset is unavoidable! There is additional file open to extract metadata, but since this is done only once I thought that was OK. This is probably my opinion but I felt passing internal GDAL representations to functions was low level so I tried to avoid it. But I will have another look and see we could simplify this issue.

Kirill888 · 2019-03-26T23:26:55Z

@ashoka1234

VRT generation is a separate problem from product definition, product can reference different VRT files for different bands. You are right about dataset.transform property being confusing and suggesting that all bands should cover the same area at the same resolution. The reality is that we already have products that have bands with slightly different footprints and different resolutions. Dataset footprint is supposed to be a union of individual band footprints, not "THE footprint for every band in the Dataset".
Having files written out next to VRT would simplify indexing step. Users will want to index this product into their own custom databases. Having "the more standard" way of just calling datacube dataset add instead of learning how to run this custom tool would be preferable.

omad · 2019-03-27T04:01:13Z

I hadn't realised we had products with different resolution bands. Since we do, this is definitely what we should use for this ASTER products.
We had originally specified to not write out YAMLS. However, for this case, since we have to write out and manage VRTs, we should also write out YAML files.

ashoka1234 · 2019-03-27T06:56:47Z

Scanning about 100 ASTER dataset directories, the following are statistics:
Total datasets: 2505
vnir bands present: 1434
swir bands present: 682
tir bands present: 2505

It seems 'tir is present in all datasets checked. If this behaviour generalises to whole collection, then tir could potentially provides minimal measurements to be included in product definition. But, I personally feel product definition should provide maximal measurement info.

Kirill888 · 2019-03-27T22:44:42Z

@ashoka1234 that's another datacube limitation.

I think the most sane way is to have exactly the same set of bands across both product and dataset definitions.

Having extra bands in a dataset (those not defined in the product) is completely useless, as they won't be loadable by dc.load.
Listing all possible bands in a product definition and only including available bands in a given dataset, kinda makes sense, but is not supported by datacube. First of all indexing step will reject such dataset. Secondly there is probably a lot of code in datacube that assumes that any band name listed in a product definition will also appear in every dataset from this product.

I see 2 possible workaround here:

(this is my preferred choice) We generate VRT that replaces missing bands with whatever available construct in the VRT toolkit, then Dataset yaml section as far as bands go is straightforward and there is a consistent mapping from band name to band index within the VRT.
Yaml includes all bands but has a "broken path" for those bands that are missing. That way you can index and load data so long as you use ignore_missing option.

Advantage of option (1) is that it is simple to verify correctness of yaml as you don't need to keep track which netcdf variable ended up with which band index in the VRT.

Option (2) is basically a work around for datacube not having an option to indicate that a particular band is missing for this particular dataset.

ashoka1234 · 2019-03-27T23:06:01Z

@Kirill888
As per discussion we had, we will retain three products corresponding to three different sensors since these sensors can be on-off and depending on that, the corresponding bands are present in the dataset. I will also make sure band mappings are consistent.

ashoka1234 · 2019-03-28T00:10:40Z

As for consistency of band mappings the code currently rely on consistent ordering of subdatasets returned by the corresponding GDAL query. All the dataset files I checked seemed to have the same consistent order! But may be we should not rely on this GDAL feature.

omad

Thanks Ashoka, looks pretty good! I still haven't actually tested it in practice, have some comments and a PR first.

the code currently rely on consistent ordering of subdatasets returned by the corresponding GDAL query. All the dataset files I checked seemed to have the same consistent order! But may be we should not rely on this GDAL feature.

Is there any check done to ensure the ordering is consistent? We really don't want bands accidentally mapped wrongly, and need to error out instead.

scripts/index_aster_lpdaac.py

scripts_tests/test_aster.py

ashoka1234 · 2019-05-01T07:45:36Z

Yes for consistency I sort the bands defined in the global constant PRODUCT and assign integer value as a string in that sorted order.

Replace this review with a new review. All the requested changes are done.

Refactoring, cleanup and vrt generation click indexing, dataset doc, vrt generation, product def working - initial version indexing with vrt file works with absolute path names for bands

…DAL order of subdatasets for band name to band number mapping.

omad · 2019-05-02T03:57:28Z

Looking pretty good. I don't think the tests are being run though. I can see that test_index_modis_lpdaac is run, but nothing for aster.

Try moving the modis test script into scripts_tests/ and add that path at the end of check-code.sh when pytest is run.

ashoka1234 · 2019-05-02T04:13:05Z

OK sure will do! I thought it was unnecessary for scripts tests to run on Travis!

omad · 2019-05-02T04:32:07Z

NP. I think that if we've gone to the effort of writing tests, we might as well do a bit more and have them run all the time in case anything breaks. :)

ashoka1234 requested review from Kirill888, dunkgray and omad March 22, 2019 05:34

omad force-pushed the aster branch from ae1041b to 8c17eb4 Compare April 11, 2019 04:55

omad previously requested changes Apr 30, 2019

View reviewed changes

ashoka1234 requested a review from omad May 1, 2019 07:47

ashoka1234 and others added 13 commits May 2, 2019 10:24

Initial commit

c2c02c4

Refactoring, cleanup and vrt generation click indexing, dataset doc, vrt generation, product def working - initial version indexing with vrt file works with absolute path names for bands

added tests for product def, vrt generation, and dataset docs

e5d9058

Added a test for product addition and dataset indexing. Not rely on G…

7e5ee68

…DAL order of subdatasets for band name to band number mapping.

index-data click command now write out the corresponding yaml file.

632bae0

catch AssertionError fro selected_bands function for failure check

6e1c546

Start trying to shrink the aster test data

4568c47

Shrink ASTER test data

27d6b15

More robust path generation in aster tests

f0fc658

Install pytest plugins for dea config and db initialisation

9adbb80

Re-add shrunken aster test data

7efba27

Update ASTER Indexing documentation

3900045

Don't globally install pytest plugins (yet)

1c6dd68

Updates and refactoring as per Damien's review

aeaccfe

ashoka1234 force-pushed the aster branch from 55a67a6 to aeaccfe Compare May 2, 2019 00:26

ashoka1234 added 2 commits May 3, 2019 12:15

make travis run scripts_tests

6854ddd

try to install rasterio from git repo

f4a5dc7

ashoka1234 force-pushed the aster branch from 52294c1 to f4a5dc7 Compare May 3, 2019 02:16

Conversation

ashoka1234 commented Mar 22, 2019

Reason for this PR

Proposed Solution

Tests

Uh oh!

codecov bot commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Kirill888 commented Mar 24, 2019

Uh oh!

ashoka1234 commented Mar 26, 2019

Uh oh!

Kirill888 commented Mar 26, 2019

Uh oh!

omad commented Mar 27, 2019

Uh oh!

ashoka1234 commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kirill888 commented Mar 27, 2019

Uh oh!

ashoka1234 commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashoka1234 commented Mar 28, 2019

Uh oh!

omad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashoka1234 commented May 1, 2019

Uh oh!

omad commented May 2, 2019

Uh oh!

ashoka1234 commented May 2, 2019

Uh oh!

omad commented May 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 22, 2019 •

edited

Loading

ashoka1234 commented Mar 27, 2019 •

edited

Loading

ashoka1234 commented Mar 27, 2019 •

edited

Loading

omad left a comment •

edited

Loading