Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge validation changes to parallel branch #63

Merged
merged 8 commits into from
Mar 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions docs/source/core/ceda_staff.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
============================
Extra Details for CEDA Staff
============================

**Last Updated: 4th March 2025**

The following content has been documented to help CEDA staff users specifically, and involves integration of other packages.

CCI: Fill group using Moles ESGF Tag results
============================================

For CCI projects, it can be faster and easier to initialise an empty group and the fill this group via the ``esgf_drs.json`` file created using the ``cci-tag-scanner`` package. The process for doing this is documented here. See the ``cci-tag-scanner`` `repo <https://github.com/cedadev/cci-tag-scanner>`_ for instructions on how to run the moles tagging script.

1. Create an empty group
------------------------

This can be done interactively or using the console.

.. code::

$ padocc new -G my_group

Or interactively...

.. code:: python

>>> from padocc import GroupOperation

>>> # Access your working directory from the external environment - if already defined
>>> import os
>>> workdir = os.environ.get("WORKDIR")

>>> my_group = GroupOperation('my_group',workdir)

2. Add new projects using the ``moles_esgf.json`` contents
----------------------------------------------------------

Either the content can be provided directly or the filepath, but in either case it must be done interactively.

.. code:: python

>>> my_group.add_project('moles_esgf.json', moles_tags=True)
INFO [group-operation]: Rejected UNKNOWN_DRS - /neodc/esacci/fire/data/burned_area/Sentinel3_SYN/pixel/v1.1 - not all files are friendly.
INFO [group-operation]: Rejected esacci.fire.mon.l3s.ba.multi-sensor.multi-platform.syn.v1-1.pixel - not all files are friendly.
DEBUG [group-operation]: Creating file "main.txt"
DEBUG [group-operation]: Creating operator for project esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid
DEBUG [group-operation]: Constructing the config file for esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid
DEBUG [group-operation]: Creating file "base-cfg.json"
DEBUG [group-operation]: Skipped setting value in detail-cfg.json
DEBUG [group-operation]: Creating file "allfiles.txt"
DEBUG [group-operation]: Skipped setting value in status_log.csv
DEBUG [group-operation]: Skipped setting value in kj1.1a.json
DEBUG [group-operation]: No 1.3.2 related file issues.
DEBUG [group-operation]: Skipped setting value in faultlist.csv
DEBUG [group-operation]: Skipped setting value in datasets.csv
>>> my_group[0]
DEBUG [group-operation]: Creating operator for project esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid
DEBUG [group-operation]: Creating file "status_log.csv"
DEBUG [group-operation]: content length: 10
esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid:
File count: 10
Group: my_group
Phase: init
Revision: kj1.1

In the above example, the ``UNKNOWN_DRS`` option was ignored since the DRS was not issued (normally meaning non-data files like READMEs), as well as the first DRS, which contained only a set of ``.tar.gz`` files which are not processable by padocc. The third option with the drs ``esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid`` was identified as valid and all subsequent files were created.

It was then possible to identify this project as the 0th member of this group, with 10 files identified from the input source. In this way, it is possible to add many projects to this group from one moles tags file. And multiple groups can be merged, which adds further options for creating groups.

CCI: Alter dataset attributes using CCI Tagger JSONs
====================================================

In many cases the CCI tagger json files contain expected default values for different datasets. Padocc has now implemented an ``apply_defaults`` method per-project which can be used to reassign values in the cloud dataset.

In contrast to the first CCI-specific case, this section must be performed via the interactive shell, and it requires opening the JSON file and passing the correct content:

.. code:: python

>>> import json
>>> with open('fire_syn_v1.1_input.json') as f:
... refs = json.load(f)

>>> defaults = refs['defaults']
>>> p = my_group[0]
>>> p.apply_defaults(defaults)

This will apply the default attributes to the 'dataset' filehandler, which is specified by the ``cloud_format``. If you wish to apply these attributes to a specific product, use the ``target`` kwarg to specify e.g ``kfile``, ``zstore``. This function can also be used to remove specific values, especially if you're using the defaults to correct a naming issue.

.. code:: python

# Quick example of how you can extract the current value of any property from the main dataset.
>>> defaults = {'PRODUCT_VERSION':p.dataset_attributes['product_version']}
>>> p.apply_defaults(defaults, remove = ['product_version'])

This will effectively rename the ``product_version`` parameter to ``PRODUCT_VERSION``. Also, performing the functions using the ``apply_defaults`` method will automatically update the base ``CFA`` dataset alongside the target dataset.
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The ``padocc`` tool makes it easy to generate data-aggregated access patterns in

Vast amounts of archival data in a variety of formats can be processed using the package's group mechanics and automatic deployment to a job submission system.

**Latest Release: v1.3 05/02/2025**: This release now adds a huge number of additional features to both projects and groups (see the CLI and Interactive sections in this documentation for details). Several alpha-stage features are still untested or not well documented, please report any issues to the `github repo <https://github.com/cedadev/padocc>_`.
**Latest Release: v1.3 05/02/2025**: This release now adds a huge number of additional features to both projects and groups (see the CLI and Interactive sections in this documentation for details). Several alpha-stage features are still untested or not well documented, please report any issues to the `github repo <https://github.com/cedadev/padocc>`_.

Formats that can be generated
-----------------------------
Expand Down Expand Up @@ -42,6 +42,7 @@ The ingestion/cataloging phase is not currently implemented for public use but m
Command Line Tool <core/cli>
Interactive Notebook/Shell <core/interactive>
Extra Details <core/extras>
Extras for CEDA Staff <core/ceda_staff>
Complex (Parallel) Operation <core/complex_operation>

.. toctree::
Expand Down
25 changes: 25 additions & 0 deletions padocc/core/mixins/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,31 @@ def update_attribute(
# Also update the CFA dataset.
self.cfa_dataset.set_meta(meta)

def remove_attribute(
self,
attribute: str,
target: str = 'dataset',
) -> None:
"""
Remove an attribute within a dataset representation's metadata.

:param attribute: (str) The name of an attribute within the metadata
property of the corresponding filehandler.

:param target: (str) The target product filehandler, uses the
generic dataset filehandler if not otherwise specified.
"""

if hasattr(self,target):
meta = getattr(self,target).get_meta()

meta.pop(attribute)

getattr(self, target).set_meta(meta)
if target != 'cfa_dataset' and self.cloud_format != 'cfa':
# Also update the CFA dataset.
self.cfa_dataset.set_meta(meta)

@property
def dataset_attributes(self) -> dict:
"""
Expand Down
20 changes: 19 additions & 1 deletion padocc/core/mixins/properties.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,4 +295,22 @@ def _get_stac_representation(self, stac_mapping: dict) -> dict:
# Apply default value if source not found
if record[k] is None:
record[k] = v[-1]
return record
return record

def apply_defaults(
self,
defaults: dict,
target: str = 'dataset',
remove: Union[list,None] = None
):
"""
Apply a default selection of attributes to a dataset.
"""

for attr, val in defaults.items():
self.update_attribute(attr, val, target=target)

for attr in remove:
self.remove_attribute(attr, target=target)

self.save_files()
18 changes: 18 additions & 0 deletions padocc/core/mixins/status.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from typing import Callable

from padocc.core.filehandlers import JSONFileHandler

class StatusMixin:
"""
Expand Down Expand Up @@ -121,3 +122,20 @@ def _rerun_command(self):
Setup for running this specific component interactively.
"""
return f'padocc <operation> -G {self.groupID} -p {self.proj_code} -vv'

def get_report(self) -> dict:
"""
Get the validation report if present for this project.
"""

full_report = {'data':None, 'metadata':None}

meta_fh = JSONFileHandler(self.dir, 'metadata_report',logger=self.logger, **self.fh_kwargs)
data_fh = JSONFileHandler(self.dir, 'data_report',logger=self.logger, **self.fh_kwargs)

if meta_fh.file_exists():
full_report['metadata'] = meta_fh.get()
if data_fh.file_exists():
full_report['data'] = data_fh.get()

return full_report
22 changes: 12 additions & 10 deletions padocc/core/project.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,13 +418,11 @@ def _configure_filelist(self) -> None:
'"pattern" attribute missing from base config.'
)

if pattern.endswith('.txt'):
content = extract_file(pattern)
if 'substitutions' in self.base_cfg:
content, status = apply_substitutions('datasets', subs=self.base_cfg['substitutions'], content=content)
if status:
self.logger.warning(status)
self.allfiles.set(content)
if isinstance(pattern, list):
# New feature to handle the moles-format data.
fileset = pattern
elif pattern.endswith('.txt'):
fileset = extract_file(pattern)
else:
# Pattern is a wildcard set of files
if 'latest' in pattern:
Expand All @@ -434,12 +432,16 @@ def _configure_filelist(self) -> None:
fileset = sorted(glob.glob(pattern, recursive=True))
if len(fileset) == 0:
raise ValueError(f'pattern {pattern} returned no files.')

self.allfiles.set(sorted(glob.glob(pattern, recursive=True)))

if 'substitutions' in self.base_cfg:
fileset, status = apply_substitutions('datasets', subs=self.base_cfg['substitutions'], content=fileset)
if status:
self.logger.warning(status)
self.allfiles.set(fileset)

def _setup_config(
self,
pattern : Union[str,None] = None,
pattern : Union[str,list,None] = None,
updates : Union[str,None] = None,
removals : Union[str,None] = None,
substitutions: Union[dict,None] = None,
Expand Down
6 changes: 6 additions & 0 deletions padocc/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@
'catalog'
]

# Which files acceptable to pull from Moles Tags file.
source_opts = [
'.nc'
]

# Which operations are parallelisable.
parallel_modes = [
'scan',
'compute',
Expand Down
58 changes: 49 additions & 9 deletions padocc/groups/mixins/modifiers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,12 @@
__contact__ = "[email protected]"
__copyright__ = "Copyright 2024 United Kingdom Research and Innovation"

from typing import Callable
import json

from typing import Callable, Union

from padocc import ProjectOperation
from padocc.core.utils import BASE_CFG, source_opts


class ModifiersMixin:
Expand Down Expand Up @@ -38,18 +41,55 @@ def help(cls, func: Callable = print):

def add_project(
self,
config: dict,
):
config: Union[str,dict],
moles_tags: bool = False,
):
"""
Add a project to this group.

:param config: (str | dict) The configuration details to add new project. Can either be
a path to a json file or json content directly. Can also be either a properly formatted
base config file (needs ``proj_code``, ``pattern`` etc.) or a moles_esgf input file.

:param moles_tags: (bool) Option for CEDA staff to integrate output from another package.
"""
if config['proj_code'] in self.proj_codes['main']:
raise ValueError(
f'proj_code {config["proj_code"]} already exists for this group.'
)

self._init_project(config)
self.proj_codes['main'].append(config['proj_code'])
if isinstance(config, str):
if config.endswith('.json'):
with open(config) as f:
config = json.load(f)
else:
config = json.loads(config)

configs = []
if moles_tags:
for key, fileset in config.items():
conf = dict(BASE_CFG)
conf['proj_code'] = key
conf['pattern'] = fileset

accept = True
for f in fileset:
if str('.' + f.split('.')[-1]) not in source_opts:
accept = False

if accept:
configs.append(conf)
else:
self.logger.info(f'Rejected {key} - not all files are friendly.')
else:
configs.append(config)

for config in configs:

if config['proj_code'] in self.proj_codes['main']:
self.logger.warning(
f'proj_code {config["proj_code"]} already exists for this group - skipping'
)
continue

self._init_project(config)
self.proj_codes['main'].append(config['proj_code'])
self.save_files()

def remove_project(self, proj_code: str, ask: bool = True) -> None:
Expand Down
2 changes: 1 addition & 1 deletion padocc/phases/compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,7 @@ def _run_cfa(

cfa.create()
if file_limit is None:
cfa.write(instance.cfa_path)
cfa.write(instance.cfa_path + '.nca')

return {
'aggregated_dims': make_tuple(cfa.agg_dims),
Expand Down
Loading