cedadev · dwest77a · Mar 6, 2025 · Mar 4, 2025 · Mar 4, 2025 · Mar 4, 2025
diff --git a/docs/source/core/ceda_staff.rst b/docs/source/core/ceda_staff.rst
@@ -0,0 +1,95 @@
+============================
+Extra Details for CEDA Staff
+============================
+
+**Last Updated: 4th March 2025**
+
+The following content has been documented to help CEDA staff users specifically, and involves integration of other packages.
+
+CCI: Fill group using Moles ESGF Tag results
+============================================
+
+For CCI projects, it can be faster and easier to initialise an empty group and the fill this group via the ``esgf_drs.json`` file created using the ``cci-tag-scanner`` package. The process for doing this is documented here. See the ``cci-tag-scanner`` `repo <https://github.com/cedadev/cci-tag-scanner>`_ for instructions on how to run the moles tagging script.
+
+1. Create an empty group
+------------------------
+
+This can be done interactively or using the console.
+
+.. code::
+
+    $ padocc new -G my_group
+
+Or interactively...
+
+.. code:: python
+
+    >>> from padocc import GroupOperation
+
+    >>> # Access your working directory from the external environment - if already defined
+    >>> import os
+    >>> workdir = os.environ.get("WORKDIR")
+
+    >>> my_group = GroupOperation('my_group',workdir)
+
+2. Add new projects using the ``moles_esgf.json`` contents
+----------------------------------------------------------
+
+Either the content can be provided directly or the filepath, but in either case it must be done interactively.
+
+.. code:: python
+
+    >>> my_group.add_project('moles_esgf.json', moles_tags=True)
+    INFO [group-operation]: Rejected UNKNOWN_DRS - /neodc/esacci/fire/data/burned_area/Sentinel3_SYN/pixel/v1.1 - not all files are friendly.
+    INFO [group-operation]: Rejected esacci.fire.mon.l3s.ba.multi-sensor.multi-platform.syn.v1-1.pixel - not all files are friendly.
+    DEBUG [group-operation]: Creating file "main.txt"
+    DEBUG [group-operation]: Creating operator for project esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid
+    DEBUG [group-operation]: Constructing the config file for esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid
+    DEBUG [group-operation]: Creating file "base-cfg.json"
+    DEBUG [group-operation]: Skipped setting value in detail-cfg.json
+    DEBUG [group-operation]: Creating file "allfiles.txt"
+    DEBUG [group-operation]: Skipped setting value in status_log.csv
+    DEBUG [group-operation]: Skipped setting value in kj1.1a.json
+    DEBUG [group-operation]: No 1.3.2 related file issues.
+    DEBUG [group-operation]: Skipped setting value in faultlist.csv
+    DEBUG [group-operation]: Skipped setting value in datasets.csv
+    >>> my_group[0]
+    DEBUG [group-operation]: Creating operator for project esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid
+    DEBUG [group-operation]: Creating file "status_log.csv"
+    DEBUG [group-operation]: content length: 10
+    esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid:
+    File count: 10
+    Group: my_group
+    Phase: init
+    Revision: kj1.1
+
+In the above example, the ``UNKNOWN_DRS`` option was ignored since the DRS was not issued (normally meaning non-data files like READMEs), as well as the first DRS, which contained only a set of ``.tar.gz`` files which are not processable by padocc. The third option with the drs ``esacci.fire.mon.l4.ba.multi-sensor.multi-platform.syn.v1-1.grid`` was identified as valid and all subsequent files were created. 
+
+It was then possible to identify this project as the 0th member of this group, with 10 files identified from the input source. In this way, it is possible to add many projects to this group from one moles tags file. And multiple groups can be merged, which adds further options for creating groups.
+
+CCI: Alter dataset attributes using CCI Tagger JSONs
+====================================================
+
+In many cases the CCI tagger json files contain expected default values for different datasets. Padocc has now implemented an ``apply_defaults`` method per-project which can be used to reassign values in the cloud dataset.
+
+In contrast to the first CCI-specific case, this section must be performed via the interactive shell, and it requires opening the JSON file and passing the correct content:
+
+.. code:: python
+
+    >>> import json
+    >>> with open('fire_syn_v1.1_input.json') as f:
+    ...     refs = json.load(f)
+
+    >>> defaults = refs['defaults']
+    >>> p = my_group[0]
+    >>> p.apply_defaults(defaults)
+
+This will apply the default attributes to the 'dataset' filehandler, which is specified by the ``cloud_format``. If you wish to apply these attributes to a specific product, use the ``target`` kwarg to specify e.g ``kfile``, ``zstore``. This function can also be used to remove specific values, especially if you're using the defaults to correct a naming issue.
+
+.. code:: python
+
+    # Quick example of how you can extract the current value of any property from the main dataset.
+    >>> defaults = {'PRODUCT_VERSION':p.dataset_attributes['product_version']}
+    >>> p.apply_defaults(defaults, remove = ['product_version'])
+
+This will effectively rename the ``product_version`` parameter to ``PRODUCT_VERSION``. Also, performing the functions using the ``apply_defaults`` method will automatically update the base ``CFA`` dataset alongside the target dataset.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -12,7 +12,7 @@ The ``padocc`` tool makes it easy to generate data-aggregated access patterns in
 
 Vast amounts of archival data in a variety of formats can be processed using the package's group mechanics and automatic deployment to a job submission system.
 
-**Latest Release: v1.3 05/02/2025**: This release now adds a huge number of additional features to both projects and groups (see the CLI and Interactive sections in this documentation for details). Several alpha-stage features are still untested or not well documented, please report any issues to the `github repo <https://github.com/cedadev/padocc>_`.
+**Latest Release: v1.3 05/02/2025**: This release now adds a huge number of additional features to both projects and groups (see the CLI and Interactive sections in this documentation for details). Several alpha-stage features are still untested or not well documented, please report any issues to the `github repo <https://github.com/cedadev/padocc>`_.
 
 Formats that can be generated
 -----------------------------
@@ -42,6 +42,7 @@ The ingestion/cataloging phase is not currently implemented for public use but m
    Command Line Tool <core/cli>
    Interactive Notebook/Shell <core/interactive>
    Extra Details <core/extras>
+   Extras for CEDA Staff <core/ceda_staff>
    Complex (Parallel) Operation <core/complex_operation>
 
 .. toctree::

diff --git a/padocc/core/mixins/dataset.py b/padocc/core/mixins/dataset.py
@@ -194,6 +194,31 @@ def update_attribute(
             # Also update the CFA dataset.
             self.cfa_dataset.set_meta(meta)
 
+    def remove_attribute(
+            self, 
+            attribute: str, 
+            target: str = 'dataset',
+        ) -> None:
+        """
+        Remove an attribute within a dataset representation's metadata.
+
+        :param attribute:   (str) The name of an attribute within the metadata
+            property of the corresponding filehandler.
+
+        :param target:      (str) The target product filehandler, uses the 
+            generic dataset filehandler if not otherwise specified.
+        """
+
+        if hasattr(self,target):
+            meta = getattr(self,target).get_meta()
+
+        meta.pop(attribute)
+
+        getattr(self, target).set_meta(meta)
+        if target != 'cfa_dataset' and self.cloud_format != 'cfa':
+            # Also update the CFA dataset.
+            self.cfa_dataset.set_meta(meta)
+
     @property
     def dataset_attributes(self) -> dict:
         """

diff --git a/padocc/core/mixins/properties.py b/padocc/core/mixins/properties.py
@@ -295,4 +295,22 @@ def _get_stac_representation(self, stac_mapping: dict) -> dict:
             # Apply default value if source not found
             if record[k] is None:
                 record[k] = v[-1]
-        return record
+        return record
+
+    def apply_defaults(
+            self, 
+            defaults: dict, 
+            target: str = 'dataset',
+            remove: Union[list,None] = None
+        ):
+        """
+        Apply a default selection of attributes to a dataset.
+        """
+
+        for attr, val in defaults.items():
+            self.update_attribute(attr, val, target=target)
+
+        for attr in remove:
+            self.remove_attribute(attr, target=target)
+
+        self.save_files()
diff --git a/padocc/core/mixins/status.py b/padocc/core/mixins/status.py
@@ -4,6 +4,7 @@
 
 from typing import Callable
 
+from padocc.core.filehandlers import JSONFileHandler
 
 class StatusMixin:
     """
@@ -121,3 +122,20 @@ def _rerun_command(self):
         Setup for running this specific component interactively.
         """
         return f'padocc <operation> -G {self.groupID} -p {self.proj_code} -vv'
+
+    def get_report(self) -> dict:
+        """
+        Get the validation report if present for this project.
+        """
+
+        full_report = {'data':None, 'metadata':None}
+
+        meta_fh = JSONFileHandler(self.dir, 'metadata_report',logger=self.logger, **self.fh_kwargs)
+        data_fh = JSONFileHandler(self.dir, 'data_report',logger=self.logger, **self.fh_kwargs)
+
+        if meta_fh.file_exists():
+            full_report['metadata'] = meta_fh.get()
+        if data_fh.file_exists():
+            full_report['data'] = data_fh.get()
+
+        return full_report
diff --git a/padocc/core/project.py b/padocc/core/project.py
@@ -418,13 +418,11 @@ def _configure_filelist(self) -> None:
                 '"pattern" attribute missing from base config.'
             )
 
-        if pattern.endswith('.txt'):
-            content = extract_file(pattern)
-            if 'substitutions' in self.base_cfg:
-                content, status = apply_substitutions('datasets', subs=self.base_cfg['substitutions'], content=content)
-                if status:
-                    self.logger.warning(status)
-            self.allfiles.set(content) 
+        if isinstance(pattern, list):
+            # New feature to handle the moles-format data.
+            fileset = pattern
+        elif pattern.endswith('.txt'):
+            fileset = extract_file(pattern)
         else:
             # Pattern is a wildcard set of files
             if 'latest' in pattern:
@@ -434,12 +432,16 @@ def _configure_filelist(self) -> None:
             fileset = sorted(glob.glob(pattern, recursive=True))
             if len(fileset) == 0:
                 raise ValueError(f'pattern {pattern} returned no files.')
-
-            self.allfiles.set(sorted(glob.glob(pattern, recursive=True)))
+
+        if 'substitutions' in self.base_cfg:
+            fileset, status = apply_substitutions('datasets', subs=self.base_cfg['substitutions'], content=fileset)
+            if status:
+                self.logger.warning(status)
+        self.allfiles.set(fileset) 
 
     def _setup_config(
             self, 
-            pattern : Union[str,None] = None, 
+            pattern : Union[str,list,None] = None, 
             updates : Union[str,None] = None, 
             removals : Union[str,None] = None,
             substitutions: Union[dict,None] = None,

diff --git a/padocc/core/utils.py b/padocc/core/utils.py
@@ -28,6 +28,12 @@
     'catalog'
 ]
 
+# Which files acceptable to pull from Moles Tags file.
+source_opts = [
+    '.nc'
+]
+
+# Which operations are parallelisable.
 parallel_modes = [
     'scan',
     'compute',

diff --git a/padocc/groups/mixins/modifiers.py b/padocc/groups/mixins/modifiers.py
@@ -2,9 +2,12 @@
 __contact__   = "[email protected]"
 __copyright__ = "Copyright 2024 United Kingdom Research and Innovation"
 
-from typing import Callable
+import json
+
+from typing import Callable, Union
 
 from padocc import ProjectOperation
+from padocc.core.utils import BASE_CFG, source_opts
 
 
 class ModifiersMixin:
@@ -38,18 +41,55 @@ def help(cls, func: Callable = print):
 
     def add_project(
             self,
-            config: dict,
-            ):
+            config: Union[str,dict],
+            moles_tags: bool = False,
+        ):
         """
         Add a project to this group. 
+
+        :param config:  (str | dict) The configuration details to add new project. Can either be 
+            a path to a json file or json content directly. Can also be either a properly formatted
+            base config file (needs ``proj_code``, ``pattern`` etc.) or a moles_esgf input file.
+
+        :param moles_tags:  (bool) Option for CEDA staff to integrate output from another package.
         """
-        if config['proj_code'] in self.proj_codes['main']:
-            raise ValueError(
-                f'proj_code {config["proj_code"]} already exists for this group.'
-            )
 
-        self._init_project(config)
-        self.proj_codes['main'].append(config['proj_code'])
+        if isinstance(config, str):
+            if config.endswith('.json'):
+                with open(config) as f:
+                    config = json.load(f)
+            else:
+                config = json.loads(config)
+
+        configs = []
+        if moles_tags:
+            for key, fileset in config.items():
+                conf = dict(BASE_CFG)
+                conf['proj_code'] = key
+                conf['pattern'] = fileset
+
+                accept = True
+                for f in fileset:
+                    if str('.' + f.split('.')[-1]) not in source_opts:
+                        accept = False
+
+                if accept:
+                    configs.append(conf)
+                else:
+                    self.logger.info(f'Rejected {key} - not all files are friendly.')
+        else:
+            configs.append(config)
+
+        for config in configs:
+
+            if config['proj_code'] in self.proj_codes['main']:
+                self.logger.warning(
+                    f'proj_code {config["proj_code"]} already exists for this group - skipping'
+                )
+                continue
+
+            self._init_project(config)
+            self.proj_codes['main'].append(config['proj_code'])
         self.save_files()
 
     def remove_project(self, proj_code: str, ask: bool = True) -> None:

diff --git a/padocc/phases/compute.py b/padocc/phases/compute.py
@@ -352,7 +352,7 @@ def _run_cfa(
 
             cfa.create()
             if file_limit is None:
-                cfa.write(instance.cfa_path)
+                cfa.write(instance.cfa_path + '.nca')
 
             return {
                 'aggregated_dims': make_tuple(cfa.agg_dims),