Lots more improvements + move sections.

pp-mo · Jan 16, 2025 · 5e81543 · 5e81543
1 parent a51f251
commit 5e81543
Show file tree

Hide file tree

Showing 11 changed files with 271 additions and 206 deletions.
diff --git a/docs/change_log.rst b/docs/change_log.rst
@@ -1,22 +1,22 @@
 Versions and Change Notes
 =========================
 
-Project Status
---------------
+Project Development Status
+--------------------------
 We intend to follow `PEP 440 <https://peps.python.org/pep-0440/>`_,
 or (older) `SemVer <https://semver.org/>`_ versioning principles.
 This means the version string has the basic form **"major.minor.bugfix[special-types]"**.
 
-Current release version is at **"v0.1"**.
+Current release version is at **"v0.2"**.
 
-This is a first complete implementation,
-with functional operational of all public APIs.
+This is a complete implementation, with functional operational of all public APIs.
 The code is however still experimental, and APIs are not stable
 (hence no major version yet).
 
 
 Change Notes
 ------------
+Summary of key features by release number
 
 Unreleased
 ^^^^^^^^^^

diff --git a/docs/details/details_index.rst b/docs/details/details_index.rst
@@ -1,8 +1,12 @@
 Detail Topics
 =============
+Detail reference topics
+
 .. toctree::
     :maxdepth: 2
 
+    ../change_log
+    ./known_issues
     ./interface_support
     ./threadlock_sharing
     ./developer_notes

diff --git a/docs/details/interface_support.rst b/docs/details/interface_support.rst
@@ -35,6 +35,7 @@ array has the actual variable dtype, and the "scale_factor" and
 
 The existence of a "_FillValue" attribute controls how.. TODO
 
+.. _file-storage:
 
 File storage control
 ^^^^^^^^^^^^^^^^^^^^
@@ -44,13 +45,21 @@ control the data compression and translation facilities of the NetCDF file
 library.
 If required, you should use :mod:`iris` or :mod:`xarray` for this.
 
+Although file-specific storage aspects, such as chunking, data-paths or compression
+strategies, are not recorded in the core objects.  However, array representations in
+variable and attribute data (notably dask lazy arrays) may hold such information.
+
+The concept of "unlimited" dimensions is also, arguably an exception.  However, this is a
+core provision in the NetCDF data model itself (see "Dimension" in the `NetCDF Classic Data Model`_).
+
 
 Dask chunking control
 ^^^^^^^^^^^^^^^^^^^^^
 Loading from netcdf files generates  variables whose data arrays are all Dask
 lazy arrays.  These are created with the "chunks='auto'" setting.
-There is currently no control for this : If required, load via Iris or Xarray
-instead.
+
+There is simple user override API available to control this on a per-dimension basis.
+See :func:`ncdata.netcdf4.from_nc4`.
 
 
 Xarray Compatibility
@@ -94,3 +103,4 @@ see : `support added in v3.7.0 <https://scitools-iris.readthedocs.io/en/stable/w
 
 
 .. _Continuous Integration testing on GitHub: https://github.com/pp-mo/ncdata/blob/main/.github/workflows/ci-tests.yml
+.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
diff --git a/docs/userdocs/user_guide/known_issues.rst → docs/details/known_issues.rst b/docs/userdocs/user_guide/known_issues.rst → docs/details/known_issues.rst
@@ -43,7 +43,7 @@ There are no current plans to address these, but could be considered in future
     * notably, includes compound and variable-length types
 
     * ..and especially **variable-length strings in variables**.
-      see : :ref:`string_and_character_data`
+      see : :ref:`string-and-character-data`, ref:`data-types`
 
 
 Features planned

diff --git a/docs/details/threadlock_sharing.rst b/docs/details/threadlock_sharing.rst
@@ -1,28 +1,19 @@
+.. _thread-safety:
+
 NetCDF Thread Locking
 =====================
-Ncdata includes support for "unifying" the thread-safety mechanisms between
-ncdata and the format packages it supports (Iris and Ncdata).
+Ncdata provides the :mod:`ncdata.threadlock_sharing` module, which can ensure that all
+multiple relevant data-format packages use a "unified" thread-safety mechanism to
+prevent them disturbing each other.
 
 This concerns the safe use of the common NetCDF library by multiple threads.
 Such multi-threaded access usually occurs when your code has Dask arrays
 created from netcdf file data, which it is either computing or storing to an
 output netcdf file.
 
-The netCDF4 package (and the underlying C library) does not implement any
-threadlock, neither is it thread-safe (re-entrant) by design.
-Thus contention is possible unless controlled by the calling packages.
-*Each* of the data-format packages (Ncdata, Iris and Xarray) defines its own
-locking mechanism to prevent overlapping calls into the netcdf library.
-
-All 3 data-format packages can map variable data into Dask lazy arrays.  Iris and
-Xarray can also create delayed write operations (but ncdata currently does not).
-
-However, those mechanisms cannot protect an operation of that package from
-overlapping with one in *another* package.
-
-The :mod:`ncdata.threadlock_sharing` module can ensure that all of the relevant
-packages use the *same* thread lock,
-so that they can safely co-operate in parallel operations.
+In short, this is not needed when all your data is loaded with only **one** of the data
+packages (Iris, Xarray or ncata).  The problem only occurs when you try to
+realise/calculate/save results which combine data loaded from a mixture of sources.
 
 sample code::
 
@@ -48,3 +39,38 @@ or::
         cubes = ncdata.iris.to_iris(ncdata)
         iris.save(cubes, output_filepath)
 
+
+Background
+^^^^^^^^^^
+In practice, Iris, Xarray and Ncdata are all capable of scanning netCDF files and interpreting their metadata, while
+not reading all the core variable data contained in them.
+
+This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
+to bulk file data for later access, with certain key benefits :
+
+* no data loading or calculation happens until needed
+*  the work is divided into sectional ‘tasks’, of which only some may ultimately be needed
+* it may be possible to perform multiple sections of calculation (including data fetch) in parallel
+* it may be possible to localise operations (fetch or calculate) near to data distributed across a cluster
+
+Usually, the most efficient parallelisation of array operations is by multi-threading, since that can use memory
+sharing of large data arrays in memory.
+
+However, the python netCDF4 library (and the underlying C library) is not threadsafe
+(re-entrant) by design, neither does it implement any thread locking itself, therefore
+the “netcdf fetch” call in each input operation must be guarded by a mutex.
+Thus contention is possible unless controlled by the calling packages.
+
+*Each* of Xarray, Iris and ncdata itself create input data tasks to fetch sections of
+the input files.  Each uses a mutex lock around netcdf accesses in those tasks, to stop
+them accessing the netCDF4 interface at the same time as any of the others.
+
+This works beautifully until ncdata connects lazy data loaded with Iris (say) with
+lazy data loaded from Xarray, which unfortunately are using their own separate mutexes
+to protect the same netcdf library. Then, when we attempt to calculate or save this
+result, we may get sporadic and unpredictable system-level errors, even a core-dump.
+
+So, the function of :mod:`ncdata.threadlock_sharing` is to connect the thread-locking
+schemes of the separate libraries, so that they cannot accidentally overlap an access
+call in a different thread *from the other package*, just as they already cannot
+overlap *one of their own*.
diff --git a/docs/index.rst b/docs/index.rst
@@ -38,8 +38,9 @@ User Documentation
    User Guide <./userdocs/user_guide/user_guide>
 
 
-Reference
----------
+Reference Documentation
+-----------------------
+
 .. toctree::
    :maxdepth: 2
 

diff --git a/docs/userdocs/user_guide/common_operations.rst b/docs/userdocs/user_guide/common_operations.rst
@@ -0,0 +1,116 @@
+.. _common_operations:
+
+Common Operations
+=================
+A group of common operations are available on all the core component types,
+i.e. the operations of extract/remove/insert/rename/copy on the ``.datasets``, ``.groups``,
+``.dimensions``, ``.variables`` and ``.attributes`` properties of the core objects.
+
+Most of these are hopoefully "obvious" Pythonic methods of the container objects.
+
+Extract and Remove
+------------------
+These are implemented as :meth:`~ncdata.NameMap.__delitem__` and :meth:`~ncdata.NameMap.pop`
+methods, which work in the usual way.
+
+Examples :
+
+* ``var_x = dataset.variables.pop("x")``
+* ``del data.variables["x"]``
+
+Insert / Add
+------------
+A new content (component) can be added under its own name with the
+:meth:`~ncdata.NameMap.add` method.
+
+Example : ``dataset.variables.add(NcVariable("x", dimensions=["x"], data=my_data))``
+
+An :meth:`~ncdata.NcAttribute` can also be added or set (if already present) with the special
+:meth:`~ncdata.NameMap.set_attrval` method.
+
+Example : ``dataset.variables["x"].set_attrval["units", "m s-1")``
+
+Rename
+------
+A component can be renamed with the :meth:`~ncdata.NameMap.rename` method.  This changes
+both the name in the container **and** the component's own name -- it is not recommended
+ever to set ``component.name`` directly, as this obviously can become inconsistent.
+
+Example : ``dataset.variables.rename("x", "y")``
+
+.. warning::
+    Renaming a dimension will not rename references to it (i.e. in variables), which
+    obviously may cause problems.
+    We may add a utility to do this safely this in future.
+
+Copying
+-------
+All core objects support a ``.copy()`` method, which however does not copy array content
+(e.g. variable data or attribute arrays).  See for instance :meth:`ncdata.NcData.copy`.
+
+There is also a utility function :func:`ncdata.utils.ncdata_copy`, this is effectively
+the same as the NcData object copy.
+
+
+Equality Checking
+-----------------
+We provide a simple, comprehensive  ``==`` check for :mod:`~ncdata.NcDimension` and
+:mod:`~ncdata.NcAttribute` objects, but not at present :mod:`~ncdata.NcVariable` or
+:mod:`~ncdata.NcData`.
+
+So, using ``==`` on :mod:`~ncdata.NcVariable` or :mod:`~ncdata.NcData` objects
+will only do an identity check -- that is, it tests ``id(A) == id(B)``, or ``A is B``.
+
+However, these objects **can** be properly compared with the dataset comparison
+utilities, :func:`ncdata.utils.dataset_differences` and
+:func:`ncdata.utils.variable_differences` :  By default, these operations are very
+comprehensive and may be very costly for instance comparing large data arrays, but they
+also allow more nuanced and controllable checking, e.g. to skip data array comparisons
+or ignore variable ordering.
+
+
+Onject Creation
+---------------
+The constructors should allow reasonably readable inline creation of data.
+See here : :ref:`data-constructors`
+
+Ncdata is deliberately not very fussy about 'correctness', since it is not tied to an actual
+dataset which must "make sense".   see : :ref:`correctness-checks` .
+
+Hence, there is no great need to install things in the 'right' order (e.g. dimensions
+before variables which need them).  You can create objects in one go, like this :
+
+.. code-block::
+
+    data = NcData(
+        dimensions=[
+            NcDimension("y", 2),
+            NcDimension("x", 3),
+        ],
+        variables=[
+            NcVariable("y", dimensions=["y"], data=[10, 11]),
+            NcVariable("x", dimensions=["y"], data=[20, 21, 22]),
+            NcVariable("dd", dimensions=["y", "x"], data=[[0, 1, 2], [3, 4, 5]])
+        ]
+    )
+
+
+or iteratively, like this :
+
+.. code-block::
+
+    data = NcData()
+    dims = [("y", 2), ("x", 3)]
+    data.variables.addall([
+        NcVariable(nn, dimensions=[nn], data=np.arange(ll))
+        for ll, nn in dims
+    ])
+    data.variables.add(
+        NcVariable("dd", dimensions=["y", "x"],
+        data=np.arange(6).reshape(2,3))
+    )
+    data.dimensions.addall([NcDimension(nn, ll) for nn, ll in dims])
+
+Note : here, the variables were created before the dimensions
+
+
diff --git a/docs/userdocs/user_guide/data_objects.rst b/docs/userdocs/user_guide/data_objects.rst
@@ -265,10 +265,6 @@ will be automatically converted to a NameMap of ``name: NcAttribute(name: value)
 
 Relationship to File Storage
 ----------------------------
-Note that file-specific storage aspects, such as chunking, data-paths or compression
-strategies, are not recorded in the core objects.  However, array representations in
-variable and attribute data (notably dask lazy arrays) may hold such information.
-The concept of "unlimited" dimensions is arguably an exception.  However, this is a
-core provision in the NetCDF data model itself (see "Dimension" in the `NetCDF Classic Data Model`_).
+See :ref:`file-storage`
 
 .. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model