Skip to content

Commit

Permalink
Lots more improvements + move sections.
Browse files Browse the repository at this point in the history
  • Loading branch information
pp-mo committed Jan 16, 2025
1 parent a51f251 commit 5e81543
Show file tree
Hide file tree
Showing 11 changed files with 271 additions and 206 deletions.
10 changes: 5 additions & 5 deletions docs/change_log.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
Versions and Change Notes
=========================

Project Status
--------------
Project Development Status
--------------------------
We intend to follow `PEP 440 <https://peps.python.org/pep-0440/>`_,
or (older) `SemVer <https://semver.org/>`_ versioning principles.
This means the version string has the basic form **"major.minor.bugfix[special-types]"**.

Current release version is at **"v0.1"**.
Current release version is at **"v0.2"**.

This is a first complete implementation,
with functional operational of all public APIs.
This is a complete implementation, with functional operational of all public APIs.
The code is however still experimental, and APIs are not stable
(hence no major version yet).


Change Notes
------------
Summary of key features by release number

Unreleased
^^^^^^^^^^
Expand Down
4 changes: 4 additions & 0 deletions docs/details/details_index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
Detail Topics
=============
Detail reference topics

.. toctree::
:maxdepth: 2

../change_log
./known_issues
./interface_support
./threadlock_sharing
./developer_notes
Expand Down
14 changes: 12 additions & 2 deletions docs/details/interface_support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ array has the actual variable dtype, and the "scale_factor" and

The existence of a "_FillValue" attribute controls how.. TODO

.. _file-storage:

File storage control
^^^^^^^^^^^^^^^^^^^^
Expand All @@ -44,13 +45,21 @@ control the data compression and translation facilities of the NetCDF file
library.
If required, you should use :mod:`iris` or :mod:`xarray` for this.

Although file-specific storage aspects, such as chunking, data-paths or compression
strategies, are not recorded in the core objects. However, array representations in
variable and attribute data (notably dask lazy arrays) may hold such information.

The concept of "unlimited" dimensions is also, arguably an exception. However, this is a
core provision in the NetCDF data model itself (see "Dimension" in the `NetCDF Classic Data Model`_).


Dask chunking control
^^^^^^^^^^^^^^^^^^^^^
Loading from netcdf files generates variables whose data arrays are all Dask
lazy arrays. These are created with the "chunks='auto'" setting.
There is currently no control for this : If required, load via Iris or Xarray
instead.

There is simple user override API available to control this on a per-dimension basis.
See :func:`ncdata.netcdf4.from_nc4`.


Xarray Compatibility
Expand Down Expand Up @@ -94,3 +103,4 @@ see : `support added in v3.7.0 <https://scitools-iris.readthedocs.io/en/stable/w


.. _Continuous Integration testing on GitHub: https://github.com/pp-mo/ncdata/blob/main/.github/workflows/ci-tests.yml
.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ There are no current plans to address these, but could be considered in future
* notably, includes compound and variable-length types

* ..and especially **variable-length strings in variables**.
see : :ref:`string_and_character_data`
see : :ref:`string-and-character-data`, ref:`data-types`


Features planned
Expand Down
60 changes: 43 additions & 17 deletions docs/details/threadlock_sharing.rst
Original file line number Diff line number Diff line change
@@ -1,28 +1,19 @@
.. _thread-safety:

NetCDF Thread Locking
=====================
Ncdata includes support for "unifying" the thread-safety mechanisms between
ncdata and the format packages it supports (Iris and Ncdata).
Ncdata provides the :mod:`ncdata.threadlock_sharing` module, which can ensure that all
multiple relevant data-format packages use a "unified" thread-safety mechanism to
prevent them disturbing each other.

This concerns the safe use of the common NetCDF library by multiple threads.
Such multi-threaded access usually occurs when your code has Dask arrays
created from netcdf file data, which it is either computing or storing to an
output netcdf file.

The netCDF4 package (and the underlying C library) does not implement any
threadlock, neither is it thread-safe (re-entrant) by design.
Thus contention is possible unless controlled by the calling packages.
*Each* of the data-format packages (Ncdata, Iris and Xarray) defines its own
locking mechanism to prevent overlapping calls into the netcdf library.

All 3 data-format packages can map variable data into Dask lazy arrays. Iris and
Xarray can also create delayed write operations (but ncdata currently does not).

However, those mechanisms cannot protect an operation of that package from
overlapping with one in *another* package.

The :mod:`ncdata.threadlock_sharing` module can ensure that all of the relevant
packages use the *same* thread lock,
so that they can safely co-operate in parallel operations.
In short, this is not needed when all your data is loaded with only **one** of the data
packages (Iris, Xarray or ncata). The problem only occurs when you try to
realise/calculate/save results which combine data loaded from a mixture of sources.

sample code::

Expand All @@ -48,3 +39,38 @@ or::
cubes = ncdata.iris.to_iris(ncdata)
iris.save(cubes, output_filepath)


Background
^^^^^^^^^^
In practice, Iris, Xarray and Ncdata are all capable of scanning netCDF files and interpreting their metadata, while
not reading all the core variable data contained in them.

This generates objects containing `Dask arrays <https://docs.dask.org/en/stable/array.html>`_ with deferred access
to bulk file data for later access, with certain key benefits :

* no data loading or calculation happens until needed
* the work is divided into sectional ‘tasks’, of which only some may ultimately be needed
* it may be possible to perform multiple sections of calculation (including data fetch) in parallel
* it may be possible to localise operations (fetch or calculate) near to data distributed across a cluster

Usually, the most efficient parallelisation of array operations is by multi-threading, since that can use memory
sharing of large data arrays in memory.

However, the python netCDF4 library (and the underlying C library) is not threadsafe
(re-entrant) by design, neither does it implement any thread locking itself, therefore
the “netcdf fetch” call in each input operation must be guarded by a mutex.
Thus contention is possible unless controlled by the calling packages.

*Each* of Xarray, Iris and ncdata itself create input data tasks to fetch sections of
the input files. Each uses a mutex lock around netcdf accesses in those tasks, to stop
them accessing the netCDF4 interface at the same time as any of the others.

This works beautifully until ncdata connects lazy data loaded with Iris (say) with
lazy data loaded from Xarray, which unfortunately are using their own separate mutexes
to protect the same netcdf library. Then, when we attempt to calculate or save this
result, we may get sporadic and unpredictable system-level errors, even a core-dump.

So, the function of :mod:`ncdata.threadlock_sharing` is to connect the thread-locking
schemes of the separate libraries, so that they cannot accidentally overlap an access
call in a different thread *from the other package*, just as they already cannot
overlap *one of their own*.
5 changes: 3 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,9 @@ User Documentation
User Guide <./userdocs/user_guide/user_guide>


Reference
---------
Reference Documentation
-----------------------

.. toctree::
:maxdepth: 2

Expand Down
116 changes: 116 additions & 0 deletions docs/userdocs/user_guide/common_operations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
.. _common_operations:

Common Operations
=================
A group of common operations are available on all the core component types,
i.e. the operations of extract/remove/insert/rename/copy on the ``.datasets``, ``.groups``,
``.dimensions``, ``.variables`` and ``.attributes`` properties of the core objects.

Most of these are hopoefully "obvious" Pythonic methods of the container objects.

Extract and Remove
------------------
These are implemented as :meth:`~ncdata.NameMap.__delitem__` and :meth:`~ncdata.NameMap.pop`
methods, which work in the usual way.

Examples :

* ``var_x = dataset.variables.pop("x")``
* ``del data.variables["x"]``

Insert / Add
------------
A new content (component) can be added under its own name with the
:meth:`~ncdata.NameMap.add` method.

Example : ``dataset.variables.add(NcVariable("x", dimensions=["x"], data=my_data))``

An :meth:`~ncdata.NcAttribute` can also be added or set (if already present) with the special
:meth:`~ncdata.NameMap.set_attrval` method.

Example : ``dataset.variables["x"].set_attrval["units", "m s-1")``

Rename
------
A component can be renamed with the :meth:`~ncdata.NameMap.rename` method. This changes
both the name in the container **and** the component's own name -- it is not recommended
ever to set ``component.name`` directly, as this obviously can become inconsistent.

Example : ``dataset.variables.rename("x", "y")``

.. warning::
Renaming a dimension will not rename references to it (i.e. in variables), which
obviously may cause problems.
We may add a utility to do this safely this in future.

Copying
-------
All core objects support a ``.copy()`` method, which however does not copy array content
(e.g. variable data or attribute arrays). See for instance :meth:`ncdata.NcData.copy`.

There is also a utility function :func:`ncdata.utils.ncdata_copy`, this is effectively
the same as the NcData object copy.


Equality Checking
-----------------
We provide a simple, comprehensive ``==`` check for :mod:`~ncdata.NcDimension` and
:mod:`~ncdata.NcAttribute` objects, but not at present :mod:`~ncdata.NcVariable` or
:mod:`~ncdata.NcData`.

So, using ``==`` on :mod:`~ncdata.NcVariable` or :mod:`~ncdata.NcData` objects
will only do an identity check -- that is, it tests ``id(A) == id(B)``, or ``A is B``.

However, these objects **can** be properly compared with the dataset comparison
utilities, :func:`ncdata.utils.dataset_differences` and
:func:`ncdata.utils.variable_differences` : By default, these operations are very
comprehensive and may be very costly for instance comparing large data arrays, but they
also allow more nuanced and controllable checking, e.g. to skip data array comparisons
or ignore variable ordering.


Onject Creation
---------------
The constructors should allow reasonably readable inline creation of data.
See here : :ref:`data-constructors`

Ncdata is deliberately not very fussy about 'correctness', since it is not tied to an actual
dataset which must "make sense". see : :ref:`correctness-checks` .

Hence, there is no great need to install things in the 'right' order (e.g. dimensions
before variables which need them). You can create objects in one go, like this :

.. code-block::
data = NcData(
dimensions=[
NcDimension("y", 2),
NcDimension("x", 3),
],
variables=[
NcVariable("y", dimensions=["y"], data=[10, 11]),
NcVariable("x", dimensions=["y"], data=[20, 21, 22]),
NcVariable("dd", dimensions=["y", "x"], data=[[0, 1, 2], [3, 4, 5]])
]
)
or iteratively, like this :

.. code-block::
data = NcData()
dims = [("y", 2), ("x", 3)]
data.variables.addall([
NcVariable(nn, dimensions=[nn], data=np.arange(ll))
for ll, nn in dims
])
data.variables.add(
NcVariable("dd", dimensions=["y", "x"],
data=np.arange(6).reshape(2,3))
)
data.dimensions.addall([NcDimension(nn, ll) for nn, ll in dims])
Note : here, the variables were created before the dimensions


6 changes: 1 addition & 5 deletions docs/userdocs/user_guide/data_objects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,10 +265,6 @@ will be automatically converted to a NameMap of ``name: NcAttribute(name: value)
Relationship to File Storage
----------------------------
Note that file-specific storage aspects, such as chunking, data-paths or compression
strategies, are not recorded in the core objects. However, array representations in
variable and attribute data (notably dask lazy arrays) may hold such information.
The concept of "unlimited" dimensions is arguably an exception. However, this is a
core provision in the NetCDF data model itself (see "Dimension" in the `NetCDF Classic Data Model`_).
See :ref:`file-storage`

.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
Loading

0 comments on commit 5e81543

Please sign in to comment.