Skip to content

Commit

Permalink
More fixes to correctness, consistency, readability. Add example for …
Browse files Browse the repository at this point in the history
…string data fix.
  • Loading branch information
pp-mo committed Jan 25, 2025
1 parent 5e81543 commit 8b3c52a
Show file tree
Hide file tree
Showing 5 changed files with 88 additions and 49 deletions.
16 changes: 3 additions & 13 deletions docs/details/interface_support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,7 @@ Datatypes
^^^^^^^^^
Ncdata supports all the regular datatypes of netcdf, but *not* the
variable-length and user-defined datatypes.

This means, notably, that all string variables will have the basic numpy type
'S1', equivalent to netcdf 'NC_CHAR'. Thus, multi-character string variables
must always have a definite "string-length" dimension.

Attribute values, by contrast, are treated as Python strings with the normal
variable length support. Their basic dtype can be any numpy string dtype,
but will be converted when required.

The NetCDF C library and netCDF4-python do not support arrays of strings in
attributes, so neither does NcData.
Please see : :ref:`data-types`.


Data Scaling, Masking and Compression
Expand All @@ -45,7 +35,7 @@ control the data compression and translation facilities of the NetCDF file
library.
If required, you should use :mod:`iris` or :mod:`xarray` for this.

Although file-specific storage aspects, such as chunking, data-paths or compression
File-specific storage aspects, such as chunking, data-paths or compression
strategies, are not recorded in the core objects. However, array representations in
variable and attribute data (notably dask lazy arrays) may hold such information.

Expand All @@ -58,7 +48,7 @@ Dask chunking control
Loading from netcdf files generates variables whose data arrays are all Dask
lazy arrays. These are created with the "chunks='auto'" setting.

There is simple user override API available to control this on a per-dimension basis.
However there is a simple per-dimension chunking control available on loading.
See :func:`ncdata.netcdf4.from_nc4`.


Expand Down
6 changes: 3 additions & 3 deletions docs/userdocs/getting_started/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The following code snippets demonstrate the absolute basics.

Likewise, internal consistency is not checked, so it is possible to create
data that cannot be stored in an actual file.
See :func:`ncdata.utils.save_errors`.
See :ref:`correctness-checks`.

We may revisit this in later releases to make data manipulation "safer".

Expand Down Expand Up @@ -109,7 +109,7 @@ which behaves like a dictionary::
Attributes
^^^^^^^^^^
Variables live in the ``attributes`` property of a :class:`~ncdata.NcData`
or :class:`~ncdata.Variable`::
or :class:`~ncdata.NcVariable`::

>>> var.set_attrval('a', 1)
NcAttribute('a', 1)
Expand Down Expand Up @@ -249,7 +249,7 @@ Thread safety
>>> from ndata.threadlock_sharing import enable_lockshare
>>> enable_lockshare(iris=True, xarray=True)

See details at :mod:`ncdata.threadlock_sharing`
See details at :ref:`thread_safety`.


Working with NetCDF files
Expand Down
59 changes: 29 additions & 30 deletions docs/userdocs/user_guide/data_objects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ inspect and/or modify it, aiming to do this is the most natural and pythonic way
Data Classes
------------
The data model components are elements of the
`NetCDF Classic Data Model`_ , plus **groups** (from the 'enhanced' netCDF model).
`NetCDF Classic Data Model`_ , plus **groups** (from the
`"enhanced" netCDF data model <NetCDF Enhanced Data Model>`_ ).

That is, a Dataset(File) consists of just Dimensions, Variables, Attributes and
Groups.
Expand Down Expand Up @@ -87,50 +88,47 @@ Attribute Values
In actual netCDF data, the value of an attribute is effectively limited to a one-dimensional
array of certain valid netCDF types, and one-element arrays are exactly equivalent to scalar values.

In ncdata, the ``.value`` of an :class:`ncdata.NcAttribute` must always be a numpy array, and
when creating one the provided ``.value`` is cast with :func:`numpy.asanyarray`.
The ``.value`` of an :class:`ncdata.NcAttribute` must always be a numpy scalar or 1-dimensional array.

However you are not prevented from setting an attributes ``.value`` to something other than
an array, which may cause an error. So for now, if setting the value of an existing attribute,
ensure you always write compatible numpy data, or use :meth:`ncdata.NameMap.set_attrval` which is safe.
When assigning a ``.value``, or creating a new :class:`ncdata.NcAttribute`, the value
is cast with :func:`numpy.asanyarray`, and if this fails, or yields a multidimensional array
then an error is raised.

For *reading* attributes, it is best to use :meth:`ncdata.NameMap.get_attrval` or (equivalently)
:meth:`ncdata.NcAttribute.as_python_value()` : These consistently return either
``None`` (if missing); a numpy scalar; or array; or a Python string. Those results are
intended to be equivalent to what you should get from storing in an actual file and reading back,
When *reading* attributes, for consistent results it is best to use the
:meth:`ncdata.NcVariable.get_attrval` method or (equivalently) :meth:`ncdata.NcAttribute.as_python_value` :
These return either ``None`` (if missing); a numpy scalar; or array; or a Python string.
These are intended to be equivalent to what you would get from storing in an actual file and reading back,
including re-interpreting a length-one vector as a scalar value.

.. attention::
The correct handling and (future) discrimination of string data as character arrays ("char" in netCDF terms)
and/or variable-length strings ("string" type) is still to be determined.
The correct handling and (future) discrimination of attribute values which are character arrays
("char" in netCDF terms) and/or variable-length strings ("string" type) is still to be determined.
( We do not yet properly support any variable-length types. )

For now, we are converting **all** string attributes to python strings.

There is **also** a longstanding known problem with the low-level C (and FORTRAN) interface, which forbids the
creation of vector character attributes, which appear as single concatenated strings. So for now, **all**
string-type attributes appear as single Python strings (you never get an array of strings or list of strings).
For now, we are simply converting **all** string-like attributes by
:meth:`ncdata.NcAttribute.as_python_value` to python strings.

See also : :ref:`data-types`

.. _correctness-checks:

Correctness and Consistency
---------------------------
In practice, to support flexibility in construction and manipulation, it is
not practical for ncdata structures to represent valid netCDF at
all times, since this would makes changing things awkward.
For example, if a group refers to a dimension *outside* the group, you could not simply
extract it from the dataset because it is not valid in isolation.

Thus, we do allow that ncdata structures represent *invalid* netCDF data.
In order to allow flexibility in construction and manipulation, it is not practical
for ncdata structures to represent valid netCDF at all times, since this would makes
changing things awkward.
For example, if a group refers to a dimension *outside* the group, strict correctness
would not allow you to simply extract it from the dataset, because it is not valid in isolation.
Thus, we do allow ncdata structures to represent *invalid* netCDF data.
For example, circular references, missing dimensions or naming mismatches.
Effectively there are a set of data validity rules, which are summarised in the
:func:`ncdata.utils.save_errors` routine.

In practice, there a minimal set of runtime rules for creating ncdata objects, and
additional requirements when ncdata is converted to actual netCDF. For example,
variables can be initially created with no data. But if subsequently written to a file,
data must be assigned first.
In practice, there are a minimal set of rules which apply when initially creating
ncdata objects, and additional requirements which apply when creating actual netCDF files.
For example, a variable can be initially created with no data. But if subsequently written
to a file, some data must be defined.

The full set of data validity rules are summarised in the
:func:`ncdata.utils.save_errors` routine.

.. Note::
These issues are not necessarily all fully resolved. Caution required !
Expand Down Expand Up @@ -268,3 +266,4 @@ Relationship to File Storage
See :ref:`file-storage`

.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
.. _NetCDF Enhanced Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#enhanced_model
18 changes: 15 additions & 3 deletions docs/userdocs/user_guide/general_topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

General Topics
==============
Odd discussion topics realting to core ncdata classes + data management
Odd discussion topics relating to core ncdata classes + data management

Validity Checking
-----------------
Expand Down Expand Up @@ -72,16 +72,28 @@ may contain zero bytes so that they convert to variable-width (Python) strings u
maximum width. The string (maximum) length is a separate dimension, which is recorded
as a normal netCDF dimension like any other.

.. note::

Although it is not tested, it has proved possible (and useful) at present to load
files with variables containing variable-length string data, but it is
necessary to supply an explicit user chunking to workaround limitations in Dask.
Please see the :ref:`howto example <howto_load_variablewidth_strings>`.

Characters in Attribute Values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Character data in string *attribute* values can be written simply as Python
strings. They are stored in an :class:`~ncdata.NcAttribute`'s ``.value`` as a
character array of dtype "<U?", or are returned from
:meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.
A vector of strings does also function as an attribute value, but bear in mind that a
vector of strings is not currently supported in netCFD4 implementations.

A vector of strings is also permitted an attribute value, but bear in mind that
**a vector of strings is not currently supported in netCFD4 implementations**.
Thus, you cannot have an array or list of strings as an attribute value in an actual file,
and if stored to a file such an attribute will be concatenated into a single string value.

Unicode is supported, and encodes/decodes seamlessly into actual files.

.. _thread_safety:

Thread Safety
-------------
Expand Down
38 changes: 38 additions & 0 deletions docs/userdocs/user_guide/howtos.rst
Original file line number Diff line number Diff line change
Expand Up @@ -552,3 +552,41 @@ or, to convert xarray data variable output to masked integers :
>>> var.set_attrval("_FillValue", -9999)
>>> to_nc4(ncdata, "output.nc")
.. _howto_load_variablewidth_strings:
Load a file containing variable-width string variables
------------------------------------------------------
You must supply a ``dim_chunks`` keyword to the :meth:`ncdata.netcdf.from_nc4` method,
specifying how to chunk the dimension(s) which the string variable uses.
.. code-block:: python
>>> from ncdata.netcdf4 import from_nc4
>>> # if we have a "string" type variable using the "date" dimension
>>> # : don't chunk that dimension.
>>> dataset = from_nc4(filepath, dim_chunks={"date": -1})
This is needed to avoid a Dask error like
``"auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks`
explicitly."``
When you have done this, Dask will return the variable data as a numpy *object* array containing Python strings.
You probably still need to (manually) convert that to something more tractable to work with it effectively.
For example, something like :
.. code-block:: python
>>> var = dataset.variables['name']
>>> data = var.data.compute()
>>> maxlen = max(len(s) for s in var.data)
>>> # convert to fixed-width character array
>>> data = np.array([[s.ljust(maxlen, "\0") for s in var.data]])
>>> print(data.shape, data.dtype)
(1010, 12) <U1
>>> dataset.dimensions.add(NcDimension('name_strlen', maxlen))
>>> var.dimensions = var.dimensions + ("name_strlen",)
>>> var.data = data

0 comments on commit 8b3c52a

Please sign in to comment.