More fixes to correctness, consistency, readability. Add example for …

…string data fix.
pp-mo · Jan 25, 2025 · 8b3c52a · 8b3c52a
1 parent 5e81543
commit 8b3c52a
Show file tree

Hide file tree

Showing 5 changed files with 88 additions and 49 deletions.
diff --git a/docs/details/interface_support.rst b/docs/details/interface_support.rst
@@ -14,17 +14,7 @@ Datatypes
 ^^^^^^^^^
 Ncdata supports all the regular datatypes of netcdf, but *not* the
 variable-length and user-defined datatypes.
-
-This means, notably, that all string variables will have the basic numpy type
-'S1', equivalent to netcdf 'NC_CHAR'.  Thus, multi-character string variables
-must always have a definite "string-length" dimension.
-
-Attribute values, by contrast, are treated as Python strings with the normal
-variable length support.  Their basic dtype can be any numpy string dtype,
-but will be converted when required.
-
-The NetCDF C library and netCDF4-python do not support arrays of strings in
-attributes, so neither does NcData.
+Please see : :ref:`data-types`.
 
 
 Data Scaling, Masking and Compression
@@ -45,7 +35,7 @@ control the data compression and translation facilities of the NetCDF file
 library.
 If required, you should use :mod:`iris` or :mod:`xarray` for this.
 
-Although file-specific storage aspects, such as chunking, data-paths or compression
+File-specific storage aspects, such as chunking, data-paths or compression
 strategies, are not recorded in the core objects.  However, array representations in
 variable and attribute data (notably dask lazy arrays) may hold such information.
 
@@ -58,7 +48,7 @@ Dask chunking control
 Loading from netcdf files generates  variables whose data arrays are all Dask
 lazy arrays.  These are created with the "chunks='auto'" setting.
 
-There is simple user override API available to control this on a per-dimension basis.
+However there is a simple per-dimension chunking control available on loading.
 See :func:`ncdata.netcdf4.from_nc4`.
 
 

diff --git a/docs/userdocs/getting_started/introduction.rst b/docs/userdocs/getting_started/introduction.rst
@@ -21,7 +21,7 @@ The following code snippets demonstrate the absolute basics.
 
     Likewise, internal consistency is not checked, so it is possible to create
     data that cannot be stored in an actual file.
-    See :func:`ncdata.utils.save_errors`.
+    See :ref:`correctness-checks`.
 
     We may revisit this in later releases to make data manipulation "safer".
 
@@ -109,7 +109,7 @@ which behaves like a dictionary::
 Attributes
 ^^^^^^^^^^
 Variables live in the ``attributes`` property of a :class:`~ncdata.NcData`
-or :class:`~ncdata.Variable`::
+or :class:`~ncdata.NcVariable`::
 
     >>> var.set_attrval('a', 1)
     NcAttribute('a', 1)
@@ -249,7 +249,7 @@ Thread safety
         >>> from ndata.threadlock_sharing import enable_lockshare
         >>> enable_lockshare(iris=True, xarray=True)
 
-    See details at :mod:`ncdata.threadlock_sharing`
+    See details at :ref:`thread_safety`.
 
 
 Working with NetCDF files

diff --git a/docs/userdocs/user_guide/data_objects.rst b/docs/userdocs/user_guide/data_objects.rst
@@ -8,7 +8,8 @@ inspect and/or modify it, aiming to do this is the most natural and pythonic way
 Data Classes
 ------------
 The data model components are elements of the
-`NetCDF Classic Data Model`_ , plus **groups** (from the 'enhanced' netCDF model).
+`NetCDF Classic Data Model`_ , plus **groups** (from the
+`"enhanced" netCDF data model <NetCDF Enhanced Data Model>`_ ).
 
 That is, a Dataset(File) consists of just Dimensions, Variables, Attributes and
 Groups.
@@ -87,50 +88,47 @@ Attribute Values
 In actual netCDF data, the value of an attribute is effectively limited to a one-dimensional
 array of certain valid netCDF types, and one-element arrays are exactly equivalent to scalar values.
 
-In ncdata, the ``.value`` of an :class:`ncdata.NcAttribute` must always be a numpy array, and
-when creating one the provided ``.value`` is cast with :func:`numpy.asanyarray`.
+The ``.value`` of an :class:`ncdata.NcAttribute` must always be a numpy scalar or 1-dimensional array.
 
-However you are not prevented from setting an attributes ``.value`` to something other than
-an array, which may cause an error.  So for now, if setting the value of an existing attribute,
-ensure you always write compatible numpy data, or use :meth:`ncdata.NameMap.set_attrval` which is safe.
+When assigning a ``.value``, or creating a new :class:`ncdata.NcAttribute`, the value
+is cast with :func:`numpy.asanyarray`, and if this fails, or yields a multidimensional array
+then an error is raised.
 
-For *reading* attributes, it is best to use :meth:`ncdata.NameMap.get_attrval` or (equivalently)
-:meth:`ncdata.NcAttribute.as_python_value()` :  These consistently return either
-``None`` (if missing); a numpy scalar; or array; or a Python string.  Those results are
-intended to be equivalent to what you should get from storing in an actual file and reading back,
+When *reading* attributes, for consistent results it is best to use the
+:meth:`ncdata.NcVariable.get_attrval` method or (equivalently) :meth:`ncdata.NcAttribute.as_python_value` :
+These return either ``None`` (if missing); a numpy scalar; or array; or a Python string.
+These are intended to be equivalent to what you would get from storing in an actual file and reading back,
 including re-interpreting a length-one vector as a scalar value.
 
 .. attention::
-   The correct handling and (future) discrimination of string data as character arrays ("char" in netCDF terms)
-   and/or variable-length strings ("string" type) is still to be determined.
+    The correct handling and (future) discrimination of attribute values which are character arrays
+    ("char" in netCDF terms) and/or variable-length strings ("string" type) is still to be determined.
+    ( We do not yet properly support any variable-length types. )
 
-   For now, we are converting **all** string attributes to python strings.
-
-   There is **also** a longstanding known problem with the low-level C (and FORTRAN) interface, which forbids the
-   creation of vector character attributes, which appear as single concatenated strings.  So for now, **all**
-   string-type attributes appear as single Python strings (you never get an array of strings or list of strings).
+    For now, we are simply converting **all** string-like attributes by
+    :meth:`ncdata.NcAttribute.as_python_value` to python strings.
 
 See also : :ref:`data-types`
 
 .. _correctness-checks:
 
 Correctness and Consistency
 ---------------------------
-In practice, to support flexibility in construction and manipulation, it is
-not practical for ncdata structures to represent valid netCDF at
-all times, since this would makes changing things awkward.
-For example, if a group refers to a dimension *outside* the group, you could not simply
-extract it from the dataset because it is not valid in isolation.
-
-Thus, we do allow that ncdata structures represent *invalid* netCDF data.
+In order to allow flexibility in construction and manipulation, it is not practical
+for ncdata structures to represent valid netCDF at all times, since this would makes
+changing things awkward.
+For example, if a group refers to a dimension *outside* the group, strict correctness
+would not allow you to simply extract it from the dataset, because it is not valid in isolation.
+Thus, we do allow ncdata structures to represent *invalid* netCDF data.
 For example, circular references, missing dimensions or naming mismatches.
-Effectively there are a set of data validity rules, which are summarised in the
-:func:`ncdata.utils.save_errors` routine.
 
-In practice, there a minimal set of runtime rules for creating ncdata objects, and
-additional requirements when ncdata is converted to actual netCDF.  For example,
-variables can be initially created with no data.  But if subsequently written to a file,
-data must be assigned first.
+In practice, there are a minimal set of rules which apply when initially creating
+ncdata objects, and additional requirements which apply when creating actual netCDF files.
+For example, a variable can be initially created with no data.  But if subsequently written
+to a file, some data must be defined.
+
+The full set of data validity rules are summarised in the
+:func:`ncdata.utils.save_errors` routine.
 
 .. Note::
   These issues are not necessarily all fully resolved.  Caution required !
@@ -268,3 +266,4 @@ Relationship to File Storage
 See :ref:`file-storage`
 
 .. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
+.. _NetCDF Enhanced Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#enhanced_model
diff --git a/docs/userdocs/user_guide/general_topics.rst b/docs/userdocs/user_guide/general_topics.rst
@@ -2,7 +2,7 @@
 
 General Topics
 ==============
-Odd discussion topics realting to core ncdata classes + data management
+Odd discussion topics relating to core ncdata classes + data management
 
 Validity Checking
 -----------------
@@ -72,16 +72,28 @@ may contain zero bytes so that they convert to variable-width (Python) strings u
 maximum width.  The string (maximum) length is a separate dimension, which is recorded
 as a normal netCDF dimension like any other.
 
+.. note::
+
+    Although it is not tested, it has proved possible (and useful) at present to load
+    files with variables containing variable-length string data, but it is
+    necessary to supply an explicit user chunking to workaround limitations in Dask.
+    Please see the :ref:`howto example <howto_load_variablewidth_strings>`.
+
 Characters in Attribute Values
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Character data in string *attribute* values can be written simply as Python
 strings.  They are stored in an :class:`~ncdata.NcAttribute`'s ``.value`` as a
 character array of dtype "<U?", or are returned from
 :meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.
-A vector of strings does also function as an attribute value, but bear in mind that a
-vector of strings is not currently supported in netCFD4 implementations.
+
+A vector of strings is also permitted an attribute value, but bear in mind that
+**a vector of strings is not currently supported in netCFD4 implementations**.
+Thus, you cannot have an array or list of strings as an attribute value in an actual file,
+and if stored to a file such an attribute will be concatenated into a single string value.
+
 Unicode is supported, and encodes/decodes seamlessly into actual files.
 
+.. _thread_safety:
 
 Thread Safety
 -------------

diff --git a/docs/userdocs/user_guide/howtos.rst b/docs/userdocs/user_guide/howtos.rst
@@ -552,3 +552,41 @@ or, to convert xarray data variable output to masked integers :
     >>> var.set_attrval("_FillValue", -9999)
     >>> to_nc4(ncdata, "output.nc")
 
+
+.. _howto_load_variablewidth_strings:
+
+Load a file containing variable-width string variables
+------------------------------------------------------
+You must supply a ``dim_chunks`` keyword to the :meth:`ncdata.netcdf.from_nc4` method,
+specifying how to chunk the dimension(s) which the string variable uses.
+
+.. code-block:: python
+
+    >>> from ncdata.netcdf4 import from_nc4
+    >>> # if we have a "string" type variable using the "date" dimension
+    >>> # : don't chunk that dimension.
+    >>> dataset = from_nc4(filepath, dim_chunks={"date": -1})
+
+This is needed to avoid a Dask error like
+``"auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks`
+explicitly."``
+
+When you have done this, Dask will return the variable data as a numpy *object* array containing Python strings.
+You probably still need to (manually) convert that to something more tractable to work with it effectively.
+
+For example, something like :
+
+.. code-block:: python
+
+    >>> var = dataset.variables['name']
+    >>> data = var.data.compute()
+    >>> maxlen = max(len(s) for s in var.data)
+
+    >>> # convert to fixed-width character array
+    >>> data = np.array([[s.ljust(maxlen, "\0") for s in var.data]])
+    >>> print(data.shape, data.dtype)
+    (1010, 12) <U1
+
+    >>> dataset.dimensions.add(NcDimension('name_strlen', maxlen))
+    >>> var.dimensions = var.dimensions + ("name_strlen",)
+    >>> var.data = data