Skip to content

Commit 250b19c

Browse files
TomNicholasshoyer
authored andcommitted
Source encoding always set when opening datasets (#2626)
* Add source encoding if not already present when opening dataset * Test source encoding present * Updated what's new * Revert "Updated what's new" This reverts commit 7858799. * Don't close file-like objects * Updated whats's new * DOC: document source encoding for datasets
1 parent 1545b50 commit 250b19c

File tree

4 files changed

+41
-11
lines changed

4 files changed

+41
-11
lines changed

doc/io.rst

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -197,24 +197,30 @@ turn this decoding off manually.
197197
.. _CF conventions: http://cfconventions.org/
198198

199199
You can view this encoding information (among others) in the
200-
:py:attr:`DataArray.encoding <xarray.DataArray.encoding>` attribute:
200+
:py:attr:`DataArray.encoding <xarray.DataArray.encoding>` and
201+
:py:attr:`DataArray.encoding <xarray.DataArray.encoding>` attributes:
201202

202203
.. ipython::
203204
:verbatim:
204205

205206
In [1]: ds_disk['y'].encoding
206207
Out[1]:
207-
{'calendar': u'proleptic_gregorian',
208-
'chunksizes': None,
208+
{'zlib': False,
209+
'shuffle': False,
209210
'complevel': 0,
210-
'contiguous': True,
211-
'dtype': dtype('float64'),
212211
'fletcher32': False,
213-
'least_significant_digit': None,
214-
'shuffle': False,
212+
'contiguous': True,
213+
'chunksizes': None,
215214
'source': 'saved_on_disk.nc',
216-
'units': u'days since 2000-01-01 00:00:00',
217-
'zlib': False}
215+
'original_shape': (5,),
216+
'dtype': dtype('int64'),
217+
'units': 'days since 2000-01-01 00:00:00',
218+
'calendar': 'proleptic_gregorian'}
219+
220+
In [9]: ds_disk.encoding
221+
Out[9]:
222+
{'unlimited_dims': set(),
223+
'source': 'saved_on_disk.nc'}
218224

219225
Note that all operations that manipulate variables other than indexing
220226
will remove encoding information.

doc/whats-new.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,9 @@ Enhancements
6767
- :py:meth:`DataArray.resample` and :py:meth:`Dataset.resample` now supports the
6868
``loffset`` kwarg just like Pandas.
6969
By `Deepak Cherian <https://github.com/dcherian>`_
70+
- Datasets are now guaranteed to have a ``'source'`` encoding, so the source
71+
file name is always stored (:issue:`2550`).
72+
By `Tom Nicholas <http://github.com/TomNicholas>`_.
7073
- The `apply` methods for `DatasetGroupBy`, `DataArrayGroupBy`,
7174
`DatasetResample` and `DataArrayResample` can now pass positional arguments to
7275
the applied function.

xarray/backends/api.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,7 @@ def maybe_decode_store(store, lock=False):
299299

300300
if isinstance(filename_or_obj, backends.AbstractDataStore):
301301
store = filename_or_obj
302+
ds = maybe_decode_store(store)
302303
elif isinstance(filename_or_obj, basestring):
303304

304305
if (isinstance(filename_or_obj, bytes) and
@@ -339,15 +340,21 @@ def maybe_decode_store(store, lock=False):
339340
% engine)
340341

341342
with close_on_error(store):
342-
return maybe_decode_store(store)
343+
ds = maybe_decode_store(store)
343344
else:
344345
if engine is not None and engine != 'scipy':
345346
raise ValueError('can only read file-like objects with '
346347
"default engine or engine='scipy'")
347348
# assume filename_or_obj is a file-like object
348349
store = backends.ScipyDataStore(filename_or_obj)
350+
ds = maybe_decode_store(store)
349351

350-
return maybe_decode_store(store)
352+
# Ensure source filename always stored in dataset object (GH issue #2550)
353+
if 'source' not in ds.encoding:
354+
if isinstance(filename_or_obj, basestring):
355+
ds.encoding['source'] = filename_or_obj
356+
357+
return ds
351358

352359

353360
def open_dataarray(filename_or_obj, group=None, decode_cf=True,
@@ -484,6 +491,7 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
484491
lock=None, data_vars='all', coords='different',
485492
autoclose=None, parallel=False, **kwargs):
486493
"""Open multiple files as a single dataset.
494+
487495
Requires dask to be installed. See documentation for details on dask [1].
488496
Attributes from the first dataset file are used for the combined dataset.
489497
@@ -523,6 +531,8 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
523531
of all non-null values.
524532
preprocess : callable, optional
525533
If provided, call this function on each dataset prior to concatenation.
534+
You can find the file-name from which each dataset was loaded in
535+
``ds.encoding['source']``.
526536
engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio', 'cfgrib'},
527537
optional
528538
Engine to use when reading files. If not provided, the default engine

xarray/tests/test_backends.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3426,3 +3426,14 @@ def test_no_warning_from_dask_effective_get():
34263426
ds = Dataset()
34273427
ds.to_netcdf(tmpfile)
34283428
assert len(record) == 0
3429+
3430+
3431+
@requires_scipy_or_netCDF4
3432+
def test_source_encoding_always_present():
3433+
# Test for GH issue #2550.
3434+
rnddata = np.random.randn(10)
3435+
original = Dataset({'foo': ('x', rnddata)})
3436+
with create_tmp_file() as tmp:
3437+
original.to_netcdf(tmp)
3438+
with open_dataset(tmp) as ds:
3439+
assert ds.encoding['source'] == tmp

0 commit comments

Comments
 (0)