Skip to content

Commit d7f4e96

Browse files
committed
Merge pull request #111 from akleeman/prepare-v0.1
Prepare v0.1
2 parents 9d09b43 + 9f15916 commit d7f4e96

13 files changed

+1212
-194
lines changed

README.md

Lines changed: 17 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ makes many powerful array operations possible:
2222
dimensions (known in numpy as "broadcasting") based on dimension names,
2323
regardless of their original order.
2424
- Flexible split-apply-combine operations with groupby:
25-
`x.groupby('time.dayofyear').apply(lambda y: y - y.mean())`.
25+
`x.groupby('time.dayofyear').mean()`.
2626
- Database like aligment based on coordinate labels that smoothly
2727
handles missing values: `x, y = xray.align(x, y, join='outer')`.
2828
- Keep track of arbitrary metadata in the form of a Python dictionary:
@@ -38,9 +38,10 @@ Because **xray** implements the same data model as the NetCDF file format,
3838
xray datasets have a natural and portable serialization format. But it's
3939
also easy to robustly convert an xray `DataArray` to and from a numpy
4040
`ndarray` or a pandas `DataFrame` or `Series`, providing compatibility with
41-
the full [scientific-python ecosystem][scipy].
41+
the full [PyData ecosystem][pydata].
4242

4343
[pandas]: http://pandas.pydata.org/
44+
[pydata]: http://pydata.org/
4445
[scipy]: http://scipy.org/
4546
[ndarray]: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
4647

@@ -143,43 +144,34 @@ labeled numpy arrays that provided some guidance for the design of xray.
143144
- Be fast. There shouldn't be a significant overhead for metadata aware
144145
manipulation of n-dimensional arrays, as long as the arrays are large
145146
enough. The goal is to be as fast as pandas or raw numpy.
146-
- Provide a uniform API for loading and saving scientific data in a variety
147-
of formats (including streaming data).
148-
- Take a pragmatic approach to metadata (attributes), and be very cautious
149-
before implementing any functionality that relies on it. Automatically
150-
maintaining attributes is a tricky and very hard to get right (see
151-
discussion about Iris above).
147+
- Support loading and saving labeled scientific data in a variety of formats
148+
(including streaming data).
152149

153150
## Getting started
154151

155-
For more details, see the **[full documentation][docs]** (still a work in
156-
progress) or the source code. **xray** is rapidly maturing, but it is still in
157-
its early development phase. ***Expect the API to change.***
152+
For more details, see the **[full documentation][docs]**, particularly the
153+
**[tutorial][tutorial]**.
158154

159155
xray requires Python 2.7 and recent versions of [numpy][numpy] (1.8.0 or
160156
later) and [pandas][pandas] (0.13.1 or later). [netCDF4-python][nc4],
161157
[pydap][pydap] and [scipy][scipy] are optional: they add support for reading
162158
and writing netCDF files and/or accessing OpenDAP datasets. We plan to
163-
eventually support Python 3 but aren't there yet. The easiest way to get any
164-
of these dependencies installed from scratch is to use [Anaconda][anaconda].
159+
eventually support Python 3 but aren't there yet.
165160

166-
xray is not yet available on the Python package index (prior to its initial
167-
release). For now, you need to install it from source:
161+
You can install xray from the pypi with pip:
168162

169-
git clone https://github.com/akleeman/xray.git
170-
# WARNING: this will automatically upgrade numpy & pandas if necessary!
171-
pip install -e xray
172-
173-
Don't forget to `git fetch` regular updates!
163+
pip install xray
174164

175165
[docs]: http://xray.readthedocs.org/
166+
[tutorial]: http://xray.readthedocs.org/en/latest/tutorial.html
176167
[numpy]: http://www.numpy.org/
177168
[pydap]: http://www.pydap.org/
178169
[anaconda]: https://store.continuum.io/cshop/anaconda/
179170

180171
## Anticipated API changes
181172

182-
Aspects of the API that we currently intend to change:
173+
Aspects of the API that we currently intend to change in future versions of
174+
xray:
183175

184176
- The constructor for `DataArray` objects will probably change, so that it
185177
is possible to create new `DataArray` objects without putting them into a
@@ -192,19 +184,10 @@ Aspects of the API that we currently intend to change:
192184
dimensional arrays.
193185
- Future versions of xray will add better support for working with datasets
194186
too big to fit into memory, probably by wrapping libraries like
195-
[blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately:
196-
- Array indexing will be made lazy, instead of immediately creating an
197-
ndarray. This will make it easier to subsample from very large Datasets
198-
incrementally using the `indexed` and `labeled` methods. We might need to
199-
add a special method to allow for explicitly caching values in memory.
200-
- We intend to support `Dataset` objects linked to NetCDF or HDF5 files on
201-
disk to allow for incremental writing of data.
202-
203-
Once we get the API in a state we're comfortable with and improve the
204-
documentation, we intend to release version 0.1. Our target is to do so before
205-
the xray talk on May 3, 2014 at [PyData Silicon Valley][pydata].
206-
207-
[pydata]: http://pydata.org/sv2014/
187+
[blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately, we intend
188+
to support `Dataset` objects linked to NetCDF or HDF5 files on disk to
189+
allow for incremental writing of data.
190+
208191
[blaze]: https://github.com/ContinuumIO/blaze/
209192
[blz]: https://github.com/ContinuumIO/blz
210193
[biggus]: https://github.com/SciTools/biggus

doc/_static/opendap-prism-tmax.png

21.5 KB
Loading

doc/_static/series_plot_example.png

-143 KB
Binary file not shown.

doc/api.rst

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Dataset
77
-------
88

99
Creating a dataset
10-
~~~~~~~~~~~~~~~~~
10+
~~~~~~~~~~~~~~~~~~
1111
.. autosummary::
1212
:toctree: generated/
1313

@@ -20,8 +20,6 @@ Attributes and underlying data
2020
.. autosummary::
2121
:toctree: generated/
2222

23-
Dataset.variables
24-
Dataset.virtual_variables
2523
Dataset.coordinates
2624
Dataset.noncoordinates
2725
Dataset.dimensions
@@ -45,10 +43,14 @@ and values given by ``DataArray`` objects.
4543
Dataset.copy
4644
Dataset.iteritems
4745
Dataset.itervalues
46+
Dataset.virtual_variables
4847

4948
Comparisons
5049
~~~~~~~~~~~
5150

51+
.. autosummary::
52+
:toctree: generated/
53+
5254
Dataset.equals
5355
Dataset.identical
5456

@@ -58,8 +60,8 @@ Selecting
5860
.. autosummary::
5961
:toctree: generated/
6062

61-
Dataset.indexed_by
62-
Dataset.labeled_by
63+
Dataset.indexed
64+
Dataset.labeled
6365
Dataset.reindex
6466
Dataset.reindex_like
6567
Dataset.rename
@@ -74,12 +76,26 @@ IO / Conversion
7476
.. autosummary::
7577
:toctree: generated/
7678

77-
Dataset.dump
79+
Dataset.to_netcdf
7880
Dataset.dumps
7981
Dataset.dump_to_store
8082
Dataset.to_dataframe
8183
Dataset.from_dataframe
8284

85+
Dataset internals
86+
~~~~~~~~~~~~~~~~~
87+
88+
These attributes and classes provide a low-level interface for working
89+
with Dataset variables. In general you should use the Dataset dictionary-
90+
like interface instead and working with DataArray objects:
91+
92+
.. autosummary::
93+
:toctree: generated/
94+
95+
Dataset.variables
96+
Variable
97+
Coordinate
98+
8399
Backends (experimental)
84100
~~~~~~~~~~~~~~~~~~~~~~~
85101

@@ -109,10 +125,24 @@ Attributes and underlying data
109125
:toctree: generated/
110126

111127
DataArray.values
128+
DataArray.as_index
112129
DataArray.coordinates
113130
DataArray.name
114131
DataArray.dataset
115132
DataArray.attrs
133+
DataArray.encoding
134+
DataArray.variable
135+
136+
NDArray attributes
137+
~~~~~~~~~~~~~~~~~~
138+
139+
.. autosummary::
140+
:toctree: generated/
141+
142+
DataArray.ndim
143+
DataArray.shape
144+
DataArray.size
145+
DataArray.dtype
116146

117147
Selecting
118148
~~~~~~~~~
@@ -123,8 +153,8 @@ Selecting
123153
DataArray.__getitem__
124154
DataArray.__setitem__
125155
DataArray.loc
126-
DataArray.indexed_by
127-
DataArray.labeled_by
156+
DataArray.indexed
157+
DataArray.labeled
128158
DataArray.reindex
129159
DataArray.reindex_like
130160
DataArray.rename
@@ -150,6 +180,7 @@ Computations
150180
DataArray.transpose
151181
DataArray.T
152182
DataArray.reduce
183+
DataArray.get_axis_num
153184
DataArray.all
154185
DataArray.any
155186
DataArray.argmax

doc/conf.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,10 @@ def __getattr__(cls, name):
8585
extensions = [
8686
'sphinx.ext.autodoc',
8787
'sphinx.ext.autosummary',
88+
'sphinx.ext.intersphinx',
8889
'numpydoc',
89-
'ipython_directive',
90-
'ipython_console_highlighting'
90+
'IPython.sphinxext.ipython_directive',
91+
'IPython.sphinxext.ipython_console_highlighting',
9192
]
9293

9394
autosummary_generate = True

doc/data-structures.rst

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,74 @@
11
Data structures
22
===============
33

4-
``xray``'s core data structures are the ``Dataset``, ``Variable`` and
5-
``DataArray``.
4+
xray's core data structures are the :py:class:`~xray.Dataset`,
5+
the :py:class:`~xray.Variable` (including its subclass
6+
:py:class:`~xray.Coordinate`) and the :py:class:`~xray.DataArray`.
7+
8+
The document is intended as a technical summary of the xray data model. It
9+
should be mostly of interest to advanced users interested in extending or
10+
contributing to xray internals.
611

712
Dataset
813
-------
914

10-
``Dataset`` is netcdf-like object consisting of **variables** (a dictionary of
11-
Variable objects) and **attributes** (an ordered dictionary) which together
12-
form a self-describing data set.
15+
:py:class:`~xray.Dataset` is a Python object representing a fully self-
16+
described dataset of labeled N-dimensional arrays. It consists of:
17+
18+
1. **variables**: A dictionary of Variable objects.
19+
2. **dimensions**: A dictionary giving the lengths of shared dimensions, which
20+
are required to be consistent across all variables in a Dataset.
21+
3. **attributes**: An ordered dictionary of metadata.
22+
23+
The design of the Dataset is based by the
24+
`NetCDF <http://www.unidata.ucar.edu/software/netcdf/>`__ file format for
25+
self-described scientific data. This is a data model that has become very
26+
successful and widely used in the geosciences.
27+
28+
The Dataset is an intelligent container. It allows for simultaneous integer
29+
or label based indexing of all of its variables, supports split-apply-combine
30+
operations with groupby, and can be converted to and from
31+
:py:class:`pandas.DataFrame` objects.
1332

1433
Variable
1534
--------
1635

17-
``Variable`` implements **xray's** basic extended array object. It supports the
18-
numpy ndarray interface, but is extended to support and use metadata. It
19-
consists of:
36+
:py:class:`~xray.Variable` implements xray's basic extended array object. It
37+
supports the numpy ndarray interface, but is extended to support and use
38+
basic metadata (not including coordinate values). It consists of:
2039

2140
1. **dimensions**: A tuple of dimension names.
22-
2. **data**: The n-dimensional array (typically, of type ``numpy.ndarray``)
23-
storing the array's data. It must have the same number of dimensions as the
24-
length of the "dimensions" attribute.
41+
2. **data**: The N-dimensional array (for example, of type
42+
:py:class:`numpy.ndarray`) storing the array's data. It must have the same
43+
number of dimensions as the length of the "dimensions" attribute.
2544
3. **attributes**: An ordered dictionary of additional metadata to associate
2645
with this array.
2746

28-
The main functional difference between Variables and numpy.ndarrays is that
47+
The main functional difference between Variables and numpy arrays is that
2948
numerical operations on Variables implement array broadcasting by dimension
3049
name. For example, adding an Variable with dimensions `('time',)` to another
3150
Variable with dimensions `('space',)` results in a new Variable with dimensions
3251
`('time', 'space')`. Furthermore, numpy reduce operations like ``mean`` or
3352
``sum`` are overwritten to take a "dimension" argument instead of an "axis".
3453

3554
Variables are light-weight objects used as the building block for datasets.
36-
However, usually manipulating data in the form of a DataArray should be
37-
preferred (see below), because they can use more complete metadata in the full
38-
of other dataset variables.
55+
**However, manipulating data in the form of a Dataset or DataArray should
56+
almost always be preferred** (see below), because they can use more complete
57+
metadata in context of coordinate labels.
3958

4059
DataArray
4160
---------
4261

43-
``DataArray`` is a flexible hybrid of Dataset and Variable that attempts to
44-
provide the best of both in a single object. Under the covers, DataArrays
45-
are simply pointers to a dataset (the ``dataset`` attribute) and the name of a
46-
"focus variable" in the dataset (the ``focus`` attribute), which indicates to
47-
which variable array operations should be applied.
62+
A :py:class:`~xray.DataArray` object is a multi-dimensional array with labeled
63+
dimensions and coordinates. Coordinate labels give it additional power over the
64+
Variable object, so it should be preferred for all high-level use.
65+
66+
Under the covers, DataArrays are simply pointers to a dataset (the ``dataset``
67+
attribute) and the name of a variable in the dataset (the ``name`` attribute),
68+
which indicates to which variable array operations should be applied.
4869

4970
DataArray objects implement the broadcasting rules of Variable objects, but
5071
also use and maintain coordinates (aka "indices"). This means you can do
5172
intelligent (and fast!) label based indexing on DataArrays (via the
5273
``.loc`` attribute), do flexibly split-apply-combine operations with
53-
``groupby`` and also easily export them to ``pandas.DataFrame`` or
54-
``pandas.Series`` objects.
74+
``groupby`` and convert them to or from :py:class:`pandas.Series` objects.

0 commit comments

Comments
 (0)