Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add VLenNDArray #200

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

sofroniewn
Copy link

@sofroniewn sofroniewn commented Sep 6, 2019

This PR closes #199 by adding support for ragged nD arrays, where each array can be a different shape and dimensionality. It does this using the scheme described in #199.

It's usage is as follows:

import numcodecs
import numpy as np
x = np.array([[1, 3, 5], [[4, 3], [2, 1]], [[7, 9]]], dtype='object')
codec = numcodecs.VLenNDArray('<i4')
codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([[4, 3], [2, 1]], dtype=int32),
    array([[7, 9]]], dtype=int32)], dtype=object)

I have not added tests yet, but will do so. I will adapt the tests from test_vlen_array.py.

Any comments on the implementation are appreciated. I'm pretty new to this code base, so may have made some wrong choices.

Oh and I also seem to have a bunch of .c files the came when I ran cythonize -a -i ./numcodecs/vlen_nd.pyx that I may or may not have wanted to change, any advice around those would be appreciated too.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py37 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@sofroniewn
Copy link
Author

I have now added tests in test_vlen_ndarray.py based on the tests for test_vlen_array.py

@sofroniewn
Copy link
Author

hmm - tests are failing on from numcodecs.vlen_nd import VLenNDArray, but that works fine locally. I'm not quite sure what's going on there.

@sofroniewn
Copy link
Author

sofroniewn commented Sep 7, 2019

Most tests pass now - I had to add the fixtures folder, fix the cython metadata in the .c file (which had included some paths on my computer) and I had to add a setup.py extension.

There's still one doc string test failing for py3.7 because of where the new-line gets split.

@sofroniewn
Copy link
Author

I also want to note that when I try and run pytest -v numcodecs locally I get the following error messages which I think pertain to parts of the codebase I'm not trying to interact with and were likely due to problems with my installation - which followed the procedure described in your contributing guide (inculding running pip install -r requirements_dev.txt and python setup.py build_ext --inplace, but without the virtual environment)

____________________________________________________________________________ ERROR collecting numcodecs/tests/test_blosc.py ____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_blosc.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_blosc.py:12: in <module>
    from numcodecs import blosc
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so, 2): Symbol not found: _blosc_cbuffer_complib
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so
_____________________________________________________________________________ ERROR collecting numcodecs/tests/test_lz4.py _____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_lz4.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_lz4.py:9: in <module>
    from numcodecs.lz4 import LZ4
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so, 2): Symbol not found: _LZ4_compressBound
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so
____________________________________________________________________________ ERROR collecting numcodecs/tests/test_zstd.py _____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_zstd.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_zstd.py:9: in <module>
    from numcodecs.zstd import Zstd
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so, 2): Symbol not found: _ZSTD_compress
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so

@NumesSanguis
Copy link

@sofroniewn Thank you for your effort in trying to support ragged nD arrays. Just want to ask if there is any update on this feature?

@sofroniewn
Copy link
Author

No update - maybe I can ping @jakirkham and @ryan-williams to take another look / help me get the tests passing / resolve conflicts - i'll note that I think the conflicts / test failures come from the failures of my dev environment not the code I'm trying to add

@alimanfoo
Copy link
Member

Hi @sofroniewn, just to apologise for not looking at this sooner.

Re the conflicting .c files, those will just be due to changes in non-essential information that gets output by cython and which depends on which computer the C files where generated on. I'd suggest to just remove any changes to those C files from this PR.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sofroniewn, apologies again for slow review.

Implementation looks fine to me. Only question is whether to use a single byte for the number of dimensions rather than 4 byte int, could save a bit of space.

A couple of small comments on docstrings.

Also would need some API docs.


def check_out_param(out, n_items):
if not isinstance(out, np.ndarray):
raise TypeError('out must be 1-dimensional array')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError('out must be 1-dimensional array')
raise TypeError('out must be a numpy array')



class VLenNDArray(Codec):
"""Encode variable-length n-dimensional arrays via UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Encode variable-length n-dimensional arrays via UTF-8.
"""Encode an array of variable-length n-dimensional arrays.

Comment on lines +134 to +137
data_length += l + 4 * (n + 2) # 4 bytes to store number of
# dimensions, 4 bytes per
# dimension to store dimension
# and 4 bytes to store the length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a single byte to store the number of dimensions?

@alimanfoo
Copy link
Member

I also want to note that when I try and run pytest -v numcodecs locally I get the following error messages which I think pertain to parts of the codebase I'm not trying to interact with and were likely due to problems with my installation

These error messages are a bit odd, they suggest some problem with how the other extension modules were compiled. Afraid I haven't got anything very intelligent to suggest, other than cleaning out all the .so and .c files and trying a full build from scratch.

@ericpre
Copy link

ericpre commented Mar 16, 2022

@sofroniewn, by any chance, would you still have interest in finishing this PR? :)

@martindurant
Copy link
Member

This idea could probably be superseded by the proposal for an awkward-zarr project for GSoC.

@sofroniewn
Copy link
Author

So sorry both for dropping the ball on this - if there are already plans for this to be superseded please press on with those, or if you have a contributor who wants to take this over PR please take it over. Thanks!!

@NumesSanguis
Copy link

@martindurant Do you have a link to that GSoC proposal? I could only find this open issue, but not sure if that's related?:
zarr-developers/community#42

@sanketverma1704
Copy link
Member

@NumesSanguis here's the ideas-list.md and here's the Awkward Array project details.

@dstansby dstansby added the New codec Suggestion for a new codec label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New codec Suggestion for a new codec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can VLenArray support 2D arrays
7 participants