Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test ManifestStore against S3 and fix key parsing #507

Merged
merged 16 commits into from
Mar 27, 2025

Conversation

maxrjones
Copy link
Member

@maxrjones maxrjones commented Mar 25, 2025

This PR test the ManifestStore against S3 using MinIO and fixes a small issue with translating ManifestArray keys to the appropriate key for obstore.get().

  • Closes Test against S3 using minio #97
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@maxrjones maxrjones added CI Continuous Integration testing labels Mar 25, 2025
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, but I'm not totally sure I understand what you're intending to test here, or how it relates to the ManifestStore work. Surely dealing with remote object storage is the responsibility of obstore and icechunk?

@maxrjones
Copy link
Member Author

This is cool, but I'm not totally sure I understand what you're intending to test here, or how it relates to the ManifestStore work. Surely dealing with remote object storage is the responsibility of obstore and icechunk?

I would like to test that ManifestStore.get() and ManifestStore.list_prefix() return the same expected results for files on both local and object storage. Since ManifestStore requires some URL manipulation, it is possible that we make assumptions valid for local filesystems but not object storage as mentioned as a concern in #490 (comment).

I built this as a separate PR under the assumption that PR scope should be as constrained as possible and that there were some Icechunk tests that would benefit from minio., but didn't pick a good test to start from. I'll modify this PR to test

@requires_network
@requires_s3fs
class TestReadFromS3:
@pytest.mark.parametrize(
"indexes",
[
None,
pytest.param({}, marks=pytest.mark.xfail(reason="not implemented")),
],
ids=["None index", "empty dict index"],
)
@parametrize_over_hdf_backends
def test_anon_read_s3(self, indexes, hdf_backend):
"""Parameterized tests for empty vs supplied indexes and filetypes."""
# TODO: Switch away from this s3 url after minIO is implemented.
fpath = "s3://carbonplan-share/virtualizarr/local.nc"
with open_virtual_dataset(
fpath,
indexes=indexes,
reader_options={"storage_options": {"anon": True}},
backend=hdf_backend,
) as vds:
assert vds.dims == {"time": 2920, "lat": 25, "lon": 53}
assert isinstance(vds["air"].data, ManifestArray)
for name in ["time", "lat", "lon"]:
assert isinstance(vds[name].data, np.ndarray)
instead, which is explicitly testing for S3 compatibilty.

@TomNicholas
Copy link
Member

TomNicholas commented Mar 25, 2025

I would like to test that ManifestStore.get() and ManifestStore.list_prefix() return the same expected results for files on both local and object storage.

Thanks for the explanation, that makes a lot more sense.

I just want to avoid the rabbit hole of adding tests to VirtualiZarr that are really just testing that Icechunk does what it says it does.

I built this as a separate PR under the assumption that PR scope should be as constrained as possible

FWIW I would be happy to merge your ManifestStore as it is and iterate on testing that.

@maxrjones
Copy link
Member Author

FWIW I would be happy to merge your ManifestStore as it is and iterate on testing that.

Oh wow, that'd make this so much easier!

@maxrjones maxrjones changed the title Setup minio for testing S3 stores WIP: Setup minio for testing S3 stores Mar 25, 2025
@maxrjones maxrjones marked this pull request as draft March 25, 2025 21:37
@maxrjones maxrjones changed the title WIP: Setup minio for testing S3 stores WIP: Setup MinIO for testing S3 stores Mar 26, 2025
@maxrjones maxrjones changed the title WIP: Setup MinIO for testing S3 stores Test ManifestStore against S3 and fix key parsing Mar 26, 2025
@maxrjones maxrjones added the bug Something isn't working label Mar 26, 2025
Comment on lines +164 to +165
parsed_key = urlparse(request_key)
return StoreRequest(store=stores[key], key=parsed_key.path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only non-test related change in the PR, which makes the ManifestStore work for both local and object stores

@maxrjones maxrjones marked this pull request as ready for review March 26, 2025 21:05
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This does seem useful.



@pytest.fixture()
@requires_obstore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this test not require obstore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These markers don't make any impact on fixtures, so I removed it. The fixture is generated if any tests using the fixtures are captured by pytest, so the markers should all go on tests rather than fixtures.

client_options={"allow_http": True},
)
filepath = "data.tmp"
obs.put(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only did one put, but now there are 4 chunks in there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a docstring explaining the approach in 874c6df (#507). This is kinda similar to libraries that concurrently compress each chunk (or not for uncompressed data), concatenate them, and then write then to a file in one go. At least, I think that's a thing.

@pytest.mark.asyncio
@pytest.mark.parametrize(
"manifest_store",
["local_store", pytest.param("s3_store", marks=pytest.mark.minio)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the mark be requires_minio?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're synonymous, but I will re-add requires_minio since it's more descriptive

"path": f"s3://{minio_bucket['bucket']}/{filepath}",
"offset": 12,
"length": 4,
},
}
manifest = ChunkManifest(entries=chunk_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to split this fixture up a bit, to more clearly delineate the creation of data in a (minio) s3-like obstore from the creation of the ManifestStore object which contains the byte references pointing to it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, done in 874c6df (#507)

@maxrjones maxrjones merged commit fd2ccf3 into zarr-developers:develop Mar 27, 2025
11 checks passed
@maxrjones maxrjones deleted the minio branch March 27, 2025 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI Continuous Integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test against S3 using minio
2 participants