Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunkwise image loader #279

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

lucas-diedrich
Copy link

@lucas-diedrich lucas-diedrich commented Feb 17, 2025

Description

This PR addresses the challenge that the currently implemented and planned image loaders require loading imaging data entirely into memory, typically as NumPy arrays. Given the large size of microscopy datasets, this is not always feasible.

To mitigate this issue, and as discussed with @LucaMarconato, this PR aims to introduce a generalizable approach for reading large microscopy files in chunks, enabling efficient handling of data that does not fit into memory.

Some related discussions.

Strategy

In this PR, we focus on .tiff images, as implemented in the _tiff_to_chunks function.

  1. Get a lazy representation of the image via a suitable reader function (here: tifffile.memmap)
  2. Pre-define chunks that fit into memory, based on the dimensions of the image (_compute_chunks)
  3. Load small chunks via a custom reader function and pass the chunks to dask.array which is memory-mapped and avoids memory overflow (_read_chunks)
  4. Reassembling the chunks into a dask.array (via dask.array.block)
  5. Parse to Image2DModel.

The strategy is implemented in

  • src/spatialdata_io/readers/generic.py and
  • src/spatialdata_io/readers/_utils/_image.py

Future extensions

The strategy can be implemented for any image type, as long as it is possible to implement

  1. a lazy image-data loader
  2. define a custom reader function

We have implemented similar readers for openslide-compatible whole slide images and the Carl-Zeiss microscopy format.

@lucas-diedrich lucas-diedrich marked this pull request as draft February 17, 2025 17:03
@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2025

Codecov Report

Attention: Patch coverage is 97.91667% with 1 line in your changes missing coverage. Please review.

Project coverage is 50.82%. Comparing base (296d9a5) to head (b7e5874).
Report is 135 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata_io/readers/generic.py 96.29% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #279       +/-   ##
===========================================
+ Coverage   39.16%   50.82%   +11.66%     
===========================================
  Files          26       27        +1     
  Lines        2663     2713       +50     
===========================================
+ Hits         1043     1379      +336     
+ Misses       1620     1334      -286     
Files with missing lines Coverage Δ
src/spatialdata_io/readers/_utils/_image.py 100.00% <100.00%> (ø)
src/spatialdata_io/readers/generic.py 91.52% <96.29%> (+2.95%) ⬆️

... and 12 files with indirect coverage changes

@lucas-diedrich lucas-diedrich marked this pull request as ready for review March 21, 2025 15:44
Comment on lines +22 to +35
def _compute_chunks(
dimensions: tuple[int, int],
chunk_size: tuple[int, int],
min_coordinates: tuple[int, int] = (0, 0),
) -> NDArray[np.int_]:
"""Create all chunk specs for a given image and chunk size.

Creates specifications (x, y, width, height) with (x, y) being the upper left corner
of chunks of size chunk_size. Chunks at the edges correspond to the remainder of
chunk size and dimensions

Parameters
----------
dimensions : tuple[int, int]
Copy link
Collaborator

@melonora melonora Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _compute_chunks(
dimensions: tuple[int, int],
chunk_size: tuple[int, int],
min_coordinates: tuple[int, int] = (0, 0),
) -> NDArray[np.int_]:
"""Create all chunk specs for a given image and chunk size.
Creates specifications (x, y, width, height) with (x, y) being the upper left corner
of chunks of size chunk_size. Chunks at the edges correspond to the remainder of
chunk size and dimensions
Parameters
----------
dimensions : tuple[int, int]
def _compute_chunks(
shape: tuple[int, int],
chunk_size: tuple[int, int],
min_coordinates: tuple[int, int] = (0, 0),
) -> NDArray[np.int_]:
"""Create all chunk specs for a given image and chunk size.
Creates specifications (x, y, width, height) with (x, y) being the upper left corner
of chunks of size chunk_size. Chunks at the edges correspond to the remainder of
chunk size and dimensions
Parameters
----------
shape : tuple[int, int]

Just to stick to standard numpy / array api conventions:) Dimensions could be interpreted as TCZYX.

Comment on lines +13 to +17
positions = np.arange(min_coord, min_coord + size, chunk)
lengths = np.full_like(positions, chunk, dtype=int)

if positions[-1] + chunk > size + min_coord:
lengths[-1] = size + min_coord - positions[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
positions = np.arange(min_coord, min_coord + size, chunk)
lengths = np.full_like(positions, chunk, dtype=int)
if positions[-1] + chunk > size + min_coord:
lengths[-1] = size + min_coord - positions[-1]
positions = np.arange(min_coord, min_coord + size, chunk)
lengths = np.minimum(chunk, min_coord + size - positions)

Think this is the equivalent two liner:) but just a bit nitpicky

Copy link
Collaborator

@melonora melonora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I have 2 minor suggestions. I also saw that you use the width by height convention. Personally, I don't have a strong opinion here, though we could also stick to array api conventions. @LucaMarconato WDYT? Pre-approving for now.

list[list[DaskArray]]
"""
# Lazy file reader
slide = tiffmmemap(input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tiff.memmap might not always work for example with compression or tiling so I would add a try, except clause.

Copy link
Collaborator

@melonora melonora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry had to change due to rethinking memmap. This does not always work, for example when dealing with compressed tiffs as far as I am aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants