Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A memory efficient implementation of the .mtx reading function #3389

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gjeuken
Copy link

@gjeuken gjeuken commented Nov 28, 2024

  • Closes #
  • Tests included or not required because: test_datasets.py already implemented
  • Release notes not necessary because: This is a backend change

Pandas read_csv function is very memory intensive, and this makes loading data (especially large datasets from EBI Single Cell Expression Atlas) impossible on computers with 16gb of ram or less. The subsequent analysis of such datasets with scanpy, however, works well on such computers.

Loading the data into chunks, using the same pandas function, solves this problem.

Copy link

codecov bot commented Nov 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.45%. Comparing base (9741ca6) to head (792b5e2).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3389   +/-   ##
=======================================
  Coverage   75.44%   75.45%           
=======================================
  Files         113      113           
  Lines       13250    13253    +3     
=======================================
+ Hits         9997    10000    +3     
  Misses       3253     3253           
Files with missing lines Coverage Δ
src/scanpy/datasets/_ebi_expression_atlas.py 94.18% <100.00%> (+0.21%) ⬆️

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! some small notes:

)
mtx = sparse.csr_matrix((data[2], (data[1] - 1, data[0] - 1)), shape=(m, n))
mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))
mtx = sparse.csr_matrix((m, n), dtype=np.float64)

Comment on lines +78 to +82
for data in chunks:
mtx_chunk = sparse.csr_matrix(
(data[2], (data[1] - 1, data[0] - 1)), shape=(m, n)
)
mtx = mtx + mtx_chunk
Copy link
Member

@flying-sheep flying-sheep Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably slightly slower than necessary, since we know the chunks don‘t overlap and = needs to deal with actually summing up things. But I imagine it could also be pretty well optimized, so if the following is not faster, please just add a comment instead explaining that + is well-enough optimized.

The way csr_matrix((data, (i, j)), [shape]) works is that it first creates a coo_matrix, then converts it to csr.
I think the best way is probably:

  1. build up data, i, j arrays in a loop
  2. create a csr_matrix from the final arrays as the last step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants