A memory efficient implementation of the .mtx reading function #3389

gjeuken · 2024-11-28T13:19:13Z

Closes #
Tests included or not required because: test_datasets.py already implemented

Release notes not necessary because: This is a backend change

Pandas read_csv function is very memory intensive, and this makes loading data (especially large datasets from EBI Single Cell Expression Atlas) impossible on computers with 16gb of ram or less. The subsequent analysis of such datasets with scanpy, however, works well on such computers.

Loading the data into chunks, using the same pandas function, solves this problem.

codecov · 2024-11-28T13:35:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.45%. Comparing base (9741ca6) to head (792b5e2).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3389   +/-   ##
=======================================
  Coverage   75.44%   75.45%           
=======================================
  Files         113      113           
  Lines       13250    13253    +3     
=======================================
+ Hits         9997    10000    +3     
  Misses       3253     3253

Files with missing lines	Coverage Δ
src/scanpy/datasets/_ebi_expression_atlas.py	`94.18% <100.00%> (+0.21%)`	⬆️

for more information, see https://pre-commit.ci

flying-sheep

Good idea! some small notes:

flying-sheep · 2025-01-27T08:17:24Z

src/scanpy/datasets/_ebi_expression_atlas.py

    )
-    mtx = sparse.csr_matrix((data[2], (data[1] - 1, data[0] - 1)), shape=(m, n))
+    mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))


Suggested change

mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))

mtx = sparse.csr_matrix((m, n), dtype=np.float64)

flying-sheep · 2025-01-27T08:43:51Z

src/scanpy/datasets/_ebi_expression_atlas.py

+    for data in chunks:
+        mtx_chunk = sparse.csr_matrix(
+            (data[2], (data[1] - 1, data[0] - 1)), shape=(m, n)
+        )
+        mtx = mtx + mtx_chunk


This is probably slightly slower than necessary, since we know the chunks don‘t overlap and = needs to deal with actually summing up things. But I imagine it could also be pretty well optimized, so if the following is not faster, please just add a comment instead explaining that + is well-enough optimized.

The way csr_matrix((data, (i, j)), [shape]) works is that it first creates a coo_matrix, then converts it to csr.
I think the best way is probably:

build up data, i, j arrays in a loop

create a csr_matrix from the final arrays as the last step

memory efficient mtx loading

fa91b73

Zethson assigned Intron7 Nov 28, 2024

Zethson added the Area – Performance 🐌 label Nov 28, 2024

Zethson and others added 2 commits January 27, 2025 08:29

Merge branch 'main' into mem_fix

25789c5

[pre-commit.ci] auto fixes from pre-commit.com hooks

792b5e2

for more information, see https://pre-commit.ci

flying-sheep reviewed Jan 27, 2025

View reviewed changes

Zethson unassigned Intron7 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A memory efficient implementation of the .mtx reading function #3389

A memory efficient implementation of the .mtx reading function #3389

gjeuken commented Nov 28, 2024

codecov bot commented Nov 28, 2024 •

edited

Loading

flying-sheep left a comment

flying-sheep Jan 27, 2025

flying-sheep Jan 27, 2025 •

edited

Loading

	mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))
	mtx = sparse.csr_matrix((m, n), dtype=np.float64)

A memory efficient implementation of the .mtx reading function #3389

Are you sure you want to change the base?

A memory efficient implementation of the .mtx reading function #3389

Conversation

gjeuken commented Nov 28, 2024

codecov bot commented Nov 28, 2024 • edited Loading

Codecov Report

flying-sheep left a comment

Choose a reason for hiding this comment

flying-sheep Jan 27, 2025

Choose a reason for hiding this comment

flying-sheep Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 28, 2024 •

edited

Loading

flying-sheep Jan 27, 2025 •

edited

Loading