-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A memory efficient implementation of the .mtx reading function #3389
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3389 +/- ##
=======================================
Coverage 75.44% 75.45%
=======================================
Files 113 113
Lines 13250 13253 +3
=======================================
+ Hits 9997 10000 +3
Misses 3253 3253
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! some small notes:
) | ||
mtx = sparse.csr_matrix((data[2], (data[1] - 1, data[0] - 1)), shape=(m, n)) | ||
mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n)) | |
mtx = sparse.csr_matrix((m, n), dtype=np.float64) |
for data in chunks: | ||
mtx_chunk = sparse.csr_matrix( | ||
(data[2], (data[1] - 1, data[0] - 1)), shape=(m, n) | ||
) | ||
mtx = mtx + mtx_chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably slightly slower than necessary, since we know the chunks don‘t overlap and =
needs to deal with actually summing up things. But I imagine it could also be pretty well optimized, so if the following is not faster, please just add a comment instead explaining that +
is well-enough optimized.
The way csr_matrix((data, (i, j)), [shape])
works is that it first creates a coo_matrix
, then converts it to csr
.
I think the best way is probably:
- build up
data, i, j
arrays in a loop - create a
csr_matrix
from the final arrays as the last step
Pandas
read_csv
function is very memory intensive, and this makes loading data (especially large datasets from EBI Single Cell Expression Atlas) impossible on computers with 16gb of ram or less. The subsequent analysis of such datasets with scanpy, however, works well on such computers.Loading the data into chunks, using the same pandas function, solves this problem.