Skip to content

Conversation

@bpiwowar
Copy link
Contributor

With my current cluster, disk access can be painfully slow. What I propose here is to add an argument to doc_stores() with a generic option (that only has a file_access field for now, but just leaving it like this for the future) that allows to control how the content (LZ4 pickled for now) is accessed:

  • FILE: as before
  • MMAP: with a mmap-ed file
  • MEMORY: the content is in memory

I just did the modifications for msmarco-passages so that this works:

import ir_datasets
from ir_datasets.indices import DocstoreOptions, FileAccess

options = DocstoreOptions(file_access=FileAccess.MEMORY)
dataset = ir_datasets.load("msmarco-passage/train")
docstore = dataset.docs_store(options=options)

Before continuing, I wanted to check if you would like such a PR, and if any modifications on the current way I designed it is needed (since it involves modifying all the docs_store() methods.

@seanmacavaney
Copy link
Collaborator

This looks great and should be super helpful!

I wonder if it'd be good to have the default file mode loaded from an environment variable to make it a bit easier?

@bpiwowar
Copy link
Contributor Author

As you wish but I would prefer avoiding environment variables since for some datasets MEMORY is OK, but not for others. The environment variable could control the default.

I had a 5x speedup on a big cluster even with the MMAP strategy.

So if OK, I proceed on updating all the docs_store() methods (I will only implement the strategy for LZ4-based doc stores, leaving a warning for the others)

@bpiwowar
Copy link
Contributor Author

And OK for having options like this (or you prefer a more direct file_access kwargs?)

@bpiwowar
Copy link
Contributor Author

OK, just implemented the change for all the docs_store(...) methods – this works with derived collections and LZ4-based ones; for the other, the option is just ignored (with warnings).

Could you run the full test (I don't have all the collections installed) so as to check if anything's wrong?

@seanmacavaney
Copy link
Collaborator

Amazing, thanks a bunch! Still running through the tests, but these ones with PrefixedDocstore popped up:

FAILED test/util/docs/test_multiple.py::test_multiple_prefixes - TypeError: PrefixedDocstore.__init__() got an unexpected keyword argument 'options'
FAILED test/util/docs/test_multiple.py::test_multiple_prefixes_inlined - TypeError: PrefixedDocstore.__init__() got an unexpected keyword argument 'options'

@bpiwowar
Copy link
Contributor Author

Amazing, thanks a bunch! Still running through the tests, but these ones with PrefixedDocstore popped up:

FAILED test/util/docs/test_multiple.py::test_multiple_prefixes - TypeError: PrefixedDocstore.__init__() got an unexpected keyword argument 'options'
FAILED test/util/docs/test_multiple.py::test_multiple_prefixes_inlined - TypeError: PrefixedDocstore.__init__() got an unexpected keyword argument 'options'

OK, this should be fixed

@seanmacavaney
Copy link
Collaborator

Thanks! I got the following test failures that look related to this change:

FAILED test/indices/lz4_pickle.py::TestLz4PickleLookup::test_lz4_pickle_lookup - IndexError: list index out of range
FAILED test/integration/aol_ia.py::TestAolIa::test_docs - TypeError: AolManager._internal_docs_store() missing 1 required positional argument: 'options'

@bpiwowar
Copy link
Contributor Author

bpiwowar commented Sep 4, 2025

For

  • test/indices/lz4_pickle.py: the test fails on the main branch
  • test/integration/aol_ia.py: it should be fixed

@bpiwowar
Copy link
Contributor Author

Gentle ping here – the tests should be fixed (lz4_pickle is not, but it fails on the main branch too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants