-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Make Lucene90 postings format to write FST off heap #12985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mikemccand I'm wondering if there is already some benchmark that can show the RAM saved by this change |
assert firstBlock.isFloor || newBlocks.size() == 1; | ||
|
||
firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef); | ||
boolean isRootBlock = prefixLength == 0 && count == pending.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this a parameter to be clearer instead of inferring from prefixLength and count
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
This change makes only that final FST construction run off-heap, which is a good baby step. But what if in my index all terms start with say We might instead just switch to off-heap building once the expected FST size crosses a threshold? We can use |
Heh, this made me remember the awesome character, Bôh, from the incredible movie Spirited Away: |
I think this is a good idea. Wondering how should we choose a reasonable threshold? Maybe it could be a parameter? (Was afraid introducing another parameter would also increase the configuration complexity of the system). One of the trade-off here is that they could potentially slow down the indexing: Apart from the root node, we need to traverse and iterate through the whole FST, and off-heap traversal might be slower than on-heap traversal (I think we saw 17% increases in the Synonym off-heap reading #13054). For root node, it doesn't need to be traversed, and we need to save it to IndexOutput anyway, so doing it off-heap actually save time: There's no need to construct the on-heap FST. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@mikemccand I tried to use
It seems the Index will check every files created under the index directory and will disallow any tmp file. I think we could delete the tmp file after use, but there doesn't seem a functionality to delete IndexInput (maybe due to the abstraction and not all IndexInput is actually a File?) |
If there is a way to create temp file outside of the search index, then it would work too, but I can't find it as all I/O are accessible from |
Hmm, the |
And you can use the |
Oh |
I've updated to use temp IndexOutput and modify the test. It seems to be working now. I'm open for suggestion of the default block heap threshold and how to configure it. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@mikemccand I've added the suggestion so that baby-giant FST will also be written off-heap if they are above some threshold. Let me know if there is other changes needed. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Closing as the block tree term format is no longer using FST |
Description
Only the root block will be written off-heap while the sub blocks won't be (They are small so it might not be worth it: We would need to have 1 IndexOutput for each of the sub blocks)
Note: The FSTCompiler change should be merged as part of #12981