-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Make FSTPostingFormat to build FST off-heap #12980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fad6363
to
aa06750
Compare
I found another quite tricky issue: If we write the FST directly to the IndexOutput, there might be a chance that there's no term accepted by the FST, in that case we still write the padding 0 byte. This padding byte is to ensure no node having the 0 address: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L172-L174 However, since we are writing the FST consecutively for each field, appending to the same file, that means there could be a case we still write that additional padding byte, which is mapped to no field: |
aa06750
to
83e3fac
Compare
There is only 1 failed test left: TestFSTPostingFormat.testRandomException
Seems like some file might not be closed correctly when there are exception |
Fixed the above unclosed issue by moving The test passed. I'll add a change log, some more comments |
I added the change log, increased the FSTPostingsFormat version (isn't entirely related to this PR, but it seems the naming convention is outdated). The change for FSTCompiler can be merged as part of #12981 first. Will published the PR when the tests finished |
I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Thanks @mikemccand for the clarification! Do you think we should still make this change? One benefit is that it can be used for reference. Otherwise I'll close this PR |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Yes I think we should! The FSTs this postings format creates can easily be massive, so they really should build off-heap now that Lucene FST impl has this capability (thank you Tantivy for this inspiration!). |
76cf384
to
09768fc
Compare
Thanks @mikemccand! I rebased the PR with the latest Lucene 11. One question I have is that do we need to maintain the backward compatibility for this format? |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Description
This is an attempt to make FSTPostingFormat to write the FST off-heap. Instead of write it on-heap then save to disk, we configure the compiler to write the FST off-heap right from the start.
Some additional changes:
tfp
file into 2 files:tfp.meta
andtfp.data
An alternative way is to copy the written FST back into metadata file, but that will slow down the writing.