Skip to content

Conversation

dungba88
Copy link
Contributor

@dungba88 dungba88 commented Dec 26, 2023

Description

This is an attempt to make FSTPostingFormat to write the FST off-heap. Instead of write it on-heap then save to disk, we configure the compiler to write the FST off-heap right from the start.

Some additional changes:

  • As we can't write the FST metadata and FST data on the same file, now we need to break the tfp file into 2 files: tfp.meta and tfp.data
  • We need to write the starting address of the FST data in the posting metadata file, then seek to that address when read

An alternative way is to copy the written FST back into metadata file, but that will slow down the writing.

@dungba88 dungba88 marked this pull request as draft December 26, 2023 06:36
@dungba88 dungba88 force-pushed the fstpostingformat-off-heap branch 2 times, most recently from fad6363 to aa06750 Compare December 26, 2023 07:21
@dungba88
Copy link
Contributor Author

dungba88 commented Dec 26, 2023

I found another quite tricky issue:

If we write the FST directly to the IndexOutput, there might be a chance that there's no term accepted by the FST, in that case we still write the padding 0 byte. This padding byte is to ensure no node having the 0 address: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L172-L174

However, since we are writing the FST consecutively for each field, appending to the same file, that means there could be a case we still write that additional padding byte, which is mapped to no field:
[ 0 ] [ FST_Field_1 ] [ 0 ] [ 0 ] [ FST_Field_3 ]

@dungba88 dungba88 force-pushed the fstpostingformat-off-heap branch from aa06750 to 83e3fac Compare December 26, 2023 09:11
@dungba88
Copy link
Contributor Author

There is only 1 failed test left: TestFSTPostingFormat.testRandomException

   >         Caused by:
   >         java.lang.RuntimeException: unclosed IndexOutput: _s_FST50_0.tfp.meta
   >             at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783)

Seems like some file might not be closed correctly when there are exception

@dungba88
Copy link
Contributor Author

dungba88 commented Dec 27, 2023

Fixed the above unclosed issue by moving openInput and createOutput to try-catch block

The test passed. I'll add a change log, some more comments

@dungba88
Copy link
Contributor Author

I added the change log, increased the FSTPostingsFormat version (isn't entirely related to this PR, but it seems the naming convention is outdated). The change for FSTCompiler can be merged as part of #12981 first. Will published the PR when the tests finished

@dungba88 dungba88 marked this pull request as ready for review December 28, 2023 08:27
@dungba88 dungba88 mentioned this pull request Dec 29, 2023
4 tasks
@dungba88
Copy link
Contributor Author

I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent.

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 26, 2024
@mikemccand
Copy link
Member

mikemccand commented Feb 8, 2024

I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent.

Let's defer this for a future issue? (Edit: nevermind -- looks like you already tackled this in the PR, great!). This is a rarely used experimental postings format, and might be replaced by the new postings format in #12688.

@github-actions github-actions bot removed the Stale label Feb 9, 2024
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 23, 2024
@dungba88
Copy link
Contributor Author

Thanks @mikemccand for the clarification!

Do you think we should still make this change? One benefit is that it can be used for reference. Otherwise I'll close this PR

@github-actions github-actions bot removed the Stale label Feb 29, 2024
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 15, 2024
@mikemccand
Copy link
Member

Thanks @mikemccand for the clarification!

Do you think we should still make this change? One benefit is that it can be used for reference. Otherwise I'll close this PR

Yes I think we should! The FSTs this postings format creates can easily be massive, so they really should build off-heap now that Lucene FST impl has this capability (thank you Tantivy for this inspiration!).

@dungba88 dungba88 force-pushed the fstpostingformat-off-heap branch from 76cf384 to 09768fc Compare September 13, 2025 20:48
@github-actions github-actions bot added this to the 11.0.0 milestone Sep 13, 2025
@github-actions github-actions bot removed the Stale label Sep 14, 2025
@dungba88
Copy link
Contributor Author

Thanks @mikemccand! I rebased the PR with the latest Lucene 11. One question I have is that do we need to maintain the backward compatibility for this format?

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants