Skip to content

Conversation

dungba88
Copy link
Contributor

@dungba88 dungba88 commented Dec 29, 2023

Description

Only the root block will be written off-heap while the sub blocks won't be (They are small so it might not be worth it: We would need to have 1 IndexOutput for each of the sub blocks)

Note: The FSTCompiler change should be merged as part of #12981

@dungba88 dungba88 mentioned this pull request Dec 29, 2023
4 tasks
@dungba88
Copy link
Contributor Author

dungba88 commented Jan 5, 2024

@mikemccand I'm wondering if there is already some benchmark that can show the RAM saved by this change

assert firstBlock.isFloor || newBlocks.size() == 1;

firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);
boolean isRootBlock = prefixLength == 0 && count == pending.size();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this a parameter to be clearer instead of inferring from prefixLength and count

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 26, 2024
@github-actions github-actions bot removed the Stale label Jan 31, 2024
@mikemccand
Copy link
Member

BlockTree is kinda crazy how it builds up the final FST: each little backwards-recursive chunk of term-space stores its subset of terms into a baby FST, and then on grouping N such child blocks into a new parent block, decodes those N baby FSTs, appending them into a new big-baby FST. This backwards recursive process continues for all chunks of term space, finally ending in the root block, where all the giant-baby FSTs are finally logically appended to one another to make the final FST.

This change makes only that final FST construction run off-heap, which is a good baby step.

But what if in my index all terms start with say a? I think that will mean we do on-heap construction of basically the full sized FST? Hmm, or, will the root block have the a prefix pointing to it, so we will in fact build the whole FST off-heap, from baby FSTs for aa*, ab*, ac*, ad*, etc.?

We might instead just switch to off-heap building once the expected FST size crosses a threshold? We can use createTempOutput to make temporary files as needed for the non-root FSTs that are "too big"?

@mikemccand
Copy link
Member

all the giant-baby FSTs

Heh, this made me remember the awesome character, Bôh, from the incredible movie Spirited Away:

image

@dungba88
Copy link
Contributor Author

dungba88 commented Feb 8, 2024

We might instead just switch to off-heap building once the expected FST size crosses a threshold? We can use createTempOutput to make temporary files as needed for the non-root FSTs that are "too big"?

I think this is a good idea. Wondering how should we choose a reasonable threshold? Maybe it could be a parameter? (Was afraid introducing another parameter would also increase the configuration complexity of the system).

One of the trade-off here is that they could potentially slow down the indexing: Apart from the root node, we need to traverse and iterate through the whole FST, and off-heap traversal might be slower than on-heap traversal (I think we saw 17% increases in the Synonym off-heap reading #13054). For root node, it doesn't need to be traversed, and we need to save it to IndexOutput anyway, so doing it off-heap actually save time: There's no need to construct the on-heap FST.

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 23, 2024
@dungba88
Copy link
Contributor Author

dungba88 commented Feb 27, 2024

@mikemccand I tried to use createTempOutput when the size is beyond some threshold. The logic works, but I got this exception

java.lang.IllegalArgumentException: invalid codec filename '_field_temp_2.tmp', cannot end with .tmp extension
	at __randomizedtesting.SeedInfo.seed([7608D2C3E960BC5A:CF77AADE6D1859B5]:0)
	at [email protected]/org.apache.lucene.index.SegmentInfo.checkFileNames(SegmentInfo.java:330)
	at [email protected]/org.apache.lucene.index.SegmentInfo.addFiles(SegmentInfo.java:306)
	at [email protected]/org.apache.lucene.index.SegmentInfo.setFiles(SegmentInfo.java:301)
	at [email protected]/org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468)
	at [email protected]/org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496)
	at [email protected]/org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450)
	at [email protected]/org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:640)
	at [email protected]/org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4259)
	at [email protected]/org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4233)
	at [email protected]/org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2124)
	at [email protected]/org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2105)

It seems the Index will check every files created under the index directory and will disallow any tmp file. I think we could delete the tmp file after use, but there doesn't seem a functionality to delete IndexInput (maybe due to the abstraction and not all IndexInput is actually a File?)

@dungba88
Copy link
Contributor Author

If there is a way to create temp file outside of the search index, then it would work too, but I can't find it as all I/O are accessible from SegmentWriteState.directory.

@mikemccand
Copy link
Member

Hmm, the .tmp file should never be part of the file set for the actual segment? It should be transient, and then re-copied into the final result (the actual block tree file that stores the FST today), and only that final result file is seen by SegmentInfo?

@mikemccand
Copy link
Member

And you can use the IndexOutput.getName() to get the String name of the temp file, and delete that file (prolly in a finally clause in case disaster strikes) using Directory.deleteFile once its contents have been saved away to the final file?

@dungba88
Copy link
Contributor Author

Oh Directory.deleteFile is exactly what I needed! It's silly that I missed that. I'll post another revision soon.

@github-actions github-actions bot removed the Stale label Feb 28, 2024
@dungba88
Copy link
Contributor Author

I've updated to use temp IndexOutput and modify the test. It seems to be working now. I'm open for suggestion of the default block heap threshold and how to configure it.

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 14, 2024
@dungba88
Copy link
Contributor Author

dungba88 commented Apr 3, 2024

@mikemccand I've added the suggestion so that baby-giant FST will also be written off-heap if they are above some threshold. Let me know if there is other changes needed.

@github-actions github-actions bot removed the Stale label Apr 4, 2024
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@dungba88
Copy link
Contributor Author

Closing as the block tree term format is no longer using FST

@dungba88 dungba88 closed this Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants