-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Move synonym map off-heap for SynonymGraphFilter #13054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I think measuring latency impact would also be important for this change as off-heap will be slower (I understand that we are only adding off-heap as an option here, not forcing, but still). |
I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233 The benchmark code and input is at msfroh@174f98a
So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast. We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap. |
I decided to try experimenting with moving the output words back onto the heap, since I didn't like the fact that every word lookup was triggering a seek. Running now, I got way less variance on the on-heap runs. I also added some GCs between iterations, since I wanted to measure the heap usage of each. That likely removed some GC pauses from the on-heap version. I then switched back to the off-heap words to confirm the results that I saw last time (and compare against the implementation with on-heap words). The conclusion seems to be roughly:
The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB. With these trade-offs, I think off-heap FST with on-heap words may be a good choice for folks with large sets of synonyms. I don't think I would recommend off-heap FST with off-heap words.
|
@dungba88 -- I'm trying to resolve conflicts with your changes, but I'm a little stuck. I don't understand how we're supposed to use the FST APIs to write the FST to disk now. After merging our changes,
That call to
Is there something else that I'm supposed to be calling on the write path? Note that in the "off-heap" case above (when |
As you only need to write the FST metadata, there is no need to create the FST. You can just call
where as |
Hmm... I'm not sure that will work, because the logic to save the metadata is associated with the I tried extracting that method out into a static, but the line |
You are right, the saveMetadata is still in FST. Now to create the FST written off heap, you need to create the corresponding DataInput and use the FST constructor. However the saveMetadata method can be moved to FSTMetadata as well since all of the information are stored there (including outputs) |
I could put a PR for the saveMetadata change if you prefer. |
I'll update to take care of that. Thanks for the pointers! |
cfdb2fb
to
32ff74b
Compare
I realized I also need the |
Sure -- a standalone PR should work. We can both rebase onto that. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@dungba88 - I forgot about this change for a while. Did you create a separate PR for the saveMetadata change? Should I? |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@msfroh I also forgot about this. Let me create a PR |
I published a PR here: #13549. Please take a look when you have time! |
Note: The above PR has been merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this change! Synonym dictionaries can become massive, so having the option for off-heap'ing the FST at a smallish performance hit makes a lot of sense.
Plus it takes advantage of the newish capability of FSTs to be accessed off-heap (thank you Tantivy inspiration!) in Lucene.
I left a few comments...
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. | ||
*/ | ||
public class SynonymMapDirectory implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mark this with @lucene.experimental
so we are free to change the API within non-major releases?
Or: could this be package private? Does the user need to create this wrapper themselves for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the user would create a SynonymMapDirectory
and pass it to SynonymMap.Builder.build(SynonymMapDirectory)
as the way of opting in to off-heap FSTs for their synonyms.
Given the issue you called out below regarding the need to close
the IndexInput
for the FST, I feel like the user needs to hold onto "something" (other than the SynonymMap
) that gives them an obligation to close filesystem resources when they're done.
Alternatively, I'd be happy to make SynonymMap
implement Closeable
. Then I'd probably just ask the user to specify a Path
instead. At that point, we could hide SynonymMapDirectory
altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever want to bring back the off-heap words, it seems like SynonymMapDirectory
is the way to go, because we need to store two "things" in this directory? Or, were we stuffing both FST and words into a single file when you had words off-heap too?
|
||
/** | ||
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user control separately whether FST and output words are off heap or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, if you're using SynonymMapDirectory
, you get off-heap FST and on-heap words.
If you don't use SynonymMapDirectory
(i.e. you're using the existing constructor or the arg-less version of SynonymMap.Builder.build()
), then everything is on-heap like before.
The numbers I posted in #13054 (comment) (which, granted, was just a single synthetic benchmark) seemed to suggest (to me, at least) that the "sweet spot" is off-heap FST with on-heap words. The performance hit from moving words off-heap (at least with my implementation) was pretty bad. Lots of seek
ing involved. Also, the vast majority of heap savings came from moving the FST.
I'm happy to bring back off-heap words as an option if we think someone would be willing to take that perf hit for slightly lower heap utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you fix the javadoc above to explain that words are on-heap and FST is off-heap?
+1 to default to that sweet spot.
I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?
Aha, you answered this in an earlier comment:
The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.
It's wild that the FST is so much larger than the words... I'm not yet understanding why.
} | ||
} | ||
|
||
abstract static class BytesRefHashLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public
since it's used in the public
ctor for SynonymMap
? I thought we had static checking for this though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe rename to remove any reference to BytesRefHash
? E.g. WordProvider
or IDToWord
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... you're right.
More importantly, since the ctor for SynonymMap
is public, I probably shouldn't change its signature.
I'll leave the existing public constructor (that takes a BytesRefHash
), add a new private constructor (that takes a SynonymDictionary
-- the new name I picked for BytesRefHashLike
), and have the public constructor delegate to the private one. (That way, SynonymDictionary
can remain package-private.)
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java
Outdated
Show resolved
Hide resolved
new SynonymMapFormat(); // TODO -- Should this be more flexible/codec-like? Less? | ||
private final Directory directory; | ||
|
||
public SynonymMapDirectory(Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to store several SynonymMap
s in one Directory
? Or one must make a separate Directory
for each? That's maybe fine ... e.g. one could make a FilterDirectory
impl that can share a single underlying filesystem directory and distinguish the files by e.g. a unique filename prefix or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, since I split the synonyms across three files (synonyms.mdt
, synonyms.wrd
, and synonyms.fst
), I assumed that there would be a single synonym map per directory.
That said, I suppose it wouldn't be hard to combine those three into a single file (with a .syn
extension, say), where the SynonymMapDirectory
could look at a prefix. Specifically, the current implementation reads the metadata and words once (keeping the words on-heap), then spends the rest of its time in the FST.
Then a single filesystem directory could have something like:
first_synonyms.syn
second_synonyms.syn
... etc. ...
What do you think? (I'm also happy to let each serialized SynonymMap
live in its own directory.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it as is for now (three separate files)? But let's mark things @lucene.experimental
to reserve the right to change APIs.
Hmm, also: will these synonym files be backwards compatible across releases? Across major releases? I would say we should not promise across major releases? Furthermore, we should enforce that not-promise, by writing the major release into the metadata somewhere and checking if that changed between writing and reading and throw a clear exception if so?
Within minor releases maybe we allow backcompat? If so, we need to add some testing to confirm syns written in 10.x are still readable/usable in 10.y?
} | ||
|
||
public SynonymMap readMap() throws IOException { | ||
return synonymMapFormat.readSynonymMap(directory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this will hold open IndexInput
file handles right? How do these get closed? (SynonymMap
doesn't have a close
I think?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like a model where the user creates a SynonymMapDirectory
and passes it to SynonymMap.Builder.build())
if they want to store/read compiled synonyms on the filesystem.
Before returning the IndexInput
, the SynonymMapDirectory
will keep a reference to it. When the user calls close
on their SynonymMapDirectory
, it will close the outstanding IndexInput
.
This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.
Moved FST metadata saving into FSTMetadata class per suggestion from @dungba88.
- Reduced visibility of some things - Brought back the old SynonymMap public constructor - Renamed BytesRefHashLike to SynonymDictionary - Hold a reference to FST's IndexInput to close it
49b622f
to
cef4fac
Compare
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@mikemccand -- do you think this needs more work? Can you work with your team to see if this change would help reduce your heap usage? While we allow custom synonym files on AWS OpenSearch, the heap utilization from synonyms hasn't come up as an issue, as far as I know. I just did this because it looked fun. I don't have real-world metrics to see if it's useful. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Hi @msfroh -- thank you for the ping! Sorry for the slow reply ... I'll try to review again soon, and we might be able to test impact in our Amazon product search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @msfroh! This is a nice added feature, not only for off-heap storage of possibly large synonym FSTs, but also the ability even to save your compiled SynonymMap
to disk and quickly load it later. Today apps must rebuild their SynonymMap
(a not cheap operation) in every JVM that will use it (or maybe do their own save/load of the underlying FST and BytesRefHash
words).
|
||
/** | ||
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you fix the javadoc above to explain that words are on-heap and FST is off-heap?
+1 to default to that sweet spot.
I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. | ||
*/ | ||
public class SynonymMapDirectory implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever want to bring back the off-heap words, it seems like SynonymMapDirectory
is the way to go, because we need to store two "things" in this directory? Or, were we stuffing both FST and words into a single file when you had words off-heap too?
} | ||
|
||
/** Builds an {@link SynonymMap} and returns it. */ | ||
/** Buils a {@link SynonymMap} and returns it. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, waaaaay up above, the javadoc for Builder
, it mentions FSTSynonymMap
twice -- can you fix those to SynonymMap
instead? That must be holdover from ancient naming...
Also, it's a bit annoying that GH does not allow me to put comments on parts of the code you did not change :) I guess this is GH's appempt to keep me in "eyes on the prize" mode ... so I only comment on stuff changed in the PR.
* Builds a {@link SynonymMap} and returns it. If directory is non-null, it will write the | ||
* compiled SynonymMap to disk and return an off-heap version. | ||
*/ | ||
public SynonymMap build(SynonymMapDirectory directory) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ability to save a SynonymMap
is new to your PR, right? We cannot save/load them today? So this is a nice new additional feature (in addition to off-heap option, and a nice side effect of it) in your PR?
|
||
/** | ||
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?
Aha, you answered this in an earlier comment:
The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.
It's wild that the FST is so much larger than the words... I'm not yet understanding why.
FST<BytesRef> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader()); | ||
FST.FSTMetadata<BytesRef> fstMetaData = fstCompiler.compile(); | ||
if (directory != null) { | ||
fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is a caller could in theory write multiple FSTs into a single IndexOutput
(remove the TODO
?)?
FST.FSTMetadata<BytesRef> fstMetaData = fstCompiler.compile(); | ||
if (directory != null) { | ||
fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this? | ||
try (SynonymMapDirectory.WordsOutput wordsOutput = directory.wordsOutput()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better on-disk layout might be to write a single big byte[]
blob for all words, and then something like the cool "linear fit" encoding that MonotonicLongValues
uses on-disk. This would be more compact and faster to load and maybe more options of what is on/off heap, etc.
But save all that for later! vInt prefix length encoding is fine for starters! Progress not perfection!
import org.apache.lucene.util.fst.OffHeapFSTStore; | ||
|
||
/** | ||
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm any reason why it must be an FSDirectory
? Can it just be any Directory
? Do we really rely on filesystem backing somehow? It looks like we are just using Lucene's standard IndexInput/Output
...
new SynonymMapFormat(); // TODO -- Should this be more flexible/codec-like? Less? | ||
private final Directory directory; | ||
|
||
public SynonymMapDirectory(Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it as is for now (three separate files)? But let's mark things @lucene.experimental
to reserve the right to change APIs.
Hmm, also: will these synonym files be backwards compatible across releases? Across major releases? I would say we should not promise across major releases? Furthermore, we should enforce that not-promise, by writing the major release into the metadata somewhere and checking if that changed between writing and reading and throw a clear exception if so?
Within minor releases maybe we allow backcompat? If so, we need to add some testing to confirm syns written in 10.x are still readable/usable in 10.y?
} | ||
|
||
boolean hasSynonyms() throws IOException { | ||
// TODO should take the path to the synonyms file to compare file hash against file used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, what would this TODO
achieve? Is it somehow trying to check if the compiled synonyms have become stale relative to the original source synonyms (a "make" like capability)? We don't know down here whether the original source synonyms are backed by a file...
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
* Builds a {@link SynonymMap} and returns it. If directory is non-null, it will write the | ||
* compiled SynonymMap to disk and return an off-heap version. | ||
*/ | ||
public SynonymMap build(SynonymMapDirectory directory) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing the new build() method which accepts a directory path parameter for off-heap SynonymMap storage, I encountered FileAlreadyExistsException
despite implementing a unique directory creation mechanism using System.currentTimeMillis()
.
Current implementation looks like something like this:
String synonymPath = ".../lucene/src/build/temp-FST";
Path dirPath = Path.of(synonymPath);
SynonymMap synonymMap;
try {
// Create directory if it doesn't exist
if (!Files.exists(dirPath)) {
Files.createDirectories(dirPath);
}
// Create a unique directory for this run
Path uniqueDirPath = dirPath.resolve("synonyms_" + System.currentTimeMillis());
Files.createDirectory(uniqueDirPath);
synonymMap = builder.build(new SynonymMapDirectory(uniqueDirPath));
...
Error observed:
Caused by: java.nio.file.FileAlreadyExistsException: /temp-FST/synonyms_1745894622943
at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken\\n
at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1205)\\n
at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)\\n
at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:735)\\n
at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)\\n
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)\\n
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)\\n
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1562)\\n
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1520)\
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:391)\\n
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:160)\\n
at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)\\n
at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1205)\\n
at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)\\n
at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:735)\\n
at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)\\n
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)\\n
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)\\n
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1562)\\n
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1520)\\n
... [truncated]```
The exception occurs during concurrent synonym map creation, appearing multiple times in the logs. Would appreciate guidance on proper handling of concurrent directory creation for off-heap storage.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Description
This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.
The initial implementation is rough, but the unit tests pass with this change randomly enabled.
Obvious things that need work are:
OffHeapBytesRefHashLike
, but I don't see an alternative (besides moving it on-heap). Given that the original issue was only about moving the FST off-heap, maybe we can keep the word dictionary on-heap.Fixes #13005