Move synonym map off-heap for SynonymGraphFilter #13054

msfroh · 2024-01-30T08:08:21Z

Description

This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.

The initial implementation is rough, but the unit tests pass with this change randomly enabled.

Obvious things that need work are:

I tried to do something like a codec, but not really a codec for the synonym map files. For a solution that could evolve over time, we should probably at least write something to the metadata file saying what format was used.
Right now it makes no effort to detect changes to the synonym files. I would suggest that SynonymGraphFilterFactory rebuild the directory if a checksum of the input files doesn't match a value recorded in the metadata file.
I don't think I like the random seeks in OffHeapBytesRefHashLike, but I don't see an alternative (besides moving it on-heap). Given that the original issue was only about moving the FST off-heap, maybe we can keep the word dictionary on-heap.

dungba88 · 2024-01-31T04:48:30Z

I think measuring latency impact would also be important for this change as off-heap will be slower (I understand that we are only adding off-heap as an option here, not forcing, but still).

msfroh · 2024-02-01T03:11:54Z

I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233

The benchmark code and input is at msfroh@174f98a

Attempt	On-heap load	Off-heap load	Off-heap reload	On-heap process	Off-heap process	Off-heap reload process
1	1146.022381	1117.004359	4.099065	569.120851	656.430684	613.475144
2	1079.578922	1060.926854	1.761465	456.203168	655.596275	622.534246
3	1035.911388	1076.611629	1.750233	579.41094	655.955431	614.788388
4	1037.825728	1085.513933	2.074129	696.390519	688.664985	613.266972
5	1017.489384	1008.209808	1.717748	485.510526	620.800148	620.708538
6	1014.641653	1024.412669	1.740371	483.617261	619.696259	619.910897
7	1027.691397	1045.129567	1.727786	670.49456	622.48759	616.303549
8	984.005971	1009.265777	1.736832	513.543926	615.448442	613.06279
9	1027.841112	1027.057453	1.732985	486.502644	622.535269	620.285635
10	981.689573	1074.613506	1.71059	707.810107	613.417977	624.34832
11	1026.165712	1065.3181	1.689407	479.610417	621.454353	616.183786
12	994.949905	1046.898091	1.730394	498.938696	612.279425	619.965444
13	1035.144288	1043.119169	1.739726	472.821155	619.267425	613.029508
14	996.056368	1017.663948	1.699742	692.135015	619.725163	620.454352
15	1046.605644	1018.287866	1.713526	470.391592	619.723699	612.068366
16	1007.579733	1042.062818	1.70251	508.481346	619.481298	619.178419
17	1038.166702	1054.039165	1.683814	485.439337	620.901934	616.017789
18	1000.900448	1058.492139	1.7267	515.185816	622.204031	627.560895
19	1236.416447	1080.877889	1.643654	434.73928	624.825435	625.622426
20	997.663619	1038.478411	1.657257	497.232157	623.337627	620.943519
Mean	1036.617319	1049.699158	1.8518967	535.1789657	628.7116725	618.4854492
Stddev	59.71799264	28.44516049	0.535792004	86.95026923	19.55324941	4.52695571

So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast.

We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap.

msfroh · 2024-02-09T08:08:49Z

I decided to try experimenting with moving the output words back onto the heap, since I didn't like the fact that every word lookup was triggering a seek.

Running now, I got way less variance on the on-heap runs. I also added some GCs between iterations, since I wanted to measure the heap usage of each. That likely removed some GC pauses from the on-heap version.

I then switched back to the off-heap words to confirm the results that I saw last time (and compare against the implementation with on-heap words).

The conclusion seems to be roughly:

Existing on-heap FST averages about 444ms to process a lot of synonyms.
Off-heap FST with on-heap words averages 515 or 516ms. (About 16% slower than existing on-heap.)
Off-heap FST with off-heap words averages 620ms. (About 40% slower than existing on-heap.)

The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.

With these trade-offs, I think off-heap FST with on-heap words may be a good choice for folks with large sets of synonyms. I don't think I would recommend off-heap FST with off-heap words.

Attempt	OnHeap FST load time	OffHeap FST (OnHeap words) load time	OffHeap FST (OnHeap words) reload time	OnHeap FST processing time	OffHeap FST (OnHeap words) processing time	OffHeap FST (OnHeap words) reloaded processing time	OffHeap FST (OffHeap words) processing time	OffHeap FST (OffHeap words) reloaded processing time
1	1191.339685	1072.285824	9.669646	436.391631	520.550704	516.11297	623.451546	620.531215
2	1030.432454	1033.619768	8.874105	448.848403	516.784387	517.230739	621.522464	622.793343
3	984.83645	1037.807342	8.912252	443.789813	512.066535	517.716981	622.455444	620.468985
4	1049.63589	1048.60113	8.894401	449.237547	518.946226	516.868933	617.837364	616.810236
5	990.22176	1049.618665	8.861166	448.923912	512.559801	511.114898	616.555422	617.122551
6	978.41877	1063.824595	8.930418	440.251675	517.632376	518.175232	621.969759	622.828416
7	985.434177	1049.113913	8.872906	443.209607	511.210536	518.802292	624.151468	622.097039
8	985.376238	1046.102696	8.823786	440.815454	517.491411	517.905752	623.390319	625.387487
9	983.341325	1065.892279	8.871586	449.145252	516.029267	516.916524	622.811992	622.798858
10	985.438642	1046.71167	8.8518	445.970679	512.045037	518.934149	622.592098	614.661805
11	990.592624	1050.377106	8.832753	443.844237	515.758106	510.808005	611.62254	622.956946
12	986.747374	1066.052969	8.884928	444.398327	517.259451	524.770132	622.085785	619.311172
13	984.328191	1052.189621	8.88281	439.612497	517.861131	515.796013	617.862222	615.101452
14	984.405339	1049.06783	8.835775	438.871305	517.885493	515.853446	615.254987	623.464483
15	997.323593	1064.473985	8.90682	443.640208	515.329143	518.807239	623.020916	623.013801
16	997.253932	1066.558928	8.900308	442.534843	511.930766	516.365803	624.316916	615.037306
17	999.464751	1046.464149	8.895899	443.48306	514.841946	517.082166	617.615908	618.661376
18	1001.896073	1045.304622	8.877555	444.875225	515.029862	510.365428	618.540866	624.355309
19	986.055833	1045.208347	8.863339	441.647553	511.489699	517.213428	623.61503	621.198543
20	984.112667	1047.317164	8.940865	451.304206	514.762544	510.45981	621.057397	621.483146
21	988.310511	1046.154648	8.865301	447.25874	514.859414	517.24163	623.916511	614.185296
22	982.874582	1062.113889	8.867098	439.785463	510.387721	516.885653	623.494968	622.527091
23	980.96967	1048.050631	8.867966	439.05464	511.423329	516.984465	621.567988	621.204435
24	983.189843	1046.083632	8.81578	440.574651	518.390122	520.392926	622.34785	614.923018
25	987.033178	1074.553767	8.812579	446.687106	513.914686	521.952744	615.870183	621.089011
26	985.771758	1076.245942	8.845264	444.718264	516.274395	513.5547	615.927497	615.53522
27	981.748774	1046.85677	8.818164	443.252924	513.632714	515.919924	626.659516	622.307368
28	983.979894	1062.317764	8.869256	443.267803	513.965345	509.688356	615.790469	615.712761
29	980.908776	1045.006602	8.855109	444.452376	517.488159	509.770143	621.96871	621.582871
30	981.508232	1046.722776	8.790313	443.952753	513.840793	512.847346	621.747601	621.901271
31	999.165558	1063.517734	8.792905	440.356205	517.677777	517.920992	620.90204	613.422668
32	1000.854281	1060.766663	9.027399	444.385706	510.006231	514.006688	623.684492	620.742008
33	1001.620724	1046.329083	8.72687	443.912072	509.793229	513.313214	620.695915	621.266234
34	1008.677463	1044.437966	8.799494	447.077333	516.263674	514.751767	622.775084	620.885167
35	987.309353	1048.062722	8.763122	440.748052	518.972785	518.608101	621.032898	620.85482
36	980.960836	1052.037316	8.834358	445.210623	518.850346	511.742763	620.719135	621.679027
37	983.807955	1049.894433	8.798039	440.302584	511.351473	510.557417	615.059624	619.802549
38	982.144377	1071.744423	8.81421	444.036711	518.551589	515.779265	614.579103	615.092139
39	984.460101	1051.399337	8.781058	439.998112	518.709639	511.192122	614.797646	621.154429
40	981.16924	1047.739329	8.827411	446.515425	512.815557	519.138446	621.769829	613.83963
Mean	995.5780219	1053.415701	8.87387035	443.6585744	515.115835	515.7387151	620.4259376	619.7447621
StdDev	34.59108879	10.41692769	0.139732265	3.344053675	2.951489941	3.528886642	3.48797555	3.346472427

msfroh · 2024-02-23T18:30:41Z

@dungba88 -- I'm trying to resolve conflicts with your changes, but I'm a little stuck. I don't understand how we're supposed to use the FST APIs to write the FST to disk now.

After merging our changes, SynonymMap contains:

      FST<BytesRef> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());
      if (directory != null) {
        fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this?
        try (SynonymMapDirectory.WordsOutput wordsOutput = directory.wordsOutput()) {
          BytesRef scratchRef = new BytesRef();
          for (int i = 0; i < words.size(); i++) {
            words.get(i, scratchRef);
            wordsOutput.addWord(scratchRef);
          }
        }
        directory.writeMetadata(words.size(), maxHorizontalContext, fst);
        return directory.readMap();
      }

That call to FST.fromFSTReader(...) fails with:

The DataOutput must implement FSTReader, but got FSIndexOutput(path="/home/froh/ws/lucene/lucene/analysis/common/build/tmp/tests-tmp/lucene.analysis.synonym.TestSynonymGraphFilter_1171182AD5892267-001/tempDir-001/synonyms.fst")

Is there something else that I'm supposed to be calling on the write path? Note that in the "off-heap" case above (when directory != null), we just need to write the FST. The directory.readMap() call loads it fresh from disk, discarding the FST that we constructed on heap.

dungba88 · 2024-02-24T08:53:04Z

@msfroh

As you only need to write the FST metadata, there is no need to create the FST. You can just call

directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);

where as fstMetadata is returned by fstCompiler.compile()

msfroh · 2024-02-25T22:11:00Z

@msfroh

As you only need to write the FST metadata, there is no need to create the FST. You can just call
directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);
where as fstMetadata is returned by fstCompiler.compile()

Hmm... I'm not sure that will work, because the logic to save the metadata is associated with the FST instance (i.e. in the FST::saveMetadata method).

I tried extracting that method out into a static, but the line outputs.writeFinalOutput(...) breaks things.

dungba88 · 2024-02-26T00:26:25Z

You are right, the saveMetadata is still in FST.

Now to create the FST written off heap, you need to create the corresponding DataInput and use the FST constructor.

However the saveMetadata method can be moved to FSTMetadata as well since all of the information are stored there (including outputs)

dungba88 · 2024-02-26T02:23:23Z

I could put a PR for the saveMetadata change if you prefer.

msfroh · 2024-02-26T04:48:47Z

I could put a PR for the saveMetadata change if you prefer.

I'll update to take care of that. Thanks for the pointers!

dungba88 · 2024-02-27T14:51:40Z

I realized I also need the saveMetadata change for #12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)

msfroh · 2024-02-27T17:00:49Z

I realized I also need the saveMetadata change for #12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)

Sure -- a standalone PR should work. We can both rebase onto that.

github-actions · 2024-03-13T00:17:06Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

msfroh · 2024-05-25T03:58:36Z

@dungba88 - I forgot about this change for a while. Did you create a separate PR for the saveMetadata change? Should I?

github-actions · 2024-06-09T00:21:54Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

dungba88 · 2024-07-08T04:05:07Z

@msfroh I also forgot about this. Let me create a PR

dungba88 · 2024-07-08T04:54:19Z

I published a PR here: #13549. Please take a look when you have time!

dungba88 · 2024-07-22T23:49:43Z

Note: The above PR has been merged

mikemccand

I love this change! Synonym dictionaries can become massive, so having the option for off-heap'ing the FST at a smallish performance hit makes a lot of sense.

Plus it takes advantage of the newish capability of FSTs to be accessed off-heap (thank you Tantivy inspiration!) in Lucene.

I left a few comments...

mikemccand · 2024-07-23T10:53:10Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+ * Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
+ * FST and output words are kept off-heap.
+ */
+public class SynonymMapDirectory implements Closeable {


Maybe mark this with @lucene.experimental so we are free to change the API within non-major releases?

Or: could this be package private? Does the user need to create this wrapper themselves for some reason?

I was thinking that the user would create a SynonymMapDirectory and pass it to SynonymMap.Builder.build(SynonymMapDirectory) as the way of opting in to off-heap FSTs for their synonyms.

Given the issue you called out below regarding the need to close the IndexInput for the FST, I feel like the user needs to hold onto "something" (other than the SynonymMap) that gives them an obligation to close filesystem resources when they're done.

Alternatively, I'd be happy to make SynonymMap implement Closeable. Then I'd probably just ask the user to specify a Path instead. At that point, we could hide SynonymMapDirectory altogether.

If we ever want to bring back the off-heap words, it seems like SynonymMapDirectory is the way to go, because we need to store two "things" in this directory? Or, were we stuffing both FST and words into a single file when you had words off-heap too?

mikemccand · 2024-07-23T10:57:53Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+
+/**
+ * Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
+ * FST and output words are kept off-heap.


How does the user control separately whether FST and output words are off heap or not?

At this point, if you're using SynonymMapDirectory, you get off-heap FST and on-heap words.

If you don't use SynonymMapDirectory (i.e. you're using the existing constructor or the arg-less version of SynonymMap.Builder.build()), then everything is on-heap like before.

The numbers I posted in #13054 (comment) (which, granted, was just a single synthetic benchmark) seemed to suggest (to me, at least) that the "sweet spot" is off-heap FST with on-heap words. The performance hit from moving words off-heap (at least with my implementation) was pretty bad. Lots of seeking involved. Also, the vast majority of heap savings came from moving the FST.

I'm happy to bring back off-heap words as an option if we think someone would be willing to take that perf hit for slightly lower heap utilization.

Hmm can you fix the javadoc above to explain that words are on-heap and FST is off-heap?

+1 to default to that sweet spot.

I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?

I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?

Aha, you answered this in an earlier comment:

The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.

It's wild that the FST is so much larger than the words... I'm not yet understanding why.

mikemccand · 2024-07-23T11:07:16Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java

    }
  }

+  abstract static class BytesRefHashLike {


Does this need to be public since it's used in the public ctor for SynonymMap? I thought we had static checking for this though...

Also, maybe rename to remove any reference to BytesRefHash? E.g. WordProvider or IDToWord or something?

Hmm... you're right.

More importantly, since the ctor for SynonymMap is public, I probably shouldn't change its signature.

I'll leave the existing public constructor (that takes a BytesRefHash), add a new private constructor (that takes a SynonymDictionary -- the new name I picked for BytesRefHashLike), and have the public constructor delegate to the private one. (That way, SynonymDictionary can remain package-private.)

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

mikemccand · 2024-07-23T11:11:22Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+      new SynonymMapFormat(); // TODO -- Should this be more flexible/codec-like? Less?
+  private final Directory directory;
+
+  public SynonymMapDirectory(Path path) throws IOException {


Is it possible to store several SynonymMaps in one Directory? Or one must make a separate Directory for each? That's maybe fine ... e.g. one could make a FilterDirectory impl that can share a single underlying filesystem directory and distinguish the files by e.g. a unique filename prefix or so.

For now, since I split the synonyms across three files (synonyms.mdt, synonyms.wrd, and synonyms.fst), I assumed that there would be a single synonym map per directory.

That said, I suppose it wouldn't be hard to combine those three into a single file (with a .syn extension, say), where the SynonymMapDirectory could look at a prefix. Specifically, the current implementation reads the metadata and words once (keeping the words on-heap), then spends the rest of its time in the FST.

Then a single filesystem directory could have something like:

first_synonyms.syn second_synonyms.syn ... etc. ...

What do you think? (I'm also happy to let each serialized SynonymMap live in its own directory.)

Let's leave it as is for now (three separate files)? But let's mark things @lucene.experimental to reserve the right to change APIs.

Hmm, also: will these synonym files be backwards compatible across releases? Across major releases? I would say we should not promise across major releases? Furthermore, we should enforce that not-promise, by writing the major release into the metadata somewhere and checking if that changed between writing and reading and throw a clear exception if so?

Within minor releases maybe we allow backcompat? If so, we need to add some testing to confirm syns written in 10.x are still readable/usable in 10.y?

mikemccand · 2024-07-23T11:15:39Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+  }
+
+  public SynonymMap readMap() throws IOException {
+    return synonymMapFormat.readSynonymMap(directory);


Hmm, this will hold open IndexInput file handles right? How do these get closed? (SynonymMap doesn't have a close I think?).

I think I like a model where the user creates a SynonymMapDirectory and passes it to SynonymMap.Builder.build()) if they want to store/read compiled synonyms on the filesystem.

Before returning the IndexInput, the SynonymMapDirectory will keep a reference to it. When the user calls close on their SynonymMapDirectory, it will close the outstanding IndexInput.

This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.

@dungba88

Moved FST metadata saving into FSTMetadata class per suggestion from @dungba88.

- Reduced visibility of some things - Brought back the old SynonymMap public constructor - Renamed BytesRefHashLike to SynonymDictionary - Hold a reference to FST's IndexInput to close it

github-actions · 2024-08-17T00:19:48Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

msfroh · 2025-01-10T08:02:15Z

@mikemccand -- do you think this needs more work? Can you work with your team to see if this change would help reduce your heap usage?

While we allow custom synonym files on AWS OpenSearch, the heap utilization from synonyms hasn't come up as an issue, as far as I know. I just did this because it looked fun. I don't have real-world metrics to see if it's useful.

github-actions · 2025-01-26T00:23:18Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

mikemccand · 2025-02-07T13:31:44Z

Hi @msfroh -- thank you for the ping! Sorry for the slow reply ... I'll try to review again soon, and we might be able to test impact in our Amazon product search SynonymGraphFilterGraph usage.

mikemccand

Thank you @msfroh! This is a nice added feature, not only for off-heap storage of possibly large synonym FSTs, but also the ability even to save your compiled SynonymMap to disk and quickly load it later. Today apps must rebuild their SynonymMap (a not cheap operation) in every JVM that will use it (or maybe do their own save/load of the underlying FST and BytesRefHash words).

mikemccand · 2025-02-07T13:35:18Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+
+/**
+ * Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
+ * FST and output words are kept off-heap.


Hmm can you fix the javadoc above to explain that words are on-heap and FST is off-heap?

+1 to default to that sweet spot.

I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?

mikemccand · 2025-02-07T13:39:08Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+ * Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
+ * FST and output words are kept off-heap.
+ */
+public class SynonymMapDirectory implements Closeable {


If we ever want to bring back the off-heap words, it seems like SynonymMapDirectory is the way to go, because we need to store two "things" in this directory? Or, were we stuffing both FST and words into a single file when you had words off-heap too?

mikemccand · 2025-02-07T13:41:18Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java

    }

-    /** Builds an {@link SynonymMap} and returns it. */
+    /** Buils a {@link SynonymMap} and returns it. */


Hmm, waaaaay up above, the javadoc for Builder, it mentions FSTSynonymMap twice -- can you fix those to SynonymMap instead? That must be holdover from ancient naming...

Also, it's a bit annoying that GH does not allow me to put comments on parts of the code you did not change :) I guess this is GH's appempt to keep me in "eyes on the prize" mode ... so I only comment on stuff changed in the PR.

mikemccand · 2025-02-07T13:42:19Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java

+     * Builds a {@link SynonymMap} and returns it. If directory is non-null, it will write the
+     * compiled SynonymMap to disk and return an off-heap version.
+     */
+    public SynonymMap build(SynonymMapDirectory directory) throws IOException {


This ability to save a SynonymMap is new to your PR, right? We cannot save/load them today? So this is a nice new additional feature (in addition to off-heap option, and a nice side effect of it) in your PR?

mikemccand · 2025-02-07T13:45:02Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+
+/**
+ * Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
+ * FST and output words are kept off-heap.


I wonder in practice what the "typical" size of FST vs words is? Like does the FST dominate the storage?

Aha, you answered this in an earlier comment:

The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.

It's wild that the FST is so much larger than the words... I'm not yet understanding why.

mikemccand · 2025-02-07T13:52:37Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java

-      FST<BytesRef> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());
+      FST.FSTMetadata<BytesRef> fstMetaData = fstCompiler.compile();
+      if (directory != null) {
+        fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this?


I think the idea is a caller could in theory write multiple FSTs into a single IndexOutput (remove the TODO?)?

mikemccand · 2025-02-07T13:56:02Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java

+      FST.FSTMetadata<BytesRef> fstMetaData = fstCompiler.compile();
+      if (directory != null) {
+        fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this?
+        try (SynonymMapDirectory.WordsOutput wordsOutput = directory.wordsOutput()) {


A better on-disk layout might be to write a single big byte[] blob for all words, and then something like the cool "linear fit" encoding that MonotonicLongValues uses on-disk. This would be more compact and faster to load and maybe more options of what is on/off heap, etc.

But save all that for later! vInt prefix length encoding is fine for starters! Progress not perfection!

mikemccand · 2025-02-07T13:57:24Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+import org.apache.lucene.util.fst.OffHeapFSTStore;
+
+/**
+ * Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the


Hmm any reason why it must be an FSDirectory? Can it just be any Directory? Do we really rely on filesystem backing somehow? It looks like we are just using Lucene's standard IndexInput/Output...

mikemccand · 2025-02-07T14:01:14Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+      new SynonymMapFormat(); // TODO -- Should this be more flexible/codec-like? Less?
+  private final Directory directory;
+
+  public SynonymMapDirectory(Path path) throws IOException {


Let's leave it as is for now (three separate files)? But let's mark things @lucene.experimental to reserve the right to change APIs.

Hmm, also: will these synonym files be backwards compatible across releases? Across major releases? I would say we should not promise across major releases? Furthermore, we should enforce that not-promise, by writing the major release into the metadata somewhere and checking if that changed between writing and reading and throw a clear exception if so?

Within minor releases maybe we allow backcompat? If so, we need to add some testing to confirm syns written in 10.x are still readable/usable in 10.y?

mikemccand · 2025-02-07T14:03:07Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java

+  }
+
+  boolean hasSynonyms() throws IOException {
+    // TODO should take the path to the synonyms file to compare file hash against file used to


Whoa, what would this TODO achieve? Is it somehow trying to check if the compiled synonyms have become stale relative to the original source synonyms (a "make" like capability)? We don't know down here whether the original source synonyms are backed by a file...

github-actions · 2025-02-23T00:25:58Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

pchencal · 2025-04-29T21:17:06Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java

+     * Builds a {@link SynonymMap} and returns it. If directory is non-null, it will write the
+     * compiled SynonymMap to disk and return an off-heap version.
+     */
+    public SynonymMap build(SynonymMapDirectory directory) throws IOException {


When implementing the new build() method which accepts a directory path parameter for off-heap SynonymMap storage, I encountered FileAlreadyExistsException despite implementing a unique directory creation mechanism using System.currentTimeMillis().

Current implementation looks like something like this:

String synonymPath = ".../lucene/src/build/temp-FST"; Path dirPath = Path.of(synonymPath); SynonymMap synonymMap; try { // Create directory if it doesn't exist if (!Files.exists(dirPath)) { Files.createDirectories(dirPath); } // Create a unique directory for this run Path uniqueDirPath = dirPath.resolve("synonyms_" + System.currentTimeMillis()); Files.createDirectory(uniqueDirPath); synonymMap = builder.build(new SynonymMapDirectory(uniqueDirPath)); ...

Error observed:

Caused by: java.nio.file.FileAlreadyExistsException: /temp-FST/synonyms_1745894622943 at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken\\n at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1205)\\n at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)\\n at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:735)\\n at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)\\n at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)\\n at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)\\n at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1562)\\n at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1520)\ at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:391)\\n at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:160)\\n at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)\\n at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1205)\\n at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)\\n at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:735)\\n at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)\\n at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)\\n at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)\\n at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1562)\\n at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1520)\\n ... [truncated]``` The exception occurs during concurrent synonym map creation, appearing multiple times in the logs. Would appreciate guidance on proper handling of concurrent directory creation for off-heap storage.

github-actions · 2025-05-14T00:26:43Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

msfroh mentioned this pull request Jan 30, 2024

SynonymGraphFilter should read FSTs off-heap? #13005

Open

dungba88 mentioned this pull request Feb 8, 2024

Make Lucene90 postings format to write FST off heap #12985

Closed

msfroh force-pushed the synonym_fst_offheap branch 2 times, most recently from cfdb2fb to 32ff74b Compare February 26, 2024 19:01

github-actions bot added the Stale label Mar 13, 2024

github-actions bot removed the Stale label May 26, 2024

github-actions bot added the Stale label Jun 9, 2024

dungba88 mentioned this pull request Jul 8, 2024

Refactor FST.saveMetadata() to FSTMetadata.save() #13549

Merged

github-actions bot removed the Stale label Jul 9, 2024

mikemccand reviewed Jul 23, 2024

View reviewed changes

msfroh added 3 commits August 1, 2024 17:18

Move synonym map off-heap for SynonymGraphFilter

a371b7b

This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.

Tidy code

139179b

Move output words on-heap

2d24031

msfroh added 2 commits August 1, 2024 17:18

Resolve merge conflicts

8bfe105

Moved FST metadata saving into FSTMetadata class per suggestion from @dungba88.

Address comments from mikemccand

cef4fac

- Reduced visibility of some things - Brought back the old SynonymMap public constructor - Renamed BytesRefHashLike to SynonymDictionary - Hold a reference to FST's IndexInput to close it

msfroh force-pushed the synonym_fst_offheap branch from 49b622f to cef4fac Compare August 2, 2024 00:37

github-actions bot added the Stale label Aug 17, 2024

github-actions bot removed the Stale label Jan 11, 2025

github-actions bot added the Stale label Jan 26, 2025

mikemccand reviewed Feb 7, 2025

View reviewed changes

github-actions bot removed the Stale label Feb 8, 2025

github-actions bot added the Stale label Feb 23, 2025

pchencal reviewed Apr 29, 2025

View reviewed changes

github-actions bot removed the Stale label Apr 30, 2025

github-actions bot added the Stale label May 14, 2025

github-actions bot removed the Stale label Oct 1, 2025

Move synonym map off-heap for SynonymGraphFilter #13054

Are you sure you want to change the base?

Move synonym map off-heap for SynonymGraphFilter #13054

Uh oh!

Conversation

msfroh commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

dungba88 commented Jan 31, 2024

Uh oh!

msfroh commented Feb 1, 2024

Uh oh!

msfroh commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msfroh commented Feb 23, 2024

Uh oh!

dungba88 commented Feb 24, 2024

Uh oh!

msfroh commented Feb 25, 2024

Uh oh!

dungba88 commented Feb 26, 2024

Uh oh!

dungba88 commented Feb 26, 2024

Uh oh!

msfroh commented Feb 26, 2024

Uh oh!

dungba88 commented Feb 27, 2024

Uh oh!

msfroh commented Feb 27, 2024

Uh oh!

github-actions bot commented Mar 13, 2024

Uh oh!

msfroh commented May 25, 2024

Uh oh!

github-actions bot commented Jun 9, 2024

Uh oh!

dungba88 commented Jul 8, 2024

Uh oh!

dungba88 commented Jul 8, 2024

Uh oh!

dungba88 commented Jul 22, 2024

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msfroh Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msfroh commented Jan 30, 2024 •

edited

Loading

msfroh commented Feb 9, 2024 •

edited

Loading

msfroh Aug 1, 2024 •

edited

Loading