-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
By default, Lucene currently uses compound files for flushed segments, and merged segments that use less than 10% of the total index size (computed either as a number of docs, or as a byte size depending on the merge policy).
I am considering switching to a fixed threshold, e.g. using compound files for all segments below 64MB for byte-size-based merge policies (TieredMergePolicy
, LogByteSizeMergePolicy
) or 65,536 docs for doc-based merge policies (LogDocMergePolicy
).
I like it better for a few reasons:
- Whether a segment is compound or not is more deterministic (and thus easier to reason about) as it doesn't depend on the total size of the index at the time of merging.
- The current ratio doesn't work well in multi-tenant scenarios where you could still have plenty of small files overall due to many small indexes.
Ideally we would have a single switch on IndexWriterConfig
instead of having flushes and merges independently make decisions about whether a segments qualifies for being compound.
I'm also wondering if we need to keep the current approach that is based on a ratio, or if only supporting a fixed threshold would be good enough.