Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception raised when using FixedShingleFilter with WordDelimiterGraphFilter #14137

Open
binshengliu opened this issue Jan 14, 2025 · 0 comments
Labels

Comments

@binshengliu
Copy link

binshengliu commented Jan 14, 2025

Description

Hi, I'd like to report an issue using FixedShingleFilter with WordDelimiterGraphFilter. An exception is raised on the following conditions.

  • Tokenizer produces 1 token
  • WordDelimiterGraphFilter produces multiple tokens
  • FixedShingleFliter used

I ran into the issue when using Elasticsearch's search_as_you_type which uses FixedShingleFilter.

Exception in thread "main" java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) for field 'contents'
        at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1232)
        at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
        at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
        at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
        at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
        at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1477)
        at org.apache.lucene.demo.IndexFiles.indexDoc(IndexFiles.java:283)
        at org.apache.lucene.demo.IndexFiles.indexDocs(IndexFiles.java:234)
        at org.apache.lucene.demo.IndexFiles.main(IndexFiles.java:167)

This is the change to IndexFiles.java that can trigger the exception, tested on c20e09e.

diff --git a/lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java b/lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java
index dca01f61254..e39f6e440d4 100644
--- a/lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java
+++ b/lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java
@@ -30,7 +30,11 @@ import java.nio.file.attribute.BasicFileAttributes;
 import java.util.Date;
 import java.util.Objects;
 import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.analysis.core.FlattenGraphFilterFactory;
+import org.apache.lucene.analysis.custom.CustomAnalyzer;
+import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
+import org.apache.lucene.analysis.shingle.FixedShingleFilterFactory;
+import org.apache.lucene.analysis.standard.StandardTokenizerFactory;
 import org.apache.lucene.demo.knn.DemoEmbeddings;
 import org.apache.lucene.demo.knn.KnnVectorDict;
 import org.apache.lucene.document.Document;
@@ -126,7 +130,12 @@ public class IndexFiles implements AutoCloseable {
       System.out.println("Indexing to directory '" + indexPath + "'...");
 
       Directory dir = FSDirectory.open(Paths.get(indexPath));
-      Analyzer analyzer = new StandardAnalyzer();
+      Analyzer analyzer = CustomAnalyzer.builder()
+              .withTokenizer(StandardTokenizerFactory.NAME)
+              .addTokenFilter(WordDelimiterGraphFilterFactory.NAME, "catenateNumbers", "1")
+              .addTokenFilter(FlattenGraphFilterFactory.NAME)
+              .addTokenFilter(FixedShingleFilterFactory.NAME)
+              .build();
       IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
 
       if (create) {

Testing data:
I have a file with the following content and then the file is fed to IndexFiles.

555,0

Version and environment details

tested on c20e09e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant