Skip to content

Commit 8f879aa

Browse files
sobychackomarkpollack
authored andcommitted
GH-1831: Add auto-truncation support strategies when batching documents
Fixes: #1831 - Document auto-truncation configuration with high token limits - Add integration tests for auto-truncation behavior - Include Spring Boot and manual configuration examples - Test large documents and batching scenarios Enables proper use of embedding model auto-truncation while avoiding batching strategy exceptions. Signed-off-by: Soby Chacko <[email protected]>
1 parent 11e3c8f commit 8f879aa

File tree

3 files changed

+333
-0
lines changed

3 files changed

+333
-0
lines changed

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/vectordbs.adoc

+95
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,101 @@ TokenCountBatchingStrategy strategy = new TokenCountBatchingStrategy(
236236
);
237237
----
238238

239+
=== Working with Auto-Truncation
240+
241+
Some embedding models, such as Vertex AI text embedding, support an `auto_truncate` feature. When enabled, the model silently truncates text inputs that exceed the maximum size and continues processing; when disabled, it throws an explicit error for inputs that are too large.
242+
243+
When using auto-truncation with the batching strategy, you must configure your batching strategy with a much higher input token count than the model's actual maximum. This prevents the batching strategy from raising exceptions for large documents, allowing the embedding model to handle truncation internally.
244+
245+
==== Configuration for Auto-Truncation
246+
247+
When enabling auto-truncation, set your batching strategy's maximum input token count much higher than the model's actual limit. This prevents the batching strategy from raising exceptions for large documents, allowing the embedding model to handle truncation internally.
248+
249+
Here's an example configuration for using Vertex AI with auto-truncation and custom `BatchingStrategy` and then using them in the PgVectorStore:
250+
251+
[source,java]
252+
----
253+
@Configuration
254+
public class AutoTruncationEmbeddingConfig {
255+
256+
@Bean
257+
public VertexAiTextEmbeddingModel vertexAiEmbeddingModel(
258+
VertexAiEmbeddingConnectionDetails connectionDetails) {
259+
260+
VertexAiTextEmbeddingOptions options = VertexAiTextEmbeddingOptions.builder()
261+
.model(VertexAiTextEmbeddingOptions.DEFAULT_MODEL_NAME)
262+
.autoTruncate(true) // Enable auto-truncation
263+
.build();
264+
265+
return new VertexAiTextEmbeddingModel(connectionDetails, options);
266+
}
267+
268+
@Bean
269+
public BatchingStrategy batchingStrategy() {
270+
// Only use a high token limit if auto-truncation is enabled in your embedding model.
271+
// Set a much higher token count than the model actually supports
272+
// (e.g., 132,900 when Vertex AI supports only up to 20,000)
273+
return new TokenCountBatchingStrategy(
274+
EncodingType.CL100K_BASE,
275+
132900, // Artificially high limit
276+
0.1 // 10% reserve
277+
);
278+
}
279+
280+
@Bean
281+
public VectorStore vectorStore(JdbcTemplate jdbcTemplate, EmbeddingModel embeddingModel, BatchingStrategy batchingStrategy) {
282+
return PgVectorStore.builder(jdbcTemplate, embeddingModel)
283+
// other properties omitted here
284+
.build();
285+
}
286+
}
287+
----
288+
289+
In this configuration:
290+
291+
1. The embedding model has auto-truncation enabled, allowing it to handle oversized inputs gracefully.
292+
2. The batching strategy uses an artificially high token limit (132,900) that's much larger than the actual model limit (20,000).
293+
3. The vector store uses the configured embedding model and the custom `BatchingStrategy` bean.
294+
295+
==== Why This Works
296+
297+
This approach works because:
298+
299+
1. The `TokenCountBatchingStrategy` checks if any single document exceeds the configured maximum and throws an `IllegalArgumentException` if it does.
300+
2. By setting a very high limit in the batching strategy, we ensure that this check never fails.
301+
3. Documents or batches exceeding the model's limit are silently truncated and processed by the embedding model's auto-truncation feature.
302+
303+
==== Best Practices
304+
305+
When using auto-truncation:
306+
307+
- Set the batching strategy's max input token count to be at least 5-10x larger than the model's actual limit to avoid premature exceptions from the batching strategy.
308+
- Monitor your logs for truncation warnings from the embedding model (note: not all models log truncation events).
309+
- Consider the implications of silent truncation on your embedding quality.
310+
- Test with sample documents to ensure truncated embeddings still meet your requirements.
311+
- Document this configuration for future maintainers, as it is non-standard.
312+
313+
CAUTION: While auto-truncation prevents errors, it can result in incomplete embeddings. Important information at the end of long documents may be lost. If your application requires all content to be embedded, split documents into smaller chunks before embedding.
314+
315+
==== Spring Boot Auto-Configuration
316+
317+
If you're using Spring Boot auto-configuration, you must provide a custom `BatchingStrategy` bean to override the default one that comes with Spring AI:
318+
319+
[source,java]
320+
----
321+
@Bean
322+
public BatchingStrategy customBatchingStrategy() {
323+
// This bean will override the default BatchingStrategy
324+
return new TokenCountBatchingStrategy(
325+
EncodingType.CL100K_BASE,
326+
132900, // Much higher than model's actual limit
327+
0.1
328+
);
329+
}
330+
----
331+
332+
The presence of this bean in your application context will automatically replace the default batching strategy used by all vector stores.
333+
239334
=== Custom Implementation
240335

241336
While `TokenCountBatchingStrategy` provides a robust default implementation, you can customize the batching strategy to fit your specific needs.

vector-stores/spring-ai-pgvector-store/pom.xml

+7
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,13 @@
7777
<scope>test</scope>
7878
</dependency>
7979

80+
<dependency>
81+
<groupId>org.springframework.ai</groupId>
82+
<artifactId>spring-ai-vertex-ai-embedding</artifactId>
83+
<version>${project.parent.version}</version>
84+
<scope>test</scope>
85+
</dependency>
86+
8087

8188
<dependency>
8289
<groupId>org.springframework.ai</groupId>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
/*
2+
* Copyright 2025-2025 the original author or authors.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* https://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
package org.springframework.ai.vectorstore.pgvector;
18+
19+
import java.util.ArrayList;
20+
import java.util.List;
21+
22+
import javax.sql.DataSource;
23+
24+
import com.knuddels.jtokkit.api.EncodingType;
25+
import com.zaxxer.hikari.HikariDataSource;
26+
import org.junit.jupiter.api.Test;
27+
import org.junit.jupiter.api.condition.EnabledIfEnvironmentVariable;
28+
import org.testcontainers.containers.PostgreSQLContainer;
29+
import org.testcontainers.junit.jupiter.Container;
30+
import org.testcontainers.junit.jupiter.Testcontainers;
31+
32+
import org.springframework.ai.document.Document;
33+
import org.springframework.ai.embedding.BatchingStrategy;
34+
import org.springframework.ai.embedding.EmbeddingModel;
35+
import org.springframework.ai.embedding.TokenCountBatchingStrategy;
36+
import org.springframework.ai.vectorstore.SearchRequest;
37+
import org.springframework.ai.vectorstore.VectorStore;
38+
import org.springframework.ai.vertexai.embedding.VertexAiEmbeddingConnectionDetails;
39+
import org.springframework.ai.vertexai.embedding.text.VertexAiTextEmbeddingModel;
40+
import org.springframework.ai.vertexai.embedding.text.VertexAiTextEmbeddingOptions;
41+
import org.springframework.beans.factory.annotation.Value;
42+
import org.springframework.boot.SpringBootConfiguration;
43+
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
44+
import org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfiguration;
45+
import org.springframework.boot.autoconfigure.jdbc.DataSourceProperties;
46+
import org.springframework.boot.context.properties.ConfigurationProperties;
47+
import org.springframework.boot.test.context.runner.ApplicationContextRunner;
48+
import org.springframework.context.ApplicationContext;
49+
import org.springframework.context.annotation.Bean;
50+
import org.springframework.context.annotation.Primary;
51+
import org.springframework.jdbc.core.JdbcTemplate;
52+
53+
import static org.assertj.core.api.Assertions.assertThat;
54+
import static org.junit.Assert.assertThrows;
55+
import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
56+
57+
/**
58+
* Integration tests for PgVectorStore with auto-truncation enabled. Tests the behavior
59+
* when using artificially high token limits with Vertex AI's auto-truncation feature.
60+
*
61+
* @author Soby Chacko
62+
*/
63+
@Testcontainers
64+
@EnabledIfEnvironmentVariable(named = "VERTEX_AI_GEMINI_PROJECT_ID", matches = ".*")
65+
@EnabledIfEnvironmentVariable(named = "VERTEX_AI_GEMINI_LOCATION", matches = ".*")
66+
public class PgVectorStoreAutoTruncationIT {
67+
68+
private static final int ARTIFICIAL_TOKEN_LIMIT = 132_900;
69+
70+
@Container
71+
@SuppressWarnings("resource")
72+
static PostgreSQLContainer<?> postgresContainer = new PostgreSQLContainer<>(PgVectorImage.DEFAULT_IMAGE)
73+
.withUsername("postgres")
74+
.withPassword("postgres");
75+
76+
private final ApplicationContextRunner contextRunner = new ApplicationContextRunner()
77+
.withUserConfiguration(PgVectorStoreAutoTruncationIT.TestApplication.class)
78+
.withPropertyValues("test.spring.ai.vectorstore.pgvector.distanceType=COSINE_DISTANCE",
79+
80+
// JdbcTemplate configuration
81+
String.format("app.datasource.url=jdbc:postgresql://%s:%d/%s", postgresContainer.getHost(),
82+
postgresContainer.getMappedPort(5432), "postgres"),
83+
"app.datasource.username=postgres", "app.datasource.password=postgres",
84+
"app.datasource.type=com.zaxxer.hikari.HikariDataSource");
85+
86+
private static void dropTable(ApplicationContext context) {
87+
JdbcTemplate jdbcTemplate = context.getBean(JdbcTemplate.class);
88+
jdbcTemplate.execute("DROP TABLE IF EXISTS vector_store");
89+
}
90+
91+
@Test
92+
public void testAutoTruncationWithLargeDocument() {
93+
this.contextRunner.run(context -> {
94+
VectorStore vectorStore = context.getBean(VectorStore.class);
95+
96+
// Test with a document that exceeds normal token limits but is within our
97+
// artificially high limit
98+
String largeContent = "This is a test document. ".repeat(5000); // ~25,000
99+
// tokens
100+
Document largeDocument = new Document(largeContent);
101+
largeDocument.getMetadata().put("test", "auto-truncation");
102+
103+
// This should not throw an exception due to our high token limit in
104+
// BatchingStrategy
105+
assertDoesNotThrow(() -> vectorStore.add(List.of(largeDocument)));
106+
107+
// Verify the document was stored
108+
List<Document> results = vectorStore
109+
.similaritySearch(SearchRequest.builder().query("test document").topK(1).build());
110+
111+
assertThat(results).hasSize(1);
112+
Document resultDoc = results.get(0);
113+
assertThat(resultDoc.getMetadata()).containsEntry("test", "auto-truncation");
114+
115+
// Test with multiple large documents to ensure batching still works
116+
List<Document> largeDocs = new ArrayList<>();
117+
for (int i = 0; i < 5; i++) {
118+
Document doc = new Document("Large content " + i + " ".repeat(4000));
119+
doc.getMetadata().put("batch", String.valueOf(i));
120+
largeDocs.add(doc);
121+
}
122+
123+
assertDoesNotThrow(() -> vectorStore.add(largeDocs));
124+
125+
// Verify all documents were processed
126+
List<Document> batchResults = vectorStore
127+
.similaritySearch(SearchRequest.builder().query("Large content").topK(5).build());
128+
129+
assertThat(batchResults).hasSizeGreaterThanOrEqualTo(5);
130+
131+
// Clean up
132+
vectorStore.delete(List.of(largeDocument.getId()));
133+
largeDocs.forEach(doc -> vectorStore.delete(List.of(doc.getId())));
134+
135+
dropTable(context);
136+
});
137+
}
138+
139+
@Test
140+
public void testExceedingArtificialLimit() {
141+
this.contextRunner.run(context -> {
142+
BatchingStrategy batchingStrategy = context.getBean(BatchingStrategy.class);
143+
144+
// Create a document that exceeds even our artificially high limit
145+
String massiveContent = "word ".repeat(150000); // ~150,000 tokens (exceeds
146+
// 132,900)
147+
Document massiveDocument = new Document(massiveContent);
148+
149+
// This should throw an exception as it exceeds our configured limit
150+
assertThrows(IllegalArgumentException.class, () -> {
151+
batchingStrategy.batch(List.of(massiveDocument));
152+
});
153+
154+
dropTable(context);
155+
});
156+
}
157+
158+
@SpringBootConfiguration
159+
@EnableAutoConfiguration(exclude = { DataSourceAutoConfiguration.class })
160+
public static class TestApplication {
161+
162+
@Value("${test.spring.ai.vectorstore.pgvector.distanceType}")
163+
PgVectorStore.PgDistanceType distanceType;
164+
165+
@Value("${test.spring.ai.vectorstore.pgvector.initializeSchema:true}")
166+
boolean initializeSchema;
167+
168+
@Value("${test.spring.ai.vectorstore.pgvector.idType:UUID}")
169+
PgVectorStore.PgIdType idType;
170+
171+
@Bean
172+
public VectorStore vectorStore(JdbcTemplate jdbcTemplate, EmbeddingModel embeddingModel,
173+
BatchingStrategy batchingStrategy) {
174+
return PgVectorStore.builder(jdbcTemplate, embeddingModel)
175+
.dimensions(PgVectorStore.INVALID_EMBEDDING_DIMENSION)
176+
.batchingStrategy(batchingStrategy)
177+
.idType(this.idType)
178+
.distanceType(this.distanceType)
179+
.initializeSchema(this.initializeSchema)
180+
.indexType(PgVectorStore.PgIndexType.HNSW)
181+
.removeExistingVectorStoreTable(true)
182+
.build();
183+
}
184+
185+
@Bean
186+
public JdbcTemplate myJdbcTemplate(DataSource dataSource) {
187+
return new JdbcTemplate(dataSource);
188+
}
189+
190+
@Bean
191+
@Primary
192+
@ConfigurationProperties("app.datasource")
193+
public DataSourceProperties dataSourceProperties() {
194+
return new DataSourceProperties();
195+
}
196+
197+
@Bean
198+
public HikariDataSource dataSource(DataSourceProperties dataSourceProperties) {
199+
return dataSourceProperties.initializeDataSourceBuilder().type(HikariDataSource.class).build();
200+
}
201+
202+
@Bean
203+
public VertexAiTextEmbeddingModel vertexAiEmbeddingModel(VertexAiEmbeddingConnectionDetails connectionDetails) {
204+
VertexAiTextEmbeddingOptions options = VertexAiTextEmbeddingOptions.builder()
205+
.model(VertexAiTextEmbeddingOptions.DEFAULT_MODEL_NAME)
206+
// Although this might be the default in Vertex, we are explicitly setting
207+
// this to true to ensure
208+
// that auto truncate is turned on as this is crucial for the
209+
// verifications in this test suite.
210+
.autoTruncate(true)
211+
.build();
212+
213+
return new VertexAiTextEmbeddingModel(connectionDetails, options);
214+
}
215+
216+
@Bean
217+
public VertexAiEmbeddingConnectionDetails connectionDetails() {
218+
return VertexAiEmbeddingConnectionDetails.builder()
219+
.projectId(System.getenv("VERTEX_AI_GEMINI_PROJECT_ID"))
220+
.location(System.getenv("VERTEX_AI_GEMINI_LOCATION"))
221+
.build();
222+
}
223+
224+
@Bean
225+
BatchingStrategy pgVectorStoreBatchingStrategy() {
226+
return new TokenCountBatchingStrategy(EncodingType.CL100K_BASE, ARTIFICIAL_TOKEN_LIMIT, 0.1);
227+
}
228+
229+
}
230+
231+
}

0 commit comments

Comments
 (0)