@@ -16,6 +16,7 @@ This module provides document classification capabilities for the IDP Accelerato
1616- Structured data models for results
1717- Grouping of pages into sections by classification
1818- Comprehensive error handling and retry mechanisms
19+ - ** DynamoDB caching for resilient page-level classification**
1920
2021## Usage Example
2122
@@ -226,6 +227,136 @@ def handler(event, context):
226227- ` ClassificationResult ` : Overall result of a classification operation
227228- ` Document ` : Core document data model used throughout the IDP pipeline
228229
230+ ## DynamoDB Caching for Resilient Classification
231+
232+ The classification service now supports optional DynamoDB caching to improve efficiency and resilience when processing documents with multiple pages. This feature addresses throttling scenarios where some pages succeed while others fail, avoiding the need to reclassify already successful pages on retry.
233+
234+ ### How It Works
235+
236+ 1 . ** Cache Check** : Before processing, the service checks for cached classification results for the document
237+ 2 . ** Selective Processing** : Only pages without cached results are classified
238+ 3 . ** Exception-Safe Caching** : Successful page results are cached even when other pages fail
239+ 4 . ** Retry Efficiency** : Subsequent retries only process previously failed pages
240+
241+ ### Configuration
242+
243+ #### Via Constructor Parameter
244+ ``` python
245+ from idp_common import classification, get_config
246+
247+ config = get_config()
248+ service = classification.ClassificationService(
249+ region = " us-east-1" ,
250+ config = config,
251+ backend = " bedrock" ,
252+ cache_table = " classification-cache-table" # Enable caching
253+ )
254+ ```
255+
256+ #### Via Environment Variable
257+ ``` bash
258+ export CLASSIFICATION_CACHE_TABLE=classification-cache-table
259+ ```
260+
261+ ``` python
262+ # Cache table will be automatically detected from environment
263+ service = classification.ClassificationService(
264+ region = " us-east-1" ,
265+ config = config,
266+ backend = " bedrock"
267+ )
268+ ```
269+
270+ ### DynamoDB Table Schema
271+
272+ The cache uses the following DynamoDB table structure:
273+
274+ - ** Primary Key (PK)** : ` classcache#{document_id}#{workflow_execution_arn} `
275+ - ** Sort Key (SK)** : ` none `
276+ - ** Attributes** :
277+ - ` page_classifications ` (String): JSON-encoded successful page results
278+ - ` cached_at ` (String): Unix timestamp of cache creation
279+ - ` document_id ` (String): Document identifier
280+ - ` workflow_execution_arn ` (String): Workflow execution ARN
281+ - ` ExpiresAfter ` (Number): TTL attribute for automatic cleanup (24 hours)
282+
283+ #### Example DynamoDB Item
284+ ``` json
285+ {
286+ "PK" : " classcache#doc-123#arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123" ,
287+ "SK" : " none" ,
288+ "page_classifications" : " {\" 1\" :{\" doc_type\" :\" invoice\" ,\" confidence\" :1.0,\" metadata\" :{\" metering\" :{...}},\" image_uri\" :\" s3://...\" ,\" text_uri\" :\" s3://...\" ,\" raw_text_uri\" :\" s3://...\" },\" 2\" :{...}}" ,
289+ "cached_at" : " 1672531200" ,
290+ "document_id" : " doc-123" ,
291+ "workflow_execution_arn" : " arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123" ,
292+ "ExpiresAfter" : 1672617600
293+ }
294+ ```
295+
296+ ### Benefits
297+
298+ - ** Cost Reduction** : Avoids redundant API calls to Bedrock/SageMaker for already-classified pages
299+ - ** Improved Resilience** : Handles partial failures gracefully during concurrent processing
300+ - ** Faster Retries** : Subsequent attempts only process failed pages, not the entire document
301+ - ** Automatic Cleanup** : TTL ensures cache entries don't accumulate indefinitely
302+ - ** Thread Safety** : Safe for concurrent page processing within the same document
303+
304+ ### Example: Resilient Processing Flow
305+
306+ ``` python
307+ from idp_common import classification, get_config
308+ from idp_common.models import Document
309+
310+ config = get_config()
311+ service = classification.ClassificationService(
312+ region = " us-east-1" ,
313+ config = config,
314+ backend = " bedrock" ,
315+ cache_table = " classification-cache-table"
316+ )
317+
318+ # Create document with 5 pages
319+ document = Document(
320+ id = " doc-123" ,
321+ workflow_execution_arn = " arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123" ,
322+ pages = {
323+ " 1" : {... },
324+ " 2" : {... },
325+ " 3" : {... },
326+ " 4" : {... },
327+ " 5" : {... }
328+ }
329+ )
330+
331+ try :
332+ # First attempt: pages 1,2,4 succeed, pages 3,5 fail due to throttling
333+ document = service.classify_document(document)
334+ except Exception as e:
335+ # Pages 1,2,4 are cached automatically before exception is raised
336+ print (f " Classification failed: { e} " )
337+
338+ try :
339+ # Retry: only pages 3,5 are processed (1,2,4 loaded from cache)
340+ document = service.classify_document(document)
341+ print (" Document classified successfully on retry" )
342+ except Exception as e:
343+ print (f " Retry failed: { e} " )
344+ ```
345+
346+ ### Cache Lifecycle
347+
348+ 1 . ** Creation** : Cache entries are created when ` classify_document() ` completes successfully or encounters exceptions
349+ 2 . ** Retrieval** : Cache is checked at the start of each ` classify_document() ` call
350+ 3 . ** Update** : Cache entries are updated with new successful results from each processing attempt
351+ 4 . ** Expiration** : Entries automatically expire after 24 hours via DynamoDB TTL
352+
353+ ### Important Notes
354+
355+ - Caching only applies to the ` classify_document() ` method, not individual ` classify_page() ` calls
356+ - Cache entries are scoped to specific document and workflow execution combinations
357+ - Only successful page classifications (without errors in metadata) are cached
358+ - The cache is transparent - existing code continues to work without modifications
359+
229360## Backend Options
230361
231362### Bedrock Backend
0 commit comments