Skip to content

feat: add RAG chunk testing script with LangChain splitters#1

Open
2561056571 wants to merge 1 commit intomainfrom
wegent/rag-chunk-test-langchain
Open

feat: add RAG chunk testing script with LangChain splitters#1
2561056571 wants to merge 1 commit intomainfrom
wegent/rag-chunk-test-langchain

Conversation

@2561056571
Copy link
Copy Markdown
Owner

Summary

  • Add a standalone Python script for testing RAG indexing chunk effects using LangChain's MarkdownHeaderTextSplitter and RecursiveCharacterTextSplitter
  • Use DoclingReader to read mixed format files (PDF, Markdown, TXT, DOCX, etc.) from local directory
  • Implement two-stage chunking strategy for better document segmentation

Changes

backend/rag_chunk_test.py:

  • Created RAGChunkTester class with configurable chunk parameters
  • Integrated DoclingReader for multi-format document reading (PDF, MD, TXT, DOCX, HTML)
  • Implemented two-stage chunking:
    1. MarkdownHeaderTextSplitter for H1, H2, H3 header-based splitting
    2. RecursiveCharacterTextSplitter for fine-grained chunking (default: chunk_size=1024, chunk_overlap=50)
  • Added comprehensive metadata display for each chunk:
    • Source file path
    • Header hierarchy information
    • Chunk content
    • Character length
  • Command-line interface with configurable directory path and chunk parameters
  • Error handling for file reading and processing

Usage

# Default usage (reads from ./documents directory)
python backend/rag_chunk_test.py

# Specify custom directory
python backend/rag_chunk_test.py ./my_documents

# Custom chunk parameters
python backend/rag_chunk_test.py ./documents --chunk-size 2048 --chunk-overlap 100

Test plan

  • Verify script runs without errors on a directory containing markdown files
  • Verify script handles PDF files correctly using DoclingReader
  • Verify two-stage chunking produces expected results
  • Verify header metadata is correctly extracted and displayed
  • Verify chunk sizes respect the specified chunk_size parameter
  • Verify chunk overlap is applied correctly
  • Test with mixed format files (PDF, MD, TXT) in the same directory
  • Verify error handling for non-existent directories

Add a standalone Python script for testing RAG indexing chunk effects using LangChain's MarkdownHeaderTextSplitter and RecursiveCharacterTextSplitter.

Features:
- Use DoclingReader to read mixed format files (PDF, Markdown, TXT, DOCX, etc.) from local directory
- Two-stage chunking strategy:
  1. MarkdownHeaderTextSplitter for H1, H2, H3 header-based splitting
  2. RecursiveCharacterTextSplitter for fine-grained chunking (chunk_size=1024, chunk_overlap=50)
- Display comprehensive metadata for each chunk:
  - Source file path
  - Header hierarchy information
  - Chunk content
  - Character length
- Command-line interface with configurable parameters
- Support for custom directory path via argument

Usage:
  python rag_chunk_test.py [directory_path]
  python rag_chunk_test.py ./documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant