feat: add RAG chunk testing script with LangChain splitters by 2561056571 · Pull Request #1 · 2561056571/wegent-evaluate

2561056571 · 2026-02-10T07:22:23Z

Summary

Add a standalone Python script for testing RAG indexing chunk effects using LangChain's MarkdownHeaderTextSplitter and RecursiveCharacterTextSplitter
Use DoclingReader to read mixed format files (PDF, Markdown, TXT, DOCX, etc.) from local directory
Implement two-stage chunking strategy for better document segmentation

Changes

backend/rag_chunk_test.py:

Created RAGChunkTester class with configurable chunk parameters
Integrated DoclingReader for multi-format document reading (PDF, MD, TXT, DOCX, HTML)
Implemented two-stage chunking:
1. MarkdownHeaderTextSplitter for H1, H2, H3 header-based splitting
2. RecursiveCharacterTextSplitter for fine-grained chunking (default: chunk_size=1024, chunk_overlap=50)
Added comprehensive metadata display for each chunk:
- Source file path
- Header hierarchy information
- Chunk content
- Character length
Command-line interface with configurable directory path and chunk parameters
Error handling for file reading and processing

Usage

# Default usage (reads from ./documents directory)
python backend/rag_chunk_test.py

# Specify custom directory
python backend/rag_chunk_test.py ./my_documents

# Custom chunk parameters
python backend/rag_chunk_test.py ./documents --chunk-size 2048 --chunk-overlap 100

Test plan

Verify script runs without errors on a directory containing markdown files
Verify script handles PDF files correctly using DoclingReader
Verify two-stage chunking produces expected results
Verify header metadata is correctly extracted and displayed
Verify chunk sizes respect the specified chunk_size parameter
Verify chunk overlap is applied correctly
Test with mixed format files (PDF, MD, TXT) in the same directory
Verify error handling for non-existent directories

Add a standalone Python script for testing RAG indexing chunk effects using LangChain's MarkdownHeaderTextSplitter and RecursiveCharacterTextSplitter. Features: - Use DoclingReader to read mixed format files (PDF, Markdown, TXT, DOCX, etc.) from local directory - Two-stage chunking strategy: 1. MarkdownHeaderTextSplitter for H1, H2, H3 header-based splitting 2. RecursiveCharacterTextSplitter for fine-grained chunking (chunk_size=1024, chunk_overlap=50) - Display comprehensive metadata for each chunk: - Source file path - Header hierarchy information - Chunk content - Character length - Command-line interface with configurable parameters - Support for custom directory path via argument Usage: python rag_chunk_test.py [directory_path] python rag_chunk_test.py ./documents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add RAG chunk testing script with LangChain splitters#1

feat: add RAG chunk testing script with LangChain splitters#1
2561056571 wants to merge 1 commit intomainfrom
wegent/rag-chunk-test-langchain

2561056571 commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

2561056571 commented Feb 10, 2026

Summary

Changes

Usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant