This repository provides a comprehensive set of tools for natural language processing, text chunking, semantic analysis, and question answering with RAG (Retrieval-Augmented Generation) capabilities. It's designed to process, analyze, and extract meaningful information from text documents, particularly focusing on Spanish-language texts. The toolkit implements a sophisticated hybrid search system that combines traditional full-text search with modern vector-based semantic search, delivering highly relevant results for complex queries.
Conceptual details at Medium post
The core functionality in tools.py implements a complete RAG (Retrieval-Augmented Generation) pipeline with the following detailed components:
The system includes specialized functions for handling Spanish text, such as normalize_spanish_text(), which removes accents and normalizes special characters through character mapping. This ensures consistent text representation regardless of accent variations common in Spanish language texts. The text processing pipeline also includes functions like format_text_with_line_breaks() for proper text formatting and process_text_into_chunks() which uses regex patterns and language structure rules to intelligently divide text into semantically coherent units.
Text is broken into semantically meaningful chunks through a multi-stage process:
- Initial chunking using paragraph and sentence boundaries
- Refinement using semantic similarity thresholds, specifically detecting significant content shifts at the 95th percentile of cosine distance between adjacent chunk embeddings
- The
SemanticSplitGenerator.get_breakpoints()method identifies these semantic boundaries by analyzing embedding distance patterns - The system identifies natural semantic breakpoints where the cosine distance between embeddings exceeds the threshold, indicating a topic or content transition
- Further semantic analysis using embedding-based clustering to identify thematic relationships
- The
process_text_into_chunks()function implements a sophisticated algorithm that joins lines where the next line doesn't start with an uppercase letter, splits by double newlines, and further divides text where periods are followed by uppercase letters to preserve logical document structure.
The system employs the Okapi BM25 ranking algorithm (implemented in extract_keywords_with_bm25()) to identify the most relevant keywords from each text fragment:
- For each text fragment, the function tokenizes and cleans both the fragment and the full text
- Creates a corpus from paragraphs or sentences in the full text
- Applies BM25 scoring to calculate the relevance of each term in the fragment against the full corpus
- Filters Spanish stopwords using NLTK or a fallback list when NLTK is unavailable
- Parallelizes keyword extraction for efficiency using Python's
concurrent.futureswith configurable worker limits
The system uses PostgreSQL with the pgvector extension to store and retrieve vector embeddings:
initialize_vector_database()sets up PostgreSQL with pgvector extension and creates the necessary tablescheck_chunks_table()ensures proper table structure with vector dimensions matching the embedding model- The database schema includes dedicated columns for:
- Raw text content (
content) - Vector embeddings (
content_vectorusing pgvector's vector type) - Text search index (
content_tsvectoras a generated column using PostgreSQL'sto_tsvector('spanish', content)) - Metadata in JSONB format (storing keywords, timestamps, etc.)
- Raw text content (
- Creates optimal indices for both vector and text search:
- HNSW (Hierarchical Navigable Small World) index for efficient vector search with configurable parameters (
ef_construction = 200, m = 16) - GIN index for PostgreSQL's full-text search on the tsvector column
- HNSW (Hierarchical Navigable Small World) index for efficient vector search with configurable parameters (
The hybrid_search() function implements a sophisticated dual-retrieval approach:
-
Vector Similarity Search:
- Converts query text to vector embedding using OpenAI-compatible models
- Uses pgvector's vector distance operator (
<=>) to find semantically similar documents - Calculates a normalized similarity score (0-100) based on vector cosine distance
-
Full-Text Keyword Search:
- Extracts keywords from the query after removing Spanish stopwords
- Uses PostgreSQL's
websearch_to_tsquery('spanish', keywords)to create a text search query - Employs
ts_rank_cdwith custom weight configurations'{0.1, 0.2, 0.4, 1.0}'to prioritize document sections - Boosts scores based on exact keyword matches with context-aware weighting
The reciprocal_rank_fusion() function implements a modified version of the Reciprocal Rank Fusion (RRF) algorithm:
- Combines results from both vector and text searches using a weighted approach
- For each document, calculates a fusion score using the formula:
1/(k + rank)where k is a constant (default: 60) - Weights the RRF components based on the original relevance scores from each method
- Applies a 20% boost to documents found by both search methods
- Normalizes final scores to a 0-100 range for consistency
- Returns a combined list of results sorted by the fused relevance score
The algorithm specifically:
- Processes vector and text search results separately, calculating RRF components for each
- Assigns weights based on original relevance scores to preserve quality signals
- Incorporates both rank position and score magnitude in the final fusion formula
- Accounts for the "methods_count" to give preference to documents found through multiple search methods
- Produces a unified, re-ranked list of the most relevant context documents
The system implements two approaches for answer generation:
answer_question_with_context(): Standard synchronous response generationanswer_question_with_context_streaming(): Token-by-token streaming for real-time responses
Both functions:
- Format retrieved context documents with sequential numbering
- Construct a prompt that includes the query and retrieved context
- Call the language model (OpenAI API or compatible local model)
- Apply appropriate system instructions to generate concise, accurate answers
The implementation is designed to work with either OpenAI's API or local language models through compatible interfaces like vLLM, making it flexible for various deployment scenarios. For local deployments, the system uses identical API signatures but points to local endpoints, ensuring a consistent interface regardless of the underlying model provider.
tools.py: Core utilities for text processing, keyword extraction, vector database operations, and question answeringtext_chunking/: Specialized modules for semantic text chunking and visualizationworkshop.ipynb: Demonstration notebook showing the complete workflowbook.txt: Sample text used in the demonstrationsrequirements.txt: Dependencies required for the project
normalize_spanish_text(text): Handles Spanish-specific text normalization, removing accents and special charactersload_book(path): Loads text content from a fileformat_text_with_line_breaks(text, line_length): Formats text with appropriate line breaks
extract_keywords_with_bm25(fragment_and_full_text): Extracts relevant keywords from text fragments using BM25 algorithmextract_keywords_from_fragments(fragments, full_text, top_n, max_workers): Handles parallel keyword extraction from multiple text fragmentsgenerate_wordcloud(keywords_results, title, output_dir, ...): Creates visual wordcloud representations of extracted keywords
connect_to_postgres(host, port, dbname, user, password): Establishes connection to PostgreSQL databasecheck_chunks_table(connection, vector_dimensions): Ensures proper table structure for vector storageget_embedding(text, api_key, base_url_embeddings, model): Obtains vector embeddings for texthybrid_search(connection, query_text, api_key, ...): Performs combined semantic and keyword-based searchinitialize_vector_database(host, port, dbname, user, password, vector_dimensions): Sets up vector database with pgvector extensioninsert_text_fragments(connection, text_fragments, keywords_lists, api_key, ...): Stores text fragments with their embeddings and keywords
reciprocal_rank_fusion(results, top_k, k_constant): Reranks search results using RRF algorithmget_openai_response(prompt, api_key, base_url, model, temperature): Obtains responses from OpenAI modelsanswer_question_with_context(query, context_docs, api_key, base_url, model, prompt_template): Generates answers based on retrieved contextanswer_question_with_context_streaming(query, context_docs, api_key, base_url, model, prompt_template): Stream-based version of answer generationprocess_text_into_chunks(fulltext): Divides text into appropriate chunks for processing
This class is responsible for semantically aware text splitting:
__init__(llm_chain, split_texts, split_text_embeddings): Initializes with text and embedding databuild_chunk_cosine_distances(): Calculates similarity between consecutive chunksget_breakpoints(embeddings, start, end, threshold): Identifies semantic breakpoints in textbuild_chunks_stack(length_threshold, cosine_distance_percentile_threshold): Creates text chunks respecting semantic boundariesbuild_semantic_groups(breakpoints): Groups text splits based on semantic similaritybuild_semantic_group_clusters(semantic_groups, cluster_ids): Aggregates semantic groups into clustersbuild_semantic_group_summaries(semantic_groups_to_summarize, verbose): Generates summaries for semantic groups
This class provides visualization and management capabilities for semantic clusters:
__init__(api_key, llm_model, temperature, base_url_llm, base_url_embeddings, embeddings_model): Initializes visualization capabilitiessplit_documents(splitter, documents, min_chunk_len, verbose): Splits documents and handles short chunksmerge_short_documents(split_texts, min_len): Combines small text fragments for better analysisembed_original_document_splits(doc_splits): Creates embeddings for document chunksembed_semantic_groups(semantic_groups): Creates embeddings for semantic groupsgenerate_breakpoints(doc_splits, doc_split_embeddings, length_threshold, percentile_threshold, plot, verbose): Creates semantic breakpoints for document organizationvizualize_semantic_groups(semantic_groups, semantic_group_embeddings, n_clusters): Visualizes semantic relationships between text groupsgenerate_cluster_labels(semantic_group_clusters, plot): Creates labels and summaries for semantic clusters
The notebook demonstrates a complete workflow for text processing and RAG-based question answering:
- Database Setup: Initializes a PostgreSQL database with vector capabilities
- Text Ingestion: Loads and performs initial processing of text
- Semantic Processing: Creates embeddings and performs semantic chunking
- Keyword Extraction: Identifies important terms in each text fragment
- Hybrid Index Creation: Builds combined semantic and keyword-based search capabilities
- Hybrid Retrieval: Performs searches using both semantic and keyword methods
- Relevance Ranking: Reorganizes results by relevance using Reciprocal Rank Fusion
- Response Generation: Creates natural language answers based on retrieved context
The workshop.ipynb notebook provides step-by-step examples of using the toolkit, including:
- Setting up a vector database
- Loading and processing text
- Extracting keywords and generating visualizations
- Performing semantic analysis and clustering
- Building hybrid search capabilities
- Asking questions and generating contextualized answers
Key dependencies include:
- Python 3.10+
- PostgreSQL with pgvector extension
- OpenAI API or compatible local models
- Various Python libraries (numpy, scipy, langchain, etc.)
See requirements.txt for a complete list of dependencies.
The toolkit is configured to work with the OpenAI API by default. To use OpenAI's services:
- Obtain an API key from OpenAI
- In the notebook, make sure the
base_urlparameters point to the OpenAI API endpoints:semantic_chunker = SemanticClusterVisualizer( api_key="your-openai-api-key", llm_model='gpt-4o', base_url_llm="https://api.openai.com/v1", base_url_embeddings="https://api.openai.com/v1", embeddings_model="text-embedding-3-small" )
For privacy, cost efficiency, or customization, you can run models locally using vLLM as an OpenAI-compatible API backend:
-
Install vLLM:
pip install vllm
-
Start the vLLM server with an OpenAI-compatible API:
python -m vllm.entrypoints.openai.api_server \ --model your-local-model-name \ --host 127.0.0.1 \ --port 3000 -
Launch an embedding model server on a different port:
python -m vllm.entrypoints.openai.api_server \ --model your-embedding-model-name \ --host 127.0.0.1 \ --port 3001 -
In the notebook, simply change the API URLs to point to your local servers:
semantic_chunker = SemanticClusterVisualizer( api_key="123", # Any string works when using vLLM locally llm_model='gpt-4o', base_url_llm="http://localhost:3000/v1", base_url_embeddings="http://localhost:3001/v1", embeddings_model="text-embedding-3-small" )
That's it! The toolkit will now use your local LLM infrastructure instead of the OpenAI API, while maintaining the same functionality. No other code changes are required beyond updating the hostnames and API keys in the notebook.
When using the notebook, the following points need to be modified according to your specific setup:
-
Database Connection Parameters (Cell 3):
connection = initialize_vector_database( host="localhost", # Change to your Postgres server port=5432, # Change if using a different port dbname="workshop_rag", # Your database name user="postgres", # Your database username password="dev.2m", # Your database password vector_dimensions=1024 )
-
LLM and Embedding Services Configuration (Cell 4):
semantic_chunker = SemanticClusterVisualizer( api_key="123", # Your API key for OpenAI or local service llm_model='gpt-4o', # Change to your preferred model base_url_llm="http://localhost:3000/v1", # URL for LLM service base_url_embeddings="http://localhost:3001/v1", # URL for embeddings service embeddings_model="text-embedding-3-small" # Change to your preferred embedding model )
-
Hybrid Search Parameters (Cell 10):
results = hybrid_search( connection=connection, query_text=query, api_key="123", # Your API key base_url_embeddings="http://localhost:3001/v1", # URL for embeddings service top_k=20 )
-
Question Answering Service Configuration (Cell 12):
stream = answer_question_with_context_streaming( query=query, context_docs=results, api_key="123", # Your API key base_url="http://localhost:3000/v1", # URL for LLM service model="gpt-4o", # Change to your preferred model prompt_template=prompt )
These are all the configuration points that need to be modified in the notebook to adapt it to your environment, whether you're using OpenAI's services or a local setup with vLLM.
The semantic chunking and visualization functionality in this repository is based on the excellent work by rmartinshort in the text_chunking repository. We've extended and adapted these core mechanisms to handle Spanish text, implement hybrid search capabilities, and integrate them with a RAG system. We are grateful for the original implementation that provided a strong foundation for semantic text processing.
With β€οΈ from Dolfs AI (https://www.dolfs.io)
This project is licensed under the MIT License.