This repository contains a Python script that processes PDF documents, chunks their content, generates embeddings using a text embedding model, and stores these embeddings in ChromaDB. The script also allows for querying the stored embeddings to retrieve and rank relevant document chunks based on a user's query.
- PDF Text Extraction: Extracts text from PDF files using the
PyPDF2
library. - Intelligent Chunking: Splits the extracted text into manageable chunks using
RecursiveCharacterTextSplitter
. - Embedding Generation: Generates embeddings for each text chunk using a model from the Ollama server.
- Document Indexing: Indexes the generated embeddings in ChromaDB, a scalable vector search engine.
- Query and Retrieval: Performs similarity search on the indexed embeddings and retrieves relevant document chunks based on a query.
- Conda (Anaconda or Miniconda)
- A running instance of ChromaDB
- Ollama server for embedding generation
-
Clone the repository:
git clone https://github.com/Teachings/rag-tools.git cd rag-tools
-
Create and activate the Conda environment:
conda create --name rag-tools python=3.12 conda activate rag-tools
-
Install the required Python packages:
pip install -r requirements.txt
-
Create a
.env
file in the root directory to configure the following environment variables:OLLAMA_API_URL=http://localhost:11424 # URL of your Ollama server CHROMA_HOST=localhost # Host for ChromaDB CHROMA_PORT=8085 # Port for ChromaDB
docker pull mukultripathi/chroma-db:latest
To persist data across container restarts, create a Docker volume:
docker volume create chroma-db-data
Run the container with your custom image, ensuring persistent storage and automatic restarts:
docker run -d \
--name chroma-db-instance \
--network ollama-network \
-p 8085:8000 \
--restart always \
-v chroma-db-data:/chroma/data \
mukultripathi/chroma-db:latest
Explanation:
--name chroma-db-instance
: Assigns a name to the container for easy reference.--network ollama-network
: Connects to the specified Docker network.-p 8085:8000
: Maps port 8085 on the host machine to port 8000 inside the container, making the Chroma DB accessible via port 8085.--restart always
: Ensures the container automatically restarts if it crashes or the Docker service restarts.-v chroma-db-data:/chroma/data
: Mounts the Docker volumechroma-db-data
to the/chroma/data
directory inside the container, ensuring persistent storage of the Chroma DB data.
Ensure the container is running correctly:
docker ps
docker inspect chroma-db-instance
These commands will show the running container's status and details, including the mounted volume.
-
docs/
: This folder is where you should place your PDF files. The script will reference files from this directory when processing documents.├── docs/ │ ├── sample-document.pdf
-
Place your PDF files in the
docs/
directory in the root of this repository. -
Update the
file_path
in the script to reference the PDF file in thedocs/
directory. For example:file_path = "./docs/sample-document.pdf"
-
Run the script:
cd langchain python ra_pipeline.py
-
The script will:
- Extract text from the specified PDF.
- Split the text into chunks.
- Generate embeddings for each chunk.
- Index the embeddings in ChromaDB.
- Allow you to query the indexed data and retrieve relevant document chunks.
The following is a brief example of how to use the script:
file_path = "./docs/sample-document.pdf"
collection_name = "sample_collection"
query = "What are the key events discussed in the document?"
# Step 1: Chunk the document and generate embeddings
documents = intelligent_chunking(file_path)
# Step 2: Index the documents in the collection
index_documents(documents, collection_name, delete=False)
# Step 3: Retrieve and rank the documents based on the query
results = retrieve_text(query, collection_name)
Feel free to open issues or submit pull requests if you find any bugs or have suggestions for new features.
This project is licensed under the MIT License. See the LICENSE file for details.