This project is an implementation of a vector database written from scratch.
To install this project, simply clone the repository:
git clone https://github.com/farhan0167/vector-db.git
cd vector-db/src/
Then create a .env
file, and paste the following:
COHERE_API_KEY=<cohere-api-key>
PYTEST_DB_URI=http://localhost:8000
Once cloned, and within src/
, you will have two options to spin up the database:
- via Docker, or
- through a standalone Kubernetes cluster.
- Build the container:
docker build -t vector-db-image:latest .
- Spin a container
docker run --name vector-db -p 8000:8000 vector-db-image:latest
Make sure you have the following installed in your system:
-
Spin up minikube
minikube start
-
Build the docker image
eval $(minikube docker-env) docker build -t vector-db-image .
-
Deploy using Helm
helm install vector-db ./vector_db-chart/
Note: If you previously installed
vector-db
, you'll need to uninstall it:helm uninstall vector-db
-
Check if the service is available
kubectl get svc
-
Launch the service
minikube service vector-db
A great way to see all the available methods and endpoints to interact with the library would be to
navigate to http://localhost:<port>/redoc
On a high level, vector-db can be summarized by the following diagram:
The vector database is made up of:
- Database: The top-level object that holds a collection of libraries. It serves as the central structure for organizing and storing all content.
- Library: A library is a group of related documents. It acts as a controller for both documents and their corresponding chunks.
- Document: A document represents a single piece of text (e.g., an article or report). To enable more efficient retrieval, documents are split into smaller segments called chunks. This allows the system to return only the most relevant parts, avoiding the limitations of an LLM's context window.
- Chunk: Chunks are the smaller text segments derived from a document. In vector-db, each chunk is embedded individually and is the core unit returned in a retrieval query.
- API: The API provides an interface for users to interact with the database, enabling operations such as storing, querying, and retrieving documents or chunks.
When designing a retrival system, the system should be able to:
- Provide accurate recommendations
- Retrive with low latency, without sacrificing on accuracy
For this project, we aim for lower search latency as an objective. Before we discuss about vector search indexes, I also want to introduce a few other indexes we'll be using to organize Libraries, Documents and Chunks.
A database holds a collection of libraries, and a library consists of a list of documents, followed by documents consisting of a list of chunks. Each object in the project stores every child object in a list. A search over a list would usually cost us a O(n) time complexity, where n being the number of child objects a parent object is holding. Therefore, we implement, what we call is a CollectionsIndex
. Put simply, it's a hash map that maps some child object identifier to the index of the collection/list it resides in. Consider the following:
documents = [c1, c2, c3, ..... cN]
index = 0, 1, 2,........N-1
What a CollectionsIndex
does is simply map:
{
c1: 0,
c2: 1
}
You could argue that I could've simply used a dictionary holding an id and the object, but this way, we can track other metadata as well. For instance in a Library, a document can be identified via both its name and id. Instead of storing two dictionaries with 2 copies of Document, we store a single Document within the list of documents, with two dictionaries storing only the identifier of interest and the index within the list where the Document is located. This way, we can perform searches over documents at O(1) time. Insertion is O(1) as well since we're assigning the index as len(Collections)-1
. Removal, however, is O(N) since once we remove an item from the list, we'll need to recompute the entire index but this is a cost we're okay with, so long search and insertion is fast.
When it comes to searching over a vector space, we have the following problem:
Given a query vector, Q, search all the embeddings and return the k-nearest vectors. This problem can nicely be summarized by the diagram below:
In this project, we call this the FlatL2 Index. What this does is, given a query, we perform a brute force search over the entire vector space via calculating the euclidean l2 distance between the query vector and all the existing vectors, and return only the K nearest neighbors to the query vector(here k is a configurable parameter set by the user).
However, we can do better. If you notice in the image, the query vector, Q, is located near the top cluster, among the 3 clusters of data. Why then do we search the entire vector space? To optimize this problem, we'll do what every search algorithm does, reduce the search surface. To do this, we can do a Kmeans clustering to find K clusters. When a user does a query, we'll first compare the query vector to the K cluster means, and pick the top one, and then perform a FlatL2 search over the vectors in that cluster.