vector-db

This project is an implementation of a vector database written from scratch.

Getting Started

To install this project, simply clone the repository:

git clone https://github.com/farhan0167/vector-db.git
cd vector-db/src/

Then create a .env file, and paste the following:

COHERE_API_KEY=<cohere-api-key>
PYTEST_DB_URI=http://localhost:8000

Once cloned, and within src/, you will have two options to spin up the database:

via Docker, or
through a standalone Kubernetes cluster.

Running as a standalone Docker container

Build the container:

docker build -t vector-db-image:latest .

Spin a container

docker run --name vector-db -p 8000:8000 vector-db-image:latest

Launching a Kubernetes Cluster

Prerequisite

Make sure you have the following installed in your system:

Minikube: Helps you spin up a local k8 cluster
Helm: A package manager for Kubernetes

Running `vector-db` on k8

Spin up minikube
```
minikube start
```

Build the docker image

 eval $(minikube docker-env)
 docker build -t vector-db-image .

Deploy using Helm
```
helm install vector-db ./vector_db-chart/
```
Note: If you previously installed vector-db, you'll need to uninstall it:
```
helm uninstall vector-db
```
Check if the service is available
```
kubectl get svc
```
Launch the service
```
minikube service vector-db
```

API Documentation

A great way to see all the available methods and endpoints to interact with the library would be to navigate to http://localhost:<port>/redoc

Conceptual Guide

On a high level, vector-db can be summarized by the following diagram:

The vector database is made up of:

Database: The top-level object that holds a collection of libraries. It serves as the central structure for organizing and storing all content.
Library: A library is a group of related documents. It acts as a controller for both documents and their corresponding chunks.
Document: A document represents a single piece of text (e.g., an article or report). To enable more efficient retrieval, documents are split into smaller segments called chunks. This allows the system to return only the most relevant parts, avoiding the limitations of an LLM's context window.
Chunk: Chunks are the smaller text segments derived from a document. In vector-db, each chunk is embedded individually and is the core unit returned in a retrieval query.
API: The API provides an interface for users to interact with the database, enabling operations such as storing, querying, and retrieving documents or chunks.

System Design

When designing a retrival system, the system should be able to:

Provide accurate recommendations
Retrive with low latency, without sacrificing on accuracy

For this project, we aim for lower search latency as an objective. Before we discuss about vector search indexes, I also want to introduce a few other indexes we'll be using to organize Libraries, Documents and Chunks.

Non Vector Search

A database holds a collection of libraries, and a library consists of a list of documents, followed by documents consisting of a list of chunks. Each object in the project stores every child object in a list. A search over a list would usually cost us a O(n) time complexity, where n being the number of child objects a parent object is holding. Therefore, we implement, what we call is a CollectionsIndex. Put simply, it's a hash map that maps some child object identifier to the index of the collection/list it resides in. Consider the following:

documents = [c1, c2, c3, ..... cN]
 index   =   0,  1,   2,........N-1

What a CollectionsIndex does is simply map:

{
   c1: 0,
   c2: 1
}

You could argue that I could've simply used a dictionary holding an id and the object, but this way, we can track other metadata as well. For instance in a Library, a document can be identified via both its name and id. Instead of storing two dictionaries with 2 copies of Document, we store a single Document within the list of documents, with two dictionaries storing only the identifier of interest and the index within the list where the Document is located. This way, we can perform searches over documents at O(1) time. Insertion is O(1) as well since we're assigning the index as len(Collections)-1. Removal, however, is O(N) since once we remove an item from the list, we'll need to recompute the entire index but this is a cost we're okay with, so long search and insertion is fast.

Vector Search

When it comes to searching over a vector space, we have the following problem:

Given a query vector, Q, search all the embeddings and return the k-nearest vectors. This problem can nicely be summarized by the diagram below:

In this project, we call this the FlatL2 Index. What this does is, given a query, we perform a brute force search over the entire vector space via calculating the euclidean l2 distance between the query vector and all the existing vectors, and return only the K nearest neighbors to the query vector(here k is a configurable parameter set by the user).

However, we can do better. If you notice in the image, the query vector, Q, is located near the top cluster, among the 3 clusters of data. Why then do we search the entire vector space? To optimize this problem, we'll do what every search algorithm does, reduce the search surface. To do this, we can do a Kmeans clustering to find K clusters. When a user does a query, we'll first compare the query vector to the K cluster means, and pick the top one, and then perform a FlatL2 search over the vectors in that cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs/assets		docs/assets
examples		examples
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vector-db

Getting Started

Running as a standalone Docker container

Launching a Kubernetes Cluster

Prerequisite

Running `vector-db` on k8

API Documentation

Conceptual Guide

System Design

Non Vector Search

Vector Search

About

Releases

Packages

Languages

farhan0167/vector-db

Folders and files

Latest commit

History

Repository files navigation

vector-db

Getting Started

Running as a standalone Docker container

Launching a Kubernetes Cluster

Prerequisite

Running vector-db on k8

API Documentation

Conceptual Guide

System Design

Non Vector Search

Vector Search

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Running `vector-db` on k8

Packages