Serverless Vector Database System for Multi-Modal Document/Image Processing

This documentation was created with the help of Generating documentation with Amazon Q Developer

Build a Serverless Embedding App with the AWS Cloud Development Kit (CDK) to create four AWS Lambda Functions.

This serverless solution creates, manages, and queries vector databases for PDF documents and images with Amazon Bedrock embeddings. You can use FAISS vector stores or Aurora PostgreSQL with pgvector for efficient similarity searches across multiple data types.

This project offers a complete serverless infrastructure for document processing and retrieval, using AWS Lambda functions to manage workflow tasks. With Amazon Bedrock for embedding generation, you can perform both text-based and image-based queries, making it ideal for multi-modal search applications.

💰 Cost to complete:

Amazon Bedrock Pricing
AWS Lambda Pricing
Amazon Aurora Pricing
Amazon S3 Pricing

Learn how test Lambda Functions in the console with test events.

AWS Lambda Funtions to Generating Embeddings for Text and Image Files:

To handle the embedding process, there is a dedicated Lambda Function for each file type:

To generate embeddings for the text content of PDF files with FAISS.

Event to trigger:

{
    "location": "REPLACE-YOU-KEY",
    "vectorStoreLocation": "REPALCE-NAME.vdb",
    "bucketName": "REPLACE-YOU-BUCKET",
    "vectorStoreType": "faiss",
    "splitStrategy": "semantic",
    "fileType": "application/pdf", 
    "embeddingModel": "amazon.titan-embed-text-v1"
  }

Event	Executing function: succeeded

To generate embeddings for images with FAISS.

Event to trigger:

{
    "location": "REPLACE-YOU-KEY-FOLDER",
    "vectorStoreLocation": "REPLACE-NAME.vdb",
    "bucketName": "REPLACE-YOU-BUCKET",
    "vectorStoreType": "faiss",
    "splitStrategy": "semantic",
    "embeddingModel": "amazon.titan-embed-image-v1"
  }

Event	Executing function: succeeded

To generate embeddings for image/pdf with pgvector and Amazon Aurora.

💡 Before testing this Lambda Function keep in mind that it must be in the same VPC and be able to access the Amazon Aurora PostreSQL DB, for that check Automatically connecting a Lambda function and an Aurora DB cluster, Using Amazon RDS Proxy for Aurora and Use interface VPC endpoints (AWS PrivateLink) for Amazon Bedrock VPC endpoint.

Event to trigger:

{
  "location": "YOU-KEY",
  "bucketName": "YOU-BUCKET-NAME",
  "fileType": "pdf or image",
  "embeddingModel": "amazon.titan-embed-text-v1", 
  "PGVECTOR_USER":"YOU-RDS-USER",
  "PGVECTOR_PASSWORD":"YOU-RDS-PASSWORD",
  "PGVECTOR_HOST":"YOU-RDS-ENDPOINT-PROXY",
  "PGVECTOR_DATABASE":"YOU-RDS-DATABASE",
  "PGVECTOR_PORT":"5432",
  "collectioName": "YOU-collectioName",
  "bedrock_endpoint": "https://vpce-...-.....bedrock-runtime.YOU-REGION.vpce.amazonaws.com"
}

Event PDF	Executing function: succeeded

Event Image	Executing function: succeeded

AWS Lambda Funtions to Query for Text and Image Files in a Vector DB:

To handle the embedding process, there is a dedicated Lambda Function for each file type:

To retrieval text content from a vector DB

Event to trigger:

{
  "vectorStoreLocation": "REPLACE-NAME.vdb",
  "bucketName": "REPLACE-YOU-BUCKET",
  "vectorStoreType": "faiss",
  "query": "YOU-QUERY",
  "numDocs": 5,
  "embeddingModel": "amazon.titan-embed-text-v1"
}

Event	Executing function: succeeded

To retrieval image location from a vector DB

You can search by text or by image

Text event to trigger

{
  "vectorStoreLocation": "REPLACE-NAME.vdb",
  "bucketName": "REPLACE-YOU-BUCKET",
  "vectorStoreType": "faiss",
  "InputType": "text",
  "query":"TEXT-QUERY",
  "embeddingModel": "amazon.titan-embed-text-v1"
}

Event	Executing function: succeeded

Image event to trigger

{
  "vectorStoreLocation": "REPLACE-NAME.vdb",
  "bucketName": "REPLACE-YOU-BUCKET",
  "vectorStoreType": "faiss",
  "InputType": "image",
  "query":"IMAGE-BUCKET-LOCATION-QUERY",
  "embeddingModel": "amazon.titan-embed-text-v1"
}

Event	Executing function: succeeded

💡 The next step is to take the image_path value and download the file from Amazon S3 bucket with a download_file boto3 method.

To generate embeddings for image/pdf with pgvector and Amazon Aurora.

{
  "location": "YOU-KEY",
  "bucketName": "YOU-BUCKET-NAME",
  "fileType": "pdf or image",
  "embeddingModel": "amazon.titan-embed-text-v1", 
  "PGVECTOR_USER":"YOU-RDS-USER",
  "PGVECTOR_PASSWORD":"YOU-RDS-PASSWORD",
  "PGVECTOR_HOST":"YOU-RDS-ENDPOINT-PROXY",
  "PGVECTOR_DATABASE":"YOU-RDS-DATABASE",
  "PGVECTOR_PORT":"5432",
  "collectioName": "YOU-collectioName",
  "bedrock_endpoint": "https://vpce-...-.....bedrock-runtime.YOU-REGION.vpce.amazonaws.com",
  "QUERY": "YOU-TEXT-QUESTION"
  }

💡 Use location and bucketNameto deliver image location to make a query.

Event PDF	Executing function: succeeded

Event Image Query Text	Executing function: succeeded

Event Image Query Image	Executing function: succeeded

🚀 Let's build!

The Amazon Lambdas that you build in this deployment are created with a container images, you must have Docker Desktop installed and active in your computer.

Step 1: APP Set Up

✅ Clone the repo

git clone https://github.com/build-on-aws/langchain-embeddings

✅ Go to:

cd serveless-embeddings

Configure the AWS Command Line Interface
Deploy architecture with CDK Follow steps:

Step 2: Deploy architecture with CDK.

✅ Create The Virtual Environment: by following the steps in the README

python3 -m venv .venv

source .venv/bin/activate

for windows:

.venv\Scripts\activate.bat

✅ Install The Requirements:

pip install -r requirements.txt

✅ Synthesize The Cloudformation Template With The Following Command:

cdk synth

✅🚀 The Deployment:

cdk deploy

🧹 Clean the house!:

If you finish testing and want to clean the application, you just have to follow these two steps:

Delete the files from the Amazon S3 bucket created in the deployment.
Run this command in your terminal:

cdk destroy

Conclusion:

In this post, you built a powerful multimodal search engine capable of handling both text and images using Amazon Titan Embeddings, Amazon Bedrock, Amazon Aurora PostgreSQL, and LangChain. You generated embeddings, stored the data in both FAISS vector databases and Amazon Aurora Postgre, and developed applications for semantic text and image search.

Additionally, you deployed a serverless application using AWS CDK with Lambda Functions to integrate embedding and retrieval capabilities through events, providing a scalable solution.

Now you have the tools to create your own multimodal search engines, unlocking new possibilities for your applications. Explore the code, experiment, and share your experiences in the comments.

🚀 Some links for you to continue learning and building:

Getting started with Amazon Bedrock, RAG, and Vector database in Python
Building with Amazon Bedrock and LangChain
How To Choose Your LLM
Working With Your Live Data Using LangChain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Serverless Vector Database System for Multi-Modal Document/Image Processing

AWS Lambda Funtions to Generating Embeddings for Text and Image Files:

To generate embeddings for the text content of PDF files with FAISS.

To generate embeddings for images with FAISS.

To generate embeddings for image/pdf with pgvector and Amazon Aurora.

AWS Lambda Funtions to Query for Text and Image Files in a Vector DB:

🚀 Let's build!

Conclusion:

🚀 Some links for you to continue learning and building:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Serverless Vector Database System for Multi-Modal Document/Image Processing

AWS Lambda Funtions to Generating Embeddings for Text and Image Files:

To generate embeddings for the text content of PDF files with FAISS.

To generate embeddings for images with FAISS.

To generate embeddings for image/pdf with pgvector and Amazon Aurora.

AWS Lambda Funtions to Query for Text and Image Files in a Vector DB:

🚀 Let's build!

Conclusion:

🚀 Some links for you to continue learning and building: