A FastAPI-based OCR and document analysis service that extracts text and structured data from images using PaddleOCR, BLIP image captioning, and TrOCR models. Optimized for processing insurance claim forms and other structured documents.
- Python 3.10+
- A CUDA-capable GPU is optional but recommended for faster inference (used by both BLIP and PaddleOCR).
cd ImageToTextLab
python -m venv .venv
.venv\Scripts\activate # On macOS/Linux use: source .venv/bin/activate
pip install -r requirements.txtuvicorn app:app --reload --port 8080Send a multipart/form-data POST request to /extract with the following fields (PaddleOCR extracts the structured text, while BLIP provides a descriptive caption):
formType– string describing the type of form being processeddocumentType– string describing the document categoryattachment– the image file that contains Name, Age, and Location
Example with curl:
curl -X POST http://localhost:8080/extract \
-F "formType=registration" \
-F "documentType=id-card" \
-F "attachment=@/path/to/image.png"{
"formType": "registration",
"documentType": "id-card",
"rawText": "NAME : JANE DOE\nAGE : 32\nLOCATION : AUSTIN TX",
"blipCaption": "Name: Jane Doe; Age: 32; Location: Austin, TX",
"data": {
"name": "Jane Doe",
"age": "32",
"location": "Austin, TX"
}
}If BLIP cannot confidently read one of the fields, the corresponding value is returned as null. The
rawText field always contains the plain text generated by BLIP for auditing purposes.