🎬 get-imdb-data — Text + Images + AI

A professional-grade Python desktop application for downloading IMDb metadata, poster images, and generating AI-powered image descriptions — all offline-capable.

Built on get-imdb-json | Python 3.12+ | SQLAlchemy + SQLite | HTTPX | Ollama VLMs

📋 Table of Contents

Overview
Architecture
Features
Installation
Quick Start
Configuration
CLI Tools
AI Image Description (Ollama)
Database Schema
Project Structure
GitHub Projects Used
Testing
Changelog

Overview

get-imdb-data is an evolution of get-imdb-json that extends beyond text metadata to include poster image downloading and AI-powered image analysis using local vision language models (VLMs) via Ollama.

What it does

Downloads IMDb metadata → titles, actors, ratings, genres — stored as JSON snapshots and in SQLite
Downloads poster images → atomic download with SHA-256 verification and deduplication
Generates AI descriptions → sends posters to a local Ollama vision model for structured scene analysis

High-Level Data Flow

flowchart LR
    A["👤 User Input<br/>(IMDb ID / Title / Name)"] --> B["🔍 IMDb Fetcher<br/>(Suggestion API + JSON-LD)"]
    B --> C["💾 SQLite Database<br/>(titles, actors, history)"]
    B --> D["📁 JSON Snapshots<br/>(content-hash dedup)"]
    B --> E{"🖼️ Poster URL<br/>found?"}
    E -->|Yes| F["⬇️ Image Downloader<br/>(HTTPX + atomic write)"]
    F --> G["📸 photos table<br/>(hash, path, size)"]
    G --> H{"🤖 Ollama<br/>enabled?"}
    H -->|Yes| I["🧠 Vision Model<br/>(llama3.2-vision)"]
    I --> J["📝 photo_descriptions<br/>(structured JSON)"]
    E -->|No| K["✅ Done<br/>(text only)"]
    H -->|No| K

Architecture

System Architecture

graph TB
    subgraph "GUI Layer"
        MW["main_window.py<br/>Tkinter Dark Theme"]
        DLG["dialogs.py<br/>Bootstrap / Export / Detail"]
        SP["status_panel.py<br/>StatusBar + LogPanel"]
    end

    subgraph "Service Layer"
        DS["download_service.py<br/>Main Orchestrator"]
        PDS["photo_download_service.py<br/>Poster Downloads"]
        IDS["image_description_service.py<br/>AI Descriptions"]
        BS["bootstrap_service.py<br/>Health Checks"]
        HS["history_service.py<br/>Version Tracking"]
        VS["validation_service.py<br/>Input Validation"]
        ES["export_service.py<br/>JSON/CSV Export"]
    end

    subgraph "Integration Layer"
        IF["imdb_fetcher.py<br/>JSON-LD + Suggestion API"]
        PA["pymoviedb_adapter.py<br/>PyMovieDb Fallback"]
        ID["image_downloader.py<br/>HTTPX + Atomic Write"]
        OV["ollama_vision.py<br/>Local VLM Client"]
    end

    subgraph "Data Layer"
        DB["engine.py<br/>SQLAlchemy Sessions"]
        MD["models.py<br/>ORM Models"]
        FM["folder_manager.py<br/>Directory Hierarchy"]
        SM["snapshot_manager.py<br/>Content-Hash Dedup"]
        JW["json_writer.py<br/>Atomic File Writes"]
    end

    subgraph "External Services"
        IMDB["🌐 IMDb<br/>Suggestion API"]
        OLL["🖥️ Ollama Server<br/>localhost:11434"]
        SQLITE["💾 SQLite<br/>WAL Mode"]
        DISK["📁 Filesystem<br/>data/ directory"]
    end

    MW --> DS
    MW --> ES
    DS --> VS
    DS --> IF
    DS --> PDS
    DS --> IDS
    DS --> HS
    DS --> SM
    PDS --> ID
    IDS --> OV
    BS --> IF
    BS --> OV

    IF --> IMDB
    ID --> IMDB
    OV --> OLL
    DB --> SQLITE
    FM --> DISK
    JW --> DISK
    SM --> FM
    SM --> JW

    DS --> DB
    PDS --> DB
    IDS --> DB

Download Pipeline (Detailed)

sequenceDiagram
    participant U as 👤 User
    participant GUI as 🖥️ GUI
    participant DS as 📦 DownloadService
    participant IF as 🔍 IMDbFetcher
    participant SM as 💾 SnapshotManager
    participant DB as 🗃️ Database
    participant PDS as 📸 PhotoService
    participant ID as ⬇️ ImageDownloader
    participant IDS as 🤖 DescriptionService
    participant OV as 🧠 OllamaVision

    U->>GUI: Enter "tt0111161"
    GUI->>DS: download("tt0111161")
    
    Note over DS: Step 1: Validate input
    DS->>DS: validate_input() → TITLE_ID
    
    Note over DS: Step 2: Fetch metadata
    DS->>IF: fetch_title("tt0111161")
    IF-->>DS: (data_dict, raw_json, None)
    
    Note over DS: Step 3: Save snapshot
    DS->>SM: save_snapshot("title", "tt0111161", raw_json)
    SM-->>DS: SnapshotResult(is_duplicate=False)
    
    Note over DS: Step 4: Persist to DB
    DS->>DB: INSERT/UPDATE titles
    
    Note over DS: Step 5: Download poster
    DS->>PDS: download_poster("tt0111161", poster_url)
    PDS->>ID: download_image(url, "tt0111161.jpg")
    ID-->>PDS: ImageDownloadResult(hash, path)
    PDS->>DB: INSERT photos
    PDS-->>DS: PhotoResult(success=True)
    
    Note over DS: Step 6: AI description
    DS->>IDS: describe_by_imdb_id("tt0111161")
    IDS->>OV: describe_image(image_path)
    OV-->>IDS: {objects, scene, colors, ...}
    IDS->>DB: INSERT photo_descriptions
    IDS-->>DS: DescriptionResult(success=True)
    
    DS-->>GUI: DownloadResult ✅
    GUI-->>U: Show in results tree

Atomic File + Database Consistency

flowchart TD
    A["Start Download"] --> B["Create temp file<br/>.tmp/img_XXXXX.part"]
    B --> C["Stream bytes<br/>+ compute SHA-256"]
    C --> D{"Download<br/>complete?"}
    D -->|No / Error| E["Delete temp file<br/>❌ No orphans"]
    D -->|Yes| F["os.replace()<br/>temp → final path"]
    F --> G{"DB INSERT<br/>photo row"}
    G -->|Success| H["✅ Done<br/>File + DB consistent"]
    G -->|Failure| I["Delete final file<br/>❌ No orphan rows"]

    style E fill:#f38ba8,color:#1e1e2e
    style I fill:#f38ba8,color:#1e1e2e
    style H fill:#a6e3a1,color:#1e1e2e

Features

Feature	Status	Description
🔍 IMDb Fetcher	✅	JSON-LD extraction + suggestion API (no browser needed)
📦 PyMovieDb Fallback	✅	Secondary data source when primary fails
💾 SQLite + WAL	✅	Reliable database with write-ahead logging
📁 JSON Snapshots	✅	Content-hash deduplication, immutable snapshots
📜 Version History	✅	Every update archives the previous version
📸 Poster Downloads	✅	HTTPX streaming + SHA-256 + atomic writes
🤖 AI Descriptions	✅	Local Ollama VLM for offline image analysis
🖥️ Dark GUI	✅	Catppuccin Mocha themed Tkinter interface
📤 Export	✅	JSON and CSV export for titles and actors
🧪 Tests	✅	pytest with mocked HTTP and in-memory SQLite
🛠️ CLI Tools	✅	Bulk poster download + bulk AI description

Installation

Prerequisites

Python 3.12+ — Download
Ollama (optional) — Download (for AI image descriptions)

Setup

# Clone the repository
git clone https://github.com/filipetorresdecarvalho/get-imdb-data.git
cd get-imdb-data

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

Ollama Setup (Optional — for AI Descriptions)

# Install Ollama from https://ollama.com/download

# Pull a vision-capable model
ollama pull llama3.2-vision

# Start the server (if not auto-started)
ollama serve

Quick Start

GUI Mode

python app.py

This launches the Tkinter GUI where you can:

Enter an IMDb ID (e.g. tt0111161) or title name (e.g. Inception)
Click Download — fetches metadata, saves poster, and (optionally) generates AI description
Double-click a result row to see full JSON details

CLI Mode — Bulk Operations

# Download posters for all titles in your database
python -m tools.bulk_download_posters --rate 5

# Generate AI descriptions for all posters
python -m tools.bulk_describe_images --model llama3.2-vision

Configuration

All settings are in config/default.ini and can be overridden with environment variables.

`config/default.ini`

[database]
url = sqlite:///data/imdb_data.db

[storage]
data_dir = data
log_dir = logs

[images]
enabled = true              # Auto-download posters
images_subdir = images      # Subdirectory under data_dir
max_concurrent_downloads = 10
rate_limit_per_second = 10
retry_max_attempts = 3
max_file_size_mb = 10

[ollama]
enabled = false             # Set to true to enable AI descriptions
base_url = http://localhost:11434
model = llama3.2-vision     # Any vision-capable model
timeout = 120               # Seconds (vision models can be slow)
max_retries = 1             # Retry on invalid JSON response

[app]
log_level = INFO
window_title = IMDb Data Manager — Text + Images + AI
request_timeout = 30

Environment Variables

Variable	Overrides	Example
`IMDB_APP_DB_URL`	database.url	`sqlite:///custom.db`
`IMDB_APP_DATA_DIR`	storage.data_dir	`/opt/imdb/data`
`IMDB_APP_LOG_DIR`	storage.log_dir	`/var/log/imdb`
`IMDB_APP_LOG_LEVEL`	app.log_level	`DEBUG`
`IMDB_APP_IMAGES_ENABLED`	images.enabled	`true` / `false`
`IMDB_APP_OLLAMA_ENABLED`	ollama.enabled	`true` / `false`
`IMDB_APP_OLLAMA_MODEL`	ollama.model	`llava`

CLI Tools

`bulk_download_posters`

Downloads poster images for all titles in the database that don't already have one.

python -m tools.bulk_download_posters [OPTIONS]

Options:
  --limit N     Max titles to process (0 = all, default: 0)
  --rate N      Max downloads per second (default: 5.0)
  --dry-run     Show what would be downloaded without doing it

`bulk_describe_images`

Generates AI descriptions for all poster images using local Ollama.

python -m tools.bulk_describe_images [OPTIONS]

Options:
  --limit N     Max images to process (0 = all, default: 0)
  --model NAME  Override Ollama model (default: from config)
  --dry-run     Show what would be described without doing it
  --force       Re-describe even if description already exists

AI Image Description (Ollama)

How It Works

flowchart LR
    A["📸 Poster Image<br/>tt0111161.jpg"] -->|base64 encode| B["📤 POST /api/chat<br/>Ollama localhost:11434"]
    B -->|System prompt<br/>+ Image| C["🧠 Vision Model<br/>(llama3.2-vision)"]
    C -->|Structured JSON| D["✅ Parse & Validate<br/>Required keys check"]
    D -->|Valid| E["💾 photo_descriptions<br/>table"]
    D -->|Invalid JSON| F["🔄 Retry with<br/>repair prompt"]
    F -->|Still invalid| G["❌ Log error<br/>skip this image"]

Supported Models

Any vision-capable Ollama model works. Tested recommendations:

Model	Size	Speed	Quality	Command
`llama3.2-vision`	7B	⚡ Fast	⭐⭐⭐ Good	`ollama pull llama3.2-vision`
`llava`	7B	⚡ Fast	⭐⭐ Decent	`ollama pull llava`
`llava:13b`	13B	🐢 Slow	⭐⭐⭐⭐ Great	`ollama pull llava:13b`

JSON Description Schema

Every image description follows this structured format:

{
  "objects": [
    {"name": "person", "confidence": 0.95, "attributes": {"clothing": "suit"}}
  ],
  "scene": {
    "setting": "prison courtyard",
    "location_type": "outdoor",
    "time_of_day": "day",
    "weather": "overcast",
    "activity_summary": "A man stands alone in a prison yard"
  },
  "colors": {
    "dominant": [{"color": "grey", "hex": "#808080", "percent": 0.4}],
    "palette_summary": "Muted greys and blues conveying confinement"
  },
  "emotions": {
    "mood": "contemplative",
    "cues": "distant gaze, isolated figure",
    "confidence": 0.8
  },
  "text_in_image": {
    "present": true,
    "extracted_text": ["The Shawshank Redemption"]
  },
  "safety": {
    "sensitive_content": false,
    "notes": "Movie poster, no sensitive content"
  },
  "detailed_analysis": "The poster depicts a solitary figure...",
  "metadata": {
    "model_used": "llama3.2-vision",
    "created_at_iso": "2026-04-02T22:00:00+00:00",
    "image_file": "tt0111161.jpg",
    "file_size_bytes": 45230
  }
}

Database Schema

erDiagram
    actors ||--o{ titles : "has"
    titles ||--o{ title_details : "has"
    titles ||--o{ title_history : "has"
    photos ||--o| photo_descriptions : "has"

    actors {
        int id PK
        string imdb_id UK
        string name
        text raw_json
        datetime created_at
        datetime updated_at
        bool is_active
    }

    titles {
        int id PK
        string imdb_id UK
        string title
        string type
        int year
        text raw_json
        int actor_id FK
        datetime created_at
        datetime updated_at
        bool is_active
    }

    title_details {
        int id PK
        int title_id FK
        string data_type
        text raw_json
        datetime created_at
        bool is_active
    }

    title_history {
        int id PK
        int title_id FK
        string snapshot_hash
        text raw_json
        datetime archived_at
        string reason
    }

    photos {
        int id PK
        string imdb_id
        string image_kind
        text source_url
        string image_path UK
        string image_hash
        int file_size
        int width
        int height
        datetime created_at
    }

    photo_descriptions {
        int id PK
        int photo_id FK_UK
        string model_used
        text description_json
        string image_hash_at_description
        float confidence_score
        datetime created_at
    }

Key Constraints

photos(imdb_id, image_kind) — UNIQUE: one poster per title
photos(image_path) — UNIQUE: no duplicate file references
photo_descriptions(photo_id) — UNIQUE: one description per photo
title_history — append-only: immutable snapshots

Project Structure

get-imdb-data/
├── app.py                          # Application entry point
├── requirements.txt                # Python dependencies
├── CHANGELOG.md                    # Version history
├── config/
│   ├── settings.py                 # Configuration loader (INI + env vars)
│   └── default.ini                 # Default settings
├── db/
│   ├── engine.py                   # SQLAlchemy engine + session_scope()
│   └── models.py                   # ORM: Actor, Title, Photo, PhotoDescription
├── gui/
│   ├── main_window.py              # Dark-themed Tkinter main window
│   ├── dialogs.py                  # Bootstrap, Export, Detail dialogs
│   └── status_panel.py             # StatusBar + LogPanel widgets
├── integrations/
│   ├── imdb_fetcher.py             # JSON-LD + Suggestion API fetcher
│   ├── pymoviedb_adapter.py        # PyMovieDb fallback wrapper
│   ├── image_downloader.py         # HTTPX streaming + atomic writes
│   └── ollama_vision.py            # Local VLM client for Ollama
├── services/
│   ├── download_service.py         # Main orchestrator (text+image+AI)
│   ├── photo_download_service.py   # Poster download + DB persistence
│   ├── image_description_service.py # AI description + DB persistence
│   ├── bootstrap_service.py        # Startup health checks
│   ├── history_service.py          # Title version tracking
│   ├── validation_service.py       # Input validation + classification
│   └── export_service.py           # JSON/CSV export
├── storage/
│   ├── folder_manager.py           # Directory hierarchy management
│   ├── json_writer.py              # Atomic JSON file read/write
│   └── snapshot_manager.py         # Content-hash dedup snapshots
├── tools/
│   ├── bulk_download_posters.py    # CLI: batch poster downloads
│   └── bulk_describe_images.py     # CLI: batch AI descriptions
├── utils/
│   ├── errors.py                   # Exception hierarchy
│   ├── hashing.py                  # SHA-256 (string + file)
│   ├── logger.py                   # Logging setup + TkinterHandler
│   └── paths.py                    # Path constants + helpers
├── tests/
│   ├── test_models.py              # Photo/PhotoDescription CRUD tests
│   ├── test_ollama_vision.py       # Vision parsing + health check tests
│   └── test_hashing.py             # Hashing utility tests
├── data/                           # (gitignored) Runtime data
│   ├── images/                     #   Downloaded poster images
│   │   └── .tmp/                   #   In-progress downloads
│   └── titles/                     #   JSON snapshots
└── logs/                           # (gitignored) Log files

GitHub Projects Used

This project builds upon and is inspired by several open-source projects:

Foundation

Project	Author	Role in get-imdb-data	Link
get-imdb-json	filipetorresdecarvalho	🏗️ Base architecture — ORM models, download pipeline, GUI framework, snapshot system. The entire service layer, storage layer, and GUI were ported from this project.	GitHub
PyMovieDb	itsmehemant7	📦 Fallback data source — web scraping library used as secondary fetcher when the primary JSON-LD approach fails.	GitHub

AI Vision (Ollama Integration) — Studied for Patterns

Project	Author	What we learned	Link
AutoDescribe-Images	hydropix	📝 Batch prompting patterns for Ollama vision models. Influenced our system prompt design and batch workflow in `bulk_describe_images.py`.	GitHub
Image-AltText-Generator	elbruno	🔄 Clean "image → caption" flow architecture. Confirmed the `base64 → /api/chat → parse` pattern as the standard approach.	GitHub
ollamavision	venkatarangan	🎯 Minimal Python reference for single-image description. Validated our prompt tuning approach for structured output.	GitHub
Ollama-Image-Processing-CLI-Tool	tristan-mcinnis	🛠️ CLI ergonomics and pluggable prompts. Influenced our `--model` and `--dry-run` CLI flags.	GitHub
vlm-batch-descriptor	psychedelicmojo	⚙️ Post-processing pipeline (description → structured metadata storage). Inspired our `photo_descriptions` table design.	GitHub

Core Dependencies

Library	Purpose	Link
SQLAlchemy	ORM + database abstraction (SQLite → MariaDB migration-ready)	Docs
HTTPX	Modern async HTTP client for image downloads with streaming	Docs
Ollama Python	Official Python SDK for the Ollama API	GitHub

How Everything Works Together

graph LR
    subgraph "Data Sources"
        IMDB["🌐 IMDb.com<br/>(Suggestion API + JSON-LD)"]
        PMDB["📦 PyMovieDb<br/>(Web scraping fallback)"]
    end

    subgraph "get-imdb-data"
        FETCH["🔍 IMDb Fetcher<br/>(imdb_fetcher.py)"]
        ADAPT["🔌 PyMovieDb Adapter<br/>(pymoviedb_adapter.py)"]
        ORCH["📦 Download Service<br/>(Orchestrator)"]
        SNAP["💾 Snapshot Manager<br/>(Content-hash dedup)"]
        IMGDL["⬇️ Image Downloader<br/>(HTTPX + atomic writes)"]
        VISION["🧠 Ollama Vision<br/>(ollama_vision.py)"]
        DB["🗃️ SQLAlchemy ORM<br/>(SQLite / MariaDB)"]
    end

    subgraph "External Tools"
        OLLAMA["🖥️ Ollama Server<br/>(localhost:11434)"]
        MODEL["🤖 Vision Model<br/>(llama3.2-vision)"]
    end

    IMDB --> FETCH
    PMDB --> ADAPT
    FETCH --> ORCH
    ADAPT --> ORCH
    ORCH --> SNAP
    ORCH --> IMGDL
    ORCH --> VISION
    ORCH --> DB
    VISION --> OLLAMA
    OLLAMA --> MODEL
    IMGDL -->|"poster.jpg"| VISION

Testing

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_models.py -v
python -m pytest tests/test_ollama_vision.py -v
python -m pytest tests/test_hashing.py -v

Test Coverage

Test File	What's Tested
`test_models.py`	Photo/PhotoDescription CRUD, unique constraints, relationships
`test_ollama_vision.py`	JSON parsing, markdown fence removal, default filling, health check
`test_hashing.py`	SHA-256 determinism, file hashing, unicode handling

All Ollama tests use mocked HTTP responses — no running Ollama server required.

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
db		db
gui		gui
integrations		integrations
services		services
storage		storage
tests		tests
tools		tools
utils		utils
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
implementation_plan_v002.md		implementation_plan_v002.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎬 get-imdb-data — Text + Images + AI

📋 Table of Contents

Overview

What it does

High-Level Data Flow

Architecture

System Architecture

Download Pipeline (Detailed)

Atomic File + Database Consistency

Features

Installation

Prerequisites

Setup

Ollama Setup (Optional — for AI Descriptions)

Quick Start

GUI Mode

CLI Mode — Bulk Operations

Configuration

config/default.ini

Environment Variables

CLI Tools

bulk_download_posters

bulk_describe_images

AI Image Description (Ollama)

How It Works

Supported Models

JSON Description Schema

Database Schema

Key Constraints

Project Structure

GitHub Projects Used

Foundation

AI Vision (Ollama Integration) — Studied for Patterns

Core Dependencies

How Everything Works Together

Testing

Test Coverage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config/default.ini`

`bulk_download_posters`

`bulk_describe_images`

Packages