Skip to content

filipetorresdecarvalho/get-imdb-data

Repository files navigation

🎬 get-imdb-data — Text + Images + AI

A professional-grade Python desktop application for downloading IMDb metadata, poster images, and generating AI-powered image descriptions — all offline-capable.

Built on get-imdb-json | Python 3.12+ | SQLAlchemy + SQLite | HTTPX | Ollama VLMs


📋 Table of Contents


Overview

get-imdb-data is an evolution of get-imdb-json that extends beyond text metadata to include poster image downloading and AI-powered image analysis using local vision language models (VLMs) via Ollama.

What it does

  1. Downloads IMDb metadata → titles, actors, ratings, genres — stored as JSON snapshots and in SQLite
  2. Downloads poster images → atomic download with SHA-256 verification and deduplication
  3. Generates AI descriptions → sends posters to a local Ollama vision model for structured scene analysis

High-Level Data Flow

flowchart LR
    A["👤 User Input<br/>(IMDb ID / Title / Name)"] --> B["🔍 IMDb Fetcher<br/>(Suggestion API + JSON-LD)"]
    B --> C["💾 SQLite Database<br/>(titles, actors, history)"]
    B --> D["📁 JSON Snapshots<br/>(content-hash dedup)"]
    B --> E{"🖼️ Poster URL<br/>found?"}
    E -->|Yes| F["⬇️ Image Downloader<br/>(HTTPX + atomic write)"]
    F --> G["📸 photos table<br/>(hash, path, size)"]
    G --> H{"🤖 Ollama<br/>enabled?"}
    H -->|Yes| I["🧠 Vision Model<br/>(llama3.2-vision)"]
    I --> J["📝 photo_descriptions<br/>(structured JSON)"]
    E -->|No| K["✅ Done<br/>(text only)"]
    H -->|No| K
Loading

Architecture

System Architecture

graph TB
    subgraph "GUI Layer"
        MW["main_window.py<br/>Tkinter Dark Theme"]
        DLG["dialogs.py<br/>Bootstrap / Export / Detail"]
        SP["status_panel.py<br/>StatusBar + LogPanel"]
    end

    subgraph "Service Layer"
        DS["download_service.py<br/>Main Orchestrator"]
        PDS["photo_download_service.py<br/>Poster Downloads"]
        IDS["image_description_service.py<br/>AI Descriptions"]
        BS["bootstrap_service.py<br/>Health Checks"]
        HS["history_service.py<br/>Version Tracking"]
        VS["validation_service.py<br/>Input Validation"]
        ES["export_service.py<br/>JSON/CSV Export"]
    end

    subgraph "Integration Layer"
        IF["imdb_fetcher.py<br/>JSON-LD + Suggestion API"]
        PA["pymoviedb_adapter.py<br/>PyMovieDb Fallback"]
        ID["image_downloader.py<br/>HTTPX + Atomic Write"]
        OV["ollama_vision.py<br/>Local VLM Client"]
    end

    subgraph "Data Layer"
        DB["engine.py<br/>SQLAlchemy Sessions"]
        MD["models.py<br/>ORM Models"]
        FM["folder_manager.py<br/>Directory Hierarchy"]
        SM["snapshot_manager.py<br/>Content-Hash Dedup"]
        JW["json_writer.py<br/>Atomic File Writes"]
    end

    subgraph "External Services"
        IMDB["🌐 IMDb<br/>Suggestion API"]
        OLL["🖥️ Ollama Server<br/>localhost:11434"]
        SQLITE["💾 SQLite<br/>WAL Mode"]
        DISK["📁 Filesystem<br/>data/ directory"]
    end

    MW --> DS
    MW --> ES
    DS --> VS
    DS --> IF
    DS --> PDS
    DS --> IDS
    DS --> HS
    DS --> SM
    PDS --> ID
    IDS --> OV
    BS --> IF
    BS --> OV

    IF --> IMDB
    ID --> IMDB
    OV --> OLL
    DB --> SQLITE
    FM --> DISK
    JW --> DISK
    SM --> FM
    SM --> JW

    DS --> DB
    PDS --> DB
    IDS --> DB
Loading

Download Pipeline (Detailed)

sequenceDiagram
    participant U as 👤 User
    participant GUI as 🖥️ GUI
    participant DS as 📦 DownloadService
    participant IF as 🔍 IMDbFetcher
    participant SM as 💾 SnapshotManager
    participant DB as 🗃️ Database
    participant PDS as 📸 PhotoService
    participant ID as ⬇️ ImageDownloader
    participant IDS as 🤖 DescriptionService
    participant OV as 🧠 OllamaVision

    U->>GUI: Enter "tt0111161"
    GUI->>DS: download("tt0111161")
    
    Note over DS: Step 1: Validate input
    DS->>DS: validate_input() → TITLE_ID
    
    Note over DS: Step 2: Fetch metadata
    DS->>IF: fetch_title("tt0111161")
    IF-->>DS: (data_dict, raw_json, None)
    
    Note over DS: Step 3: Save snapshot
    DS->>SM: save_snapshot("title", "tt0111161", raw_json)
    SM-->>DS: SnapshotResult(is_duplicate=False)
    
    Note over DS: Step 4: Persist to DB
    DS->>DB: INSERT/UPDATE titles
    
    Note over DS: Step 5: Download poster
    DS->>PDS: download_poster("tt0111161", poster_url)
    PDS->>ID: download_image(url, "tt0111161.jpg")
    ID-->>PDS: ImageDownloadResult(hash, path)
    PDS->>DB: INSERT photos
    PDS-->>DS: PhotoResult(success=True)
    
    Note over DS: Step 6: AI description
    DS->>IDS: describe_by_imdb_id("tt0111161")
    IDS->>OV: describe_image(image_path)
    OV-->>IDS: {objects, scene, colors, ...}
    IDS->>DB: INSERT photo_descriptions
    IDS-->>DS: DescriptionResult(success=True)
    
    DS-->>GUI: DownloadResult ✅
    GUI-->>U: Show in results tree
Loading

Atomic File + Database Consistency

flowchart TD
    A["Start Download"] --> B["Create temp file<br/>.tmp/img_XXXXX.part"]
    B --> C["Stream bytes<br/>+ compute SHA-256"]
    C --> D{"Download<br/>complete?"}
    D -->|No / Error| E["Delete temp file<br/>❌ No orphans"]
    D -->|Yes| F["os.replace()<br/>temp → final path"]
    F --> G{"DB INSERT<br/>photo row"}
    G -->|Success| H["✅ Done<br/>File + DB consistent"]
    G -->|Failure| I["Delete final file<br/>❌ No orphan rows"]

    style E fill:#f38ba8,color:#1e1e2e
    style I fill:#f38ba8,color:#1e1e2e
    style H fill:#a6e3a1,color:#1e1e2e
Loading

Features

Feature Status Description
🔍 IMDb Fetcher JSON-LD extraction + suggestion API (no browser needed)
📦 PyMovieDb Fallback Secondary data source when primary fails
💾 SQLite + WAL Reliable database with write-ahead logging
📁 JSON Snapshots Content-hash deduplication, immutable snapshots
📜 Version History Every update archives the previous version
📸 Poster Downloads HTTPX streaming + SHA-256 + atomic writes
🤖 AI Descriptions Local Ollama VLM for offline image analysis
🖥️ Dark GUI Catppuccin Mocha themed Tkinter interface
📤 Export JSON and CSV export for titles and actors
🧪 Tests pytest with mocked HTTP and in-memory SQLite
🛠️ CLI Tools Bulk poster download + bulk AI description

Installation

Prerequisites

  • Python 3.12+Download
  • Ollama (optional) — Download (for AI image descriptions)

Setup

# Clone the repository
git clone https://github.com/filipetorresdecarvalho/get-imdb-data.git
cd get-imdb-data

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

Ollama Setup (Optional — for AI Descriptions)

# Install Ollama from https://ollama.com/download

# Pull a vision-capable model
ollama pull llama3.2-vision

# Start the server (if not auto-started)
ollama serve

Quick Start

GUI Mode

python app.py

This launches the Tkinter GUI where you can:

  1. Enter an IMDb ID (e.g. tt0111161) or title name (e.g. Inception)
  2. Click Download — fetches metadata, saves poster, and (optionally) generates AI description
  3. Double-click a result row to see full JSON details

CLI Mode — Bulk Operations

# Download posters for all titles in your database
python -m tools.bulk_download_posters --rate 5

# Generate AI descriptions for all posters
python -m tools.bulk_describe_images --model llama3.2-vision

Configuration

All settings are in config/default.ini and can be overridden with environment variables.

config/default.ini

[database]
url = sqlite:///data/imdb_data.db

[storage]
data_dir = data
log_dir = logs

[images]
enabled = true              # Auto-download posters
images_subdir = images      # Subdirectory under data_dir
max_concurrent_downloads = 10
rate_limit_per_second = 10
retry_max_attempts = 3
max_file_size_mb = 10

[ollama]
enabled = false             # Set to true to enable AI descriptions
base_url = http://localhost:11434
model = llama3.2-vision     # Any vision-capable model
timeout = 120               # Seconds (vision models can be slow)
max_retries = 1             # Retry on invalid JSON response

[app]
log_level = INFO
window_title = IMDb Data Manager — Text + Images + AI
request_timeout = 30

Environment Variables

Variable Overrides Example
IMDB_APP_DB_URL database.url sqlite:///custom.db
IMDB_APP_DATA_DIR storage.data_dir /opt/imdb/data
IMDB_APP_LOG_DIR storage.log_dir /var/log/imdb
IMDB_APP_LOG_LEVEL app.log_level DEBUG
IMDB_APP_IMAGES_ENABLED images.enabled true / false
IMDB_APP_OLLAMA_ENABLED ollama.enabled true / false
IMDB_APP_OLLAMA_MODEL ollama.model llava

CLI Tools

bulk_download_posters

Downloads poster images for all titles in the database that don't already have one.

python -m tools.bulk_download_posters [OPTIONS]

Options:
  --limit N     Max titles to process (0 = all, default: 0)
  --rate N      Max downloads per second (default: 5.0)
  --dry-run     Show what would be downloaded without doing it

bulk_describe_images

Generates AI descriptions for all poster images using local Ollama.

python -m tools.bulk_describe_images [OPTIONS]

Options:
  --limit N     Max images to process (0 = all, default: 0)
  --model NAME  Override Ollama model (default: from config)
  --dry-run     Show what would be described without doing it
  --force       Re-describe even if description already exists

AI Image Description (Ollama)

How It Works

flowchart LR
    A["📸 Poster Image<br/>tt0111161.jpg"] -->|base64 encode| B["📤 POST /api/chat<br/>Ollama localhost:11434"]
    B -->|System prompt<br/>+ Image| C["🧠 Vision Model<br/>(llama3.2-vision)"]
    C -->|Structured JSON| D["✅ Parse & Validate<br/>Required keys check"]
    D -->|Valid| E["💾 photo_descriptions<br/>table"]
    D -->|Invalid JSON| F["🔄 Retry with<br/>repair prompt"]
    F -->|Still invalid| G["❌ Log error<br/>skip this image"]
Loading

Supported Models

Any vision-capable Ollama model works. Tested recommendations:

Model Size Speed Quality Command
llama3.2-vision 7B ⚡ Fast ⭐⭐⭐ Good ollama pull llama3.2-vision
llava 7B ⚡ Fast ⭐⭐ Decent ollama pull llava
llava:13b 13B 🐢 Slow ⭐⭐⭐⭐ Great ollama pull llava:13b

JSON Description Schema

Every image description follows this structured format:

{
  "objects": [
    {"name": "person", "confidence": 0.95, "attributes": {"clothing": "suit"}}
  ],
  "scene": {
    "setting": "prison courtyard",
    "location_type": "outdoor",
    "time_of_day": "day",
    "weather": "overcast",
    "activity_summary": "A man stands alone in a prison yard"
  },
  "colors": {
    "dominant": [{"color": "grey", "hex": "#808080", "percent": 0.4}],
    "palette_summary": "Muted greys and blues conveying confinement"
  },
  "emotions": {
    "mood": "contemplative",
    "cues": "distant gaze, isolated figure",
    "confidence": 0.8
  },
  "text_in_image": {
    "present": true,
    "extracted_text": ["The Shawshank Redemption"]
  },
  "safety": {
    "sensitive_content": false,
    "notes": "Movie poster, no sensitive content"
  },
  "detailed_analysis": "The poster depicts a solitary figure...",
  "metadata": {
    "model_used": "llama3.2-vision",
    "created_at_iso": "2026-04-02T22:00:00+00:00",
    "image_file": "tt0111161.jpg",
    "file_size_bytes": 45230
  }
}

Database Schema

erDiagram
    actors ||--o{ titles : "has"
    titles ||--o{ title_details : "has"
    titles ||--o{ title_history : "has"
    photos ||--o| photo_descriptions : "has"

    actors {
        int id PK
        string imdb_id UK
        string name
        text raw_json
        datetime created_at
        datetime updated_at
        bool is_active
    }

    titles {
        int id PK
        string imdb_id UK
        string title
        string type
        int year
        text raw_json
        int actor_id FK
        datetime created_at
        datetime updated_at
        bool is_active
    }

    title_details {
        int id PK
        int title_id FK
        string data_type
        text raw_json
        datetime created_at
        bool is_active
    }

    title_history {
        int id PK
        int title_id FK
        string snapshot_hash
        text raw_json
        datetime archived_at
        string reason
    }

    photos {
        int id PK
        string imdb_id
        string image_kind
        text source_url
        string image_path UK
        string image_hash
        int file_size
        int width
        int height
        datetime created_at
    }

    photo_descriptions {
        int id PK
        int photo_id FK_UK
        string model_used
        text description_json
        string image_hash_at_description
        float confidence_score
        datetime created_at
    }
Loading

Key Constraints

  • photos(imdb_id, image_kind)UNIQUE: one poster per title
  • photos(image_path)UNIQUE: no duplicate file references
  • photo_descriptions(photo_id)UNIQUE: one description per photo
  • title_historyappend-only: immutable snapshots

Project Structure

get-imdb-data/
├── app.py                          # Application entry point
├── requirements.txt                # Python dependencies
├── CHANGELOG.md                    # Version history
├── config/
│   ├── settings.py                 # Configuration loader (INI + env vars)
│   └── default.ini                 # Default settings
├── db/
│   ├── engine.py                   # SQLAlchemy engine + session_scope()
│   └── models.py                   # ORM: Actor, Title, Photo, PhotoDescription
├── gui/
│   ├── main_window.py              # Dark-themed Tkinter main window
│   ├── dialogs.py                  # Bootstrap, Export, Detail dialogs
│   └── status_panel.py             # StatusBar + LogPanel widgets
├── integrations/
│   ├── imdb_fetcher.py             # JSON-LD + Suggestion API fetcher
│   ├── pymoviedb_adapter.py        # PyMovieDb fallback wrapper
│   ├── image_downloader.py         # HTTPX streaming + atomic writes
│   └── ollama_vision.py            # Local VLM client for Ollama
├── services/
│   ├── download_service.py         # Main orchestrator (text+image+AI)
│   ├── photo_download_service.py   # Poster download + DB persistence
│   ├── image_description_service.py # AI description + DB persistence
│   ├── bootstrap_service.py        # Startup health checks
│   ├── history_service.py          # Title version tracking
│   ├── validation_service.py       # Input validation + classification
│   └── export_service.py           # JSON/CSV export
├── storage/
│   ├── folder_manager.py           # Directory hierarchy management
│   ├── json_writer.py              # Atomic JSON file read/write
│   └── snapshot_manager.py         # Content-hash dedup snapshots
├── tools/
│   ├── bulk_download_posters.py    # CLI: batch poster downloads
│   └── bulk_describe_images.py     # CLI: batch AI descriptions
├── utils/
│   ├── errors.py                   # Exception hierarchy
│   ├── hashing.py                  # SHA-256 (string + file)
│   ├── logger.py                   # Logging setup + TkinterHandler
│   └── paths.py                    # Path constants + helpers
├── tests/
│   ├── test_models.py              # Photo/PhotoDescription CRUD tests
│   ├── test_ollama_vision.py       # Vision parsing + health check tests
│   └── test_hashing.py             # Hashing utility tests
├── data/                           # (gitignored) Runtime data
│   ├── images/                     #   Downloaded poster images
│   │   └── .tmp/                   #   In-progress downloads
│   └── titles/                     #   JSON snapshots
└── logs/                           # (gitignored) Log files

GitHub Projects Used

This project builds upon and is inspired by several open-source projects:

Foundation

Project Author Role in get-imdb-data Link
get-imdb-json filipetorresdecarvalho 🏗️ Base architecture — ORM models, download pipeline, GUI framework, snapshot system. The entire service layer, storage layer, and GUI were ported from this project. GitHub
PyMovieDb itsmehemant7 📦 Fallback data source — web scraping library used as secondary fetcher when the primary JSON-LD approach fails. GitHub

AI Vision (Ollama Integration) — Studied for Patterns

Project Author What we learned Link
AutoDescribe-Images hydropix 📝 Batch prompting patterns for Ollama vision models. Influenced our system prompt design and batch workflow in bulk_describe_images.py. GitHub
Image-AltText-Generator elbruno 🔄 Clean "image → caption" flow architecture. Confirmed the base64 → /api/chat → parse pattern as the standard approach. GitHub
ollamavision venkatarangan 🎯 Minimal Python reference for single-image description. Validated our prompt tuning approach for structured output. GitHub
Ollama-Image-Processing-CLI-Tool tristan-mcinnis 🛠️ CLI ergonomics and pluggable prompts. Influenced our --model and --dry-run CLI flags. GitHub
vlm-batch-descriptor psychedelicmojo ⚙️ Post-processing pipeline (description → structured metadata storage). Inspired our photo_descriptions table design. GitHub

Core Dependencies

Library Purpose Link
SQLAlchemy ORM + database abstraction (SQLite → MariaDB migration-ready) Docs
HTTPX Modern async HTTP client for image downloads with streaming Docs
Ollama Python Official Python SDK for the Ollama API GitHub

How Everything Works Together

graph LR
    subgraph "Data Sources"
        IMDB["🌐 IMDb.com<br/>(Suggestion API + JSON-LD)"]
        PMDB["📦 PyMovieDb<br/>(Web scraping fallback)"]
    end

    subgraph "get-imdb-data"
        FETCH["🔍 IMDb Fetcher<br/>(imdb_fetcher.py)"]
        ADAPT["🔌 PyMovieDb Adapter<br/>(pymoviedb_adapter.py)"]
        ORCH["📦 Download Service<br/>(Orchestrator)"]
        SNAP["💾 Snapshot Manager<br/>(Content-hash dedup)"]
        IMGDL["⬇️ Image Downloader<br/>(HTTPX + atomic writes)"]
        VISION["🧠 Ollama Vision<br/>(ollama_vision.py)"]
        DB["🗃️ SQLAlchemy ORM<br/>(SQLite / MariaDB)"]
    end

    subgraph "External Tools"
        OLLAMA["🖥️ Ollama Server<br/>(localhost:11434)"]
        MODEL["🤖 Vision Model<br/>(llama3.2-vision)"]
    end

    IMDB --> FETCH
    PMDB --> ADAPT
    FETCH --> ORCH
    ADAPT --> ORCH
    ORCH --> SNAP
    ORCH --> IMGDL
    ORCH --> VISION
    ORCH --> DB
    VISION --> OLLAMA
    OLLAMA --> MODEL
    IMGDL -->|"poster.jpg"| VISION
Loading

Testing

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_models.py -v
python -m pytest tests/test_ollama_vision.py -v
python -m pytest tests/test_hashing.py -v

Test Coverage

Test File What's Tested
test_models.py Photo/PhotoDescription CRUD, unique constraints, relationships
test_ollama_vision.py JSON parsing, markdown fence removal, default filling, health check
test_hashing.py SHA-256 determinism, file hashing, unicode handling

All Ollama tests use mocked HTTP responses — no running Ollama server required.


License

See LICENSE for details.

About

based on get-imdb-json, but this will get text and images

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages