Skip to content

Latest commit

 

History

History
301 lines (233 loc) · 12.9 KB

File metadata and controls

301 lines (233 loc) · 12.9 KB

IMDb Data Downloader & Manager — MVP Implementation Plan

Background & Problem

Build a production-grade Python desktop application for downloading, organizing, storing, versioning, and exporting IMDb-related data. The app wraps two public repos (PyMovieDb for data acquisition, imdb-scraper as architectural reference) and adds a complete orchestration layer: GUI, database, versioning, exports, logging, and bootstrap automation.


User Review Required

Important

Database choice for MVP: The brief specifies MariaDB/MySQL via SQLAlchemy. For the MVP, I recommend SQLite via SQLAlchemy so the app runs out of the box with zero database setup. The SQLAlchemy models will be identical — switching to MariaDB later is a one-line connection-string change. This dramatically lowers the barrier to first-run success.

Important

Python version: The system has Python 3.14.3. PyMovieDb on PyPI may have compatibility nuances with 3.14. I'll test during installation and fall back to vendoring the relevant source if needed.

Important

imdb-scraper strategy: Rather than cloning/depending on imdb-scraper at runtime, I'll use it purely as an architectural reference for database patterns. This avoids an unlicensed dependency and a Flask/BeautifulSoup/matplotlib stack the user doesn't need. The MVP will only depend on PyMovieDb for data acquisition.


Proposed Changes

MVP Scope (what will be built now)

Feature Status
Tkinter GUI (search, download, view, export, log viewer) ✅ Build
PyMovieDb integration (search by name/ID, person lookup) ✅ Build
SQLite database via SQLAlchemy ORM ✅ Build
Record versioning / history snapshots ✅ Build
Folder-based JSON storage ✅ Build
Content-hash duplicate detection ✅ Build
JSON & CSV export ✅ Build
Timestamped file logging + GUI log panel ✅ Build
Bootstrap / dependency checker ✅ Build
Configuration system (INI file) ✅ Build
Comprehensive error handling ✅ Build
Professional README.md ✅ Build

Deferred to v2 (well-commented stubs)

Feature Notes
MariaDB/MySQL backend Connection string swap; models ready
SQL dump export Needs DB-specific tooling
Auto-clone dependency repos from GitHub Stubs + README instructions
Retry queue for failed fetches Commented architecture
Migration helpers Alembic integration stub
Actor/title history browser panel GUI stub
Configurable endpoint sets Config stub

Project File Structure

c:\Users\Administrator\GitHub\get-imdb-json\
├── app.py                          # Entry point
├── README.md                       # Professional documentation
├── requirements.txt                # Dependencies
├── config/
│   ├── __init__.py
│   ├── settings.py                 # Config loader / dataclass
│   └── default.ini                 # Default configuration
├── gui/
│   ├── __init__.py
│   ├── main_window.py              # Main Tkinter window
│   ├── dialogs.py                  # Setup wizard, error dialogs
│   └── status_panel.py             # Status bar, log viewer panel
├── services/
│   ├── __init__.py
│   ├── bootstrap_service.py        # Dependency & DB health checks
│   ├── download_service.py         # IMDb data acquisition orchestrator
│   ├── export_service.py           # JSON/CSV export
│   ├── validation_service.py       # Input validation (IMDb IDs, names)
│   └── history_service.py          # Versioning / snapshot logic
├── db/
│   ├── __init__.py
│   ├── engine.py                   # SQLAlchemy engine + session factory
│   └── models.py                   # ORM models (Actor, Title, TitleDetail, TitleHistory)
├── storage/
│   ├── __init__.py
│   ├── folder_manager.py           # Folder hierarchy creation
│   ├── json_writer.py              # JSON read/write with atomic safety
│   └── snapshot_manager.py         # Content-hash snapshots
├── integrations/
│   ├── __init__.py
│   └── pymoviedb_adapter.py        # Adapter wrapping PyMovieDb calls
├── utils/
│   ├── __init__.py
│   ├── logger.py                   # Logging setup (file + GUI handler)
│   ├── hashing.py                  # SHA-256 content hashing
│   ├── paths.py                    # Path constants & helpers
│   └── errors.py                   # Custom exception hierarchy
├── logs/                           # Runtime log files (gitignored)
├── data/                           # Downloaded JSON data (gitignored)
└── tests/                          # Future test suite
    └── __init__.py

Component Details

[NEW] app.py

Entry point. Initializes logging, loads config, runs bootstrap checks, launches the Tkinter GUI.


Config Component

[NEW] settings.py
  • @dataclass AppConfig with all tunables (DB URL, data dir, log level, etc.)
  • Loads from default.ini, overridable by environment variables
  • Creates directories on first access
[NEW] default.ini
[database]
url = sqlite:///data/imdb_data.db

[storage]
data_dir = data
log_dir = logs

[app]
log_level = INFO

Database Component

[NEW] engine.py
  • get_engine(db_url) — creates SQLAlchemy engine
  • get_session_factory(engine) — returns sessionmaker
  • init_db(engine) — creates all tables via Base.metadata.create_all
  • Context manager session_scope() for safe commit/rollback
[NEW] models.py

Four SQLAlchemy ORM models matching the ER diagram:

  • Actorimdb_id, name, raw_json, created_at, updated_at, is_active
  • Titleimdb_id, title, type, year, raw_json, actor_id FK, timestamps, is_active
  • TitleDetailtitle_id FK, data_type, raw_json, timestamps, is_active
  • TitleHistorytitle_id FK, snapshot_hash, raw_json, archived_at, reason

Integration Component

[NEW] pymoviedb_adapter.py

Thin adapter around PyMovieDb.IMDB:

  • search_title(name, year, tv) → parsed dict
  • get_title_by_id(imdb_id) → parsed dict
  • get_person_by_name(name) → parsed dict
  • get_person_by_id(imdb_id) → parsed dict
  • search_person(name) → parsed dict
  • All methods return (data, error) tuples — never raise to caller
  • JSON parse safety (PyMovieDb returns JSON strings, not dicts)

Services Component

[NEW] download_service.py

Orchestrates the full download flow:

  1. Validate input
  2. Call adapter
  3. Compute content hash
  4. Check for duplicates
  5. Save to folder storage
  6. Upsert to database (version old record first)
  7. Return result summary
[NEW] history_service.py
  • archive_record(session, model_instance, reason) — copies current state to TitleHistory
  • get_history(session, title_id) — returns all historical snapshots
  • Never deletes — only marks is_active = False
[NEW] export_service.py
  • export_json(session, output_path, filters) — export DB records to JSON
  • export_csv(session, output_path, filters) — export to CSV
  • Respects active/inactive filters
[NEW] validation_service.py
  • validate_imdb_id(text) — checks tt\d{7,} or nm\d{7,} patterns
  • detect_input_type(text) — classifies as IMDB_ID, PERSON_ID, TITLE_NAME, PERSON_NAME
  • sanitize_filename(text) — safe filesystem names
[NEW] bootstrap_service.py
  • check_dependencies() — verifies PyMovieDb is importable
  • check_database(db_url) — tests DB connectivity
  • ensure_directories(config) — creates data/logs folders
  • Returns structured health report for GUI display

Storage Component

[NEW] folder_manager.py
  • Creates data/{type}/{imdb_id}/{timestamp}/ structure
  • Ensures no collisions
[NEW] json_writer.py
  • Atomic JSON write (write to temp, rename)
  • Pretty-print with 2-space indent
  • Read-back with validation
[NEW] snapshot_manager.py
  • SHA-256 hash of JSON content
  • Skip-if-duplicate logic
  • Snapshot metadata sidecar files

GUI Component

[NEW] main_window.py

Tkinter main window with:

  • Search frame — input field, search type dropdown (Title/Person/IMDb ID), search button
  • Results frame — treeview showing results, download button
  • Details frame — display selected record details
  • Action buttons — Export JSON, Export CSV, Init Database, Refresh
  • Log panel — scrolled text widget showing live log output
  • Status bar — current operation status
  • Dark theme using ttk.Style customization
  • Threading for non-blocking downloads
[NEW] dialogs.py
  • SetupWizardDialog — shown when DB is missing, guides through setup
  • ErrorDialog — friendly error display with technical details expandable
  • ExportDialog — choose export format and path
[NEW] status_panel.py
  • Reusable status bar widget
  • Log viewer with level-based coloring

Utils Component

[NEW] logger.py
  • Configures root logger with file handler (rotating, timestamped)
  • Custom TkinterHandler that pushes log records to GUI
  • Format: 2026-04-02 14:21:09 [ERROR] services.download_service: message
[NEW] hashing.py
  • compute_hash(data: str) -> str — SHA-256
[NEW] paths.py
  • PROJECT_ROOT, DATA_DIR, LOGS_DIR constants
  • ensure_dir(path) helper
[NEW] errors.py
  • IMDbAppError base exception
  • DownloadError, DatabaseError, ValidationError, BootstrapError subclasses

Documentation

[MODIFY] README.md

Complete rewrite with: project overview, features, prerequisites, installation, database setup, usage guide, architecture diagram (Mermaid), folder structure, troubleshooting, licensing notes, contribution guidance.

[NEW] requirements.txt
PyMovieDb>=0.1.0
SQLAlchemy>=2.0

Open Questions

Important

  1. SQLite vs MariaDB for MVP — I strongly recommend SQLite for MVP (zero setup, portable). The models are identical either way. Do you agree, or do you want MariaDB from day one?

Important

2. imdb-scraper as reference only — Since it has no license and adds heavy dependencies (Flask, matplotlib, lxml), I plan to use it only as a design reference, not as a runtime dependency. Is that acceptable?

Note

3. Scope for MVP — The plan builds all core features (GUI, DB, versioning, exports, logging) as working code, with v2 features as well-commented stubs. Does this scope feel right?


Verification Plan

Automated Tests

  1. pip install -r requirements.txt completes without errors
  2. python app.py launches the GUI without crashes
  3. Search for a known title (e.g., "The Shawshank Redemption") returns results
  4. Download stores JSON to data/ folder and inserts to SQLite
  5. Re-downloading the same data creates a history snapshot (no duplicates)
  6. Export JSON and CSV produce valid output files
  7. Log file appears in logs/ with correct format

Manual Verification

  • Visual inspection of GUI layout and dark theme
  • Test with invalid IMDb IDs to verify error handling
  • Test with network disabled to verify graceful degradation