Skip to content

[Price Compare] Step 14: Product image crawler & thumbnail storage #85

@frankieboxx

Description

@frankieboxx

Step 14 — Product Image Crawler & Thumbnail Storage

Crawl product images from retailer websites for products that exist in the Atrium DB, convert to 200×200 thumbnails, and store in the cijene-api (DB_DSN) database with product code and EAN.

Scope:

  • Only crawl images for products that appear in Atrium troskovi_detalji (matched by sifra or fuzzy name)
  • Sources: Metro, Konzum, Tommy, Studenac, Lidl, Ribola
  • Store as JPEG thumbnail (200×200, quality 85) in product_images table

DB Model (SQL migration created):

service/db/product_images.sql:

CREATE TABLE IF NOT EXISTS product_images (
    id SERIAL PRIMARY KEY,
    chain_product_id INTEGER NOT NULL REFERENCES chain_products (id),
    ean VARCHAR(50),
    image_data BYTEA NOT NULL,
    image_format VARCHAR(10) NOT NULL DEFAULT 'jpeg',
    width INTEGER NOT NULL DEFAULT 200,
    height INTEGER NOT NULL DEFAULT 200,
    source_url TEXT,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    UNIQUE (chain_product_id)
);

CREATE INDEX IF NOT EXISTS idx_product_images_ean ON product_images (ean);
CREATE INDEX IF NOT EXISTS idx_product_images_chain_product_id ON product_images (chain_product_id);

Implementation:

  1. Script: scripts/crawl_images.py
  2. Flow:
    • Query Atrium DB → get all unique sifra values from troskovi_detalji
    • For each sifra, find matching chain_products in cijene-api DB (across all chains)
    • For each matched chain_product, check if image already exists in product_images
    • If not, crawl the product page, extract product photo
    • Download image, resize to 200×200 with Pillow, convert to JPEG
    • Insert into product_images with chain_product_id, EAN (from products.ean), and thumbnail bytes
  3. Dependencies: Pillow>=10.0, httpx (already used), optionally playwright for JS-heavy sites
  4. Rate limiting: 1 req/sec per domain, respect robots.txt
  5. Cron: Run weekly: 0 10 * * 0 (Sundays 10:00)

Image URL patterns (to be verified per chain):

  • Metro: scrape from metrocjenik.com.hr or product page
  • Konzum: konzum.hr product page
  • Tommy, Studenac, Lidl, Ribola: product page scrape

Optional API endpoint (extend #83):

GET /api/v1/product-image/{chain_product_id}
→ Returns image/jpeg (200×200 thumbnail)

Use cases:

  • Atrium ERP — display product images in purchase comparison UI
  • Price compare email — embed thumbnails for top items
  • Future dashboard — visual product catalog

Files: scripts/crawl_images.py, service/db/product_images.sql ✅ (already created)
Dependencies: Pillow>=10.0
Priority: P3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions