-
Notifications
You must be signed in to change notification settings - Fork 26
[Price Compare] Step 14: Product image crawler & thumbnail storage #85
Copy link
Copy link
Open
Description
Step 14 — Product Image Crawler & Thumbnail Storage
Crawl product images from retailer websites for products that exist in the Atrium DB, convert to 200×200 thumbnails, and store in the cijene-api (DB_DSN) database with product code and EAN.
Scope:
- Only crawl images for products that appear in Atrium
troskovi_detalji(matched bysifraor fuzzy name) - Sources: Metro, Konzum, Tommy, Studenac, Lidl, Ribola
- Store as JPEG thumbnail (200×200, quality 85) in
product_imagestable
DB Model (SQL migration created):
service/db/product_images.sql:
CREATE TABLE IF NOT EXISTS product_images (
id SERIAL PRIMARY KEY,
chain_product_id INTEGER NOT NULL REFERENCES chain_products (id),
ean VARCHAR(50),
image_data BYTEA NOT NULL,
image_format VARCHAR(10) NOT NULL DEFAULT 'jpeg',
width INTEGER NOT NULL DEFAULT 200,
height INTEGER NOT NULL DEFAULT 200,
source_url TEXT,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
UNIQUE (chain_product_id)
);
CREATE INDEX IF NOT EXISTS idx_product_images_ean ON product_images (ean);
CREATE INDEX IF NOT EXISTS idx_product_images_chain_product_id ON product_images (chain_product_id);Implementation:
- Script:
scripts/crawl_images.py - Flow:
- Query Atrium DB → get all unique
sifravalues fromtroskovi_detalji - For each
sifra, find matchingchain_productsin cijene-api DB (across all chains) - For each matched
chain_product, check if image already exists inproduct_images - If not, crawl the product page, extract product photo
- Download image, resize to 200×200 with Pillow, convert to JPEG
- Insert into
product_imageswithchain_product_id, EAN (fromproducts.ean), and thumbnail bytes
- Query Atrium DB → get all unique
- Dependencies:
Pillow>=10.0,httpx(already used), optionallyplaywrightfor JS-heavy sites - Rate limiting: 1 req/sec per domain, respect robots.txt
- Cron: Run weekly:
0 10 * * 0(Sundays 10:00)
Image URL patterns (to be verified per chain):
- Metro: scrape from
metrocjenik.com.hror product page - Konzum:
konzum.hrproduct page - Tommy, Studenac, Lidl, Ribola: product page scrape
Optional API endpoint (extend #83):
GET /api/v1/product-image/{chain_product_id}
→ Returns image/jpeg (200×200 thumbnail)
Use cases:
- Atrium ERP — display product images in purchase comparison UI
- Price compare email — embed thumbnails for top items
- Future dashboard — visual product catalog
Files: scripts/crawl_images.py, service/db/product_images.sql ✅ (already created)
Dependencies: Pillow>=10.0
Priority: P3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels