# Scraper Refactor Guide - Image Download Integration ## 🎯 Objective Refactor the Troostwijk scraper to **download and store images locally**, eliminating the 57M+ duplicate image problem in the monitoring process. ## πŸ“‹ Current vs. New Architecture ### **Before** (Current Architecture) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Scraper │────────▢│ Database │◀────────│ Monitor β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Stores URLs β”‚ β”‚ images table β”‚ β”‚ Downloads + β”‚ β”‚ downloaded=0 β”‚ β”‚ β”‚ β”‚ Detection β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό 57M+ duplicates! ``` ### **After** (New Architecture) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Scraper │────────▢│ Database │◀────────│ Monitor β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Downloads + β”‚ β”‚ images table β”‚ β”‚ Detection β”‚ β”‚ Stores path β”‚ β”‚ local_path βœ“ β”‚ β”‚ Only β”‚ β”‚ downloaded=1 β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό No duplicates! ``` ## πŸ—„οΈ Database Schema Changes ### Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122) ```sql CREATE TABLE images ( id INTEGER PRIMARY KEY AUTOINCREMENT, lot_id TEXT, url TEXT, local_path TEXT, -- Currently NULL downloaded INTEGER -- Currently 0 -- Missing: processed_at, labels (added by monitor) ); ``` ### Required Schema (Already Compatible!) ```sql CREATE TABLE images ( id INTEGER PRIMARY KEY AUTOINCREMENT, lot_id TEXT, url TEXT, local_path TEXT, -- βœ… SET by scraper after download downloaded INTEGER, -- βœ… SET to 1 by scraper after download labels TEXT, -- ⚠️ SET by monitor (object detection) processed_at INTEGER, -- ⚠️ SET by monitor (timestamp) FOREIGN KEY (lot_id) REFERENCES lots(lot_id) ); ``` **Good News**: The scraper's schema already has `local_path` and `downloaded` columns! You just need to populate them. ## πŸ”§ Implementation Steps ### **Step 1: Enable Image Downloading in Configuration** **File**: Your scraper's config file (e.g., `config.py` or environment variables) ```python # Current setting DOWNLOAD_IMAGES = False # ❌ Change this! # New setting DOWNLOAD_IMAGES = True # βœ… Enable downloads # Image storage path IMAGES_DIR = "/mnt/okcomputer/output/images" # Or your preferred path ``` ### **Step 2: Update Image Download Logic** Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change: **Current Code** (Conceptual): ```python # Phase 3: Scrape lot details def scrape_lot(lot_url): lot_data = parse_lot_page(lot_url) # Save lot to database db.insert_lot(lot_data) # Save image URLs to database (NOT DOWNLOADED) for img_url in lot_data['images']: db.execute(""" INSERT INTO images (lot_id, url, downloaded) VALUES (?, ?, 0) """, (lot_data['lot_id'], img_url)) ``` **New Code** (Required): ```python import os import requests from pathlib import Path import time def scrape_lot(lot_url): lot_data = parse_lot_page(lot_url) # Save lot to database db.insert_lot(lot_data) # Download and save images for idx, img_url in enumerate(lot_data['images'], start=1): try: # Download image local_path = download_image(img_url, lot_data['lot_id'], idx) # Insert with local_path and downloaded=1 db.execute(""" INSERT INTO images (lot_id, url, local_path, downloaded) VALUES (?, ?, ?, 1) ON CONFLICT(lot_id, url) DO UPDATE SET local_path = excluded.local_path, downloaded = 1 """, (lot_data['lot_id'], img_url, local_path)) # Rate limiting (0.5s between downloads) time.sleep(0.5) except Exception as e: print(f"Failed to download {img_url}: {e}") # Still insert record but mark as not downloaded db.execute(""" INSERT INTO images (lot_id, url, downloaded) VALUES (?, ?, 0) """, (lot_data['lot_id'], img_url)) def download_image(image_url, lot_id, index): """ Downloads an image and saves it to organized directory structure. Args: image_url: Remote URL of the image lot_id: Lot identifier (e.g., "A1-28505-5") index: Image sequence number (1, 2, 3, ...) Returns: Absolute path to saved file """ # Create directory structure: /images/{lot_id}/ images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images')) lot_dir = images_dir / lot_id lot_dir.mkdir(parents=True, exist_ok=True) # Determine file extension from URL or content-type ext = Path(image_url).suffix or '.jpg' filename = f"{index:03d}{ext}" # 001.jpg, 002.jpg, etc. local_path = lot_dir / filename # Download with timeout response = requests.get(image_url, timeout=10) response.raise_for_status() # Save to disk with open(local_path, 'wb') as f: f.write(response.content) return str(local_path.absolute()) ``` ### **Step 3: Add Unique Constraint to Prevent Duplicates** **Migration SQL**: ```sql -- Add unique constraint to prevent duplicate image records CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique ON images(lot_id, url); ``` Add this to your scraper's schema initialization: ```python def init_database(): conn = sqlite3.connect('/mnt/okcomputer/output/cache.db') cursor = conn.cursor() # Existing table creation... cursor.execute(""" CREATE TABLE IF NOT EXISTS images (...) """) # Add unique constraint (NEW) cursor.execute(""" CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique ON images(lot_id, url) """) conn.commit() conn.close() ``` ### **Step 4: Handle Image Download Failures Gracefully** ```python def download_with_retry(image_url, lot_id, index, max_retries=3): """Downloads image with retry logic.""" for attempt in range(max_retries): try: return download_image(image_url, lot_id, index) except requests.exceptions.RequestException as e: if attempt == max_retries - 1: print(f"Failed after {max_retries} attempts: {image_url}") return None # Return None on failure print(f"Retry {attempt + 1}/{max_retries} for {image_url}") time.sleep(2 ** attempt) # Exponential backoff ``` ### **Step 5: Update Database Queries** Make sure your INSERT uses `INSERT ... ON CONFLICT` to handle re-scraping: ```python # Good: Handles re-scraping without duplicates db.execute(""" INSERT INTO images (lot_id, url, local_path, downloaded) VALUES (?, ?, ?, 1) ON CONFLICT(lot_id, url) DO UPDATE SET local_path = excluded.local_path, downloaded = 1 """, (lot_id, img_url, local_path)) # Bad: Creates duplicates on re-scrape db.execute(""" INSERT INTO images (lot_id, url, local_path, downloaded) VALUES (?, ?, ?, 1) """, (lot_id, img_url, local_path)) ``` ## πŸ“Š Expected Outcomes ### Before Refactor ```sql SELECT COUNT(*) FROM images WHERE downloaded = 0; -- Result: 57,376,293 (57M+ undownloaded!) SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL; -- Result: 0 (no files downloaded) ``` ### After Refactor ```sql SELECT COUNT(*) FROM images WHERE downloaded = 1; -- Result: ~16,807 (one per actual lot image) SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL; -- Result: ~16,807 (all downloaded images have paths) SELECT COUNT(DISTINCT lot_id, url) FROM images; -- Result: ~16,807 (no duplicates!) ``` ## πŸš€ Deployment Checklist ### Pre-Deployment - [ ] Back up current database: `cp cache.db cache.db.backup` - [ ] Verify disk space: At least 10GB free for images - [ ] Test download function on 5 sample lots - [ ] Verify `IMAGES_DIR` path exists and is writable ### Deployment - [ ] Update configuration: `DOWNLOAD_IMAGES = True` - [ ] Run schema migration to add unique index - [ ] Deploy updated scraper code - [ ] Monitor first 100 lots for errors ### Post-Deployment Verification ```sql -- Check download success rate SELECT COUNT(*) as total_images, SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded, SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed, ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate FROM images; -- Check for duplicates (should be 0) SELECT lot_id, url, COUNT(*) as dup_count FROM images GROUP BY lot_id, url HAVING COUNT(*) > 1; -- Verify file system SELECT COUNT(*) FROM images WHERE downloaded = 1 AND local_path IS NOT NULL AND local_path != ''; ``` ## πŸ” Monitoring Process Impact The monitoring process (auctiora) will automatically: - βœ… Stop downloading images (network I/O eliminated) - βœ… Only run object detection on `local_path` files - βœ… Query: `WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')` - βœ… Update only the `labels` and `processed_at` columns **No changes needed in monitoring process!** It's already updated to work with scraper-downloaded images. ## πŸ› Troubleshooting ### Problem: "No space left on device" ```bash # Check disk usage df -h /mnt/okcomputer/output/images # Estimate needed space: ~100KB per image # 16,807 images Γ— 100KB = ~1.6GB ``` ### Problem: "Permission denied" when writing images ```bash # Fix permissions chmod 755 /mnt/okcomputer/output/images chown -R scraper_user:scraper_group /mnt/okcomputer/output/images ``` ### Problem: Images downloading but not recorded in DB ```python # Add logging import logging logging.basicConfig(level=logging.INFO) def download_image(...): logging.info(f"Downloading {image_url} to {local_path}") # ... download code ... logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes") return local_path ``` ### Problem: Duplicate images after refactor ```sql -- Find duplicates SELECT lot_id, url, COUNT(*) FROM images GROUP BY lot_id, url HAVING COUNT(*) > 1; -- Clean up duplicates (keep newest) DELETE FROM images WHERE id NOT IN ( SELECT MAX(id) FROM images GROUP BY lot_id, url ); ``` ## πŸ“ˆ Performance Comparison | Metric | Before (Monitor Downloads) | After (Scraper Downloads) | |----------------------|---------------------------------|---------------------------| | **Image records** | 57,376,293 | ~16,807 | | **Duplicates** | 57,359,486 (99.97%!) | 0 | | **Network I/O** | Monitor process | Scraper process | | **Disk usage** | 0 (URLs only) | ~1.6GB (actual files) | | **Processing speed** | 500ms/image (download + detect) | 100ms/image (detect only) | | **Error handling** | Complex (download failures) | Simple (files exist) | ## πŸŽ“ Code Examples by Language ### Python (Most Likely) See **Step 2** above for complete implementation. ## πŸ“š References - **Current Scraper Architecture**: `wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md` - **Database Schema**: `wiki/DATABASE_ARCHITECTURE.md` - **Monitor Changes**: See commit history for `ImageProcessingService.java`, `DatabaseService.java` ## βœ… Success Criteria You'll know the refactor is successful when: 1. βœ… Database query `SELECT COUNT(*) FROM images` returns ~16,807 (not 57M+) 2. βœ… All images have `downloaded = 1` and `local_path IS NOT NULL` 3. βœ… No duplicate records: `SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1` returns 0 rows 4. βœ… Monitor logs show "Found N images needing detection" with reasonable numbers 5. βœ… Files exist at paths in `local_path` column 6. βœ… Monitor process speed increases (100ms vs 500ms per image) --- **Questions?** Check the troubleshooting section or inspect the monitor's updated code in: - `src/main/java/auctiora/ImageProcessingService.java` - `src/main/java/auctiora/DatabaseService.java:695-719`