Files
auctiora/SCRAPER_REFACTOR_GUIDE.md
Tour 6091b7180f go
2025-12-06 21:27:19 +01:00

13 KiB
Raw Blame History

Scraper Refactor Guide - Image Download Integration

🎯 Objective

Refactor the Troostwijk scraper to download and store images locally, eliminating the 57M+ duplicate image problem in the monitoring process.

📋 Current vs. New Architecture

Before (Current Architecture)

┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│   Scraper    │────────▶│   Database   │◀────────│   Monitor    │
│              │         │              │         │              │
│ Stores URLs  │         │ images table │         │ Downloads +  │
│ downloaded=0 │         │              │         │ Detection    │
└──────────────┘         └──────────────┘         └──────────────┘
                                                         │
                                                         ▼
                                                   57M+ duplicates!

After (New Architecture)

┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│   Scraper    │────────▶│   Database   │◀────────│   Monitor    │
│              │         │              │         │              │
│ Downloads +  │         │ images table │         │ Detection    │
│ Stores path  │         │ local_path ✓ │         │ Only         │
│ downloaded=1 │         │              │         │              │
└──────────────┘         └──────────────┘         └──────────────┘
                                                         │
                                                         ▼
                                                   No duplicates!

🗄️ Database Schema Changes

Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)

CREATE TABLE images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    lot_id TEXT,
    url TEXT,
    local_path TEXT,           -- Currently NULL
    downloaded INTEGER         -- Currently 0
    -- Missing: processed_at, labels (added by monitor)
);

Required Schema (Already Compatible!)

CREATE TABLE images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    lot_id TEXT,
    url TEXT,
    local_path TEXT,           -- ✅ SET by scraper after download
    downloaded INTEGER,        -- ✅ SET to 1 by scraper after download
    labels TEXT,               -- ⚠️ SET by monitor (object detection)
    processed_at INTEGER,      -- ⚠️ SET by monitor (timestamp)
    FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);

Good News: The scraper's schema already has local_path and downloaded columns! You just need to populate them.

🔧 Implementation Steps

Step 1: Enable Image Downloading in Configuration

File: Your scraper's config file (e.g., config.py or environment variables)

# Current setting
DOWNLOAD_IMAGES = False  # ❌ Change this!

# New setting
DOWNLOAD_IMAGES = True   # ✅ Enable downloads

# Image storage path
IMAGES_DIR = "/mnt/okcomputer/output/images"  # Or your preferred path

Step 2: Update Image Download Logic

Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:

Current Code (Conceptual):

# Phase 3: Scrape lot details
def scrape_lot(lot_url):
    lot_data = parse_lot_page(lot_url)

    # Save lot to database
    db.insert_lot(lot_data)

    # Save image URLs to database (NOT DOWNLOADED)
    for img_url in lot_data['images']:
        db.execute("""
            INSERT INTO images (lot_id, url, downloaded)
            VALUES (?, ?, 0)
        """, (lot_data['lot_id'], img_url))

New Code (Required):

import os
import requests
from pathlib import Path
import time

def scrape_lot(lot_url):
    lot_data = parse_lot_page(lot_url)

    # Save lot to database
    db.insert_lot(lot_data)

    # Download and save images
    for idx, img_url in enumerate(lot_data['images'], start=1):
        try:
            # Download image
            local_path = download_image(img_url, lot_data['lot_id'], idx)

            # Insert with local_path and downloaded=1
            db.execute("""
                INSERT INTO images (lot_id, url, local_path, downloaded)
                VALUES (?, ?, ?, 1)
                ON CONFLICT(lot_id, url) DO UPDATE SET
                    local_path = excluded.local_path,
                    downloaded = 1
            """, (lot_data['lot_id'], img_url, local_path))

            # Rate limiting (0.5s between downloads)
            time.sleep(0.5)

        except Exception as e:
            print(f"Failed to download {img_url}: {e}")
            # Still insert record but mark as not downloaded
            db.execute("""
                INSERT INTO images (lot_id, url, downloaded)
                VALUES (?, ?, 0)
            """, (lot_data['lot_id'], img_url))

def download_image(image_url, lot_id, index):
    """
    Downloads an image and saves it to organized directory structure.

    Args:
        image_url: Remote URL of the image
        lot_id: Lot identifier (e.g., "A1-28505-5")
        index: Image sequence number (1, 2, 3, ...)

    Returns:
        Absolute path to saved file
    """
    # Create directory structure: /images/{lot_id}/
    images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
    lot_dir = images_dir / lot_id
    lot_dir.mkdir(parents=True, exist_ok=True)

    # Determine file extension from URL or content-type
    ext = Path(image_url).suffix or '.jpg'
    filename = f"{index:03d}{ext}"  # 001.jpg, 002.jpg, etc.
    local_path = lot_dir / filename

    # Download with timeout
    response = requests.get(image_url, timeout=10)
    response.raise_for_status()

    # Save to disk
    with open(local_path, 'wb') as f:
        f.write(response.content)

    return str(local_path.absolute())

Step 3: Add Unique Constraint to Prevent Duplicates

Migration SQL:

-- Add unique constraint to prevent duplicate image records
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
ON images(lot_id, url);

Add this to your scraper's schema initialization:

def init_database():
    conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
    cursor = conn.cursor()

    # Existing table creation...
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS images (...)
    """)

    # Add unique constraint (NEW)
    cursor.execute("""
        CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
        ON images(lot_id, url)
    """)

    conn.commit()
    conn.close()

Step 4: Handle Image Download Failures Gracefully

def download_with_retry(image_url, lot_id, index, max_retries=3):
    """Downloads image with retry logic."""
    for attempt in range(max_retries):
        try:
            return download_image(image_url, lot_id, index)
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                print(f"Failed after {max_retries} attempts: {image_url}")
                return None  # Return None on failure
            print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
            time.sleep(2 ** attempt)  # Exponential backoff

Step 5: Update Database Queries

Make sure your INSERT uses INSERT ... ON CONFLICT to handle re-scraping:

# Good: Handles re-scraping without duplicates
db.execute("""
    INSERT INTO images (lot_id, url, local_path, downloaded)
    VALUES (?, ?, ?, 1)
    ON CONFLICT(lot_id, url) DO UPDATE SET
        local_path = excluded.local_path,
        downloaded = 1
""", (lot_id, img_url, local_path))

# Bad: Creates duplicates on re-scrape
db.execute("""
    INSERT INTO images (lot_id, url, local_path, downloaded)
    VALUES (?, ?, ?, 1)
""", (lot_id, img_url, local_path))

📊 Expected Outcomes

Before Refactor

SELECT COUNT(*) FROM images WHERE downloaded = 0;
-- Result: 57,376,293 (57M+ undownloaded!)

SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: 0 (no files downloaded)

After Refactor

SELECT COUNT(*) FROM images WHERE downloaded = 1;
-- Result: ~16,807 (one per actual lot image)

SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: ~16,807 (all downloaded images have paths)

SELECT COUNT(DISTINCT lot_id, url) FROM images;
-- Result: ~16,807 (no duplicates!)

🚀 Deployment Checklist

Pre-Deployment

  • Back up current database: cp cache.db cache.db.backup
  • Verify disk space: At least 10GB free for images
  • Test download function on 5 sample lots
  • Verify IMAGES_DIR path exists and is writable

Deployment

  • Update configuration: DOWNLOAD_IMAGES = True
  • Run schema migration to add unique index
  • Deploy updated scraper code
  • Monitor first 100 lots for errors

Post-Deployment Verification

-- Check download success rate
SELECT
    COUNT(*) as total_images,
    SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
    SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
    ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
FROM images;

-- Check for duplicates (should be 0)
SELECT lot_id, url, COUNT(*) as dup_count
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;

-- Verify file system
SELECT COUNT(*) FROM images
WHERE downloaded = 1
  AND local_path IS NOT NULL
  AND local_path != '';

🔍 Monitoring Process Impact

The monitoring process (auctiora) will automatically:

  • Stop downloading images (network I/O eliminated)
  • Only run object detection on local_path files
  • Query: WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')
  • Update only the labels and processed_at columns

No changes needed in monitoring process! It's already updated to work with scraper-downloaded images.

🐛 Troubleshooting

Problem: "No space left on device"

# Check disk usage
df -h /mnt/okcomputer/output/images

# Estimate needed space: ~100KB per image
# 16,807 images × 100KB = ~1.6GB

Problem: "Permission denied" when writing images

# Fix permissions
chmod 755 /mnt/okcomputer/output/images
chown -R scraper_user:scraper_group /mnt/okcomputer/output/images

Problem: Images downloading but not recorded in DB

# Add logging
import logging
logging.basicConfig(level=logging.INFO)

def download_image(...):
    logging.info(f"Downloading {image_url} to {local_path}")
    # ... download code ...
    logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
    return local_path

Problem: Duplicate images after refactor

-- Find duplicates
SELECT lot_id, url, COUNT(*)
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;

-- Clean up duplicates (keep newest)
DELETE FROM images
WHERE id NOT IN (
    SELECT MAX(id)
    FROM images
    GROUP BY lot_id, url
);

📈 Performance Comparison

Metric Before (Monitor Downloads) After (Scraper Downloads)
Image records 57,376,293 ~16,807
Duplicates 57,359,486 (99.97%!) 0
Network I/O Monitor process Scraper process
Disk usage 0 (URLs only) ~1.6GB (actual files)
Processing speed 500ms/image (download + detect) 100ms/image (detect only)
Error handling Complex (download failures) Simple (files exist)

🎓 Code Examples by Language

Python (Most Likely)

See Step 2 above for complete implementation.

📚 References

  • Current Scraper Architecture: wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
  • Database Schema: wiki/DATABASE_ARCHITECTURE.md
  • Monitor Changes: See commit history for ImageProcessingService.java, DatabaseService.java

Success Criteria

You'll know the refactor is successful when:

  1. Database query SELECT COUNT(*) FROM images returns ~16,807 (not 57M+)
  2. All images have downloaded = 1 and local_path IS NOT NULL
  3. No duplicate records: SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1 returns 0 rows
  4. Monitor logs show "Found N images needing detection" with reasonable numbers
  5. Files exist at paths in local_path column
  6. Monitor process speed increases (100ms vs 500ms per image)

Questions? Check the troubleshooting section or inspect the monitor's updated code in:

  • src/main/java/auctiora/ImageProcessingService.java
  • src/main/java/auctiora/DatabaseService.java:695-719