auctiora/docs/SCRAPER_REFACTOR_GUIDE.md

# Scraper Refactor Guide - Image Download Integration

## 🎯 Objective

Refactor the Troostwijk scraper to **download and store images locally**, eliminating the 57M+ duplicate image problem in the monitoring process.

## 📋 Current vs. New Architecture

### **Before** (Current Architecture)
```
┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│   Scraper    │────────▶│   Database   │◀────────│   Monitor    │
│              │         │              │         │              │
│ Stores URLs  │         │ images table │         │ Downloads +  │
│ downloaded=0 │         │              │         │ Detection    │
└──────────────┘         └──────────────┘         └──────────────┘
                                                         │
                                                         ▼
                                                   57M+ duplicates!
```

### **After** (New Architecture)
```
┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│   Scraper    │────────▶│   Database   │◀────────│   Monitor    │
│              │         │              │         │              │
│ Downloads +  │         │ images table │         │ Detection    │
│ Stores path  │         │ local_path ✓ │         │ Only         │
│ downloaded=1 │         │              │         │              │
└──────────────┘         └──────────────┘         └──────────────┘
                                                         │
                                                         ▼
                                                   No duplicates!
```

## 🗄️ Database Schema Changes

### Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)
```sql
CREATE TABLE images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    lot_id TEXT,
    url TEXT,
    local_path TEXT,           -- Currently NULL
    downloaded INTEGER         -- Currently 0
    -- Missing: processed_at, labels (added by monitor)
);
```

### Required Schema (Already Compatible!)
```sql
CREATE TABLE images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    lot_id TEXT,
    url TEXT,
    local_path TEXT,           -- ✅ SET by scraper after download
    downloaded INTEGER,        -- ✅ SET to 1 by scraper after download
    labels TEXT,               -- ⚠️ SET by monitor (object detection)
    processed_at INTEGER,      -- ⚠️ SET by monitor (timestamp)
    FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);
```

**Good News**: The scraper's schema already has `local_path` and `downloaded` columns! You just need to populate them.

## 🔧 Implementation Steps

### **Step 1: Enable Image Downloading in Configuration**

**File**: Your scraper's config file (e.g., `config.py` or environment variables)

```python
# Current setting
DOWNLOAD_IMAGES = False  # ❌ Change this!

# New setting
DOWNLOAD_IMAGES = True   # ✅ Enable downloads

# Image storage path
IMAGES_DIR = "/mnt/okcomputer/output/images"  # Or your preferred path
```

### **Step 2: Update Image Download Logic**

Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:

**Current Code** (Conceptual):
```python
# Phase 3: Scrape lot details
def scrape_lot(lot_url):
    lot_data = parse_lot_page(lot_url)

    # Save lot to database
    db.insert_lot(lot_data)

    # Save image URLs to database (NOT DOWNLOADED)
    for img_url in lot_data['images']:
        db.execute("""
            INSERT INTO images (lot_id, url, downloaded)
            VALUES (?, ?, 0)
        """, (lot_data['lot_id'], img_url))
```

**New Code** (Required):
```python
import os
import requests
from pathlib import Path
import time

def scrape_lot(lot_url):
    lot_data = parse_lot_page(lot_url)

    # Save lot to database
    db.insert_lot(lot_data)

    # Download and save images
    for idx, img_url in enumerate(lot_data['images'], start=1):
        try:
            # Download image
            local_path = download_image(img_url, lot_data['lot_id'], idx)

            # Insert with local_path and downloaded=1
            db.execute("""
                INSERT INTO images (lot_id, url, local_path, downloaded)
                VALUES (?, ?, ?, 1)
                ON CONFLICT(lot_id, url) DO UPDATE SET
                    local_path = excluded.local_path,
                    downloaded = 1
            """, (lot_data['lot_id'], img_url, local_path))

            # Rate limiting (0.5s between downloads)
            time.sleep(0.5)

        except Exception as e:
            print(f"Failed to download {img_url}: {e}")
            # Still insert record but mark as not downloaded
            db.execute("""
                INSERT INTO images (lot_id, url, downloaded)
                VALUES (?, ?, 0)
            """, (lot_data['lot_id'], img_url))

def download_image(image_url, lot_id, index):
    """
    Downloads an image and saves it to organized directory structure.

    Args:
        image_url: Remote URL of the image
        lot_id: Lot identifier (e.g., "A1-28505-5")
        index: Image sequence number (1, 2, 3, ...)

    Returns:
        Absolute path to saved file
    """
    # Create directory structure: /images/{lot_id}/
    images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
    lot_dir = images_dir / lot_id
    lot_dir.mkdir(parents=True, exist_ok=True)

    # Determine file extension from URL or content-type
    ext = Path(image_url).suffix or '.jpg'
    filename = f"{index:03d}{ext}"  # 001.jpg, 002.jpg, etc.
    local_path = lot_dir / filename

    # Download with timeout
    response = requests.get(image_url, timeout=10)
    response.raise_for_status()

    # Save to disk
    with open(local_path, 'wb') as f:
        f.write(response.content)

    return str(local_path.absolute())
```

### **Step 3: Add Unique Constraint to Prevent Duplicates**

**Migration SQL**:
```sql
-- Add unique constraint to prevent duplicate image records
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
ON images(lot_id, url);
```

Add this to your scraper's schema initialization:

```python
def init_database():
    conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
    cursor = conn.cursor()

    # Existing table creation...
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS images (...)
    """)

    # Add unique constraint (NEW)
    cursor.execute("""
        CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
        ON images(lot_id, url)
    """)

    conn.commit()
    conn.close()
```

### **Step 4: Handle Image Download Failures Gracefully**

```python
def download_with_retry(image_url, lot_id, index, max_retries=3):
    """Downloads image with retry logic."""
    for attempt in range(max_retries):
        try:
            return download_image(image_url, lot_id, index)
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                print(f"Failed after {max_retries} attempts: {image_url}")
                return None  # Return None on failure
            print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
            time.sleep(2 ** attempt)  # Exponential backoff
```

### **Step 5: Update Database Queries**

Make sure your INSERT uses `INSERT ... ON CONFLICT` to handle re-scraping:

```python
# Good: Handles re-scraping without duplicates
db.execute("""
    INSERT INTO images (lot_id, url, local_path, downloaded)
    VALUES (?, ?, ?, 1)
    ON CONFLICT(lot_id, url) DO UPDATE SET
        local_path = excluded.local_path,
        downloaded = 1
""", (lot_id, img_url, local_path))

# Bad: Creates duplicates on re-scrape
db.execute("""
    INSERT INTO images (lot_id, url, local_path, downloaded)
    VALUES (?, ?, ?, 1)
""", (lot_id, img_url, local_path))
```

## 📊 Expected Outcomes

### Before Refactor
```sql
SELECT COUNT(*) FROM images WHERE downloaded = 0;
-- Result: 57,376,293 (57M+ undownloaded!)

SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: 0 (no files downloaded)
```

### After Refactor
```sql
SELECT COUNT(*) FROM images WHERE downloaded = 1;
-- Result: ~16,807 (one per actual lot image)

SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: ~16,807 (all downloaded images have paths)

SELECT COUNT(DISTINCT lot_id, url) FROM images;
-- Result: ~16,807 (no duplicates!)
```

## 🚀 Deployment Checklist

### Pre-Deployment
- [ ] Back up current database: `cp cache.db cache.db.backup`
- [ ] Verify disk space: At least 10GB free for images
- [ ] Test download function on 5 sample lots
- [ ] Verify `IMAGES_DIR` path exists and is writable

### Deployment
- [ ] Update configuration: `DOWNLOAD_IMAGES = True`
- [ ] Run schema migration to add unique index
- [ ] Deploy updated scraper code
- [ ] Monitor first 100 lots for errors

### Post-Deployment Verification
```sql
-- Check download success rate
SELECT
    COUNT(*) as total_images,
    SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
    SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
    ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
FROM images;

-- Check for duplicates (should be 0)
SELECT lot_id, url, COUNT(*) as dup_count
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;

-- Verify file system
SELECT COUNT(*) FROM images
WHERE downloaded = 1
  AND local_path IS NOT NULL
  AND local_path != '';
```

## 🔍 Monitoring Process Impact

The monitoring process (auctiora) will automatically:
- ✅ Stop downloading images (network I/O eliminated)
- ✅ Only run object detection on `local_path` files
- ✅ Query: `WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')`
- ✅ Update only the `labels` and `processed_at` columns

**No changes needed in monitoring process!** It's already updated to work with scraper-downloaded images.

## 🐛 Troubleshooting

### Problem: "No space left on device"
```bash
# Check disk usage
df -h /mnt/okcomputer/output/images

# Estimate needed space: ~100KB per image
# 16,807 images × 100KB = ~1.6GB
```

### Problem: "Permission denied" when writing images
```bash
# Fix permissions
chmod 755 /mnt/okcomputer/output/images
chown -R scraper_user:scraper_group /mnt/okcomputer/output/images
```

### Problem: Images downloading but not recorded in DB
```python
# Add logging
import logging
logging.basicConfig(level=logging.INFO)

def download_image(...):
    logging.info(f"Downloading {image_url} to {local_path}")
    # ... download code ...
    logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
    return local_path
```

### Problem: Duplicate images after refactor
```sql
-- Find duplicates
SELECT lot_id, url, COUNT(*)
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;

-- Clean up duplicates (keep newest)
DELETE FROM images
WHERE id NOT IN (
    SELECT MAX(id)
    FROM images
    GROUP BY lot_id, url
);
```

## 📈 Performance Comparison

| Metric               | Before (Monitor Downloads)      | After (Scraper Downloads) |
|----------------------|---------------------------------|---------------------------|
| **Image records**    | 57,376,293                      | ~16,807                   |
| **Duplicates**       | 57,359,486 (99.97%!)            | 0                         |
| **Network I/O**      | Monitor process                 | Scraper process           |
| **Disk usage**       | 0 (URLs only)                   | ~1.6GB (actual files)     |
| **Processing speed** | 500ms/image (download + detect) | 100ms/image (detect only) |
| **Error handling**   | Complex (download failures)     | Simple (files exist)      |

## 🎓 Code Examples by Language

### Python (Most Likely)
See **Step 2** above for complete implementation.

## 📚 References

- **Current Scraper Architecture**: `wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md`
- **Database Schema**: `wiki/DATABASE_ARCHITECTURE.md`
- **Monitor Changes**: See commit history for `ImageProcessingService.java`, `DatabaseService.java`

## ✅ Success Criteria

You'll know the refactor is successful when:

1. ✅ Database query `SELECT COUNT(*) FROM images` returns ~16,807 (not 57M+)
2. ✅ All images have `downloaded = 1` and `local_path IS NOT NULL`
3. ✅ No duplicate records: `SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1` returns 0 rows
4. ✅ Monitor logs show "Found N images needing detection" with reasonable numbers
5. ✅ Files exist at paths in `local_path` column
6. ✅ Monitor process speed increases (100ms vs 500ms per image)

---

**Questions?** Check the troubleshooting section or inspect the monitor's updated code in:
- `src/main/java/auctiora/ImageProcessingService.java`
- `src/main/java/auctiora/DatabaseService.java:695-719`