Fix mock tests

2025-12-07 06:32:03 +01:00
parent ef804b3896
commit 3efa83bc44
15 changed files with 18 additions and 15 deletions
--- a/docs/SCRAPER_REFACTOR_GUIDE.md
+++ b/docs/SCRAPER_REFACTOR_GUIDE.md
@@ -0,0 +1,399 @@
+# Scraper Refactor Guide - Image Download Integration
+
+## 🎯 Objective
+
+Refactor the Troostwijk scraper to **download and store images locally**, eliminating the 57M+ duplicate image problem in the monitoring process.
+
+## 📋 Current vs. New Architecture
+
+### **Before** (Current Architecture)
+```
+┌──────────────┐         ┌──────────────┐         ┌──────────────┐
+│   Scraper    │────────▶│   Database   │◀────────│   Monitor    │
+│              │         │              │         │              │
+│ Stores URLs  │         │ images table │         │ Downloads +  │
+│ downloaded=0 │         │              │         │ Detection    │
+└──────────────┘         └──────────────┘         └──────────────┘
+                                                         │
+                                                         ▼
+                                                   57M+ duplicates!
+```
+
+### **After** (New Architecture)
+```
+┌──────────────┐         ┌──────────────┐         ┌──────────────┐
+│   Scraper    │────────▶│   Database   │◀────────│   Monitor    │
+│              │         │              │         │              │
+│ Downloads +  │         │ images table │         │ Detection    │
+│ Stores path  │         │ local_path ✓ │         │ Only         │
+│ downloaded=1 │         │              │         │              │
+└──────────────┘         └──────────────┘         └──────────────┘
+                                                         │
+                                                         ▼
+                                                   No duplicates!
+```
+
+## 🗄️ Database Schema Changes
+
+### Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)
+```sql
+CREATE TABLE images (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    lot_id TEXT,
+    url TEXT,
+    local_path TEXT,           -- Currently NULL
+    downloaded INTEGER         -- Currently 0
+    -- Missing: processed_at, labels (added by monitor)
+);
+```
+
+### Required Schema (Already Compatible!)
+```sql
+CREATE TABLE images (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    lot_id TEXT,
+    url TEXT,
+    local_path TEXT,           -- ✅ SET by scraper after download
+    downloaded INTEGER,        -- ✅ SET to 1 by scraper after download
+    labels TEXT,               -- ⚠️ SET by monitor (object detection)
+    processed_at INTEGER,      -- ⚠️ SET by monitor (timestamp)
+    FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
+);
+```
+
+**Good News**: The scraper's schema already has `local_path` and `downloaded` columns! You just need to populate them.
+
+## 🔧 Implementation Steps
+
+### **Step 1: Enable Image Downloading in Configuration**
+
+**File**: Your scraper's config file (e.g., `config.py` or environment variables)
+
+```python
+# Current setting
+DOWNLOAD_IMAGES = False  # ❌ Change this!
+
+# New setting
+DOWNLOAD_IMAGES = True   # ✅ Enable downloads
+
+# Image storage path
+IMAGES_DIR = "/mnt/okcomputer/output/images"  # Or your preferred path
+```
+
+### **Step 2: Update Image Download Logic**
+
+Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:
+
+**Current Code** (Conceptual):
+```python
+# Phase 3: Scrape lot details
+def scrape_lot(lot_url):
+    lot_data = parse_lot_page(lot_url)
+
+    # Save lot to database
+    db.insert_lot(lot_data)
+
+    # Save image URLs to database (NOT DOWNLOADED)
+    for img_url in lot_data['images']:
+        db.execute("""
+            INSERT INTO images (lot_id, url, downloaded)
+            VALUES (?, ?, 0)
+        """, (lot_data['lot_id'], img_url))
+```
+
+**New Code** (Required):
+```python
+import os
+import requests
+from pathlib import Path
+import time
+
+def scrape_lot(lot_url):
+    lot_data = parse_lot_page(lot_url)
+
+    # Save lot to database
+    db.insert_lot(lot_data)
+
+    # Download and save images
+    for idx, img_url in enumerate(lot_data['images'], start=1):
+        try:
+            # Download image
+            local_path = download_image(img_url, lot_data['lot_id'], idx)
+
+            # Insert with local_path and downloaded=1
+            db.execute("""
+                INSERT INTO images (lot_id, url, local_path, downloaded)
+                VALUES (?, ?, ?, 1)
+                ON CONFLICT(lot_id, url) DO UPDATE SET
+                    local_path = excluded.local_path,
+                    downloaded = 1
+            """, (lot_data['lot_id'], img_url, local_path))
+
+            # Rate limiting (0.5s between downloads)
+            time.sleep(0.5)
+
+        except Exception as e:
+            print(f"Failed to download {img_url}: {e}")
+            # Still insert record but mark as not downloaded
+            db.execute("""
+                INSERT INTO images (lot_id, url, downloaded)
+                VALUES (?, ?, 0)
+            """, (lot_data['lot_id'], img_url))
+
+def download_image(image_url, lot_id, index):
+    """
+    Downloads an image and saves it to organized directory structure.
+
+    Args:
+        image_url: Remote URL of the image
+        lot_id: Lot identifier (e.g., "A1-28505-5")
+        index: Image sequence number (1, 2, 3, ...)
+
+    Returns:
+        Absolute path to saved file
+    """
+    # Create directory structure: /images/{lot_id}/
+    images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
+    lot_dir = images_dir / lot_id
+    lot_dir.mkdir(parents=True, exist_ok=True)
+
+    # Determine file extension from URL or content-type
+    ext = Path(image_url).suffix or '.jpg'
+    filename = f"{index:03d}{ext}"  # 001.jpg, 002.jpg, etc.
+    local_path = lot_dir / filename
+
+    # Download with timeout
+    response = requests.get(image_url, timeout=10)
+    response.raise_for_status()
+
+    # Save to disk
+    with open(local_path, 'wb') as f:
+        f.write(response.content)
+
+    return str(local_path.absolute())
+```
+
+### **Step 3: Add Unique Constraint to Prevent Duplicates**
+
+**Migration SQL**:
+```sql
+-- Add unique constraint to prevent duplicate image records
+CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
+ON images(lot_id, url);
+```
+
+Add this to your scraper's schema initialization:
+
+```python
+def init_database():
+    conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
+    cursor = conn.cursor()
+
+    # Existing table creation...
+    cursor.execute("""
+        CREATE TABLE IF NOT EXISTS images (...)
+    """)
+
+    # Add unique constraint (NEW)
+    cursor.execute("""
+        CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
+        ON images(lot_id, url)
+    """)
+
+    conn.commit()
+    conn.close()
+```
+
+### **Step 4: Handle Image Download Failures Gracefully**
+
+```python
+def download_with_retry(image_url, lot_id, index, max_retries=3):
+    """Downloads image with retry logic."""
+    for attempt in range(max_retries):
+        try:
+            return download_image(image_url, lot_id, index)
+        except requests.exceptions.RequestException as e:
+            if attempt == max_retries - 1:
+                print(f"Failed after {max_retries} attempts: {image_url}")
+                return None  # Return None on failure
+            print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
+            time.sleep(2 ** attempt)  # Exponential backoff
+```
+
+### **Step 5: Update Database Queries**
+
+Make sure your INSERT uses `INSERT ... ON CONFLICT` to handle re-scraping:
+
+```python
+# Good: Handles re-scraping without duplicates
+db.execute("""
+    INSERT INTO images (lot_id, url, local_path, downloaded)
+    VALUES (?, ?, ?, 1)
+    ON CONFLICT(lot_id, url) DO UPDATE SET
+        local_path = excluded.local_path,
+        downloaded = 1
+""", (lot_id, img_url, local_path))
+
+# Bad: Creates duplicates on re-scrape
+db.execute("""
+    INSERT INTO images (lot_id, url, local_path, downloaded)
+    VALUES (?, ?, ?, 1)
+""", (lot_id, img_url, local_path))
+```
+
+## 📊 Expected Outcomes
+
+### Before Refactor
+```sql
+SELECT COUNT(*) FROM images WHERE downloaded = 0;
+-- Result: 57,376,293 (57M+ undownloaded!)
+
+SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
+-- Result: 0 (no files downloaded)
+```
+
+### After Refactor
+```sql
+SELECT COUNT(*) FROM images WHERE downloaded = 1;
+-- Result: ~16,807 (one per actual lot image)
+
+SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
+-- Result: ~16,807 (all downloaded images have paths)
+
+SELECT COUNT(DISTINCT lot_id, url) FROM images;
+-- Result: ~16,807 (no duplicates!)
+```
+
+## 🚀 Deployment Checklist
+
+### Pre-Deployment
+- [ ] Back up current database: `cp cache.db cache.db.backup`
+- [ ] Verify disk space: At least 10GB free for images
+- [ ] Test download function on 5 sample lots
+- [ ] Verify `IMAGES_DIR` path exists and is writable
+
+### Deployment
+- [ ] Update configuration: `DOWNLOAD_IMAGES = True`
+- [ ] Run schema migration to add unique index
+- [ ] Deploy updated scraper code
+- [ ] Monitor first 100 lots for errors
+
+### Post-Deployment Verification
+```sql
+-- Check download success rate
+SELECT
+    COUNT(*) as total_images,
+    SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
+    SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
+    ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
+FROM images;
+
+-- Check for duplicates (should be 0)
+SELECT lot_id, url, COUNT(*) as dup_count
+FROM images
+GROUP BY lot_id, url
+HAVING COUNT(*) > 1;
+
+-- Verify file system
+SELECT COUNT(*) FROM images
+WHERE downloaded = 1
+  AND local_path IS NOT NULL
+  AND local_path != '';
+```
+
+## 🔍 Monitoring Process Impact
+
+The monitoring process (auctiora) will automatically:
+- ✅ Stop downloading images (network I/O eliminated)
+- ✅ Only run object detection on `local_path` files
+- ✅ Query: `WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')`
+- ✅ Update only the `labels` and `processed_at` columns
+
+**No changes needed in monitoring process!** It's already updated to work with scraper-downloaded images.
+
+## 🐛 Troubleshooting
+
+### Problem: "No space left on device"
+```bash
+# Check disk usage
+df -h /mnt/okcomputer/output/images
+
+# Estimate needed space: ~100KB per image
+# 16,807 images × 100KB = ~1.6GB
+```
+
+### Problem: "Permission denied" when writing images
+```bash
+# Fix permissions
+chmod 755 /mnt/okcomputer/output/images
+chown -R scraper_user:scraper_group /mnt/okcomputer/output/images
+```
+
+### Problem: Images downloading but not recorded in DB
+```python
+# Add logging
+import logging
+logging.basicConfig(level=logging.INFO)
+
+def download_image(...):
+    logging.info(f"Downloading {image_url} to {local_path}")
+    # ... download code ...
+    logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
+    return local_path
+```
+
+### Problem: Duplicate images after refactor
+```sql
+-- Find duplicates
+SELECT lot_id, url, COUNT(*)
+FROM images
+GROUP BY lot_id, url
+HAVING COUNT(*) > 1;
+
+-- Clean up duplicates (keep newest)
+DELETE FROM images
+WHERE id NOT IN (
+    SELECT MAX(id)
+    FROM images
+    GROUP BY lot_id, url
+);
+```
+
+## 📈 Performance Comparison
+
+| Metric               | Before (Monitor Downloads)      | After (Scraper Downloads) |
+|----------------------|---------------------------------|---------------------------|
+| **Image records**    | 57,376,293                      | ~16,807                   |
+| **Duplicates**       | 57,359,486 (99.97%!)            | 0                         |
+| **Network I/O**      | Monitor process                 | Scraper process           |
+| **Disk usage**       | 0 (URLs only)                   | ~1.6GB (actual files)     |
+| **Processing speed** | 500ms/image (download + detect) | 100ms/image (detect only) |
+| **Error handling**   | Complex (download failures)     | Simple (files exist)      |
+
+## 🎓 Code Examples by Language
+
+### Python (Most Likely)
+See **Step 2** above for complete implementation.
+
+## 📚 References
+
+- **Current Scraper Architecture**: `wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md`
+- **Database Schema**: `wiki/DATABASE_ARCHITECTURE.md`
+- **Monitor Changes**: See commit history for `ImageProcessingService.java`, `DatabaseService.java`
+
+## ✅ Success Criteria
+
+You'll know the refactor is successful when:
+
+1. ✅ Database query `SELECT COUNT(*) FROM images` returns ~16,807 (not 57M+)
+2. ✅ All images have `downloaded = 1` and `local_path IS NOT NULL`
+3. ✅ No duplicate records: `SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1` returns 0 rows
+4. ✅ Monitor logs show "Found N images needing detection" with reasonable numbers
+5. ✅ Files exist at paths in `local_path` column
+6. ✅ Monitor process speed increases (100ms vs 500ms per image)
+
+---
+
+**Questions?** Check the troubleshooting section or inspect the monitor's updated code in:
+- `src/main/java/auctiora/ImageProcessingService.java`
+- `src/main/java/auctiora/DatabaseService.java:695-719`