13 KiB
Scraper Refactor Guide - Image Download Integration
🎯 Objective
Refactor the Troostwijk scraper to download and store images locally, eliminating the 57M+ duplicate image problem in the monitoring process.
📋 Current vs. New Architecture
Before (Current Architecture)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Scraper │────────▶│ Database │◀────────│ Monitor │
│ │ │ │ │ │
│ Stores URLs │ │ images table │ │ Downloads + │
│ downloaded=0 │ │ │ │ Detection │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
57M+ duplicates!
After (New Architecture)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Scraper │────────▶│ Database │◀────────│ Monitor │
│ │ │ │ │ │
│ Downloads + │ │ images table │ │ Detection │
│ Stores path │ │ local_path ✓ │ │ Only │
│ downloaded=1 │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
No duplicates!
🗄️ Database Schema Changes
Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)
CREATE TABLE images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT,
url TEXT,
local_path TEXT, -- Currently NULL
downloaded INTEGER -- Currently 0
-- Missing: processed_at, labels (added by monitor)
);
Required Schema (Already Compatible!)
CREATE TABLE images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT,
url TEXT,
local_path TEXT, -- ✅ SET by scraper after download
downloaded INTEGER, -- ✅ SET to 1 by scraper after download
labels TEXT, -- ⚠️ SET by monitor (object detection)
processed_at INTEGER, -- ⚠️ SET by monitor (timestamp)
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);
Good News: The scraper's schema already has local_path and downloaded columns! You just need to populate them.
🔧 Implementation Steps
Step 1: Enable Image Downloading in Configuration
File: Your scraper's config file (e.g., config.py or environment variables)
# Current setting
DOWNLOAD_IMAGES = False # ❌ Change this!
# New setting
DOWNLOAD_IMAGES = True # ✅ Enable downloads
# Image storage path
IMAGES_DIR = "/mnt/okcomputer/output/images" # Or your preferred path
Step 2: Update Image Download Logic
Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:
Current Code (Conceptual):
# Phase 3: Scrape lot details
def scrape_lot(lot_url):
lot_data = parse_lot_page(lot_url)
# Save lot to database
db.insert_lot(lot_data)
# Save image URLs to database (NOT DOWNLOADED)
for img_url in lot_data['images']:
db.execute("""
INSERT INTO images (lot_id, url, downloaded)
VALUES (?, ?, 0)
""", (lot_data['lot_id'], img_url))
New Code (Required):
import os
import requests
from pathlib import Path
import time
def scrape_lot(lot_url):
lot_data = parse_lot_page(lot_url)
# Save lot to database
db.insert_lot(lot_data)
# Download and save images
for idx, img_url in enumerate(lot_data['images'], start=1):
try:
# Download image
local_path = download_image(img_url, lot_data['lot_id'], idx)
# Insert with local_path and downloaded=1
db.execute("""
INSERT INTO images (lot_id, url, local_path, downloaded)
VALUES (?, ?, ?, 1)
ON CONFLICT(lot_id, url) DO UPDATE SET
local_path = excluded.local_path,
downloaded = 1
""", (lot_data['lot_id'], img_url, local_path))
# Rate limiting (0.5s between downloads)
time.sleep(0.5)
except Exception as e:
print(f"Failed to download {img_url}: {e}")
# Still insert record but mark as not downloaded
db.execute("""
INSERT INTO images (lot_id, url, downloaded)
VALUES (?, ?, 0)
""", (lot_data['lot_id'], img_url))
def download_image(image_url, lot_id, index):
"""
Downloads an image and saves it to organized directory structure.
Args:
image_url: Remote URL of the image
lot_id: Lot identifier (e.g., "A1-28505-5")
index: Image sequence number (1, 2, 3, ...)
Returns:
Absolute path to saved file
"""
# Create directory structure: /images/{lot_id}/
images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
lot_dir = images_dir / lot_id
lot_dir.mkdir(parents=True, exist_ok=True)
# Determine file extension from URL or content-type
ext = Path(image_url).suffix or '.jpg'
filename = f"{index:03d}{ext}" # 001.jpg, 002.jpg, etc.
local_path = lot_dir / filename
# Download with timeout
response = requests.get(image_url, timeout=10)
response.raise_for_status()
# Save to disk
with open(local_path, 'wb') as f:
f.write(response.content)
return str(local_path.absolute())
Step 3: Add Unique Constraint to Prevent Duplicates
Migration SQL:
-- Add unique constraint to prevent duplicate image records
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
ON images(lot_id, url);
Add this to your scraper's schema initialization:
def init_database():
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
cursor = conn.cursor()
# Existing table creation...
cursor.execute("""
CREATE TABLE IF NOT EXISTS images (...)
""")
# Add unique constraint (NEW)
cursor.execute("""
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
ON images(lot_id, url)
""")
conn.commit()
conn.close()
Step 4: Handle Image Download Failures Gracefully
def download_with_retry(image_url, lot_id, index, max_retries=3):
"""Downloads image with retry logic."""
for attempt in range(max_retries):
try:
return download_image(image_url, lot_id, index)
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
print(f"Failed after {max_retries} attempts: {image_url}")
return None # Return None on failure
print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
time.sleep(2 ** attempt) # Exponential backoff
Step 5: Update Database Queries
Make sure your INSERT uses INSERT ... ON CONFLICT to handle re-scraping:
# Good: Handles re-scraping without duplicates
db.execute("""
INSERT INTO images (lot_id, url, local_path, downloaded)
VALUES (?, ?, ?, 1)
ON CONFLICT(lot_id, url) DO UPDATE SET
local_path = excluded.local_path,
downloaded = 1
""", (lot_id, img_url, local_path))
# Bad: Creates duplicates on re-scrape
db.execute("""
INSERT INTO images (lot_id, url, local_path, downloaded)
VALUES (?, ?, ?, 1)
""", (lot_id, img_url, local_path))
📊 Expected Outcomes
Before Refactor
SELECT COUNT(*) FROM images WHERE downloaded = 0;
-- Result: 57,376,293 (57M+ undownloaded!)
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: 0 (no files downloaded)
After Refactor
SELECT COUNT(*) FROM images WHERE downloaded = 1;
-- Result: ~16,807 (one per actual lot image)
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: ~16,807 (all downloaded images have paths)
SELECT COUNT(DISTINCT lot_id, url) FROM images;
-- Result: ~16,807 (no duplicates!)
🚀 Deployment Checklist
Pre-Deployment
- Back up current database:
cp cache.db cache.db.backup - Verify disk space: At least 10GB free for images
- Test download function on 5 sample lots
- Verify
IMAGES_DIRpath exists and is writable
Deployment
- Update configuration:
DOWNLOAD_IMAGES = True - Run schema migration to add unique index
- Deploy updated scraper code
- Monitor first 100 lots for errors
Post-Deployment Verification
-- Check download success rate
SELECT
COUNT(*) as total_images,
SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
FROM images;
-- Check for duplicates (should be 0)
SELECT lot_id, url, COUNT(*) as dup_count
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;
-- Verify file system
SELECT COUNT(*) FROM images
WHERE downloaded = 1
AND local_path IS NOT NULL
AND local_path != '';
🔍 Monitoring Process Impact
The monitoring process (auctiora) will automatically:
- ✅ Stop downloading images (network I/O eliminated)
- ✅ Only run object detection on
local_pathfiles - ✅ Query:
WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '') - ✅ Update only the
labelsandprocessed_atcolumns
No changes needed in monitoring process! It's already updated to work with scraper-downloaded images.
🐛 Troubleshooting
Problem: "No space left on device"
# Check disk usage
df -h /mnt/okcomputer/output/images
# Estimate needed space: ~100KB per image
# 16,807 images × 100KB = ~1.6GB
Problem: "Permission denied" when writing images
# Fix permissions
chmod 755 /mnt/okcomputer/output/images
chown -R scraper_user:scraper_group /mnt/okcomputer/output/images
Problem: Images downloading but not recorded in DB
# Add logging
import logging
logging.basicConfig(level=logging.INFO)
def download_image(...):
logging.info(f"Downloading {image_url} to {local_path}")
# ... download code ...
logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
return local_path
Problem: Duplicate images after refactor
-- Find duplicates
SELECT lot_id, url, COUNT(*)
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;
-- Clean up duplicates (keep newest)
DELETE FROM images
WHERE id NOT IN (
SELECT MAX(id)
FROM images
GROUP BY lot_id, url
);
📈 Performance Comparison
| Metric | Before (Monitor Downloads) | After (Scraper Downloads) |
|---|---|---|
| Image records | 57,376,293 | ~16,807 |
| Duplicates | 57,359,486 (99.97%!) | 0 |
| Network I/O | Monitor process | Scraper process |
| Disk usage | 0 (URLs only) | ~1.6GB (actual files) |
| Processing speed | 500ms/image (download + detect) | 100ms/image (detect only) |
| Error handling | Complex (download failures) | Simple (files exist) |
🎓 Code Examples by Language
Python (Most Likely)
See Step 2 above for complete implementation.
📚 References
- Current Scraper Architecture:
wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md - Database Schema:
wiki/DATABASE_ARCHITECTURE.md - Monitor Changes: See commit history for
ImageProcessingService.java,DatabaseService.java
✅ Success Criteria
You'll know the refactor is successful when:
- ✅ Database query
SELECT COUNT(*) FROM imagesreturns ~16,807 (not 57M+) - ✅ All images have
downloaded = 1andlocal_path IS NOT NULL - ✅ No duplicate records:
SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1returns 0 rows - ✅ Monitor logs show "Found N images needing detection" with reasonable numbers
- ✅ Files exist at paths in
local_pathcolumn - ✅ Monitor process speed increases (100ms vs 500ms per image)
Questions? Check the troubleshooting section or inspect the monitor's updated code in:
src/main/java/auctiora/ImageProcessingService.javasrc/main/java/auctiora/DatabaseService.java:695-719