Files
auctiora/docs/SCRAPER_REFACTOR_GUIDE.md
2025-12-07 06:32:03 +01:00

400 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scraper Refactor Guide - Image Download Integration
## 🎯 Objective
Refactor the Troostwijk scraper to **download and store images locally**, eliminating the 57M+ duplicate image problem in the monitoring process.
## 📋 Current vs. New Architecture
### **Before** (Current Architecture)
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Scraper │────────▶│ Database │◀────────│ Monitor │
│ │ │ │ │ │
│ Stores URLs │ │ images table │ │ Downloads + │
│ downloaded=0 │ │ │ │ Detection │
└──────────────┘ └──────────────┘ └──────────────┘
57M+ duplicates!
```
### **After** (New Architecture)
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Scraper │────────▶│ Database │◀────────│ Monitor │
│ │ │ │ │ │
│ Downloads + │ │ images table │ │ Detection │
│ Stores path │ │ local_path ✓ │ │ Only │
│ downloaded=1 │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
No duplicates!
```
## 🗄️ Database Schema Changes
### Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)
```sql
CREATE TABLE images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT,
url TEXT,
local_path TEXT, -- Currently NULL
downloaded INTEGER -- Currently 0
-- Missing: processed_at, labels (added by monitor)
);
```
### Required Schema (Already Compatible!)
```sql
CREATE TABLE images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT,
url TEXT,
local_path TEXT, -- ✅ SET by scraper after download
downloaded INTEGER, -- ✅ SET to 1 by scraper after download
labels TEXT, -- ⚠️ SET by monitor (object detection)
processed_at INTEGER, -- ⚠️ SET by monitor (timestamp)
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);
```
**Good News**: The scraper's schema already has `local_path` and `downloaded` columns! You just need to populate them.
## 🔧 Implementation Steps
### **Step 1: Enable Image Downloading in Configuration**
**File**: Your scraper's config file (e.g., `config.py` or environment variables)
```python
# Current setting
DOWNLOAD_IMAGES = False # ❌ Change this!
# New setting
DOWNLOAD_IMAGES = True # ✅ Enable downloads
# Image storage path
IMAGES_DIR = "/mnt/okcomputer/output/images" # Or your preferred path
```
### **Step 2: Update Image Download Logic**
Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:
**Current Code** (Conceptual):
```python
# Phase 3: Scrape lot details
def scrape_lot(lot_url):
lot_data = parse_lot_page(lot_url)
# Save lot to database
db.insert_lot(lot_data)
# Save image URLs to database (NOT DOWNLOADED)
for img_url in lot_data['images']:
db.execute("""
INSERT INTO images (lot_id, url, downloaded)
VALUES (?, ?, 0)
""", (lot_data['lot_id'], img_url))
```
**New Code** (Required):
```python
import os
import requests
from pathlib import Path
import time
def scrape_lot(lot_url):
lot_data = parse_lot_page(lot_url)
# Save lot to database
db.insert_lot(lot_data)
# Download and save images
for idx, img_url in enumerate(lot_data['images'], start=1):
try:
# Download image
local_path = download_image(img_url, lot_data['lot_id'], idx)
# Insert with local_path and downloaded=1
db.execute("""
INSERT INTO images (lot_id, url, local_path, downloaded)
VALUES (?, ?, ?, 1)
ON CONFLICT(lot_id, url) DO UPDATE SET
local_path = excluded.local_path,
downloaded = 1
""", (lot_data['lot_id'], img_url, local_path))
# Rate limiting (0.5s between downloads)
time.sleep(0.5)
except Exception as e:
print(f"Failed to download {img_url}: {e}")
# Still insert record but mark as not downloaded
db.execute("""
INSERT INTO images (lot_id, url, downloaded)
VALUES (?, ?, 0)
""", (lot_data['lot_id'], img_url))
def download_image(image_url, lot_id, index):
"""
Downloads an image and saves it to organized directory structure.
Args:
image_url: Remote URL of the image
lot_id: Lot identifier (e.g., "A1-28505-5")
index: Image sequence number (1, 2, 3, ...)
Returns:
Absolute path to saved file
"""
# Create directory structure: /images/{lot_id}/
images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
lot_dir = images_dir / lot_id
lot_dir.mkdir(parents=True, exist_ok=True)
# Determine file extension from URL or content-type
ext = Path(image_url).suffix or '.jpg'
filename = f"{index:03d}{ext}" # 001.jpg, 002.jpg, etc.
local_path = lot_dir / filename
# Download with timeout
response = requests.get(image_url, timeout=10)
response.raise_for_status()
# Save to disk
with open(local_path, 'wb') as f:
f.write(response.content)
return str(local_path.absolute())
```
### **Step 3: Add Unique Constraint to Prevent Duplicates**
**Migration SQL**:
```sql
-- Add unique constraint to prevent duplicate image records
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
ON images(lot_id, url);
```
Add this to your scraper's schema initialization:
```python
def init_database():
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
cursor = conn.cursor()
# Existing table creation...
cursor.execute("""
CREATE TABLE IF NOT EXISTS images (...)
""")
# Add unique constraint (NEW)
cursor.execute("""
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
ON images(lot_id, url)
""")
conn.commit()
conn.close()
```
### **Step 4: Handle Image Download Failures Gracefully**
```python
def download_with_retry(image_url, lot_id, index, max_retries=3):
"""Downloads image with retry logic."""
for attempt in range(max_retries):
try:
return download_image(image_url, lot_id, index)
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
print(f"Failed after {max_retries} attempts: {image_url}")
return None # Return None on failure
print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
time.sleep(2 ** attempt) # Exponential backoff
```
### **Step 5: Update Database Queries**
Make sure your INSERT uses `INSERT ... ON CONFLICT` to handle re-scraping:
```python
# Good: Handles re-scraping without duplicates
db.execute("""
INSERT INTO images (lot_id, url, local_path, downloaded)
VALUES (?, ?, ?, 1)
ON CONFLICT(lot_id, url) DO UPDATE SET
local_path = excluded.local_path,
downloaded = 1
""", (lot_id, img_url, local_path))
# Bad: Creates duplicates on re-scrape
db.execute("""
INSERT INTO images (lot_id, url, local_path, downloaded)
VALUES (?, ?, ?, 1)
""", (lot_id, img_url, local_path))
```
## 📊 Expected Outcomes
### Before Refactor
```sql
SELECT COUNT(*) FROM images WHERE downloaded = 0;
-- Result: 57,376,293 (57M+ undownloaded!)
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: 0 (no files downloaded)
```
### After Refactor
```sql
SELECT COUNT(*) FROM images WHERE downloaded = 1;
-- Result: ~16,807 (one per actual lot image)
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
-- Result: ~16,807 (all downloaded images have paths)
SELECT COUNT(DISTINCT lot_id, url) FROM images;
-- Result: ~16,807 (no duplicates!)
```
## 🚀 Deployment Checklist
### Pre-Deployment
- [ ] Back up current database: `cp cache.db cache.db.backup`
- [ ] Verify disk space: At least 10GB free for images
- [ ] Test download function on 5 sample lots
- [ ] Verify `IMAGES_DIR` path exists and is writable
### Deployment
- [ ] Update configuration: `DOWNLOAD_IMAGES = True`
- [ ] Run schema migration to add unique index
- [ ] Deploy updated scraper code
- [ ] Monitor first 100 lots for errors
### Post-Deployment Verification
```sql
-- Check download success rate
SELECT
COUNT(*) as total_images,
SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
FROM images;
-- Check for duplicates (should be 0)
SELECT lot_id, url, COUNT(*) as dup_count
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;
-- Verify file system
SELECT COUNT(*) FROM images
WHERE downloaded = 1
AND local_path IS NOT NULL
AND local_path != '';
```
## 🔍 Monitoring Process Impact
The monitoring process (auctiora) will automatically:
- ✅ Stop downloading images (network I/O eliminated)
- ✅ Only run object detection on `local_path` files
- ✅ Query: `WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')`
- ✅ Update only the `labels` and `processed_at` columns
**No changes needed in monitoring process!** It's already updated to work with scraper-downloaded images.
## 🐛 Troubleshooting
### Problem: "No space left on device"
```bash
# Check disk usage
df -h /mnt/okcomputer/output/images
# Estimate needed space: ~100KB per image
# 16,807 images × 100KB = ~1.6GB
```
### Problem: "Permission denied" when writing images
```bash
# Fix permissions
chmod 755 /mnt/okcomputer/output/images
chown -R scraper_user:scraper_group /mnt/okcomputer/output/images
```
### Problem: Images downloading but not recorded in DB
```python
# Add logging
import logging
logging.basicConfig(level=logging.INFO)
def download_image(...):
logging.info(f"Downloading {image_url} to {local_path}")
# ... download code ...
logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
return local_path
```
### Problem: Duplicate images after refactor
```sql
-- Find duplicates
SELECT lot_id, url, COUNT(*)
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;
-- Clean up duplicates (keep newest)
DELETE FROM images
WHERE id NOT IN (
SELECT MAX(id)
FROM images
GROUP BY lot_id, url
);
```
## 📈 Performance Comparison
| Metric | Before (Monitor Downloads) | After (Scraper Downloads) |
|----------------------|---------------------------------|---------------------------|
| **Image records** | 57,376,293 | ~16,807 |
| **Duplicates** | 57,359,486 (99.97%!) | 0 |
| **Network I/O** | Monitor process | Scraper process |
| **Disk usage** | 0 (URLs only) | ~1.6GB (actual files) |
| **Processing speed** | 500ms/image (download + detect) | 100ms/image (detect only) |
| **Error handling** | Complex (download failures) | Simple (files exist) |
## 🎓 Code Examples by Language
### Python (Most Likely)
See **Step 2** above for complete implementation.
## 📚 References
- **Current Scraper Architecture**: `wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md`
- **Database Schema**: `wiki/DATABASE_ARCHITECTURE.md`
- **Monitor Changes**: See commit history for `ImageProcessingService.java`, `DatabaseService.java`
## ✅ Success Criteria
You'll know the refactor is successful when:
1. ✅ Database query `SELECT COUNT(*) FROM images` returns ~16,807 (not 57M+)
2. ✅ All images have `downloaded = 1` and `local_path IS NOT NULL`
3. ✅ No duplicate records: `SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1` returns 0 rows
4. ✅ Monitor logs show "Found N images needing detection" with reasonable numbers
5. ✅ Files exist at paths in `local_path` column
6. ✅ Monitor process speed increases (100ms vs 500ms per image)
---
**Questions?** Check the troubleshooting section or inspect the monitor's updated code in:
- `src/main/java/auctiora/ImageProcessingService.java`
- `src/main/java/auctiora/DatabaseService.java:695-719`