400 lines
13 KiB
Markdown
400 lines
13 KiB
Markdown
# Scraper Refactor Guide - Image Download Integration
|
||
|
||
## 🎯 Objective
|
||
|
||
Refactor the Troostwijk scraper to **download and store images locally**, eliminating the 57M+ duplicate image problem in the monitoring process.
|
||
|
||
## 📋 Current vs. New Architecture
|
||
|
||
### **Before** (Current Architecture)
|
||
```
|
||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Scraper │────────▶│ Database │◀────────│ Monitor │
|
||
│ │ │ │ │ │
|
||
│ Stores URLs │ │ images table │ │ Downloads + │
|
||
│ downloaded=0 │ │ │ │ Detection │
|
||
└──────────────┘ └──────────────┘ └──────────────┘
|
||
│
|
||
▼
|
||
57M+ duplicates!
|
||
```
|
||
|
||
### **After** (New Architecture)
|
||
```
|
||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Scraper │────────▶│ Database │◀────────│ Monitor │
|
||
│ │ │ │ │ │
|
||
│ Downloads + │ │ images table │ │ Detection │
|
||
│ Stores path │ │ local_path ✓ │ │ Only │
|
||
│ downloaded=1 │ │ │ │ │
|
||
└──────────────┘ └──────────────┘ └──────────────┘
|
||
│
|
||
▼
|
||
No duplicates!
|
||
```
|
||
|
||
## 🗄️ Database Schema Changes
|
||
|
||
### Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)
|
||
```sql
|
||
CREATE TABLE images (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
lot_id TEXT,
|
||
url TEXT,
|
||
local_path TEXT, -- Currently NULL
|
||
downloaded INTEGER -- Currently 0
|
||
-- Missing: processed_at, labels (added by monitor)
|
||
);
|
||
```
|
||
|
||
### Required Schema (Already Compatible!)
|
||
```sql
|
||
CREATE TABLE images (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
lot_id TEXT,
|
||
url TEXT,
|
||
local_path TEXT, -- ✅ SET by scraper after download
|
||
downloaded INTEGER, -- ✅ SET to 1 by scraper after download
|
||
labels TEXT, -- ⚠️ SET by monitor (object detection)
|
||
processed_at INTEGER, -- ⚠️ SET by monitor (timestamp)
|
||
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
|
||
);
|
||
```
|
||
|
||
**Good News**: The scraper's schema already has `local_path` and `downloaded` columns! You just need to populate them.
|
||
|
||
## 🔧 Implementation Steps
|
||
|
||
### **Step 1: Enable Image Downloading in Configuration**
|
||
|
||
**File**: Your scraper's config file (e.g., `config.py` or environment variables)
|
||
|
||
```python
|
||
# Current setting
|
||
DOWNLOAD_IMAGES = False # ❌ Change this!
|
||
|
||
# New setting
|
||
DOWNLOAD_IMAGES = True # ✅ Enable downloads
|
||
|
||
# Image storage path
|
||
IMAGES_DIR = "/mnt/okcomputer/output/images" # Or your preferred path
|
||
```
|
||
|
||
### **Step 2: Update Image Download Logic**
|
||
|
||
Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:
|
||
|
||
**Current Code** (Conceptual):
|
||
```python
|
||
# Phase 3: Scrape lot details
|
||
def scrape_lot(lot_url):
|
||
lot_data = parse_lot_page(lot_url)
|
||
|
||
# Save lot to database
|
||
db.insert_lot(lot_data)
|
||
|
||
# Save image URLs to database (NOT DOWNLOADED)
|
||
for img_url in lot_data['images']:
|
||
db.execute("""
|
||
INSERT INTO images (lot_id, url, downloaded)
|
||
VALUES (?, ?, 0)
|
||
""", (lot_data['lot_id'], img_url))
|
||
```
|
||
|
||
**New Code** (Required):
|
||
```python
|
||
import os
|
||
import requests
|
||
from pathlib import Path
|
||
import time
|
||
|
||
def scrape_lot(lot_url):
|
||
lot_data = parse_lot_page(lot_url)
|
||
|
||
# Save lot to database
|
||
db.insert_lot(lot_data)
|
||
|
||
# Download and save images
|
||
for idx, img_url in enumerate(lot_data['images'], start=1):
|
||
try:
|
||
# Download image
|
||
local_path = download_image(img_url, lot_data['lot_id'], idx)
|
||
|
||
# Insert with local_path and downloaded=1
|
||
db.execute("""
|
||
INSERT INTO images (lot_id, url, local_path, downloaded)
|
||
VALUES (?, ?, ?, 1)
|
||
ON CONFLICT(lot_id, url) DO UPDATE SET
|
||
local_path = excluded.local_path,
|
||
downloaded = 1
|
||
""", (lot_data['lot_id'], img_url, local_path))
|
||
|
||
# Rate limiting (0.5s between downloads)
|
||
time.sleep(0.5)
|
||
|
||
except Exception as e:
|
||
print(f"Failed to download {img_url}: {e}")
|
||
# Still insert record but mark as not downloaded
|
||
db.execute("""
|
||
INSERT INTO images (lot_id, url, downloaded)
|
||
VALUES (?, ?, 0)
|
||
""", (lot_data['lot_id'], img_url))
|
||
|
||
def download_image(image_url, lot_id, index):
|
||
"""
|
||
Downloads an image and saves it to organized directory structure.
|
||
|
||
Args:
|
||
image_url: Remote URL of the image
|
||
lot_id: Lot identifier (e.g., "A1-28505-5")
|
||
index: Image sequence number (1, 2, 3, ...)
|
||
|
||
Returns:
|
||
Absolute path to saved file
|
||
"""
|
||
# Create directory structure: /images/{lot_id}/
|
||
images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
|
||
lot_dir = images_dir / lot_id
|
||
lot_dir.mkdir(parents=True, exist_ok=True)
|
||
|
||
# Determine file extension from URL or content-type
|
||
ext = Path(image_url).suffix or '.jpg'
|
||
filename = f"{index:03d}{ext}" # 001.jpg, 002.jpg, etc.
|
||
local_path = lot_dir / filename
|
||
|
||
# Download with timeout
|
||
response = requests.get(image_url, timeout=10)
|
||
response.raise_for_status()
|
||
|
||
# Save to disk
|
||
with open(local_path, 'wb') as f:
|
||
f.write(response.content)
|
||
|
||
return str(local_path.absolute())
|
||
```
|
||
|
||
### **Step 3: Add Unique Constraint to Prevent Duplicates**
|
||
|
||
**Migration SQL**:
|
||
```sql
|
||
-- Add unique constraint to prevent duplicate image records
|
||
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
|
||
ON images(lot_id, url);
|
||
```
|
||
|
||
Add this to your scraper's schema initialization:
|
||
|
||
```python
|
||
def init_database():
|
||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||
cursor = conn.cursor()
|
||
|
||
# Existing table creation...
|
||
cursor.execute("""
|
||
CREATE TABLE IF NOT EXISTS images (...)
|
||
""")
|
||
|
||
# Add unique constraint (NEW)
|
||
cursor.execute("""
|
||
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
|
||
ON images(lot_id, url)
|
||
""")
|
||
|
||
conn.commit()
|
||
conn.close()
|
||
```
|
||
|
||
### **Step 4: Handle Image Download Failures Gracefully**
|
||
|
||
```python
|
||
def download_with_retry(image_url, lot_id, index, max_retries=3):
|
||
"""Downloads image with retry logic."""
|
||
for attempt in range(max_retries):
|
||
try:
|
||
return download_image(image_url, lot_id, index)
|
||
except requests.exceptions.RequestException as e:
|
||
if attempt == max_retries - 1:
|
||
print(f"Failed after {max_retries} attempts: {image_url}")
|
||
return None # Return None on failure
|
||
print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
|
||
time.sleep(2 ** attempt) # Exponential backoff
|
||
```
|
||
|
||
### **Step 5: Update Database Queries**
|
||
|
||
Make sure your INSERT uses `INSERT ... ON CONFLICT` to handle re-scraping:
|
||
|
||
```python
|
||
# Good: Handles re-scraping without duplicates
|
||
db.execute("""
|
||
INSERT INTO images (lot_id, url, local_path, downloaded)
|
||
VALUES (?, ?, ?, 1)
|
||
ON CONFLICT(lot_id, url) DO UPDATE SET
|
||
local_path = excluded.local_path,
|
||
downloaded = 1
|
||
""", (lot_id, img_url, local_path))
|
||
|
||
# Bad: Creates duplicates on re-scrape
|
||
db.execute("""
|
||
INSERT INTO images (lot_id, url, local_path, downloaded)
|
||
VALUES (?, ?, ?, 1)
|
||
""", (lot_id, img_url, local_path))
|
||
```
|
||
|
||
## 📊 Expected Outcomes
|
||
|
||
### Before Refactor
|
||
```sql
|
||
SELECT COUNT(*) FROM images WHERE downloaded = 0;
|
||
-- Result: 57,376,293 (57M+ undownloaded!)
|
||
|
||
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
|
||
-- Result: 0 (no files downloaded)
|
||
```
|
||
|
||
### After Refactor
|
||
```sql
|
||
SELECT COUNT(*) FROM images WHERE downloaded = 1;
|
||
-- Result: ~16,807 (one per actual lot image)
|
||
|
||
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
|
||
-- Result: ~16,807 (all downloaded images have paths)
|
||
|
||
SELECT COUNT(DISTINCT lot_id, url) FROM images;
|
||
-- Result: ~16,807 (no duplicates!)
|
||
```
|
||
|
||
## 🚀 Deployment Checklist
|
||
|
||
### Pre-Deployment
|
||
- [ ] Back up current database: `cp cache.db cache.db.backup`
|
||
- [ ] Verify disk space: At least 10GB free for images
|
||
- [ ] Test download function on 5 sample lots
|
||
- [ ] Verify `IMAGES_DIR` path exists and is writable
|
||
|
||
### Deployment
|
||
- [ ] Update configuration: `DOWNLOAD_IMAGES = True`
|
||
- [ ] Run schema migration to add unique index
|
||
- [ ] Deploy updated scraper code
|
||
- [ ] Monitor first 100 lots for errors
|
||
|
||
### Post-Deployment Verification
|
||
```sql
|
||
-- Check download success rate
|
||
SELECT
|
||
COUNT(*) as total_images,
|
||
SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
|
||
SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
|
||
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
|
||
FROM images;
|
||
|
||
-- Check for duplicates (should be 0)
|
||
SELECT lot_id, url, COUNT(*) as dup_count
|
||
FROM images
|
||
GROUP BY lot_id, url
|
||
HAVING COUNT(*) > 1;
|
||
|
||
-- Verify file system
|
||
SELECT COUNT(*) FROM images
|
||
WHERE downloaded = 1
|
||
AND local_path IS NOT NULL
|
||
AND local_path != '';
|
||
```
|
||
|
||
## 🔍 Monitoring Process Impact
|
||
|
||
The monitoring process (auctiora) will automatically:
|
||
- ✅ Stop downloading images (network I/O eliminated)
|
||
- ✅ Only run object detection on `local_path` files
|
||
- ✅ Query: `WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')`
|
||
- ✅ Update only the `labels` and `processed_at` columns
|
||
|
||
**No changes needed in monitoring process!** It's already updated to work with scraper-downloaded images.
|
||
|
||
## 🐛 Troubleshooting
|
||
|
||
### Problem: "No space left on device"
|
||
```bash
|
||
# Check disk usage
|
||
df -h /mnt/okcomputer/output/images
|
||
|
||
# Estimate needed space: ~100KB per image
|
||
# 16,807 images × 100KB = ~1.6GB
|
||
```
|
||
|
||
### Problem: "Permission denied" when writing images
|
||
```bash
|
||
# Fix permissions
|
||
chmod 755 /mnt/okcomputer/output/images
|
||
chown -R scraper_user:scraper_group /mnt/okcomputer/output/images
|
||
```
|
||
|
||
### Problem: Images downloading but not recorded in DB
|
||
```python
|
||
# Add logging
|
||
import logging
|
||
logging.basicConfig(level=logging.INFO)
|
||
|
||
def download_image(...):
|
||
logging.info(f"Downloading {image_url} to {local_path}")
|
||
# ... download code ...
|
||
logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
|
||
return local_path
|
||
```
|
||
|
||
### Problem: Duplicate images after refactor
|
||
```sql
|
||
-- Find duplicates
|
||
SELECT lot_id, url, COUNT(*)
|
||
FROM images
|
||
GROUP BY lot_id, url
|
||
HAVING COUNT(*) > 1;
|
||
|
||
-- Clean up duplicates (keep newest)
|
||
DELETE FROM images
|
||
WHERE id NOT IN (
|
||
SELECT MAX(id)
|
||
FROM images
|
||
GROUP BY lot_id, url
|
||
);
|
||
```
|
||
|
||
## 📈 Performance Comparison
|
||
|
||
| Metric | Before (Monitor Downloads) | After (Scraper Downloads) |
|
||
|----------------------|---------------------------------|---------------------------|
|
||
| **Image records** | 57,376,293 | ~16,807 |
|
||
| **Duplicates** | 57,359,486 (99.97%!) | 0 |
|
||
| **Network I/O** | Monitor process | Scraper process |
|
||
| **Disk usage** | 0 (URLs only) | ~1.6GB (actual files) |
|
||
| **Processing speed** | 500ms/image (download + detect) | 100ms/image (detect only) |
|
||
| **Error handling** | Complex (download failures) | Simple (files exist) |
|
||
|
||
## 🎓 Code Examples by Language
|
||
|
||
### Python (Most Likely)
|
||
See **Step 2** above for complete implementation.
|
||
|
||
## 📚 References
|
||
|
||
- **Current Scraper Architecture**: `wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md`
|
||
- **Database Schema**: `wiki/DATABASE_ARCHITECTURE.md`
|
||
- **Monitor Changes**: See commit history for `ImageProcessingService.java`, `DatabaseService.java`
|
||
|
||
## ✅ Success Criteria
|
||
|
||
You'll know the refactor is successful when:
|
||
|
||
1. ✅ Database query `SELECT COUNT(*) FROM images` returns ~16,807 (not 57M+)
|
||
2. ✅ All images have `downloaded = 1` and `local_path IS NOT NULL`
|
||
3. ✅ No duplicate records: `SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1` returns 0 rows
|
||
4. ✅ Monitor logs show "Found N images needing detection" with reasonable numbers
|
||
5. ✅ Files exist at paths in `local_path` column
|
||
6. ✅ Monitor process speed increases (100ms vs 500ms per image)
|
||
|
||
---
|
||
|
||
**Questions?** Check the troubleshooting section or inspect the monitor's updated code in:
|
||
- `src/main/java/auctiora/ImageProcessingService.java`
|
||
- `src/main/java/auctiora/DatabaseService.java:695-719`
|