Fix mock tests
This commit is contained in:
399
docs/SCRAPER_REFACTOR_GUIDE.md
Normal file
399
docs/SCRAPER_REFACTOR_GUIDE.md
Normal file
@@ -0,0 +1,399 @@
|
||||
# Scraper Refactor Guide - Image Download Integration
|
||||
|
||||
## 🎯 Objective
|
||||
|
||||
Refactor the Troostwijk scraper to **download and store images locally**, eliminating the 57M+ duplicate image problem in the monitoring process.
|
||||
|
||||
## 📋 Current vs. New Architecture
|
||||
|
||||
### **Before** (Current Architecture)
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Scraper │────────▶│ Database │◀────────│ Monitor │
|
||||
│ │ │ │ │ │
|
||||
│ Stores URLs │ │ images table │ │ Downloads + │
|
||||
│ downloaded=0 │ │ │ │ Detection │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
│
|
||||
▼
|
||||
57M+ duplicates!
|
||||
```
|
||||
|
||||
### **After** (New Architecture)
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Scraper │────────▶│ Database │◀────────│ Monitor │
|
||||
│ │ │ │ │ │
|
||||
│ Downloads + │ │ images table │ │ Detection │
|
||||
│ Stores path │ │ local_path ✓ │ │ Only │
|
||||
│ downloaded=1 │ │ │ │ │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
│
|
||||
▼
|
||||
No duplicates!
|
||||
```
|
||||
|
||||
## 🗄️ Database Schema Changes
|
||||
|
||||
### Current Schema (ARCHITECTURE-TROOSTWIJK-SCRAPER.md:113-122)
|
||||
```sql
|
||||
CREATE TABLE images (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
lot_id TEXT,
|
||||
url TEXT,
|
||||
local_path TEXT, -- Currently NULL
|
||||
downloaded INTEGER -- Currently 0
|
||||
-- Missing: processed_at, labels (added by monitor)
|
||||
);
|
||||
```
|
||||
|
||||
### Required Schema (Already Compatible!)
|
||||
```sql
|
||||
CREATE TABLE images (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
lot_id TEXT,
|
||||
url TEXT,
|
||||
local_path TEXT, -- ✅ SET by scraper after download
|
||||
downloaded INTEGER, -- ✅ SET to 1 by scraper after download
|
||||
labels TEXT, -- ⚠️ SET by monitor (object detection)
|
||||
processed_at INTEGER, -- ⚠️ SET by monitor (timestamp)
|
||||
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
|
||||
);
|
||||
```
|
||||
|
||||
**Good News**: The scraper's schema already has `local_path` and `downloaded` columns! You just need to populate them.
|
||||
|
||||
## 🔧 Implementation Steps
|
||||
|
||||
### **Step 1: Enable Image Downloading in Configuration**
|
||||
|
||||
**File**: Your scraper's config file (e.g., `config.py` or environment variables)
|
||||
|
||||
```python
|
||||
# Current setting
|
||||
DOWNLOAD_IMAGES = False # ❌ Change this!
|
||||
|
||||
# New setting
|
||||
DOWNLOAD_IMAGES = True # ✅ Enable downloads
|
||||
|
||||
# Image storage path
|
||||
IMAGES_DIR = "/mnt/okcomputer/output/images" # Or your preferred path
|
||||
```
|
||||
|
||||
### **Step 2: Update Image Download Logic**
|
||||
|
||||
Based on ARCHITECTURE-TROOSTWIJK-SCRAPER.md:211-228, you already have the structure. Here's what needs to change:
|
||||
|
||||
**Current Code** (Conceptual):
|
||||
```python
|
||||
# Phase 3: Scrape lot details
|
||||
def scrape_lot(lot_url):
|
||||
lot_data = parse_lot_page(lot_url)
|
||||
|
||||
# Save lot to database
|
||||
db.insert_lot(lot_data)
|
||||
|
||||
# Save image URLs to database (NOT DOWNLOADED)
|
||||
for img_url in lot_data['images']:
|
||||
db.execute("""
|
||||
INSERT INTO images (lot_id, url, downloaded)
|
||||
VALUES (?, ?, 0)
|
||||
""", (lot_data['lot_id'], img_url))
|
||||
```
|
||||
|
||||
**New Code** (Required):
|
||||
```python
|
||||
import os
|
||||
import requests
|
||||
from pathlib import Path
|
||||
import time
|
||||
|
||||
def scrape_lot(lot_url):
|
||||
lot_data = parse_lot_page(lot_url)
|
||||
|
||||
# Save lot to database
|
||||
db.insert_lot(lot_data)
|
||||
|
||||
# Download and save images
|
||||
for idx, img_url in enumerate(lot_data['images'], start=1):
|
||||
try:
|
||||
# Download image
|
||||
local_path = download_image(img_url, lot_data['lot_id'], idx)
|
||||
|
||||
# Insert with local_path and downloaded=1
|
||||
db.execute("""
|
||||
INSERT INTO images (lot_id, url, local_path, downloaded)
|
||||
VALUES (?, ?, ?, 1)
|
||||
ON CONFLICT(lot_id, url) DO UPDATE SET
|
||||
local_path = excluded.local_path,
|
||||
downloaded = 1
|
||||
""", (lot_data['lot_id'], img_url, local_path))
|
||||
|
||||
# Rate limiting (0.5s between downloads)
|
||||
time.sleep(0.5)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Failed to download {img_url}: {e}")
|
||||
# Still insert record but mark as not downloaded
|
||||
db.execute("""
|
||||
INSERT INTO images (lot_id, url, downloaded)
|
||||
VALUES (?, ?, 0)
|
||||
""", (lot_data['lot_id'], img_url))
|
||||
|
||||
def download_image(image_url, lot_id, index):
|
||||
"""
|
||||
Downloads an image and saves it to organized directory structure.
|
||||
|
||||
Args:
|
||||
image_url: Remote URL of the image
|
||||
lot_id: Lot identifier (e.g., "A1-28505-5")
|
||||
index: Image sequence number (1, 2, 3, ...)
|
||||
|
||||
Returns:
|
||||
Absolute path to saved file
|
||||
"""
|
||||
# Create directory structure: /images/{lot_id}/
|
||||
images_dir = Path(os.getenv('IMAGES_DIR', '/mnt/okcomputer/output/images'))
|
||||
lot_dir = images_dir / lot_id
|
||||
lot_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Determine file extension from URL or content-type
|
||||
ext = Path(image_url).suffix or '.jpg'
|
||||
filename = f"{index:03d}{ext}" # 001.jpg, 002.jpg, etc.
|
||||
local_path = lot_dir / filename
|
||||
|
||||
# Download with timeout
|
||||
response = requests.get(image_url, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
# Save to disk
|
||||
with open(local_path, 'wb') as f:
|
||||
f.write(response.content)
|
||||
|
||||
return str(local_path.absolute())
|
||||
```
|
||||
|
||||
### **Step 3: Add Unique Constraint to Prevent Duplicates**
|
||||
|
||||
**Migration SQL**:
|
||||
```sql
|
||||
-- Add unique constraint to prevent duplicate image records
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
|
||||
ON images(lot_id, url);
|
||||
```
|
||||
|
||||
Add this to your scraper's schema initialization:
|
||||
|
||||
```python
|
||||
def init_database():
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Existing table creation...
|
||||
cursor.execute("""
|
||||
CREATE TABLE IF NOT EXISTS images (...)
|
||||
""")
|
||||
|
||||
# Add unique constraint (NEW)
|
||||
cursor.execute("""
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS idx_images_unique
|
||||
ON images(lot_id, url)
|
||||
""")
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
### **Step 4: Handle Image Download Failures Gracefully**
|
||||
|
||||
```python
|
||||
def download_with_retry(image_url, lot_id, index, max_retries=3):
|
||||
"""Downloads image with retry logic."""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return download_image(image_url, lot_id, index)
|
||||
except requests.exceptions.RequestException as e:
|
||||
if attempt == max_retries - 1:
|
||||
print(f"Failed after {max_retries} attempts: {image_url}")
|
||||
return None # Return None on failure
|
||||
print(f"Retry {attempt + 1}/{max_retries} for {image_url}")
|
||||
time.sleep(2 ** attempt) # Exponential backoff
|
||||
```
|
||||
|
||||
### **Step 5: Update Database Queries**
|
||||
|
||||
Make sure your INSERT uses `INSERT ... ON CONFLICT` to handle re-scraping:
|
||||
|
||||
```python
|
||||
# Good: Handles re-scraping without duplicates
|
||||
db.execute("""
|
||||
INSERT INTO images (lot_id, url, local_path, downloaded)
|
||||
VALUES (?, ?, ?, 1)
|
||||
ON CONFLICT(lot_id, url) DO UPDATE SET
|
||||
local_path = excluded.local_path,
|
||||
downloaded = 1
|
||||
""", (lot_id, img_url, local_path))
|
||||
|
||||
# Bad: Creates duplicates on re-scrape
|
||||
db.execute("""
|
||||
INSERT INTO images (lot_id, url, local_path, downloaded)
|
||||
VALUES (?, ?, ?, 1)
|
||||
""", (lot_id, img_url, local_path))
|
||||
```
|
||||
|
||||
## 📊 Expected Outcomes
|
||||
|
||||
### Before Refactor
|
||||
```sql
|
||||
SELECT COUNT(*) FROM images WHERE downloaded = 0;
|
||||
-- Result: 57,376,293 (57M+ undownloaded!)
|
||||
|
||||
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
|
||||
-- Result: 0 (no files downloaded)
|
||||
```
|
||||
|
||||
### After Refactor
|
||||
```sql
|
||||
SELECT COUNT(*) FROM images WHERE downloaded = 1;
|
||||
-- Result: ~16,807 (one per actual lot image)
|
||||
|
||||
SELECT COUNT(*) FROM images WHERE local_path IS NOT NULL;
|
||||
-- Result: ~16,807 (all downloaded images have paths)
|
||||
|
||||
SELECT COUNT(DISTINCT lot_id, url) FROM images;
|
||||
-- Result: ~16,807 (no duplicates!)
|
||||
```
|
||||
|
||||
## 🚀 Deployment Checklist
|
||||
|
||||
### Pre-Deployment
|
||||
- [ ] Back up current database: `cp cache.db cache.db.backup`
|
||||
- [ ] Verify disk space: At least 10GB free for images
|
||||
- [ ] Test download function on 5 sample lots
|
||||
- [ ] Verify `IMAGES_DIR` path exists and is writable
|
||||
|
||||
### Deployment
|
||||
- [ ] Update configuration: `DOWNLOAD_IMAGES = True`
|
||||
- [ ] Run schema migration to add unique index
|
||||
- [ ] Deploy updated scraper code
|
||||
- [ ] Monitor first 100 lots for errors
|
||||
|
||||
### Post-Deployment Verification
|
||||
```sql
|
||||
-- Check download success rate
|
||||
SELECT
|
||||
COUNT(*) as total_images,
|
||||
SUM(CASE WHEN downloaded = 1 THEN 1 ELSE 0 END) as downloaded,
|
||||
SUM(CASE WHEN downloaded = 0 THEN 1 ELSE 0 END) as failed,
|
||||
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
|
||||
FROM images;
|
||||
|
||||
-- Check for duplicates (should be 0)
|
||||
SELECT lot_id, url, COUNT(*) as dup_count
|
||||
FROM images
|
||||
GROUP BY lot_id, url
|
||||
HAVING COUNT(*) > 1;
|
||||
|
||||
-- Verify file system
|
||||
SELECT COUNT(*) FROM images
|
||||
WHERE downloaded = 1
|
||||
AND local_path IS NOT NULL
|
||||
AND local_path != '';
|
||||
```
|
||||
|
||||
## 🔍 Monitoring Process Impact
|
||||
|
||||
The monitoring process (auctiora) will automatically:
|
||||
- ✅ Stop downloading images (network I/O eliminated)
|
||||
- ✅ Only run object detection on `local_path` files
|
||||
- ✅ Query: `WHERE local_path IS NOT NULL AND (labels IS NULL OR labels = '')`
|
||||
- ✅ Update only the `labels` and `processed_at` columns
|
||||
|
||||
**No changes needed in monitoring process!** It's already updated to work with scraper-downloaded images.
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### Problem: "No space left on device"
|
||||
```bash
|
||||
# Check disk usage
|
||||
df -h /mnt/okcomputer/output/images
|
||||
|
||||
# Estimate needed space: ~100KB per image
|
||||
# 16,807 images × 100KB = ~1.6GB
|
||||
```
|
||||
|
||||
### Problem: "Permission denied" when writing images
|
||||
```bash
|
||||
# Fix permissions
|
||||
chmod 755 /mnt/okcomputer/output/images
|
||||
chown -R scraper_user:scraper_group /mnt/okcomputer/output/images
|
||||
```
|
||||
|
||||
### Problem: Images downloading but not recorded in DB
|
||||
```python
|
||||
# Add logging
|
||||
import logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
def download_image(...):
|
||||
logging.info(f"Downloading {image_url} to {local_path}")
|
||||
# ... download code ...
|
||||
logging.info(f"Saved to {local_path}, size: {os.path.getsize(local_path)} bytes")
|
||||
return local_path
|
||||
```
|
||||
|
||||
### Problem: Duplicate images after refactor
|
||||
```sql
|
||||
-- Find duplicates
|
||||
SELECT lot_id, url, COUNT(*)
|
||||
FROM images
|
||||
GROUP BY lot_id, url
|
||||
HAVING COUNT(*) > 1;
|
||||
|
||||
-- Clean up duplicates (keep newest)
|
||||
DELETE FROM images
|
||||
WHERE id NOT IN (
|
||||
SELECT MAX(id)
|
||||
FROM images
|
||||
GROUP BY lot_id, url
|
||||
);
|
||||
```
|
||||
|
||||
## 📈 Performance Comparison
|
||||
|
||||
| Metric | Before (Monitor Downloads) | After (Scraper Downloads) |
|
||||
|----------------------|---------------------------------|---------------------------|
|
||||
| **Image records** | 57,376,293 | ~16,807 |
|
||||
| **Duplicates** | 57,359,486 (99.97%!) | 0 |
|
||||
| **Network I/O** | Monitor process | Scraper process |
|
||||
| **Disk usage** | 0 (URLs only) | ~1.6GB (actual files) |
|
||||
| **Processing speed** | 500ms/image (download + detect) | 100ms/image (detect only) |
|
||||
| **Error handling** | Complex (download failures) | Simple (files exist) |
|
||||
|
||||
## 🎓 Code Examples by Language
|
||||
|
||||
### Python (Most Likely)
|
||||
See **Step 2** above for complete implementation.
|
||||
|
||||
## 📚 References
|
||||
|
||||
- **Current Scraper Architecture**: `wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md`
|
||||
- **Database Schema**: `wiki/DATABASE_ARCHITECTURE.md`
|
||||
- **Monitor Changes**: See commit history for `ImageProcessingService.java`, `DatabaseService.java`
|
||||
|
||||
## ✅ Success Criteria
|
||||
|
||||
You'll know the refactor is successful when:
|
||||
|
||||
1. ✅ Database query `SELECT COUNT(*) FROM images` returns ~16,807 (not 57M+)
|
||||
2. ✅ All images have `downloaded = 1` and `local_path IS NOT NULL`
|
||||
3. ✅ No duplicate records: `SELECT lot_id, url, COUNT(*) ... HAVING COUNT(*) > 1` returns 0 rows
|
||||
4. ✅ Monitor logs show "Found N images needing detection" with reasonable numbers
|
||||
5. ✅ Files exist at paths in `local_path` column
|
||||
6. ✅ Monitor process speed increases (100ms vs 500ms per image)
|
||||
|
||||
---
|
||||
|
||||
**Questions?** Check the troubleshooting section or inspect the monitor's updated code in:
|
||||
- `src/main/java/auctiora/ImageProcessingService.java`
|
||||
- `src/main/java/auctiora/DatabaseService.java:695-719`
|
||||
Reference in New Issue
Block a user