Files
auctiora/wiki/INTEGRATION_GUIDE.md
2025-12-06 05:39:59 +01:00

479 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Integration Guide: Troostwijk Monitor ↔ Scraper
## Overview
This document describes how **Troostwijk Monitor** (this Java project) integrates with the **ARCHITECTURE-TROOSTWIJK-SCRAPER** (Python scraper process).
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ ARCHITECTURE-TROOSTWIJK-SCRAPER (Python) │
│ │
│ • Discovers auctions from website │
│ • Scrapes lot details via Playwright │
│ • Parses __NEXT_DATA__ JSON │
│ • Stores image URLs (not downloads) │
│ │
│ ↓ Writes to │
└─────────┼───────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ SHARED SQLite DATABASE │
│ (troostwijk.db) │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ auctions │ │ lots │ │ images │ │
│ │ (Scraper) │ │ (Scraper) │ │ (Both) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
│ ↑ Reads from ↓ Writes to │
└─────────┼──────────────────────────────┼──────────────────────┘
│ │
│ ▼
┌─────────┴──────────────────────────────────────────────────────┐
│ TROOSTWIJK MONITOR (Java - This Project) │
│ │
│ • Reads auction/lot data from database │
│ • Downloads images from URLs │
│ • Runs YOLO object detection │
│ • Monitors bid changes │
│ • Sends notifications │
└─────────────────────────────────────────────────────────────────┘
```
## Database Schema Mapping
### Scraper Schema → Monitor Schema
The scraper and monitor use **slightly different schemas** that need to be reconciled:
| Scraper Table | Monitor Table | Integration Notes |
|---------------|---------------|-----------------------------------------------|
| `auctions` | `auctions` | ✅ **Compatible** - same structure |
| `lots` | `lots` | ⚠️ **Needs mapping** - field name differences |
| `images` | `images` | ⚠️ **Partial overlap** - different purposes |
| `cache` | N/A | ❌ Monitor doesn't use cache |
### Field Mapping: `auctions` Table
| Scraper Field | Monitor Field | Notes |
|--------------------------|-------------------------------|---------------------------------------------------------------------|
| `auction_id` (TEXT) | `auction_id` (INTEGER) | ⚠️ **TYPE MISMATCH** - Scraper uses "A7-39813", Monitor expects INT |
| `url` | `url` | ✅ Compatible |
| `title` | `title` | ✅ Compatible |
| `location` | `location`, `city`, `country` | ⚠️ Monitor splits into 3 fields |
| `lots_count` | `lot_count` | ⚠️ Name difference |
| `first_lot_closing_time` | `closing_time` | ⚠️ Name difference |
| `scraped_at` | `discovered_at` | ⚠️ Name + type difference (TEXT vs INTEGER timestamp) |
### Field Mapping: `lots` Table
| Scraper Field | Monitor Field | Notes |
|----------------------|----------------------|--------------------------------------------------|
| `lot_id` (TEXT) | `lot_id` (INTEGER) | ⚠️ **TYPE MISMATCH** - "A1-28505-5" vs INT |
| `auction_id` | `sale_id` | ⚠️ Different name |
| `url` | `url` | ✅ Compatible |
| `title` | `title` | ✅ Compatible |
| `current_bid` (TEXT) | `current_bid` (REAL) | ⚠️ **TYPE MISMATCH** - "€123.45" vs 123.45 |
| `bid_count` | N/A | Monitor doesn't track |
| `closing_time` | `closing_time` | ⚠️ Format difference (TEXT vs LocalDateTime) |
| `viewing_time` | N/A | Monitor doesn't track |
| `pickup_date` | N/A | Monitor doesn't track |
| `location` | N/A | Monitor doesn't track lot location separately |
| `description` | `description` | ✅ Compatible |
| `category` | `category` | ✅ Compatible |
| N/A | `manufacturer` | Monitor has additional field |
| N/A | `type` | Monitor has additional field |
| N/A | `year` | Monitor has additional field |
| N/A | `currency` | Monitor has additional field |
| N/A | `closing_notified` | Monitor tracking field |
### Field Mapping: `images` Table
| Scraper Field | Monitor Field | Notes |
|------------------------|--------------------------|----------------------------------------|
| `id` | `id` | ✅ Compatible |
| `lot_id` | `lot_id` | ⚠️ Type difference (TEXT vs INTEGER) |
| `url` | `url` | ✅ Compatible |
| `local_path` | `Local_path` | ⚠️ Different name |
| `downloaded` (INTEGER) | N/A | Monitor uses `processed_at` instead |
| N/A | `labels` (TEXT) | Monitor adds detected objects |
| N/A | `processed_at` (INTEGER) | Monitor tracking field |
## Integration Options
### Option 1: Database Schema Adapter (Recommended)
Create a compatibility layer that transforms scraper data to monitor format.
**Implementation:**
```java
// Add to DatabaseService.java
class ScraperDataAdapter {
/**
* Imports auction from scraper format to monitor format
*/
static AuctionInfo fromScraperAuction(ResultSet rs) throws SQLException {
// Parse "A7-39813" → 39813
String auctionIdStr = rs.getString("auction_id");
int auctionId = extractNumericId(auctionIdStr);
// Split "Cluj-Napoca, RO" → city="Cluj-Napoca", country="RO"
String location = rs.getString("location");
String[] parts = location.split(",\\s*");
String city = parts.length > 0 ? parts[0] : "";
String country = parts.length > 1 ? parts[1] : "";
return new AuctionInfo(
auctionId,
rs.getString("title"),
location,
city,
country,
rs.getString("url"),
extractTypePrefix(auctionIdStr), // "A7-39813" → "A7"
rs.getInt("lots_count"),
parseTimestamp(rs.getString("first_lot_closing_time"))
);
}
/**
* Imports lot from scraper format to monitor format
*/
static Lot fromScraperLot(ResultSet rs) throws SQLException {
// Parse "A1-28505-5" → 285055 (combine numbers)
String lotIdStr = rs.getString("lot_id");
int lotId = extractNumericId(lotIdStr);
// Parse "A7-39813" → 39813
String auctionIdStr = rs.getString("auction_id");
int saleId = extractNumericId(auctionIdStr);
// Parse "€123.45" → 123.45
String currentBidStr = rs.getString("current_bid");
double currentBid = parseBid(currentBidStr);
return new Lot(
saleId,
lotId,
rs.getString("title"),
rs.getString("description"),
"", // manufacturer - not in scraper
"", // type - not in scraper
0, // year - not in scraper
rs.getString("category"),
currentBid,
"EUR", // currency - inferred from €
rs.getString("url"),
parseTimestamp(rs.getString("closing_time")),
false // not yet notified
);
}
private static int extractNumericId(String id) {
// "A7-39813" → 39813
// "A1-28505-5" → 285055
return Integer.parseInt(id.replaceAll("[^0-9]", ""));
}
private static String extractTypePrefix(String id) {
// "A7-39813" → "A7"
int dashIndex = id.indexOf('-');
return dashIndex > 0 ? id.substring(0, dashIndex) : "";
}
private static double parseBid(String bid) {
// "€123.45" → 123.45
// "No bids" → 0.0
if (bid == null || bid.contains("No")) return 0.0;
return Double.parseDouble(bid.replaceAll("[^0-9.]", ""));
}
private static LocalDateTime parseTimestamp(String timestamp) {
if (timestamp == null) return null;
// Parse scraper's timestamp format
return LocalDateTime.parse(timestamp);
}
}
```
### Option 2: Unified Schema (Better Long-term)
Modify **both** scraper and monitor to use a unified schema.
**Create**: `SHARED_SCHEMA.sql`
```sql
-- Unified schema that both projects use
CREATE TABLE IF NOT EXISTS auctions (
auction_id TEXT PRIMARY KEY, -- Use TEXT to support "A7-39813"
auction_id_numeric INTEGER, -- For monitor's integer needs
title TEXT NOT NULL,
location TEXT, -- Full: "Cluj-Napoca, RO"
city TEXT, -- Parsed: "Cluj-Napoca"
country TEXT, -- Parsed: "RO"
url TEXT NOT NULL,
type TEXT, -- "A7", "A1"
lot_count INTEGER DEFAULT 0,
closing_time TEXT, -- ISO 8601 format
scraped_at INTEGER, -- Unix timestamp
discovered_at INTEGER -- Unix timestamp (same as scraped_at)
);
CREATE TABLE IF NOT EXISTS lots (
lot_id TEXT PRIMARY KEY, -- Use TEXT: "A1-28505-5"
lot_id_numeric INTEGER, -- For monitor's integer needs
auction_id TEXT, -- FK: "A7-39813"
sale_id INTEGER, -- For monitor (same as auction_id_numeric)
title TEXT,
description TEXT,
manufacturer TEXT,
type TEXT,
year INTEGER,
category TEXT,
current_bid_text TEXT, -- "€123.45" or "No bids"
current_bid REAL, -- 123.45
bid_count INTEGER,
currency TEXT DEFAULT 'EUR',
url TEXT UNIQUE,
closing_time TEXT,
viewing_time TEXT,
pickup_date TEXT,
location TEXT,
closing_notified INTEGER DEFAULT 0,
scraped_at TEXT,
FOREIGN KEY (auction_id) REFERENCES auctions(auction_id)
);
CREATE TABLE IF NOT EXISTS images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT, -- FK: "A1-28505-5"
url TEXT, -- Image URL from website
local_path TEXT, -- Local path after download
labels TEXT, -- Detected objects (comma-separated)
downloaded INTEGER DEFAULT 0, -- 0=pending, 1=downloaded
processed_at INTEGER, -- Unix timestamp when processed
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country);
CREATE INDEX IF NOT EXISTS idx_lots_auction_id ON lots(auction_id);
CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id);
CREATE INDEX IF NOT EXISTS idx_images_downloaded ON images(downloaded);
```
### Option 3: API Integration (Most Flexible)
Have the scraper expose a REST API for the monitor to query.
```python
# In scraper: Add Flask API endpoint
@app.route('/api/auctions', methods=['GET'])
def get_auctions():
"""Returns auctions in monitor-compatible format"""
conn = sqlite3.connect(CACHE_DB)
cursor = conn.cursor()
cursor.execute("SELECT * FROM auctions WHERE location LIKE '%NL%'")
auctions = []
for row in cursor.fetchall():
auctions.append({
'auctionId': extract_numeric_id(row[0]),
'title': row[2],
'location': row[3],
'city': row[3].split(',')[0] if row[3] else '',
'country': row[3].split(',')[1].strip() if ',' in row[3] else '',
'url': row[1],
'type': row[0].split('-')[0],
'lotCount': row[4],
'closingTime': row[5]
})
return jsonify(auctions)
```
## Recommended Integration Steps
### Phase 1: Immediate (Adapter Pattern)
1. ✅ Keep separate schemas
2. ✅ Create `ScraperDataAdapter` in Monitor
3. ✅ Add import methods to `DatabaseService`
4. ✅ Monitor reads from scraper's tables using adapter
### Phase 2: Short-term (Unified Schema)
1. 📋 Design unified schema (see Option 2)
2. 📋 Update scraper to use unified schema
3. 📋 Update monitor to use unified schema
4. 📋 Migrate existing data
### Phase 3: Long-term (API + Event-driven)
1. 📋 Add REST API to scraper
2. 📋 Add webhook/event notification when new data arrives
3. 📋 Monitor subscribes to events
4. 📋 Process images asynchronously
## Current Integration Flow
### Scraper Process (Python)
```bash
# 1. Run scraper to populate database
cd /path/to/scraper
python scraper.py
# Output:
# ✅ Scraped 42 auctions
# ✅ Scraped 1,234 lots
# ✅ Saved 3,456 image URLs
# ✅ Data written to: /mnt/okcomputer/output/cache.db
```
### Monitor Process (Java)
```bash
# 2. Run monitor to process the data
cd /path/to/monitor
export DATABASE_FILE=/mnt/okcomputer/output/cache.db
java -jar troostwijk-monitor.jar
# Output:
# 📊 Current Database State:
# Total lots in database: 1,234
# Total images processed: 0
#
# [1/2] Processing images...
# Downloading and analyzing 3,456 images...
#
# [2/2] Starting bid monitoring...
# ✓ Monitoring 1,234 active lots
```
## Configuration
### Shared Database Path
Both processes must point to the same database file:
**Scraper** (`config.py`):
```python
CACHE_DB = '/mnt/okcomputer/output/cache.db'
```
**Monitor** (`Main.java`):
```java
String databaseFile = System.getenv().getOrDefault(
"DATABASE_FILE",
"/mnt/okcomputer/output/cache.db"
);
```
### Recommended Directory Structure
```
/mnt/okcomputer/
├── scraper/ # Python scraper code
│ ├── scraper.py
│ └── requirements.txt
├── monitor/ # Java monitor code
│ ├── troostwijk-monitor.jar
│ └── models/ # YOLO models
│ ├── yolov4.cfg
│ ├── yolov4.weights
│ └── coco.names
└── output/ # Shared data directory
├── cache.db # Shared SQLite database
└── images/ # Downloaded images
├── A1-28505-5/
│ ├── 001.jpg
│ └── 002.jpg
└── ...
```
## Monitoring & Coordination
### Option A: Sequential Execution
```bash
#!/bin/bash
# run-pipeline.sh
echo "Step 1: Scraping..."
python scraper/scraper.py
echo "Step 2: Processing images..."
java -jar monitor/troostwijk-monitor.jar --process-images-only
echo "Step 3: Starting monitor..."
java -jar monitor/troostwijk-monitor.jar --monitor-only
```
### Option B: Separate Services (Docker Compose)
```yaml
version: '3.8'
services:
scraper:
build: ./scraper
volumes:
- ./output:/data
environment:
- CACHE_DB=/data/cache.db
command: python scraper.py
monitor:
build: ./monitor
volumes:
- ./output:/data
environment:
- DATABASE_FILE=/data/cache.db
- NOTIFICATION_CONFIG=desktop
depends_on:
- scraper
command: java -jar troostwijk-monitor.jar
```
### Option C: Cron-based Scheduling
```cron
# Scrape every 6 hours
0 */6 * * * cd /mnt/okcomputer/scraper && python scraper.py
# Process images every hour (if new lots found)
0 * * * * cd /mnt/okcomputer/monitor && java -jar monitor.jar --process-new
# Monitor runs continuously
@reboot cd /mnt/okcomputer/monitor && java -jar monitor.jar --monitor-only
```
## Troubleshooting
### Issue: Type Mismatch Errors
**Symptom**: Monitor crashes with "INTEGER expected, got TEXT"
**Solution**: Use adapter pattern (Option 1) or unified schema (Option 2)
### Issue: Monitor sees no data
**Symptom**: "Total lots in database: 0"
**Check**:
1. Is `DATABASE_FILE` env var set correctly?
2. Did scraper actually write data?
3. Are both processes using the same database file?
```bash
# Verify database has data
sqlite3 /mnt/okcomputer/output/cache.db "SELECT COUNT(*) FROM lots"
```
### Issue: Images not downloading
**Symptom**: "Total images processed: 0" but scraper found images
**Check**:
1. Scraper writes image URLs to `images` table
2. Monitor reads from `images` table with `downloaded=0`
3. Field name mapping: `local_path` vs `local_path`
## Next Steps
1. **Immediate**: Implement `ScraperDataAdapter` for compatibility
2. **This Week**: Test end-to-end integration with sample data
3. **Next Sprint**: Migrate to unified schema
4. **Future**: Add event-driven architecture with webhooks