start

2025-12-03 15:40:19 +01:00
parent d3dc37576d
commit febd08821a
6 changed files with 861 additions and 47 deletions
--- a/INTEGRATION_GUIDE.md
+++ b/INTEGRATION_GUIDE.md
@@ -0,0 +1,479 @@
+# Integration Guide: Troostwijk Monitor ↔ Scraper
+
+## Overview
+
+This document describes how **Troostwijk Monitor** (this Java project) integrates with the **ARCHITECTURE-TROOSTWIJK-SCRAPER** (Python scraper process).
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│              ARCHITECTURE-TROOSTWIJK-SCRAPER (Python)           │
+│                                                                  │
+│  • Discovers auctions from website                              │
+│  • Scrapes lot details via Playwright                           │
+│  • Parses __NEXT_DATA__ JSON                                    │
+│  • Stores image URLs (not downloads)                            │
+│                                                                  │
+│         ↓ Writes to                                             │
+└─────────┼───────────────────────────────────────────────────────┘
+          │
+          ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                     SHARED SQLite DATABASE                       │
+│                       (troostwijk.db)                            │
+│                                                                  │
+│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐   │
+│  │   auctions     │  │     lots       │  │    images      │   │
+│  │  (Scraper)     │  │   (Scraper)    │  │  (Both)        │   │
+│  └────────────────┘  └────────────────┘  └────────────────┘   │
+│                                                                  │
+│         ↑ Reads from                    ↓ Writes to             │
+└─────────┼──────────────────────────────┼──────────────────────┘
+          │                              │
+          │                              ▼
+┌─────────┴──────────────────────────────────────────────────────┐
+│          TROOSTWIJK MONITOR (Java - This Project)              │
+│                                                                  │
+│  • Reads auction/lot data from database                         │
+│  • Downloads images from URLs                                   │
+│  • Runs YOLO object detection                                   │
+│  • Monitors bid changes                                         │
+│  • Sends notifications                                          │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Database Schema Mapping
+
+### Scraper Schema → Monitor Schema
+
+The scraper and monitor use **slightly different schemas** that need to be reconciled:
+
+| Scraper Table | Monitor Table | Integration Notes |
+|---------------|---------------|-------------------|
+| `auctions` | `auctions` | ✅ **Compatible** - same structure |
+| `lots` | `lots` | ⚠️ **Needs mapping** - field name differences |
+| `images` | `images` | ⚠️ **Partial overlap** - different purposes |
+| `cache` | N/A | ❌ Monitor doesn't use cache |
+
+### Field Mapping: `auctions` Table
+
+| Scraper Field | Monitor Field | Notes |
+|---------------|---------------|-------|
+| `auction_id` (TEXT) | `auction_id` (INTEGER) | ⚠️ **TYPE MISMATCH** - Scraper uses "A7-39813", Monitor expects INT |
+| `url` | `url` | ✅ Compatible |
+| `title` | `title` | ✅ Compatible |
+| `location` | `location`, `city`, `country` | ⚠️ Monitor splits into 3 fields |
+| `lots_count` | `lot_count` | ⚠️ Name difference |
+| `first_lot_closing_time` | `closing_time` | ⚠️ Name difference |
+| `scraped_at` | `discovered_at` | ⚠️ Name + type difference (TEXT vs INTEGER timestamp) |
+
+### Field Mapping: `lots` Table
+
+| Scraper Field | Monitor Field | Notes |
+|---------------|---------------|-------|
+| `lot_id` (TEXT) | `lot_id` (INTEGER) | ⚠️ **TYPE MISMATCH** - "A1-28505-5" vs INT |
+| `auction_id` | `sale_id` | ⚠️ Different name |
+| `url` | `url` | ✅ Compatible |
+| `title` | `title` | ✅ Compatible |
+| `current_bid` (TEXT) | `current_bid` (REAL) | ⚠️ **TYPE MISMATCH** - "€123.45" vs 123.45 |
+| `bid_count` | N/A | ℹ️ Monitor doesn't track |
+| `closing_time` | `closing_time` | ⚠️ Format difference (TEXT vs LocalDateTime) |
+| `viewing_time` | N/A | ℹ️ Monitor doesn't track |
+| `pickup_date` | N/A | ℹ️ Monitor doesn't track |
+| `location` | N/A | ℹ️ Monitor doesn't track lot location separately |
+| `description` | `description` | ✅ Compatible |
+| `category` | `category` | ✅ Compatible |
+| N/A | `manufacturer` | ℹ️ Monitor has additional field |
+| N/A | `type` | ℹ️ Monitor has additional field |
+| N/A | `year` | ℹ️ Monitor has additional field |
+| N/A | `currency` | ℹ️ Monitor has additional field |
+| N/A | `closing_notified` | ℹ️ Monitor tracking field |
+
+### Field Mapping: `images` Table
+
+| Scraper Field | Monitor Field | Notes |
+|---------------|---------------|-------|
+| `id` | `id` | ✅ Compatible |
+| `lot_id` | `lot_id` | ⚠️ Type difference (TEXT vs INTEGER) |
+| `url` | `url` | ✅ Compatible |
+| `local_path` | `file_path` | ⚠️ Different name |
+| `downloaded` (INTEGER) | N/A | ℹ️ Monitor uses `processed_at` instead |
+| N/A | `labels` (TEXT) | ℹ️ Monitor adds detected objects |
+| N/A | `processed_at` (INTEGER) | ℹ️ Monitor tracking field |
+
+## Integration Options
+
+### Option 1: Database Schema Adapter (Recommended)
+
+Create a compatibility layer that transforms scraper data to monitor format.
+
+**Implementation:**
+```java
+// Add to DatabaseService.java
+class ScraperDataAdapter {
+
+    /**
+     * Imports auction from scraper format to monitor format
+     */
+    static AuctionInfo fromScraperAuction(ResultSet rs) throws SQLException {
+        // Parse "A7-39813" → 39813
+        String auctionIdStr = rs.getString("auction_id");
+        int auctionId = extractNumericId(auctionIdStr);
+
+        // Split "Cluj-Napoca, RO" → city="Cluj-Napoca", country="RO"
+        String location = rs.getString("location");
+        String[] parts = location.split(",\\s*");
+        String city = parts.length > 0 ? parts[0] : "";
+        String country = parts.length > 1 ? parts[1] : "";
+
+        return new AuctionInfo(
+            auctionId,
+            rs.getString("title"),
+            location,
+            city,
+            country,
+            rs.getString("url"),
+            extractTypePrefix(auctionIdStr), // "A7-39813" → "A7"
+            rs.getInt("lots_count"),
+            parseTimestamp(rs.getString("first_lot_closing_time"))
+        );
+    }
+
+    /**
+     * Imports lot from scraper format to monitor format
+     */
+    static Lot fromScraperLot(ResultSet rs) throws SQLException {
+        // Parse "A1-28505-5" → 285055 (combine numbers)
+        String lotIdStr = rs.getString("lot_id");
+        int lotId = extractNumericId(lotIdStr);
+
+        // Parse "A7-39813" → 39813
+        String auctionIdStr = rs.getString("auction_id");
+        int saleId = extractNumericId(auctionIdStr);
+
+        // Parse "€123.45" → 123.45
+        String currentBidStr = rs.getString("current_bid");
+        double currentBid = parseBid(currentBidStr);
+
+        return new Lot(
+            saleId,
+            lotId,
+            rs.getString("title"),
+            rs.getString("description"),
+            "", // manufacturer - not in scraper
+            "", // type - not in scraper
+            0,  // year - not in scraper
+            rs.getString("category"),
+            currentBid,
+            "EUR", // currency - inferred from €
+            rs.getString("url"),
+            parseTimestamp(rs.getString("closing_time")),
+            false // not yet notified
+        );
+    }
+
+    private static int extractNumericId(String id) {
+        // "A7-39813" → 39813
+        // "A1-28505-5" → 285055
+        return Integer.parseInt(id.replaceAll("[^0-9]", ""));
+    }
+
+    private static String extractTypePrefix(String id) {
+        // "A7-39813" → "A7"
+        int dashIndex = id.indexOf('-');
+        return dashIndex > 0 ? id.substring(0, dashIndex) : "";
+    }
+
+    private static double parseBid(String bid) {
+        // "€123.45" → 123.45
+        // "No bids" → 0.0
+        if (bid == null || bid.contains("No")) return 0.0;
+        return Double.parseDouble(bid.replaceAll("[^0-9.]", ""));
+    }
+
+    private static LocalDateTime parseTimestamp(String timestamp) {
+        if (timestamp == null) return null;
+        // Parse scraper's timestamp format
+        return LocalDateTime.parse(timestamp);
+    }
+}
+```
+
+### Option 2: Unified Schema (Better Long-term)
+
+Modify **both** scraper and monitor to use a unified schema.
+
+**Create**: `SHARED_SCHEMA.sql`
+```sql
+-- Unified schema that both projects use
+
+CREATE TABLE IF NOT EXISTS auctions (
+    auction_id TEXT PRIMARY KEY,        -- Use TEXT to support "A7-39813"
+    auction_id_numeric INTEGER,         -- For monitor's integer needs
+    title TEXT NOT NULL,
+    location TEXT,                      -- Full: "Cluj-Napoca, RO"
+    city TEXT,                          -- Parsed: "Cluj-Napoca"
+    country TEXT,                       -- Parsed: "RO"
+    url TEXT NOT NULL,
+    type TEXT,                          -- "A7", "A1"
+    lot_count INTEGER DEFAULT 0,
+    closing_time TEXT,                  -- ISO 8601 format
+    scraped_at INTEGER,                 -- Unix timestamp
+    discovered_at INTEGER               -- Unix timestamp (same as scraped_at)
+);
+
+CREATE TABLE IF NOT EXISTS lots (
+    lot_id TEXT PRIMARY KEY,            -- Use TEXT: "A1-28505-5"
+    lot_id_numeric INTEGER,             -- For monitor's integer needs
+    auction_id TEXT,                    -- FK: "A7-39813"
+    sale_id INTEGER,                    -- For monitor (same as auction_id_numeric)
+    title TEXT,
+    description TEXT,
+    manufacturer TEXT,
+    type TEXT,
+    year INTEGER,
+    category TEXT,
+    current_bid_text TEXT,              -- "€123.45" or "No bids"
+    current_bid REAL,                   -- 123.45
+    bid_count INTEGER,
+    currency TEXT DEFAULT 'EUR',
+    url TEXT UNIQUE,
+    closing_time TEXT,
+    viewing_time TEXT,
+    pickup_date TEXT,
+    location TEXT,
+    closing_notified INTEGER DEFAULT 0,
+    scraped_at TEXT,
+    FOREIGN KEY (auction_id) REFERENCES auctions(auction_id)
+);
+
+CREATE TABLE IF NOT EXISTS images (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    lot_id TEXT,                        -- FK: "A1-28505-5"
+    url TEXT,                           -- Image URL from website
+    file_path TEXT,                     -- Local path after download
+    local_path TEXT,                    -- Alias for compatibility
+    labels TEXT,                        -- Detected objects (comma-separated)
+    downloaded INTEGER DEFAULT 0,       -- 0=pending, 1=downloaded
+    processed_at INTEGER,               -- Unix timestamp when processed
+    FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
+);
+
+-- Indexes
+CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country);
+CREATE INDEX IF NOT EXISTS idx_lots_auction_id ON lots(auction_id);
+CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id);
+CREATE INDEX IF NOT EXISTS idx_images_downloaded ON images(downloaded);
+```
+
+### Option 3: API Integration (Most Flexible)
+
+Have the scraper expose a REST API for the monitor to query.
+
+```python
+# In scraper: Add Flask API endpoint
+@app.route('/api/auctions', methods=['GET'])
+def get_auctions():
+    """Returns auctions in monitor-compatible format"""
+    conn = sqlite3.connect(CACHE_DB)
+    cursor = conn.cursor()
+    cursor.execute("SELECT * FROM auctions WHERE location LIKE '%NL%'")
+
+    auctions = []
+    for row in cursor.fetchall():
+        auctions.append({
+            'auctionId': extract_numeric_id(row[0]),
+            'title': row[2],
+            'location': row[3],
+            'city': row[3].split(',')[0] if row[3] else '',
+            'country': row[3].split(',')[1].strip() if ',' in row[3] else '',
+            'url': row[1],
+            'type': row[0].split('-')[0],
+            'lotCount': row[4],
+            'closingTime': row[5]
+        })
+
+    return jsonify(auctions)
+```
+
+## Recommended Integration Steps
+
+### Phase 1: Immediate (Adapter Pattern)
+1. ✅ Keep separate schemas
+2. ✅ Create `ScraperDataAdapter` in Monitor
+3. ✅ Add import methods to `DatabaseService`
+4. ✅ Monitor reads from scraper's tables using adapter
+
+### Phase 2: Short-term (Unified Schema)
+1. 📋 Design unified schema (see Option 2)
+2. 📋 Update scraper to use unified schema
+3. 📋 Update monitor to use unified schema
+4. 📋 Migrate existing data
+
+### Phase 3: Long-term (API + Event-driven)
+1. 📋 Add REST API to scraper
+2. 📋 Add webhook/event notification when new data arrives
+3. 📋 Monitor subscribes to events
+4. 📋 Process images asynchronously
+
+## Current Integration Flow
+
+### Scraper Process (Python)
+```bash
+# 1. Run scraper to populate database
+cd /path/to/scraper
+python scraper.py
+
+# Output:
+# ✅ Scraped 42 auctions
+# ✅ Scraped 1,234 lots
+# ✅ Saved 3,456 image URLs
+# ✅ Data written to: /mnt/okcomputer/output/cache.db
+```
+
+### Monitor Process (Java)
+```bash
+# 2. Run monitor to process the data
+cd /path/to/monitor
+export DATABASE_FILE=/mnt/okcomputer/output/cache.db
+java -jar troostwijk-monitor.jar
+
+# Output:
+# 📊 Current Database State:
+#   Total lots in database: 1,234
+#   Total images processed: 0
+#
+# [1/2] Processing images...
+#   Downloading and analyzing 3,456 images...
+#
+# [2/2] Starting bid monitoring...
+# ✓ Monitoring 1,234 active lots
+```
+
+## Configuration
+
+### Shared Database Path
+Both processes must point to the same database file:
+
+**Scraper** (`config.py`):
+```python
+CACHE_DB = '/mnt/okcomputer/output/cache.db'
+```
+
+**Monitor** (`Main.java`):
+```java
+String databaseFile = System.getenv().getOrDefault(
+    "DATABASE_FILE",
+    "/mnt/okcomputer/output/cache.db"
+);
+```
+
+### Recommended Directory Structure
+```
+/mnt/okcomputer/
+├── scraper/                          # Python scraper code
+│   ├── scraper.py
+│   └── requirements.txt
+├── monitor/                          # Java monitor code
+│   ├── troostwijk-monitor.jar
+│   └── models/                       # YOLO models
+│       ├── yolov4.cfg
+│       ├── yolov4.weights
+│       └── coco.names
+└── output/                           # Shared data directory
+    ├── cache.db                      # Shared SQLite database
+    └── images/                       # Downloaded images
+        ├── A1-28505-5/
+        │   ├── 001.jpg
+        │   └── 002.jpg
+        └── ...
+```
+
+## Monitoring & Coordination
+
+### Option A: Sequential Execution
+```bash
+#!/bin/bash
+# run-pipeline.sh
+
+echo "Step 1: Scraping..."
+python scraper/scraper.py
+
+echo "Step 2: Processing images..."
+java -jar monitor/troostwijk-monitor.jar --process-images-only
+
+echo "Step 3: Starting monitor..."
+java -jar monitor/troostwijk-monitor.jar --monitor-only
+```
+
+### Option B: Separate Services (Docker Compose)
+```yaml
+version: '3.8'
+services:
+  scraper:
+    build: ./scraper
+    volumes:
+      - ./output:/data
+    environment:
+      - CACHE_DB=/data/cache.db
+    command: python scraper.py
+
+  monitor:
+    build: ./monitor
+    volumes:
+      - ./output:/data
+    environment:
+      - DATABASE_FILE=/data/cache.db
+      - NOTIFICATION_CONFIG=desktop
+    depends_on:
+      - scraper
+    command: java -jar troostwijk-monitor.jar
+```
+
+### Option C: Cron-based Scheduling
+```cron
+# Scrape every 6 hours
+0 */6 * * * cd /mnt/okcomputer/scraper && python scraper.py
+
+# Process images every hour (if new lots found)
+0 * * * * cd /mnt/okcomputer/monitor && java -jar monitor.jar --process-new
+
+# Monitor runs continuously
+@reboot cd /mnt/okcomputer/monitor && java -jar monitor.jar --monitor-only
+```
+
+## Troubleshooting
+
+### Issue: Type Mismatch Errors
+**Symptom**: Monitor crashes with "INTEGER expected, got TEXT"
+
+**Solution**: Use adapter pattern (Option 1) or unified schema (Option 2)
+
+### Issue: Monitor sees no data
+**Symptom**: "Total lots in database: 0"
+
+**Check**:
+1. Is `DATABASE_FILE` env var set correctly?
+2. Did scraper actually write data?
+3. Are both processes using the same database file?
+
+```bash
+# Verify database has data
+sqlite3 /mnt/okcomputer/output/cache.db "SELECT COUNT(*) FROM lots"
+```
+
+### Issue: Images not downloading
+**Symptom**: "Total images processed: 0" but scraper found images
+
+**Check**:
+1. Scraper writes image URLs to `images` table
+2. Monitor reads from `images` table with `downloaded=0`
+3. Field name mapping: `local_path` vs `file_path`
+
+## Next Steps
+
+1. **Immediate**: Implement `ScraperDataAdapter` for compatibility
+2. **This Week**: Test end-to-end integration with sample data
+3. **Next Sprint**: Migrate to unified schema
+4. **Future**: Add event-driven architecture with webhooks