# Integration Guide: Troostwijk Monitor ↔ Scraper ## Overview This document describes how **Troostwijk Monitor** (this Java project) integrates with the **ARCHITECTURE-TROOSTWIJK-SCRAPER** (Python scraper process). ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ ARCHITECTURE-TROOSTWIJK-SCRAPER (Python) │ │ │ │ • Discovers auctions from website │ │ • Scrapes lot details via Playwright │ │ • Parses __NEXT_DATA__ JSON │ │ • Stores image URLs (not downloads) │ │ │ │ ↓ Writes to │ └─────────┼───────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ SHARED SQLite DATABASE │ │ (troostwijk.db) │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │ │ auctions │ │ lots │ │ images │ │ │ │ (Scraper) │ │ (Scraper) │ │ (Both) │ │ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │ │ │ ↑ Reads from ↓ Writes to │ └─────────┼──────────────────────────────┼──────────────────────┘ │ │ │ ▼ ┌─────────┴──────────────────────────────────────────────────────┐ │ TROOSTWIJK MONITOR (Java - This Project) │ │ │ │ • Reads auction/lot data from database │ │ • Downloads images from URLs │ │ • Runs YOLO object detection │ │ • Monitors bid changes │ │ • Sends notifications │ └─────────────────────────────────────────────────────────────────┘ ``` ## Database Schema Mapping ### Scraper Schema → Monitor Schema The scraper and monitor use **slightly different schemas** that need to be reconciled: | Scraper Table | Monitor Table | Integration Notes | |---------------|---------------|-------------------| | `auctions` | `auctions` | ✅ **Compatible** - same structure | | `lots` | `lots` | ⚠️ **Needs mapping** - field name differences | | `images` | `images` | ⚠️ **Partial overlap** - different purposes | | `cache` | N/A | ❌ Monitor doesn't use cache | ### Field Mapping: `auctions` Table | Scraper Field | Monitor Field | Notes | |---------------|---------------|-------| | `auction_id` (TEXT) | `auction_id` (INTEGER) | ⚠️ **TYPE MISMATCH** - Scraper uses "A7-39813", Monitor expects INT | | `url` | `url` | ✅ Compatible | | `title` | `title` | ✅ Compatible | | `location` | `location`, `city`, `country` | ⚠️ Monitor splits into 3 fields | | `lots_count` | `lot_count` | ⚠️ Name difference | | `first_lot_closing_time` | `closing_time` | ⚠️ Name difference | | `scraped_at` | `discovered_at` | ⚠️ Name + type difference (TEXT vs INTEGER timestamp) | ### Field Mapping: `lots` Table | Scraper Field | Monitor Field | Notes | |---------------|---------------|-------| | `lot_id` (TEXT) | `lot_id` (INTEGER) | ⚠️ **TYPE MISMATCH** - "A1-28505-5" vs INT | | `auction_id` | `sale_id` | ⚠️ Different name | | `url` | `url` | ✅ Compatible | | `title` | `title` | ✅ Compatible | | `current_bid` (TEXT) | `current_bid` (REAL) | ⚠️ **TYPE MISMATCH** - "€123.45" vs 123.45 | | `bid_count` | N/A | ℹ️ Monitor doesn't track | | `closing_time` | `closing_time` | ⚠️ Format difference (TEXT vs LocalDateTime) | | `viewing_time` | N/A | ℹ️ Monitor doesn't track | | `pickup_date` | N/A | ℹ️ Monitor doesn't track | | `location` | N/A | ℹ️ Monitor doesn't track lot location separately | | `description` | `description` | ✅ Compatible | | `category` | `category` | ✅ Compatible | | N/A | `manufacturer` | ℹ️ Monitor has additional field | | N/A | `type` | ℹ️ Monitor has additional field | | N/A | `year` | ℹ️ Monitor has additional field | | N/A | `currency` | ℹ️ Monitor has additional field | | N/A | `closing_notified` | ℹ️ Monitor tracking field | ### Field Mapping: `images` Table | Scraper Field | Monitor Field | Notes | |---------------|---------------|-------| | `id` | `id` | ✅ Compatible | | `lot_id` | `lot_id` | ⚠️ Type difference (TEXT vs INTEGER) | | `url` | `url` | ✅ Compatible | | `local_path` | `file_path` | ⚠️ Different name | | `downloaded` (INTEGER) | N/A | ℹ️ Monitor uses `processed_at` instead | | N/A | `labels` (TEXT) | ℹ️ Monitor adds detected objects | | N/A | `processed_at` (INTEGER) | ℹ️ Monitor tracking field | ## Integration Options ### Option 1: Database Schema Adapter (Recommended) Create a compatibility layer that transforms scraper data to monitor format. **Implementation:** ```java // Add to DatabaseService.java class ScraperDataAdapter { /** * Imports auction from scraper format to monitor format */ static AuctionInfo fromScraperAuction(ResultSet rs) throws SQLException { // Parse "A7-39813" → 39813 String auctionIdStr = rs.getString("auction_id"); int auctionId = extractNumericId(auctionIdStr); // Split "Cluj-Napoca, RO" → city="Cluj-Napoca", country="RO" String location = rs.getString("location"); String[] parts = location.split(",\\s*"); String city = parts.length > 0 ? parts[0] : ""; String country = parts.length > 1 ? parts[1] : ""; return new AuctionInfo( auctionId, rs.getString("title"), location, city, country, rs.getString("url"), extractTypePrefix(auctionIdStr), // "A7-39813" → "A7" rs.getInt("lots_count"), parseTimestamp(rs.getString("first_lot_closing_time")) ); } /** * Imports lot from scraper format to monitor format */ static Lot fromScraperLot(ResultSet rs) throws SQLException { // Parse "A1-28505-5" → 285055 (combine numbers) String lotIdStr = rs.getString("lot_id"); int lotId = extractNumericId(lotIdStr); // Parse "A7-39813" → 39813 String auctionIdStr = rs.getString("auction_id"); int saleId = extractNumericId(auctionIdStr); // Parse "€123.45" → 123.45 String currentBidStr = rs.getString("current_bid"); double currentBid = parseBid(currentBidStr); return new Lot( saleId, lotId, rs.getString("title"), rs.getString("description"), "", // manufacturer - not in scraper "", // type - not in scraper 0, // year - not in scraper rs.getString("category"), currentBid, "EUR", // currency - inferred from € rs.getString("url"), parseTimestamp(rs.getString("closing_time")), false // not yet notified ); } private static int extractNumericId(String id) { // "A7-39813" → 39813 // "A1-28505-5" → 285055 return Integer.parseInt(id.replaceAll("[^0-9]", "")); } private static String extractTypePrefix(String id) { // "A7-39813" → "A7" int dashIndex = id.indexOf('-'); return dashIndex > 0 ? id.substring(0, dashIndex) : ""; } private static double parseBid(String bid) { // "€123.45" → 123.45 // "No bids" → 0.0 if (bid == null || bid.contains("No")) return 0.0; return Double.parseDouble(bid.replaceAll("[^0-9.]", "")); } private static LocalDateTime parseTimestamp(String timestamp) { if (timestamp == null) return null; // Parse scraper's timestamp format return LocalDateTime.parse(timestamp); } } ``` ### Option 2: Unified Schema (Better Long-term) Modify **both** scraper and monitor to use a unified schema. **Create**: `SHARED_SCHEMA.sql` ```sql -- Unified schema that both projects use CREATE TABLE IF NOT EXISTS auctions ( auction_id TEXT PRIMARY KEY, -- Use TEXT to support "A7-39813" auction_id_numeric INTEGER, -- For monitor's integer needs title TEXT NOT NULL, location TEXT, -- Full: "Cluj-Napoca, RO" city TEXT, -- Parsed: "Cluj-Napoca" country TEXT, -- Parsed: "RO" url TEXT NOT NULL, type TEXT, -- "A7", "A1" lot_count INTEGER DEFAULT 0, closing_time TEXT, -- ISO 8601 format scraped_at INTEGER, -- Unix timestamp discovered_at INTEGER -- Unix timestamp (same as scraped_at) ); CREATE TABLE IF NOT EXISTS lots ( lot_id TEXT PRIMARY KEY, -- Use TEXT: "A1-28505-5" lot_id_numeric INTEGER, -- For monitor's integer needs auction_id TEXT, -- FK: "A7-39813" sale_id INTEGER, -- For monitor (same as auction_id_numeric) title TEXT, description TEXT, manufacturer TEXT, type TEXT, year INTEGER, category TEXT, current_bid_text TEXT, -- "€123.45" or "No bids" current_bid REAL, -- 123.45 bid_count INTEGER, currency TEXT DEFAULT 'EUR', url TEXT UNIQUE, closing_time TEXT, viewing_time TEXT, pickup_date TEXT, location TEXT, closing_notified INTEGER DEFAULT 0, scraped_at TEXT, FOREIGN KEY (auction_id) REFERENCES auctions(auction_id) ); CREATE TABLE IF NOT EXISTS images ( id INTEGER PRIMARY KEY AUTOINCREMENT, lot_id TEXT, -- FK: "A1-28505-5" url TEXT, -- Image URL from website file_path TEXT, -- Local path after download local_path TEXT, -- Alias for compatibility labels TEXT, -- Detected objects (comma-separated) downloaded INTEGER DEFAULT 0, -- 0=pending, 1=downloaded processed_at INTEGER, -- Unix timestamp when processed FOREIGN KEY (lot_id) REFERENCES lots(lot_id) ); -- Indexes CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country); CREATE INDEX IF NOT EXISTS idx_lots_auction_id ON lots(auction_id); CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id); CREATE INDEX IF NOT EXISTS idx_images_downloaded ON images(downloaded); ``` ### Option 3: API Integration (Most Flexible) Have the scraper expose a REST API for the monitor to query. ```python # In scraper: Add Flask API endpoint @app.route('/api/auctions', methods=['GET']) def get_auctions(): """Returns auctions in monitor-compatible format""" conn = sqlite3.connect(CACHE_DB) cursor = conn.cursor() cursor.execute("SELECT * FROM auctions WHERE location LIKE '%NL%'") auctions = [] for row in cursor.fetchall(): auctions.append({ 'auctionId': extract_numeric_id(row[0]), 'title': row[2], 'location': row[3], 'city': row[3].split(',')[0] if row[3] else '', 'country': row[3].split(',')[1].strip() if ',' in row[3] else '', 'url': row[1], 'type': row[0].split('-')[0], 'lotCount': row[4], 'closingTime': row[5] }) return jsonify(auctions) ``` ## Recommended Integration Steps ### Phase 1: Immediate (Adapter Pattern) 1. ✅ Keep separate schemas 2. ✅ Create `ScraperDataAdapter` in Monitor 3. ✅ Add import methods to `DatabaseService` 4. ✅ Monitor reads from scraper's tables using adapter ### Phase 2: Short-term (Unified Schema) 1. 📋 Design unified schema (see Option 2) 2. 📋 Update scraper to use unified schema 3. 📋 Update monitor to use unified schema 4. 📋 Migrate existing data ### Phase 3: Long-term (API + Event-driven) 1. 📋 Add REST API to scraper 2. 📋 Add webhook/event notification when new data arrives 3. 📋 Monitor subscribes to events 4. 📋 Process images asynchronously ## Current Integration Flow ### Scraper Process (Python) ```bash # 1. Run scraper to populate database cd /path/to/scraper python scraper.py # Output: # ✅ Scraped 42 auctions # ✅ Scraped 1,234 lots # ✅ Saved 3,456 image URLs # ✅ Data written to: /mnt/okcomputer/output/cache.db ``` ### Monitor Process (Java) ```bash # 2. Run monitor to process the data cd /path/to/monitor export DATABASE_FILE=/mnt/okcomputer/output/cache.db java -jar troostwijk-monitor.jar # Output: # 📊 Current Database State: # Total lots in database: 1,234 # Total images processed: 0 # # [1/2] Processing images... # Downloading and analyzing 3,456 images... # # [2/2] Starting bid monitoring... # ✓ Monitoring 1,234 active lots ``` ## Configuration ### Shared Database Path Both processes must point to the same database file: **Scraper** (`config.py`): ```python CACHE_DB = '/mnt/okcomputer/output/cache.db' ``` **Monitor** (`Main.java`): ```java String databaseFile = System.getenv().getOrDefault( "DATABASE_FILE", "/mnt/okcomputer/output/cache.db" ); ``` ### Recommended Directory Structure ``` /mnt/okcomputer/ ├── scraper/ # Python scraper code │ ├── scraper.py │ └── requirements.txt ├── monitor/ # Java monitor code │ ├── troostwijk-monitor.jar │ └── models/ # YOLO models │ ├── yolov4.cfg │ ├── yolov4.weights │ └── coco.names └── output/ # Shared data directory ├── cache.db # Shared SQLite database └── images/ # Downloaded images ├── A1-28505-5/ │ ├── 001.jpg │ └── 002.jpg └── ... ``` ## Monitoring & Coordination ### Option A: Sequential Execution ```bash #!/bin/bash # run-pipeline.sh echo "Step 1: Scraping..." python scraper/scraper.py echo "Step 2: Processing images..." java -jar monitor/troostwijk-monitor.jar --process-images-only echo "Step 3: Starting monitor..." java -jar monitor/troostwijk-monitor.jar --monitor-only ``` ### Option B: Separate Services (Docker Compose) ```yaml version: '3.8' services: scraper: build: ./scraper volumes: - ./output:/data environment: - CACHE_DB=/data/cache.db command: python scraper.py monitor: build: ./monitor volumes: - ./output:/data environment: - DATABASE_FILE=/data/cache.db - NOTIFICATION_CONFIG=desktop depends_on: - scraper command: java -jar troostwijk-monitor.jar ``` ### Option C: Cron-based Scheduling ```cron # Scrape every 6 hours 0 */6 * * * cd /mnt/okcomputer/scraper && python scraper.py # Process images every hour (if new lots found) 0 * * * * cd /mnt/okcomputer/monitor && java -jar monitor.jar --process-new # Monitor runs continuously @reboot cd /mnt/okcomputer/monitor && java -jar monitor.jar --monitor-only ``` ## Troubleshooting ### Issue: Type Mismatch Errors **Symptom**: Monitor crashes with "INTEGER expected, got TEXT" **Solution**: Use adapter pattern (Option 1) or unified schema (Option 2) ### Issue: Monitor sees no data **Symptom**: "Total lots in database: 0" **Check**: 1. Is `DATABASE_FILE` env var set correctly? 2. Did scraper actually write data? 3. Are both processes using the same database file? ```bash # Verify database has data sqlite3 /mnt/okcomputer/output/cache.db "SELECT COUNT(*) FROM lots" ``` ### Issue: Images not downloading **Symptom**: "Total images processed: 0" but scraper found images **Check**: 1. Scraper writes image URLs to `images` table 2. Monitor reads from `images` table with `downloaded=0` 3. Field name mapping: `local_path` vs `file_path` ## Next Steps 1. **Immediate**: Implement `ScraperDataAdapter` for compatibility 2. **This Week**: Test end-to-end integration with sample data 3. **Next Sprint**: Migrate to unified schema 4. **Future**: Add event-driven architecture with webhooks