Files

Tour 9857f053a1 Fix mock tests

2025-12-04 04:30:44 +01:00

17 KiB

Raw Blame History

Integration Guide: Troostwijk Monitor ↔ Scraper

Overview

This document describes how Troostwijk Monitor (this Java project) integrates with the ARCHITECTURE-TROOSTWIJK-SCRAPER (Python scraper process).

Architecture

┌─────────────────────────────────────────────────────────────────┐
│              ARCHITECTURE-TROOSTWIJK-SCRAPER (Python)           │
│                                                                  │
│  • Discovers auctions from website                              │
│  • Scrapes lot details via Playwright                           │
│  • Parses __NEXT_DATA__ JSON                                    │
│  • Stores image URLs (not downloads)                            │
│                                                                  │
│         ↓ Writes to                                             │
└─────────┼───────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────────┐
│                     SHARED SQLite DATABASE                       │
│                       (troostwijk.db)                            │
│                                                                  │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐   │
│  │   auctions     │  │     lots       │  │    images      │   │
│  │  (Scraper)     │  │   (Scraper)    │  │  (Both)        │   │
│  └────────────────┘  └────────────────┘  └────────────────┘   │
│                                                                  │
│         ↑ Reads from                    ↓ Writes to             │
└─────────┼──────────────────────────────┼──────────────────────┘
          │                              │
          │                              ▼
┌─────────┴──────────────────────────────────────────────────────┐
│          TROOSTWIJK MONITOR (Java - This Project)              │
│                                                                  │
│  • Reads auction/lot data from database                         │
│  • Downloads images from URLs                                   │
│  • Runs YOLO object detection                                   │
│  • Monitors bid changes                                         │
│  • Sends notifications                                          │
└─────────────────────────────────────────────────────────────────┘

Database Schema Mapping

Scraper Schema → Monitor Schema

The scraper and monitor use slightly different schemas that need to be reconciled:

Scraper Table	Monitor Table	Integration Notes
`auctions`	`auctions`	✅ Compatible - same structure
`lots`	`lots`	⚠️ Needs mapping - field name differences
`images`	`images`	⚠️ Partial overlap - different purposes
`cache`	N/A	❌ Monitor doesn't use cache

Field Mapping: `auctions` Table

Scraper Field	Monitor Field	Notes
`auction_id` (TEXT)	`auction_id` (INTEGER)	⚠️ TYPE MISMATCH - Scraper uses "A7-39813", Monitor expects INT
`url`	`url`	✅ Compatible
`title`	`title`	✅ Compatible
`location`	`location`, `city`, `country`	⚠️ Monitor splits into 3 fields
`lots_count`	`lot_count`	⚠️ Name difference
`first_lot_closing_time`	`closing_time`	⚠️ Name difference
`scraped_at`	`discovered_at`	⚠️ Name + type difference (TEXT vs INTEGER timestamp)

Field Mapping: `lots` Table

Scraper Field	Monitor Field	Notes
`lot_id` (TEXT)	`lot_id` (INTEGER)	⚠️ TYPE MISMATCH - "A1-28505-5" vs INT
`auction_id`	`sale_id`	⚠️ Different name
`url`	`url`	✅ Compatible
`title`	`title`	✅ Compatible
`current_bid` (TEXT)	`current_bid` (REAL)	⚠️ TYPE MISMATCH - "€123.45" vs 123.45
`bid_count`	N/A	ℹ️ Monitor doesn't track
`closing_time`	`closing_time`	⚠️ Format difference (TEXT vs LocalDateTime)
`viewing_time`	N/A	ℹ️ Monitor doesn't track
`pickup_date`	N/A	ℹ️ Monitor doesn't track
`location`	N/A	ℹ️ Monitor doesn't track lot location separately
`description`	`description`	✅ Compatible
`category`	`category`	✅ Compatible
N/A	`manufacturer`	ℹ️ Monitor has additional field
N/A	`type`	ℹ️ Monitor has additional field
N/A	`year`	ℹ️ Monitor has additional field
N/A	`currency`	ℹ️ Monitor has additional field
N/A	`closing_notified`	ℹ️ Monitor tracking field

Field Mapping: `images` Table

Scraper Field	Monitor Field	Notes
`id`	`id`	✅ Compatible
`lot_id`	`lot_id`	⚠️ Type difference (TEXT vs INTEGER)
`url`	`url`	✅ Compatible
`local_path`	`file_path`	⚠️ Different name
`downloaded` (INTEGER)	N/A	ℹ️ Monitor uses `processed_at` instead
N/A	`labels` (TEXT)	ℹ️ Monitor adds detected objects
N/A	`processed_at` (INTEGER)	ℹ️ Monitor tracking field

Integration Options

Option 1: Database Schema Adapter (Recommended)

Create a compatibility layer that transforms scraper data to monitor format.

Implementation:

// Add to DatabaseService.java
class ScraperDataAdapter {

    /**
     * Imports auction from scraper format to monitor format
     */
    static AuctionInfo fromScraperAuction(ResultSet rs) throws SQLException {
        // Parse "A7-39813" → 39813
        String auctionIdStr = rs.getString("auction_id");
        int auctionId = extractNumericId(auctionIdStr);

        // Split "Cluj-Napoca, RO" → city="Cluj-Napoca", country="RO"
        String location = rs.getString("location");
        String[] parts = location.split(",\\s*");
        String city = parts.length > 0 ? parts[0] : "";
        String country = parts.length > 1 ? parts[1] : "";

        return new AuctionInfo(
            auctionId,
            rs.getString("title"),
            location,
            city,
            country,
            rs.getString("url"),
            extractTypePrefix(auctionIdStr), // "A7-39813" → "A7"
            rs.getInt("lots_count"),
            parseTimestamp(rs.getString("first_lot_closing_time"))
        );
    }

    /**
     * Imports lot from scraper format to monitor format
     */
    static Lot fromScraperLot(ResultSet rs) throws SQLException {
        // Parse "A1-28505-5" → 285055 (combine numbers)
        String lotIdStr = rs.getString("lot_id");
        int lotId = extractNumericId(lotIdStr);

        // Parse "A7-39813" → 39813
        String auctionIdStr = rs.getString("auction_id");
        int saleId = extractNumericId(auctionIdStr);

        // Parse "€123.45" → 123.45
        String currentBidStr = rs.getString("current_bid");
        double currentBid = parseBid(currentBidStr);

        return new Lot(
            saleId,
            lotId,
            rs.getString("title"),
            rs.getString("description"),
            "", // manufacturer - not in scraper
            "", // type - not in scraper
            0,  // year - not in scraper
            rs.getString("category"),
            currentBid,
            "EUR", // currency - inferred from €
            rs.getString("url"),
            parseTimestamp(rs.getString("closing_time")),
            false // not yet notified
        );
    }

    private static int extractNumericId(String id) {
        // "A7-39813" → 39813
        // "A1-28505-5" → 285055
        return Integer.parseInt(id.replaceAll("[^0-9]", ""));
    }

    private static String extractTypePrefix(String id) {
        // "A7-39813" → "A7"
        int dashIndex = id.indexOf('-');
        return dashIndex > 0 ? id.substring(0, dashIndex) : "";
    }

    private static double parseBid(String bid) {
        // "€123.45" → 123.45
        // "No bids" → 0.0
        if (bid == null || bid.contains("No")) return 0.0;
        return Double.parseDouble(bid.replaceAll("[^0-9.]", ""));
    }

    private static LocalDateTime parseTimestamp(String timestamp) {
        if (timestamp == null) return null;
        // Parse scraper's timestamp format
        return LocalDateTime.parse(timestamp);
    }
}

Option 2: Unified Schema (Better Long-term)

Modify both scraper and monitor to use a unified schema.

Create: SHARED_SCHEMA.sql

-- Unified schema that both projects use

CREATE TABLE IF NOT EXISTS auctions (
    auction_id TEXT PRIMARY KEY,        -- Use TEXT to support "A7-39813"
    auction_id_numeric INTEGER,         -- For monitor's integer needs
    title TEXT NOT NULL,
    location TEXT,                      -- Full: "Cluj-Napoca, RO"
    city TEXT,                          -- Parsed: "Cluj-Napoca"
    country TEXT,                       -- Parsed: "RO"
    url TEXT NOT NULL,
    type TEXT,                          -- "A7", "A1"
    lot_count INTEGER DEFAULT 0,
    closing_time TEXT,                  -- ISO 8601 format
    scraped_at INTEGER,                 -- Unix timestamp
    discovered_at INTEGER               -- Unix timestamp (same as scraped_at)
);

CREATE TABLE IF NOT EXISTS lots (
    lot_id TEXT PRIMARY KEY,            -- Use TEXT: "A1-28505-5"
    lot_id_numeric INTEGER,             -- For monitor's integer needs
    auction_id TEXT,                    -- FK: "A7-39813"
    sale_id INTEGER,                    -- For monitor (same as auction_id_numeric)
    title TEXT,
    description TEXT,
    manufacturer TEXT,
    type TEXT,
    year INTEGER,
    category TEXT,
    current_bid_text TEXT,              -- "€123.45" or "No bids"
    current_bid REAL,                   -- 123.45
    bid_count INTEGER,
    currency TEXT DEFAULT 'EUR',
    url TEXT UNIQUE,
    closing_time TEXT,
    viewing_time TEXT,
    pickup_date TEXT,
    location TEXT,
    closing_notified INTEGER DEFAULT 0,
    scraped_at TEXT,
    FOREIGN KEY (auction_id) REFERENCES auctions(auction_id)
);

CREATE TABLE IF NOT EXISTS images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    lot_id TEXT,                        -- FK: "A1-28505-5"
    url TEXT,                           -- Image URL from website
    file_path TEXT,                     -- Local path after download
    local_path TEXT,                    -- Alias for compatibility
    labels TEXT,                        -- Detected objects (comma-separated)
    downloaded INTEGER DEFAULT 0,       -- 0=pending, 1=downloaded
    processed_at INTEGER,               -- Unix timestamp when processed
    FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);

-- Indexes
CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country);
CREATE INDEX IF NOT EXISTS idx_lots_auction_id ON lots(auction_id);
CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id);
CREATE INDEX IF NOT EXISTS idx_images_downloaded ON images(downloaded);

Option 3: API Integration (Most Flexible)

Have the scraper expose a REST API for the monitor to query.

# In scraper: Add Flask API endpoint
@app.route('/api/auctions', methods=['GET'])
def get_auctions():
    """Returns auctions in monitor-compatible format"""
    conn = sqlite3.connect(CACHE_DB)
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM auctions WHERE location LIKE '%NL%'")

    auctions = []
    for row in cursor.fetchall():
        auctions.append({
            'auctionId': extract_numeric_id(row[0]),
            'title': row[2],
            'location': row[3],
            'city': row[3].split(',')[0] if row[3] else '',
            'country': row[3].split(',')[1].strip() if ',' in row[3] else '',
            'url': row[1],
            'type': row[0].split('-')[0],
            'lotCount': row[4],
            'closingTime': row[5]
        })

    return jsonify(auctions)

Recommended Integration Steps

Phase 1: Immediate (Adapter Pattern)

✅ Keep separate schemas
✅ Create ScraperDataAdapter in Monitor
✅ Add import methods to DatabaseService
✅ Monitor reads from scraper's tables using adapter

Phase 2: Short-term (Unified Schema)

📋 Design unified schema (see Option 2)
📋 Update scraper to use unified schema
📋 Update monitor to use unified schema
📋 Migrate existing data

Phase 3: Long-term (API + Event-driven)

📋 Add REST API to scraper
📋 Add webhook/event notification when new data arrives
📋 Monitor subscribes to events
📋 Process images asynchronously

Current Integration Flow

Scraper Process (Python)

# 1. Run scraper to populate database
cd /path/to/scraper
python scraper.py

# Output:
# ✅ Scraped 42 auctions
# ✅ Scraped 1,234 lots
# ✅ Saved 3,456 image URLs
# ✅ Data written to: /mnt/okcomputer/output/cache.db

Monitor Process (Java)

# 2. Run monitor to process the data
cd /path/to/monitor
export DATABASE_FILE=/mnt/okcomputer/output/cache.db
java -jar troostwijk-monitor.jar

# Output:
# 📊 Current Database State:
#   Total lots in database: 1,234
#   Total images processed: 0
#
# [1/2] Processing images...
#   Downloading and analyzing 3,456 images...
#
# [2/2] Starting bid monitoring...
# ✓ Monitoring 1,234 active lots

Configuration

Shared Database Path

Both processes must point to the same database file:

Scraper (config.py):

CACHE_DB = '/mnt/okcomputer/output/cache.db'

Monitor (Main.java):

String databaseFile = System.getenv().getOrDefault(
    "DATABASE_FILE",
    "/mnt/okcomputer/output/cache.db"
);

Recommended Directory Structure

/mnt/okcomputer/
├── scraper/                          # Python scraper code
│   ├── scraper.py
│   └── requirements.txt
├── monitor/                          # Java monitor code
│   ├── troostwijk-monitor.jar
│   └── models/                       # YOLO models
│       ├── yolov4.cfg
│       ├── yolov4.weights
│       └── coco.names
└── output/                           # Shared data directory
    ├── cache.db                      # Shared SQLite database
    └── images/                       # Downloaded images
        ├── A1-28505-5/
        │   ├── 001.jpg
        │   └── 002.jpg
        └── ...

Monitoring & Coordination

Option A: Sequential Execution

#!/bin/bash
# run-pipeline.sh

echo "Step 1: Scraping..."
python scraper/scraper.py

echo "Step 2: Processing images..."
java -jar monitor/troostwijk-monitor.jar --process-images-only

echo "Step 3: Starting monitor..."
java -jar monitor/troostwijk-monitor.jar --monitor-only

Option B: Separate Services (Docker Compose)

version: '3.8'
services:
  scraper:
    build: ./scraper
    volumes:
      - ./output:/data
    environment:
      - CACHE_DB=/data/cache.db
    command: python scraper.py

  monitor:
    build: ./monitor
    volumes:
      - ./output:/data
    environment:
      - DATABASE_FILE=/data/cache.db
      - NOTIFICATION_CONFIG=desktop
    depends_on:
      - scraper
    command: java -jar troostwijk-monitor.jar

Option C: Cron-based Scheduling

# Scrape every 6 hours
0 */6 * * * cd /mnt/okcomputer/scraper && python scraper.py

# Process images every hour (if new lots found)
0 * * * * cd /mnt/okcomputer/monitor && java -jar monitor.jar --process-new

# Monitor runs continuously
@reboot cd /mnt/okcomputer/monitor && java -jar monitor.jar --monitor-only

Troubleshooting

Issue: Type Mismatch Errors

Symptom: Monitor crashes with "INTEGER expected, got TEXT"

Solution: Use adapter pattern (Option 1) or unified schema (Option 2)

Issue: Monitor sees no data

Symptom: "Total lots in database: 0"

Check:

Is DATABASE_FILE env var set correctly?
Did scraper actually write data?
Are both processes using the same database file?

# Verify database has data
sqlite3 /mnt/okcomputer/output/cache.db "SELECT COUNT(*) FROM lots"

Issue: Images not downloading

Symptom: "Total images processed: 0" but scraper found images