19 KiB
Integration Guide: Troostwijk Monitor ↔ Scraper
Overview
This document describes how Troostwijk Monitor (this Java project) integrates with the ARCHITECTURE-TROOSTWIJK-SCRAPER (Python scraper process).
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ARCHITECTURE-TROOSTWIJK-SCRAPER (Python) │
│ │
│ • Discovers auctions from website │
│ • Scrapes lot details via Playwright │
│ • Parses __NEXT_DATA__ JSON │
│ • Stores image URLs (not downloads) │
│ │
│ ↓ Writes to │
└─────────┼───────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SHARED SQLite DATABASE │
│ (troostwijk.db) │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ auctions │ │ lots │ │ images │ │
│ │ (Scraper) │ │ (Scraper) │ │ (Both) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
│ ↑ Reads from ↓ Writes to │
└─────────┼──────────────────────────────┼──────────────────────┘
│ │
│ ▼
┌─────────┴──────────────────────────────────────────────────────┐
│ TROOSTWIJK MONITOR (Java - This Project) │
│ │
│ • Reads auction/lot data from database │
│ • Downloads images from URLs │
│ • Runs YOLO object detection │
│ • Monitors bid changes │
│ • Sends notifications │
└─────────────────────────────────────────────────────────────────┘
Database Schema Mapping
Scraper Schema → Monitor Schema
The scraper and monitor use slightly different schemas that need to be reconciled:
| Scraper Table | Monitor Table | Integration Notes |
|---|---|---|
auctions |
auctions |
✅ Compatible - same structure |
lots |
lots |
⚠️ Needs mapping - field name differences |
images |
images |
⚠️ Partial overlap - different purposes |
cache |
N/A | ❌ Monitor doesn't use cache |
Field Mapping: auctions Table
| Scraper Field | Monitor Field | Notes |
|---|---|---|
auction_id (TEXT) |
auction_id (INTEGER) |
⚠️ TYPE MISMATCH - Scraper uses "A7-39813", Monitor expects INT |
url |
url |
✅ Compatible |
title |
title |
✅ Compatible |
location |
location, city, country |
⚠️ Monitor splits into 3 fields |
lots_count |
lot_count |
⚠️ Name difference |
first_lot_closing_time |
closing_time |
⚠️ Name difference |
scraped_at |
discovered_at |
⚠️ Name + type difference (TEXT vs INTEGER timestamp) |
Field Mapping: lots Table
| Scraper Field | Monitor Field | Notes |
|---|---|---|
lot_id (TEXT) |
lot_id (INTEGER) |
⚠️ TYPE MISMATCH - "A1-28505-5" vs INT |
auction_id |
sale_id |
⚠️ Different name |
url |
url |
✅ Compatible |
title |
title |
✅ Compatible |
current_bid (TEXT) |
current_bid (REAL) |
⚠️ TYPE MISMATCH - "€123.45" vs 123.45 |
bid_count |
N/A | ℹ️ Monitor doesn't track |
closing_time |
closing_time |
⚠️ Format difference (TEXT vs LocalDateTime) |
viewing_time |
N/A | ℹ️ Monitor doesn't track |
pickup_date |
N/A | ℹ️ Monitor doesn't track |
location |
N/A | ℹ️ Monitor doesn't track lot location separately |
description |
description |
✅ Compatible |
category |
category |
✅ Compatible |
| N/A | manufacturer |
ℹ️ Monitor has additional field |
| N/A | type |
ℹ️ Monitor has additional field |
| N/A | year |
ℹ️ Monitor has additional field |
| N/A | currency |
ℹ️ Monitor has additional field |
| N/A | closing_notified |
ℹ️ Monitor tracking field |
Field Mapping: images Table
| Scraper Field | Monitor Field | Notes |
|---|---|---|
id |
id |
✅ Compatible |
lot_id |
lot_id |
⚠️ Type difference (TEXT vs INTEGER) |
url |
url |
✅ Compatible |
local_path |
Local_path |
⚠️ Different name |
downloaded (INTEGER) |
N/A | ℹ️ Monitor uses processed_at instead |
| N/A | labels (TEXT) |
ℹ️ Monitor adds detected objects |
| N/A | processed_at (INTEGER) |
ℹ️ Monitor tracking field |
Integration Options
Option 1: Database Schema Adapter (Recommended)
Create a compatibility layer that transforms scraper data to monitor format.
Implementation:
// Add to DatabaseService.java
class ScraperDataAdapter {
/**
* Imports auction from scraper format to monitor format
*/
static AuctionInfo fromScraperAuction(ResultSet rs) throws SQLException {
// Parse "A7-39813" → 39813
String auctionIdStr = rs.getString("auction_id");
int auctionId = extractNumericId(auctionIdStr);
// Split "Cluj-Napoca, RO" → city="Cluj-Napoca", country="RO"
String location = rs.getString("location");
String[] parts = location.split(",\\s*");
String city = parts.length > 0 ? parts[0] : "";
String country = parts.length > 1 ? parts[1] : "";
return new AuctionInfo(
auctionId,
rs.getString("title"),
location,
city,
country,
rs.getString("url"),
extractTypePrefix(auctionIdStr), // "A7-39813" → "A7"
rs.getInt("lots_count"),
parseTimestamp(rs.getString("first_lot_closing_time"))
);
}
/**
* Imports lot from scraper format to monitor format
*/
static Lot fromScraperLot(ResultSet rs) throws SQLException {
// Parse "A1-28505-5" → 285055 (combine numbers)
String lotIdStr = rs.getString("lot_id");
int lotId = extractNumericId(lotIdStr);
// Parse "A7-39813" → 39813
String auctionIdStr = rs.getString("auction_id");
int saleId = extractNumericId(auctionIdStr);
// Parse "€123.45" → 123.45
String currentBidStr = rs.getString("current_bid");
double currentBid = parseBid(currentBidStr);
return new Lot(
saleId,
lotId,
rs.getString("title"),
rs.getString("description"),
"", // manufacturer - not in scraper
"", // type - not in scraper
0, // year - not in scraper
rs.getString("category"),
currentBid,
"EUR", // currency - inferred from €
rs.getString("url"),
parseTimestamp(rs.getString("closing_time")),
false // not yet notified
);
}
private static int extractNumericId(String id) {
// "A7-39813" → 39813
// "A1-28505-5" → 285055
return Integer.parseInt(id.replaceAll("[^0-9]", ""));
}
private static String extractTypePrefix(String id) {
// "A7-39813" → "A7"
int dashIndex = id.indexOf('-');
return dashIndex > 0 ? id.substring(0, dashIndex) : "";
}
private static double parseBid(String bid) {
// "€123.45" → 123.45
// "No bids" → 0.0
if (bid == null || bid.contains("No")) return 0.0;
return Double.parseDouble(bid.replaceAll("[^0-9.]", ""));
}
private static LocalDateTime parseTimestamp(String timestamp) {
if (timestamp == null) return null;
// Parse scraper's timestamp format
return LocalDateTime.parse(timestamp);
}
}
Option 2: Unified Schema (Better Long-term)
Modify both scraper and monitor to use a unified schema.
Create: SHARED_SCHEMA.sql
-- Unified schema that both projects use
CREATE TABLE IF NOT EXISTS auctions (
auction_id TEXT PRIMARY KEY, -- Use TEXT to support "A7-39813"
auction_id_numeric INTEGER, -- For monitor's integer needs
title TEXT NOT NULL,
location TEXT, -- Full: "Cluj-Napoca, RO"
city TEXT, -- Parsed: "Cluj-Napoca"
country TEXT, -- Parsed: "RO"
url TEXT NOT NULL,
type TEXT, -- "A7", "A1"
lot_count INTEGER DEFAULT 0,
closing_time TEXT, -- ISO 8601 format
scraped_at INTEGER, -- Unix timestamp
discovered_at INTEGER -- Unix timestamp (same as scraped_at)
);
CREATE TABLE IF NOT EXISTS lots (
lot_id TEXT PRIMARY KEY, -- Use TEXT: "A1-28505-5"
lot_id_numeric INTEGER, -- For monitor's integer needs
auction_id TEXT, -- FK: "A7-39813"
sale_id INTEGER, -- For monitor (same as auction_id_numeric)
title TEXT,
description TEXT,
manufacturer TEXT,
type TEXT,
year INTEGER,
category TEXT,
current_bid_text TEXT, -- "€123.45" or "No bids"
current_bid REAL, -- 123.45
bid_count INTEGER,
currency TEXT DEFAULT 'EUR',
url TEXT UNIQUE,
closing_time TEXT,
viewing_time TEXT,
pickup_date TEXT,
location TEXT,
closing_notified INTEGER DEFAULT 0,
scraped_at TEXT,
FOREIGN KEY (auction_id) REFERENCES auctions(auction_id)
);
CREATE TABLE IF NOT EXISTS images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT, -- FK: "A1-28505-5"
url TEXT, -- Image URL from website
local_path TEXT, -- Local path after download
labels TEXT, -- Detected objects (comma-separated)
downloaded INTEGER DEFAULT 0, -- 0=pending, 1=downloaded
processed_at INTEGER, -- Unix timestamp when processed
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country);
CREATE INDEX IF NOT EXISTS idx_lots_auction_id ON lots(auction_id);
CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id);
CREATE INDEX IF NOT EXISTS idx_images_downloaded ON images(downloaded);
Option 3: API Integration (Most Flexible)
Have the scraper expose a REST API for the monitor to query.
# In scraper: Add Flask API endpoint
@app.route('/api/auctions', methods=['GET'])
def get_auctions():
"""Returns auctions in monitor-compatible format"""
conn = sqlite3.connect(CACHE_DB)
cursor = conn.cursor()
cursor.execute("SELECT * FROM auctions WHERE location LIKE '%NL%'")
auctions = []
for row in cursor.fetchall():
auctions.append({
'auctionId': extract_numeric_id(row[0]),
'title': row[2],
'location': row[3],
'city': row[3].split(',')[0] if row[3] else '',
'country': row[3].split(',')[1].strip() if ',' in row[3] else '',
'url': row[1],
'type': row[0].split('-')[0],
'lotCount': row[4],
'closingTime': row[5]
})
return jsonify(auctions)
Recommended Integration Steps
Phase 1: Immediate (Adapter Pattern)
- ✅ Keep separate schemas
- ✅ Create
ScraperDataAdapterin Monitor - ✅ Add import methods to
DatabaseService - ✅ Monitor reads from scraper's tables using adapter
Phase 2: Short-term (Unified Schema)
- 📋 Design unified schema (see Option 2)
- 📋 Update scraper to use unified schema
- 📋 Update monitor to use unified schema
- 📋 Migrate existing data
Phase 3: Long-term (API + Event-driven)
- 📋 Add REST API to scraper
- 📋 Add webhook/event notification when new data arrives
- 📋 Monitor subscribes to events
- 📋 Process images asynchronously
Current Integration Flow
Scraper Process (Python)
# 1. Run scraper to populate database
cd /path/to/scraper
python scraper.py
# Output:
# ✅ Scraped 42 auctions
# ✅ Scraped 1,234 lots
# ✅ Saved 3,456 image URLs
# ✅ Data written to: /mnt/okcomputer/output/cache.db
Monitor Process (Java)
# 2. Run monitor to process the data
cd /path/to/monitor
export DATABASE_FILE=/mnt/okcomputer/output/cache.db
java -jar troostwijk-monitor.jar
# Output:
# 📊 Current Database State:
# Total lots in database: 1,234
# Total images processed: 0
#
# [1/2] Processing images...
# Downloading and analyzing 3,456 images...
#
# [2/2] Starting bid monitoring...
# ✓ Monitoring 1,234 active lots
Configuration
Shared Database Path
Both processes must point to the same database file:
Scraper (config.py):
CACHE_DB = '/mnt/okcomputer/output/cache.db'
Monitor (Main.java):
String databaseFile = System.getenv().getOrDefault(
"DATABASE_FILE",
"/mnt/okcomputer/output/cache.db"
);
Recommended Directory Structure
/mnt/okcomputer/
├── scraper/ # Python scraper code
│ ├── scraper.py
│ └── requirements.txt
├── monitor/ # Java monitor code
│ ├── troostwijk-monitor.jar
│ └── models/ # YOLO models
│ ├── yolov4.cfg
│ ├── yolov4.weights
│ └── coco.names
└── output/ # Shared data directory
├── cache.db # Shared SQLite database
└── images/ # Downloaded images
├── A1-28505-5/
│ ├── 001.jpg
│ └── 002.jpg
└── ...
Monitoring & Coordination
Option A: Sequential Execution
#!/bin/bash
# run-pipeline.sh
echo "Step 1: Scraping..."
python scraper/scraper.py
echo "Step 2: Processing images..."
java -jar monitor/troostwijk-monitor.jar --process-images-only
echo "Step 3: Starting monitor..."
java -jar monitor/troostwijk-monitor.jar --monitor-only
Option B: Separate Services (Docker Compose)
version: '3.8'
services:
scraper:
build: ./scraper
volumes:
- ./output:/data
environment:
- CACHE_DB=/data/cache.db
command: python scraper.py
monitor:
build: ./monitor
volumes:
- ./output:/data
environment:
- DATABASE_FILE=/data/cache.db
- NOTIFICATION_CONFIG=desktop
depends_on:
- scraper
command: java -jar troostwijk-monitor.jar
Option C: Cron-based Scheduling
# Scrape every 6 hours
0 */6 * * * cd /mnt/okcomputer/scraper && python scraper.py
# Process images every hour (if new lots found)
0 * * * * cd /mnt/okcomputer/monitor && java -jar monitor.jar --process-new
# Monitor runs continuously
@reboot cd /mnt/okcomputer/monitor && java -jar monitor.jar --monitor-only
Troubleshooting
Issue: Type Mismatch Errors
Symptom: Monitor crashes with "INTEGER expected, got TEXT"
Solution: Use adapter pattern (Option 1) or unified schema (Option 2)
Issue: Monitor sees no data
Symptom: "Total lots in database: 0"
Check:
- Is
DATABASE_FILEenv var set correctly? - Did scraper actually write data?
- Are both processes using the same database file?
# Verify database has data
sqlite3 /mnt/okcomputer/output/cache.db "SELECT COUNT(*) FROM lots"
Issue: Images not downloading
Symptom: "Total images processed: 0" but scraper found images
Check:
- Scraper writes image URLs to
imagestable - Monitor reads from
imagestable withdownloaded=0 - Field name mapping:
local_pathvslocal_path
Next Steps
- Immediate: Implement
ScraperDataAdapterfor compatibility - This Week: Test end-to-end integration with sample data
- Next Sprint: Migrate to unified schema
- Future: Add event-driven architecture with webhooks