integrating with monitor app

2025-12-05 06:48:08 +01:00
parent 72afdf772b
commit aea188699f
7 changed files with 1234 additions and 7 deletions
--- a/_wiki/FIXING_MALFORMED_ENTRIES.md
+++ b/_wiki/FIXING_MALFORMED_ENTRIES.md
@@ -0,0 +1,262 @@
+# Fixing Malformed Database Entries
+
+## Problem
+
+After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data:
+
+### Examples of Malformed Data
+
+```csv
+A1-34327,"",https://...,"",€Huidig bod,0,gap,"","","",...
+A1-39577,"",https://...,"",€Huidig bod,0,gap,"","","",...
+```
+
+**Issues identified:**
+1. ❌ Missing `auction_id` (empty string)
+2. ❌ Missing `title` (empty string)
+3. ❌ Invalid bid value: `€Huidig bod` (Dutch for "Current bid" - placeholder text)
+4. ❌ Invalid timestamp: `gap` (should be empty or valid date)
+5. ❌ Missing `viewing_time`, `pickup_date`, and other fields
+
+## Root Cause
+
+Earlier scraping runs:
+- Used less strict validation
+- Fell back to HTML parsing when `__NEXT_DATA__` JSON extraction failed
+- HTML parser extracted placeholder text as actual values
+- Continued on errors instead of flagging incomplete data
+
+## Solution
+
+### Step 1: Parser Improvements ✅
+
+**Fixed in `src/parse.py`:**
+
+1. **Timestamp parsing** (lines 37-70):
+   - Filters invalid strings like "gap", "materieel wegens vereffening"
+   - Returns empty string instead of invalid value
+   - Handles Unix timestamps in seconds and milliseconds
+
+2. **Bid extraction** (lines 246-280):
+   - Rejects placeholder text like "€Huidig bod", "€Huidig bod"
+   - Removes zero-width Unicode spaces
+   - Returns "No bids" instead of invalid placeholder text
+
+### Step 2: Detection and Repair Scripts ✅
+
+Created two scripts to fix existing data:
+
+#### A. `script/migrate_reparse_lots.py`
+**Purpose:** Re-parse ALL cached entries with improved JSON extraction
+
+
+```bash
+# Preview what would be changed
+# python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db
+python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db
+```
+
+```bash
+# Preview what would be changed
+python script/migrate_reparse_lots.py --dry-run
+# Apply changes
+python script/migrate_reparse_lots.py
+
+# Use custom database path
+python script/migrate_reparse_lots.py --db /path/to/cache.db
+```
+
+**What it does:**
+- Reads all cached HTML pages from `cache` table
+- Re-parses using improved `__NEXT_DATA__` JSON extraction
+- Updates existing database entries with newly extracted fields
+- Populates missing `auction_id`, `viewing_time`, `pickup_date`, etc.
+
+#### B. `script/fix_malformed_entries.py` ⭐ **RECOMMENDED**
+**Purpose:** Detect and fix ONLY malformed entries
+
+```bash
+# Preview malformed entries and fixes
+python script/fix_malformed_entries.py --dry-run
+
+# Fix malformed entries
+python script/fix_malformed_entries.py
+
+# Use custom database path
+python script/fix_malformed_entries.py --db /path/to/cache.db
+```
+
+**What it detects:**
+```sql
+-- Auctions with issues
+SELECT * FROM auctions WHERE
+    auction_id = '' OR auction_id IS NULL
+    OR title = '' OR title IS NULL
+    OR first_lot_closing_time = 'gap'
+
+-- Lots with issues
+SELECT * FROM lots WHERE
+    auction_id = '' OR auction_id IS NULL
+    OR title = '' OR title IS NULL
+    OR current_bid LIKE '%Huidig%bod%'
+    OR closing_time = 'gap' OR closing_time = ''
+```
+
+**Example output:**
+```
+=================================================================
+MALFORMED ENTRY DETECTION AND REPAIR
+=================================================================
+
+1. CHECKING AUCTIONS...
+   Found 23 malformed auction entries
+
+  Fixing auction: A1-39577
+    URL: https://www.troostwijkauctions.com/a/...-A1-39577
+    ✓ Parsed successfully:
+       auction_id: A1-39577
+       title: Bootveiling Rotterdam - Console boten, RIB, speedboten...
+       location: Rotterdam, NL
+       lots: 45
+    ✓ Database updated
+
+2. CHECKING LOTS...
+   Found 127 malformed lot entries
+
+  Fixing lot: A1-39529-10
+    URL: https://www.troostwijkauctions.com/l/...-A1-39529-10
+    ✓ Parsed successfully:
+       lot_id: A1-39529-10
+       auction_id: A1-39529
+       title: Audi A7 Sportback Personenauto
+       bid: No bids
+       closing: 2024-12-08 15:30:00
+    ✓ Database updated
+
+=================================================================
+SUMMARY
+=================================================================
+Auctions:
+  - Found:  23
+  - Fixed:  21
+  - Failed: 2
+
+Lots:
+  - Found:  127
+  - Fixed:  124
+  - Failed: 3
+```
+
+### Step 3: Verification
+
+After running the fix script, verify the data:
+
+```bash
+# Check if malformed entries still exist
+python -c "
+import sqlite3
+conn = sqlite3.connect('path/to/cache.db')
+
+print('Auctions with empty auction_id:')
+print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0])
+
+print('Lots with invalid bids:')
+print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0])
+
+print('Lots with \"gap\" timestamps:')
+print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0])
+"
+```
+
+Expected result after fix: **All counts should be 0**
+
+### Step 4: Prevention
+
+To prevent future occurrences:
+
+1. **Validation in scraper** - Add validation before saving to database:
+
+```python
+def validate_lot_data(lot_data: Dict) -> bool:
+    """Validate lot data before saving"""
+    required_fields = ['lot_id', 'title', 'url']
+    invalid_values = ['gap', '€Huidig bod', '€Huidig bod', '']
+
+    for field in required_fields:
+        value = lot_data.get(field, '')
+        if not value or value in invalid_values:
+            print(f"  ⚠️  Invalid {field}: {value}")
+            return False
+
+    return True
+
+# In save_lot method:
+if not validate_lot_data(lot_data):
+    print(f"  ❌ Skipping invalid lot: {lot_data.get('url')}")
+    return
+```
+
+2. **Prefer JSON over HTML** - Ensure `__NEXT_DATA__` parsing is tried first (already implemented)
+
+3. **Logging** - Add logging for fallback to HTML parsing:
+
+```python
+if next_data:
+    return next_data
+else:
+    print(f"  ⚠️  No __NEXT_DATA__ found, falling back to HTML parsing: {url}")
+    # HTML parsing...
+```
+
+## Recommended Workflow
+
+```bash
+# 1. First, run dry-run to see what will be fixed
+python script/fix_malformed_entries.py --dry-run
+
+# 2. Review the output - check if fixes look correct
+
+# 3. Run the actual fix
+python script/fix_malformed_entries.py
+
+# 4. Verify the results
+python script/fix_malformed_entries.py --dry-run
+# Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries"
+
+# 5. (Optional) Run full migration to ensure all fields are populated
+python script/migrate_reparse_lots.py
+```
+
+## Files Modified/Created
+
+### Modified:
+- ✅ `src/parse.py` - Improved timestamp and bid parsing with validation
+
+### Created:
+- ✅ `script/fix_malformed_entries.py` - Targeted fix for malformed entries
+- ✅ `script/migrate_reparse_lots.py` - Full re-parse migration
+- ✅ `_wiki/JAVA_FIXES_NEEDED.md` - Java-side fixes documentation
+- ✅ `_wiki/FIXING_MALFORMED_ENTRIES.md` - This file
+
+## Database Location
+
+If you get "no such table" errors, find your actual database:
+
+```bash
+# Find all .db files
+find . -name "*.db"
+
+# Check which one has data
+sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots"
+
+# Use that path with --db flag
+python script/fix_malformed_entries.py --db /actual/path/to/cache.db
+```
+
+## Next Steps
+
+After fixing malformed entries:
+1. ✅ Run `fix_malformed_entries.py` to repair bad data
+2. ⏳ Apply Java-side fixes (see `_wiki/JAVA_FIXES_NEEDED.md`)
+3. ⏳ Re-run Java monitoring process
+4. ✅ Add validation to prevent future issues
--- a/_wiki/JAVA_FIXES_NEEDED.md
+++ b/_wiki/JAVA_FIXES_NEEDED.md
@@ -0,0 +1,170 @@
+# Java Monitoring Process Fixes
+
+## Issues Identified
+
+Based on the error logs from the Java monitoring process, the following bugs need to be fixed:
+
+### 1. Integer Overflow - `extractNumericId()` method
+
+**Error:**
+```
+For input string: "239144949705335"
+at java.lang.Integer.parseInt(Integer.java:565)
+at auctiora.ScraperDataAdapter.extractNumericId(ScraperDataAdapter.java:81)
+```
+
+**Problem:**
+- Lot IDs are being parsed as `int` (32-bit, max value: 2,147,483,647)
+- Actual lot IDs can exceed this limit (e.g., "239144949705335")
+
+**Solution:**
+Change from `Integer.parseInt()` to `Long.parseLong()`:
+
+```java
+// BEFORE (ScraperDataAdapter.java:81)
+int numericId = Integer.parseInt(lotId);
+
+// AFTER
+long numericId = Long.parseLong(lotId);
+```
+
+**Additional changes needed:**
+- Update all related fields/variables from `int` to `long`
+- Update database schema if numeric ID is stored (change INTEGER to BIGINT)
+- Update any method signatures that return/accept `int` for lot IDs
+
+---
+
+### 2. UNIQUE Constraint Failures
+
+**Error:**
+```
+Failed to import lot: [SQLITE_CONSTRAINT_UNIQUE] A UNIQUE constraint failed (UNIQUE constraint failed: lots.url)
+```
+
+**Problem:**
+- Attempting to re-insert lots that already exist
+- No graceful handling of duplicate entries
+
+**Solution:**
+Use `INSERT OR REPLACE` or `INSERT OR IGNORE`:
+
+```java
+// BEFORE
+String sql = "INSERT INTO lots (lot_id, url, ...) VALUES (?, ?, ...)";
+
+// AFTER - Option 1: Update existing records
+String sql = "INSERT OR REPLACE INTO lots (lot_id, url, ...) VALUES (?, ?, ...)";
+
+// AFTER - Option 2: Skip duplicates silently
+String sql = "INSERT OR IGNORE INTO lots (lot_id, url, ...) VALUES (?, ?, ...)";
+```
+
+**Alternative with try-catch:**
+```java
+try {
+    insertLot(lotData);
+} catch (SQLException e) {
+    if (e.getMessage().contains("UNIQUE constraint")) {
+        logger.debug("Lot already exists, skipping: " + lotData.getUrl());
+        return; // Or update instead
+    }
+    throw e;
+}
+```
+
+---
+
+### 3. Timestamp Parsing - Already Fixed in Python
+
+**Error:**
+```
+Unable to parse timestamp: materieel wegens vereffening
+Unable to parse timestamp: gap
+```
+
+**Status:** ✅ Fixed in `parse.py` (src/parse.py:37-70)
+
+The Python parser now:
+- Filters out invalid timestamp strings like "gap", "materieel wegens vereffening"
+- Returns empty string for invalid values
+- Handles both Unix timestamps (seconds/milliseconds)
+
+**Java side action:**
+If the Java code also parses timestamps, apply similar validation:
+- Check for known invalid values before parsing
+- Use try-catch and return null/empty for unparseable timestamps
+- Don't fail the entire import if one timestamp is invalid
+
+---
+
+## Migration Strategy
+
+### Step 1: Fix Python Parser ✅
+- [x] Updated `format_timestamp()` to handle invalid strings
+- [x] Created migration script `script/migrate_reparse_lots.py`
+
+### Step 2: Run Migration
+```bash
+cd /path/to/scaev
+python script/migrate_reparse_lots.py --dry-run  # Preview changes
+python script/migrate_reparse_lots.py           # Apply changes
+```
+
+This will:
+- Re-parse all cached HTML pages using improved __NEXT_DATA__ extraction
+- Update existing database entries with newly extracted fields
+- Populate missing `viewing_time`, `pickup_date`, and other fields
+
+### Step 3: Fix Java Code
+1. Update `ScraperDataAdapter.java:81` - use `Long.parseLong()`
+2. Update `DatabaseService.java` - use `INSERT OR REPLACE` or handle duplicates
+3. Update timestamp parsing - add validation for invalid strings
+4. Update database schema - change numeric ID columns to BIGINT if needed
+
+### Step 4: Re-run Monitoring Process
+After fixes, the monitoring process should:
+- Successfully import all lots without crashes
+- Gracefully skip duplicates
+- Handle large numeric IDs
+- Ignore invalid timestamp values
+
+---
+
+## Database Schema Changes (if needed)
+
+If lot IDs are stored as numeric values in Java's database:
+
+```sql
+-- Check current schema
+PRAGMA table_info(lots);
+
+-- If numeric ID field exists and is INTEGER, change to BIGINT:
+ALTER TABLE lots ADD COLUMN lot_id_numeric BIGINT;
+UPDATE lots SET lot_id_numeric = CAST(lot_id AS BIGINT) WHERE lot_id GLOB '[0-9]*';
+-- Then update code to use lot_id_numeric
+```
+
+---
+
+## Testing Checklist
+
+After applying fixes:
+- [ ] Import lot with ID > 2,147,483,647 (e.g., "239144949705335")
+- [ ] Re-import existing lot (should update or skip gracefully)
+- [ ] Import lot with invalid timestamp (should not crash)
+- [ ] Verify all newly extracted fields are populated (viewing_time, pickup_date, etc.)
+- [ ] Check logs for any remaining errors
+
+---
+
+## Files Modified
+
+Python side (completed):
+- `src/parse.py` - Fixed `format_timestamp()` method
+- `script/migrate_reparse_lots.py` - New migration script
+
+Java side (needs implementation):
+- `auctiora/ScraperDataAdapter.java` - Line 81: Change Integer.parseInt to Long.parseLong
+- `auctiora/DatabaseService.java` - Line ~569: Handle UNIQUE constraints gracefully
+- Database schema - Consider BIGINT for numeric IDs
--- a/_wiki/REFACTORING_SUMMARY.md
+++ b/_wiki/REFACTORING_SUMMARY.md
@@ -0,0 +1,118 @@
+# Refactoring Summary: Troostwijk Auction Monitor
+
+## Overview
+This project has been refactored to focus on **image processing and monitoring**, removing all auction/lot scraping functionality which is now handled by the external `ARCHITECTURE-TROOSTWIJK-SCRAPER` process.
+
+## Architecture Changes
+
+### Removed Components
+- ❌ **TroostwijkScraper.java** - Removed (replaced by TroostwijkMonitor)
+- ❌ Auction discovery and scraping logic
+- ❌ Lot scraping via Playwright/JSoup
+- ❌ CacheDatabase (can be removed if not used elsewhere)
+
+### New/Updated Components
+
+#### New Classes
+- ✅ **TroostwijkMonitor.java** - Monitors bids and coordinates services (no scraping)
+- ✅ **ImageProcessingService.java** - Downloads images and runs object detection
+- ✅ **Console.java** - Simple output utility (renamed from IO to avoid Java 25 conflict)
+
+#### Modernized Classes
+- ✅ **AuctionInfo** - Converted to immutable `record`
+- ✅ **Lot** - Converted to immutable `record` with `minutesUntilClose()` method
+- ✅ **DatabaseService.java** - Uses modern Java features:
+  - Text blocks (`"""`) for SQL
+  - Record accessor methods
+  - Added `getImagesForLot()` method
+  - Added `processed_at` timestamp to images table
+  - Nested `ImageRecord` record
+
+#### Preserved Components
+- ✅ **NotificationService.java** - Desktop/email notifications
+- ✅ **ObjectDetectionService.java** - YOLO-based object detection
+- ✅ **Main.java** - Updated to use new architecture
+
+## Database Schema
+
+### Populated by External Scraper
+- `auctions` table - Auction metadata
+- `lots` table - Lot details with bidding info
+
+### Populated by This Process
+- `images` table - Downloaded images with:
+  - `file_path` - Local storage path
+  - `labels` - Detected objects (comma-separated)
+  - `processed_at` - Processing timestamp
+
+## Modern Java Features Used
+
+- **Records** - Immutable data carriers (AuctionInfo, Lot, ImageRecord)
+- **Text Blocks** - Multi-line SQL queries
+- **var** - Type inference throughout
+- **Switch expressions** - Where applicable
+- **Pattern matching** - Ready for future enhancements
+
+## Responsibilities
+
+### This Project
+1. ✅ Image downloading from URLs in database
+2. ✅ Object detection using YOLO/OpenCV
+3. ✅ Bid monitoring and change detection
+4. ✅ Desktop and email notifications
+5. ✅ Data enrichment with image analysis
+
+### External ARCHITECTURE-TROOSTWIJK-SCRAPER
+1. 🔄 Discover auctions from Troostwijk website
+2. 🔄 Scrape lot details via API
+3. 🔄 Populate `auctions` and `lots` tables
+4. 🔄 Share database with this process
+
+## Usage
+
+### Running the Monitor
+```bash
+# With environment variables
+export DATABASE_FILE=troostwijk.db
+export NOTIFICATION_CONFIG=desktop  # or smtp:user:pass:email
+
+java -jar troostwijk-monitor.jar
+```
+
+### Expected Output
+```
+=== Troostwijk Auction Monitor ===
+
+✓ OpenCV loaded
+Initializing monitor...
+
+📊 Current Database State:
+  Total lots in database: 42
+  Total images processed: 0
+
+[1/2] Processing images...
+Processing pending images...
+
+[2/2] Starting bid monitoring...
+✓ Monitoring service started
+
+✓ Monitor is running. Press Ctrl+C to stop.
+
+NOTE: This process expects auction/lot data from the external scraper.
+      Make sure ARCHITECTURE-TROOSTWIJK-SCRAPER is running and populating the database.
+```
+
+## Migration Notes
+
+1. The project now compiles successfully with Java 25
+2. All scraping logic removed - rely on external scraper
+3. Shared database architecture for inter-process communication
+4. Clean separation of concerns
+5. Modern, maintainable codebase with records and text blocks
+
+## Next Steps
+
+- Remove `CacheDatabase.java` if not needed
+- Consider adding API endpoint for external scraper to trigger image processing
+- Add metrics/logging framework
+- Consider message queue (e.g., Redis, RabbitMQ) for better inter-process communication
--- a/_wiki/RUN_INSTRUCTIONS.md
+++ b/_wiki/RUN_INSTRUCTIONS.md
@@ -0,0 +1,164 @@
+# Troostwijk Auction Extractor - Run Instructions
+
+## Fixed Warnings
+
+All warnings have been resolved:
+- ✅ SLF4J logging configured (slf4j-simple)
+- ✅ Native access enabled for SQLite JDBC
+- ✅ Logging output controlled via simplelogger.properties
+
+## Prerequisites
+
+1. **Java 21** installed
+2. **Maven** installed
+3. **IntelliJ IDEA** (recommended) or command line
+
+## Setup (First Time Only)
+
+### 1. Install Dependencies
+
+In IntelliJ Terminal or PowerShell:
+
+```bash
+# Reload Maven dependencies
+mvn clean install
+
+# Install Playwright browser binaries (first time only)
+mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
+```
+
+## Running the Application
+
+### Option A: Using IntelliJ IDEA (Easiest)
+
+1. **Add VM Options for native access:**
+   - Run → Edit Configurations
+   - Select or create configuration for `TroostwijkAuctionExtractor`
+   - In "VM options" field, add:
+     ```
+     --enable-native-access=ALL-UNNAMED
+     ```
+
+2. **Add Program Arguments (optional):**
+   - In "Program arguments" field, add:
+     ```
+     --max-visits 3
+     ```
+
+3. **Run the application:**
+   - Click the green Run button
+
+### Option B: Using Maven (Command Line)
+
+```bash
+# Run with 3 page limit
+mvn exec:java
+
+# Run with custom arguments (override pom.xml defaults)
+mvn exec:java -Dexec.args="--max-visits 5"
+
+# Run without cache
+mvn exec:java -Dexec.args="--no-cache --max-visits 2"
+
+# Run with unlimited visits
+mvn exec:java -Dexec.args=""
+```
+
+### Option C: Using Java Directly
+
+```bash
+# Compile first
+mvn clean compile
+
+# Run with native access enabled
+java --enable-native-access=ALL-UNNAMED \
+  -cp target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
+  com.auction.TroostwijkAuctionExtractor --max-visits 3
+```
+
+## Command Line Arguments
+
+```
+--max-visits <n>   Limit actual page fetches to n (0 = unlimited, default)
+--no-cache         Disable page caching
+--help             Show help message
+```
+
+## Examples
+
+### Test with 3 page visits (cached pages don't count):
+```bash
+mvn exec:java -Dexec.args="--max-visits 3"
+```
+
+### Fresh extraction without cache:
+```bash
+mvn exec:java -Dexec.args="--no-cache --max-visits 5"
+```
+
+### Full extraction (all pages, unlimited):
+```bash
+mvn exec:java -Dexec.args=""
+```
+
+## Expected Output (No Warnings)
+
+```
+=== Troostwijk Auction Extractor ===
+Max page visits set to: 3
+
+Initializing Playwright browser...
+✓ Browser ready
+✓ Cache database initialized
+
+Starting auction extraction from https://www.troostwijkauctions.com/auctions
+
+[Page 1] Fetching auctions...
+  ✓ Fetched from website (visit 1/3)
+  ✓ Found 20 auctions
+
+[Page 2] Fetching auctions...
+  ✓ Loaded from cache
+  ✓ Found 20 auctions
+
+[Page 3] Fetching auctions...
+  ✓ Fetched from website (visit 2/3)
+  ✓ Found 20 auctions
+
+✓ Total auctions extracted: 60
+
+=== Results ===
+Total auctions found: 60
+Dutch auctions (NL): 45
+Actual page visits: 2
+
+✓ Browser and cache closed
+```
+
+## Cache Management
+
+- Cache is stored in: `cache/page_cache.db`
+- Cache expires after: 24 hours (configurable in code)
+- To clear cache: Delete `cache/page_cache.db` file
+
+## Troubleshooting
+
+### If you still see warnings:
+
+1. **Reload Maven project in IntelliJ:**
+   - Right-click `pom.xml` → Maven → Reload project
+
+2. **Verify VM options:**
+   - Ensure `--enable-native-access=ALL-UNNAMED` is in VM options
+
+3. **Clean and rebuild:**
+   ```bash
+   mvn clean install
+   ```
+
+### If Playwright fails:
+
+```bash
+# Reinstall browser binaries
+mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"
+```