integrating with monitor app

This commit is contained in:
Tour
2025-12-05 06:48:08 +01:00
parent 72afdf772b
commit aea188699f
7 changed files with 1234 additions and 7 deletions

View File

@@ -0,0 +1,262 @@
# Fixing Malformed Database Entries
## Problem
After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data:
### Examples of Malformed Data
```csv
A1-34327,"",https://...,"",€Huidig bod,0,gap,"","","",...
A1-39577,"",https://...,"",€Huidig bod,0,gap,"","","",...
```
**Issues identified:**
1. ❌ Missing `auction_id` (empty string)
2. ❌ Missing `title` (empty string)
3. ❌ Invalid bid value: `€Huidig bod` (Dutch for "Current bid" - placeholder text)
4. ❌ Invalid timestamp: `gap` (should be empty or valid date)
5. ❌ Missing `viewing_time`, `pickup_date`, and other fields
## Root Cause
Earlier scraping runs:
- Used less strict validation
- Fell back to HTML parsing when `__NEXT_DATA__` JSON extraction failed
- HTML parser extracted placeholder text as actual values
- Continued on errors instead of flagging incomplete data
## Solution
### Step 1: Parser Improvements ✅
**Fixed in `src/parse.py`:**
1. **Timestamp parsing** (lines 37-70):
- Filters invalid strings like "gap", "materieel wegens vereffening"
- Returns empty string instead of invalid value
- Handles Unix timestamps in seconds and milliseconds
2. **Bid extraction** (lines 246-280):
- Rejects placeholder text like "€Huidig bod", "€Huidig bod"
- Removes zero-width Unicode spaces
- Returns "No bids" instead of invalid placeholder text
### Step 2: Detection and Repair Scripts ✅
Created two scripts to fix existing data:
#### A. `script/migrate_reparse_lots.py`
**Purpose:** Re-parse ALL cached entries with improved JSON extraction
```bash
# Preview what would be changed
# python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db
python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db
```
```bash
# Preview what would be changed
python script/migrate_reparse_lots.py --dry-run
# Apply changes
python script/migrate_reparse_lots.py
# Use custom database path
python script/migrate_reparse_lots.py --db /path/to/cache.db
```
**What it does:**
- Reads all cached HTML pages from `cache` table
- Re-parses using improved `__NEXT_DATA__` JSON extraction
- Updates existing database entries with newly extracted fields
- Populates missing `auction_id`, `viewing_time`, `pickup_date`, etc.
#### B. `script/fix_malformed_entries.py` ⭐ **RECOMMENDED**
**Purpose:** Detect and fix ONLY malformed entries
```bash
# Preview malformed entries and fixes
python script/fix_malformed_entries.py --dry-run
# Fix malformed entries
python script/fix_malformed_entries.py
# Use custom database path
python script/fix_malformed_entries.py --db /path/to/cache.db
```
**What it detects:**
```sql
-- Auctions with issues
SELECT * FROM auctions WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR first_lot_closing_time = 'gap'
-- Lots with issues
SELECT * FROM lots WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR current_bid LIKE '%Huidig%bod%'
OR closing_time = 'gap' OR closing_time = ''
```
**Example output:**
```
=================================================================
MALFORMED ENTRY DETECTION AND REPAIR
=================================================================
1. CHECKING AUCTIONS...
Found 23 malformed auction entries
Fixing auction: A1-39577
URL: https://www.troostwijkauctions.com/a/...-A1-39577
✓ Parsed successfully:
auction_id: A1-39577
title: Bootveiling Rotterdam - Console boten, RIB, speedboten...
location: Rotterdam, NL
lots: 45
✓ Database updated
2. CHECKING LOTS...
Found 127 malformed lot entries
Fixing lot: A1-39529-10
URL: https://www.troostwijkauctions.com/l/...-A1-39529-10
✓ Parsed successfully:
lot_id: A1-39529-10
auction_id: A1-39529
title: Audi A7 Sportback Personenauto
bid: No bids
closing: 2024-12-08 15:30:00
✓ Database updated
=================================================================
SUMMARY
=================================================================
Auctions:
- Found: 23
- Fixed: 21
- Failed: 2
Lots:
- Found: 127
- Fixed: 124
- Failed: 3
```
### Step 3: Verification
After running the fix script, verify the data:
```bash
# Check if malformed entries still exist
python -c "
import sqlite3
conn = sqlite3.connect('path/to/cache.db')
print('Auctions with empty auction_id:')
print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0])
print('Lots with invalid bids:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0])
print('Lots with \"gap\" timestamps:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0])
"
```
Expected result after fix: **All counts should be 0**
### Step 4: Prevention
To prevent future occurrences:
1. **Validation in scraper** - Add validation before saving to database:
```python
def validate_lot_data(lot_data: Dict) -> bool:
"""Validate lot data before saving"""
required_fields = ['lot_id', 'title', 'url']
invalid_values = ['gap', '€Huidig bod', '€Huidig bod', '']
for field in required_fields:
value = lot_data.get(field, '')
if not value or value in invalid_values:
print(f" ⚠️ Invalid {field}: {value}")
return False
return True
# In save_lot method:
if not validate_lot_data(lot_data):
print(f" ❌ Skipping invalid lot: {lot_data.get('url')}")
return
```
2. **Prefer JSON over HTML** - Ensure `__NEXT_DATA__` parsing is tried first (already implemented)
3. **Logging** - Add logging for fallback to HTML parsing:
```python
if next_data:
return next_data
else:
print(f" ⚠️ No __NEXT_DATA__ found, falling back to HTML parsing: {url}")
# HTML parsing...
```
## Recommended Workflow
```bash
# 1. First, run dry-run to see what will be fixed
python script/fix_malformed_entries.py --dry-run
# 2. Review the output - check if fixes look correct
# 3. Run the actual fix
python script/fix_malformed_entries.py
# 4. Verify the results
python script/fix_malformed_entries.py --dry-run
# Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries"
# 5. (Optional) Run full migration to ensure all fields are populated
python script/migrate_reparse_lots.py
```
## Files Modified/Created
### Modified:
-`src/parse.py` - Improved timestamp and bid parsing with validation
### Created:
-`script/fix_malformed_entries.py` - Targeted fix for malformed entries
-`script/migrate_reparse_lots.py` - Full re-parse migration
-`_wiki/JAVA_FIXES_NEEDED.md` - Java-side fixes documentation
-`_wiki/FIXING_MALFORMED_ENTRIES.md` - This file
## Database Location
If you get "no such table" errors, find your actual database:
```bash
# Find all .db files
find . -name "*.db"
# Check which one has data
sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots"
# Use that path with --db flag
python script/fix_malformed_entries.py --db /actual/path/to/cache.db
```
## Next Steps
After fixing malformed entries:
1. ✅ Run `fix_malformed_entries.py` to repair bad data
2. ⏳ Apply Java-side fixes (see `_wiki/JAVA_FIXES_NEEDED.md`)
3. ⏳ Re-run Java monitoring process
4. ✅ Add validation to prevent future issues

170
_wiki/JAVA_FIXES_NEEDED.md Normal file
View File

@@ -0,0 +1,170 @@
# Java Monitoring Process Fixes
## Issues Identified
Based on the error logs from the Java monitoring process, the following bugs need to be fixed:
### 1. Integer Overflow - `extractNumericId()` method
**Error:**
```
For input string: "239144949705335"
at java.lang.Integer.parseInt(Integer.java:565)
at auctiora.ScraperDataAdapter.extractNumericId(ScraperDataAdapter.java:81)
```
**Problem:**
- Lot IDs are being parsed as `int` (32-bit, max value: 2,147,483,647)
- Actual lot IDs can exceed this limit (e.g., "239144949705335")
**Solution:**
Change from `Integer.parseInt()` to `Long.parseLong()`:
```java
// BEFORE (ScraperDataAdapter.java:81)
int numericId = Integer.parseInt(lotId);
// AFTER
long numericId = Long.parseLong(lotId);
```
**Additional changes needed:**
- Update all related fields/variables from `int` to `long`
- Update database schema if numeric ID is stored (change INTEGER to BIGINT)
- Update any method signatures that return/accept `int` for lot IDs
---
### 2. UNIQUE Constraint Failures
**Error:**
```
Failed to import lot: [SQLITE_CONSTRAINT_UNIQUE] A UNIQUE constraint failed (UNIQUE constraint failed: lots.url)
```
**Problem:**
- Attempting to re-insert lots that already exist
- No graceful handling of duplicate entries
**Solution:**
Use `INSERT OR REPLACE` or `INSERT OR IGNORE`:
```java
// BEFORE
String sql = "INSERT INTO lots (lot_id, url, ...) VALUES (?, ?, ...)";
// AFTER - Option 1: Update existing records
String sql = "INSERT OR REPLACE INTO lots (lot_id, url, ...) VALUES (?, ?, ...)";
// AFTER - Option 2: Skip duplicates silently
String sql = "INSERT OR IGNORE INTO lots (lot_id, url, ...) VALUES (?, ?, ...)";
```
**Alternative with try-catch:**
```java
try {
insertLot(lotData);
} catch (SQLException e) {
if (e.getMessage().contains("UNIQUE constraint")) {
logger.debug("Lot already exists, skipping: " + lotData.getUrl());
return; // Or update instead
}
throw e;
}
```
---
### 3. Timestamp Parsing - Already Fixed in Python
**Error:**
```
Unable to parse timestamp: materieel wegens vereffening
Unable to parse timestamp: gap
```
**Status:** ✅ Fixed in `parse.py` (src/parse.py:37-70)
The Python parser now:
- Filters out invalid timestamp strings like "gap", "materieel wegens vereffening"
- Returns empty string for invalid values
- Handles both Unix timestamps (seconds/milliseconds)
**Java side action:**
If the Java code also parses timestamps, apply similar validation:
- Check for known invalid values before parsing
- Use try-catch and return null/empty for unparseable timestamps
- Don't fail the entire import if one timestamp is invalid
---
## Migration Strategy
### Step 1: Fix Python Parser ✅
- [x] Updated `format_timestamp()` to handle invalid strings
- [x] Created migration script `script/migrate_reparse_lots.py`
### Step 2: Run Migration
```bash
cd /path/to/scaev
python script/migrate_reparse_lots.py --dry-run # Preview changes
python script/migrate_reparse_lots.py # Apply changes
```
This will:
- Re-parse all cached HTML pages using improved __NEXT_DATA__ extraction
- Update existing database entries with newly extracted fields
- Populate missing `viewing_time`, `pickup_date`, and other fields
### Step 3: Fix Java Code
1. Update `ScraperDataAdapter.java:81` - use `Long.parseLong()`
2. Update `DatabaseService.java` - use `INSERT OR REPLACE` or handle duplicates
3. Update timestamp parsing - add validation for invalid strings
4. Update database schema - change numeric ID columns to BIGINT if needed
### Step 4: Re-run Monitoring Process
After fixes, the monitoring process should:
- Successfully import all lots without crashes
- Gracefully skip duplicates
- Handle large numeric IDs
- Ignore invalid timestamp values
---
## Database Schema Changes (if needed)
If lot IDs are stored as numeric values in Java's database:
```sql
-- Check current schema
PRAGMA table_info(lots);
-- If numeric ID field exists and is INTEGER, change to BIGINT:
ALTER TABLE lots ADD COLUMN lot_id_numeric BIGINT;
UPDATE lots SET lot_id_numeric = CAST(lot_id AS BIGINT) WHERE lot_id GLOB '[0-9]*';
-- Then update code to use lot_id_numeric
```
---
## Testing Checklist
After applying fixes:
- [ ] Import lot with ID > 2,147,483,647 (e.g., "239144949705335")
- [ ] Re-import existing lot (should update or skip gracefully)
- [ ] Import lot with invalid timestamp (should not crash)
- [ ] Verify all newly extracted fields are populated (viewing_time, pickup_date, etc.)
- [ ] Check logs for any remaining errors
---
## Files Modified
Python side (completed):
- `src/parse.py` - Fixed `format_timestamp()` method
- `script/migrate_reparse_lots.py` - New migration script
Java side (needs implementation):
- `auctiora/ScraperDataAdapter.java` - Line 81: Change Integer.parseInt to Long.parseLong
- `auctiora/DatabaseService.java` - Line ~569: Handle UNIQUE constraints gracefully
- Database schema - Consider BIGINT for numeric IDs

View File

@@ -0,0 +1,118 @@
# Refactoring Summary: Troostwijk Auction Monitor
## Overview
This project has been refactored to focus on **image processing and monitoring**, removing all auction/lot scraping functionality which is now handled by the external `ARCHITECTURE-TROOSTWIJK-SCRAPER` process.
## Architecture Changes
### Removed Components
-**TroostwijkScraper.java** - Removed (replaced by TroostwijkMonitor)
- ❌ Auction discovery and scraping logic
- ❌ Lot scraping via Playwright/JSoup
- ❌ CacheDatabase (can be removed if not used elsewhere)
### New/Updated Components
#### New Classes
-**TroostwijkMonitor.java** - Monitors bids and coordinates services (no scraping)
-**ImageProcessingService.java** - Downloads images and runs object detection
-**Console.java** - Simple output utility (renamed from IO to avoid Java 25 conflict)
#### Modernized Classes
-**AuctionInfo** - Converted to immutable `record`
-**Lot** - Converted to immutable `record` with `minutesUntilClose()` method
-**DatabaseService.java** - Uses modern Java features:
- Text blocks (`"""`) for SQL
- Record accessor methods
- Added `getImagesForLot()` method
- Added `processed_at` timestamp to images table
- Nested `ImageRecord` record
#### Preserved Components
-**NotificationService.java** - Desktop/email notifications
-**ObjectDetectionService.java** - YOLO-based object detection
-**Main.java** - Updated to use new architecture
## Database Schema
### Populated by External Scraper
- `auctions` table - Auction metadata
- `lots` table - Lot details with bidding info
### Populated by This Process
- `images` table - Downloaded images with:
- `file_path` - Local storage path
- `labels` - Detected objects (comma-separated)
- `processed_at` - Processing timestamp
## Modern Java Features Used
- **Records** - Immutable data carriers (AuctionInfo, Lot, ImageRecord)
- **Text Blocks** - Multi-line SQL queries
- **var** - Type inference throughout
- **Switch expressions** - Where applicable
- **Pattern matching** - Ready for future enhancements
## Responsibilities
### This Project
1. ✅ Image downloading from URLs in database
2. ✅ Object detection using YOLO/OpenCV
3. ✅ Bid monitoring and change detection
4. ✅ Desktop and email notifications
5. ✅ Data enrichment with image analysis
### External ARCHITECTURE-TROOSTWIJK-SCRAPER
1. 🔄 Discover auctions from Troostwijk website
2. 🔄 Scrape lot details via API
3. 🔄 Populate `auctions` and `lots` tables
4. 🔄 Share database with this process
## Usage
### Running the Monitor
```bash
# With environment variables
export DATABASE_FILE=troostwijk.db
export NOTIFICATION_CONFIG=desktop # or smtp:user:pass:email
java -jar troostwijk-monitor.jar
```
### Expected Output
```
=== Troostwijk Auction Monitor ===
✓ OpenCV loaded
Initializing monitor...
📊 Current Database State:
Total lots in database: 42
Total images processed: 0
[1/2] Processing images...
Processing pending images...
[2/2] Starting bid monitoring...
✓ Monitoring service started
✓ Monitor is running. Press Ctrl+C to stop.
NOTE: This process expects auction/lot data from the external scraper.
Make sure ARCHITECTURE-TROOSTWIJK-SCRAPER is running and populating the database.
```
## Migration Notes
1. The project now compiles successfully with Java 25
2. All scraping logic removed - rely on external scraper
3. Shared database architecture for inter-process communication
4. Clean separation of concerns
5. Modern, maintainable codebase with records and text blocks
## Next Steps
- Remove `CacheDatabase.java` if not needed
- Consider adding API endpoint for external scraper to trigger image processing
- Add metrics/logging framework
- Consider message queue (e.g., Redis, RabbitMQ) for better inter-process communication

164
_wiki/RUN_INSTRUCTIONS.md Normal file
View File

@@ -0,0 +1,164 @@
# Troostwijk Auction Extractor - Run Instructions
## Fixed Warnings
All warnings have been resolved:
- ✅ SLF4J logging configured (slf4j-simple)
- ✅ Native access enabled for SQLite JDBC
- ✅ Logging output controlled via simplelogger.properties
## Prerequisites
1. **Java 21** installed
2. **Maven** installed
3. **IntelliJ IDEA** (recommended) or command line
## Setup (First Time Only)
### 1. Install Dependencies
In IntelliJ Terminal or PowerShell:
```bash
# Reload Maven dependencies
mvn clean install
# Install Playwright browser binaries (first time only)
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
```
## Running the Application
### Option A: Using IntelliJ IDEA (Easiest)
1. **Add VM Options for native access:**
- Run → Edit Configurations
- Select or create configuration for `TroostwijkAuctionExtractor`
- In "VM options" field, add:
```
--enable-native-access=ALL-UNNAMED
```
2. **Add Program Arguments (optional):**
- In "Program arguments" field, add:
```
--max-visits 3
```
3. **Run the application:**
- Click the green Run button
### Option B: Using Maven (Command Line)
```bash
# Run with 3 page limit
mvn exec:java
# Run with custom arguments (override pom.xml defaults)
mvn exec:java -Dexec.args="--max-visits 5"
# Run without cache
mvn exec:java -Dexec.args="--no-cache --max-visits 2"
# Run with unlimited visits
mvn exec:java -Dexec.args=""
```
### Option C: Using Java Directly
```bash
# Compile first
mvn clean compile
# Run with native access enabled
java --enable-native-access=ALL-UNNAMED \
-cp target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
com.auction.TroostwijkAuctionExtractor --max-visits 3
```
## Command Line Arguments
```
--max-visits <n> Limit actual page fetches to n (0 = unlimited, default)
--no-cache Disable page caching
--help Show help message
```
## Examples
### Test with 3 page visits (cached pages don't count):
```bash
mvn exec:java -Dexec.args="--max-visits 3"
```
### Fresh extraction without cache:
```bash
mvn exec:java -Dexec.args="--no-cache --max-visits 5"
```
### Full extraction (all pages, unlimited):
```bash
mvn exec:java -Dexec.args=""
```
## Expected Output (No Warnings)
```
=== Troostwijk Auction Extractor ===
Max page visits set to: 3
Initializing Playwright browser...
✓ Browser ready
✓ Cache database initialized
Starting auction extraction from https://www.troostwijkauctions.com/auctions
[Page 1] Fetching auctions...
✓ Fetched from website (visit 1/3)
✓ Found 20 auctions
[Page 2] Fetching auctions...
✓ Loaded from cache
✓ Found 20 auctions
[Page 3] Fetching auctions...
✓ Fetched from website (visit 2/3)
✓ Found 20 auctions
✓ Total auctions extracted: 60
=== Results ===
Total auctions found: 60
Dutch auctions (NL): 45
Actual page visits: 2
✓ Browser and cache closed
```
## Cache Management
- Cache is stored in: `cache/page_cache.db`
- Cache expires after: 24 hours (configurable in code)
- To clear cache: Delete `cache/page_cache.db` file
## Troubleshooting
### If you still see warnings:
1. **Reload Maven project in IntelliJ:**
- Right-click `pom.xml` → Maven → Reload project
2. **Verify VM options:**
- Ensure `--enable-native-access=ALL-UNNAMED` is in VM options
3. **Clean and rebuild:**
```bash
mvn clean install
```
### If Playwright fails:
```bash
# Reinstall browser binaries
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"
```