start

2025-12-03 15:09:39 +01:00
parent 7fa3e4a545
commit 853c3cf53e
16 changed files with 1405 additions and 2000 deletions
--- a/wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
+++ b/wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
@@ -0,0 +1,326 @@
+# Troostwijk Scraper - Architecture & Data Flow
+
+## System Overview
+
+The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     TROOSTWIJK SCRAPER                          │
+└─────────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 1: COLLECT AUCTION URLs                                  │
+│  ┌──────────────┐         ┌──────────────┐                      │
+│  │ Listing Page │────────▶│ Extract /a/  │                      │
+│  │ /auctions?   │         │ auction URLs │                      │
+│  │ page=1..N    │         └──────────────┘                      │
+│  └──────────────┘                │                              │
+│                                   ▼                             │
+│                        [ List of Auction URLs ]                 │
+└─────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        │
+│  ┌──────────────┐         ┌──────────────┐                     │
+│  │ Auction Page │────────▶│ Parse        │                     │
+│  │ /a/...       │         │ __NEXT_DATA__│                     │
+│  └──────────────┘         │ JSON         │                     │
+│         │                 └──────────────┘                     │
+│         │                        │                              │
+│         ▼                        ▼                              │
+│  ┌──────────────┐         ┌──────────────┐                     │
+│  │ Save Auction │         │ Extract /l/  │                     │
+│  │ Metadata     │         │ lot URLs     │                     │
+│  │ to DB        │         └──────────────┘                     │
+│  └──────────────┘                │                              │
+│                                   ▼                              │
+│                          [ List of Lot URLs ]                   │
+└─────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 3: SCRAPE LOT DETAILS                                    │
+│  ┌──────────────┐         ┌──────────────┐                     │
+│  │ Lot Page     │────────▶│ Parse        │                     │
+│  │ /l/...       │         │ __NEXT_DATA__│                     │
+│  └──────────────┘         │ JSON         │                     │
+│                           └──────────────┘                     │
+│                                   │                              │
+│         ┌─────────────────────────┴─────────────────┐           │
+│         ▼                                           ▼           │
+│  ┌──────────────┐                          ┌──────────────┐    │
+│  │ Save Lot     │                          │ Save Images  │    │
+│  │ Details      │                          │ URLs to DB   │    │
+│  │ to DB        │                          └──────────────┘    │
+│  └──────────────┘                                 │            │
+│                                                    ▼            │
+│                                          [Optional Download]    │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Database Schema
+
+```sql
+┌──────────────────────────────────────────────────────────────────┐
+│  CACHE TABLE (HTML Storage with Compression)                     │
+├──────────────────────────────────────────────────────────────────┤
+│  cache                                                            │
+│  ├── url (TEXT, PRIMARY KEY)                                     │
+│  ├── content (BLOB)              -- Compressed HTML (zlib)       │
+│  ├── timestamp (REAL)                                            │
+│  ├── status_code (INTEGER)                                       │
+│  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  AUCTIONS TABLE                                                   │
+├──────────────────────────────────────────────────────────────────┤
+│  auctions                                                         │
+│  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"         │
+│  ├── url (TEXT, UNIQUE)                                          │
+│  ├── title (TEXT)                                                │
+│  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"  │
+│  ├── lots_count (INTEGER)                                        │
+│  ├── first_lot_closing_time (TEXT)                              │
+│  └── scraped_at (TEXT)                                           │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  LOTS TABLE                                                       │
+├──────────────────────────────────────────────────────────────────┤
+│  lots                                                             │
+│  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"       │
+│  ├── auction_id (TEXT)               -- FK to auctions          │
+│  ├── url (TEXT, UNIQUE)                                          │
+│  ├── title (TEXT)                                                │
+│  ├── current_bid (TEXT)              -- "€123.45" or "No bids"  │
+│  ├── bid_count (INTEGER)                                         │
+│  ├── closing_time (TEXT)                                         │
+│  ├── viewing_time (TEXT)                                         │
+│  ├── pickup_date (TEXT)                                          │
+│  ├── location (TEXT)                 -- e.g. "Dongen, NL"       │
+│  ├── description (TEXT)                                          │
+│  ├── category (TEXT)                                             │
+│  └── scraped_at (TEXT)                                           │
+│      FOREIGN KEY (auction_id) → auctions(auction_id)             │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  IMAGES TABLE (Image URLs & Download Status)                     │
+├──────────────────────────────────────────────────────────────────┤
+│  images                          ◀── THIS TABLE HOLDS IMAGE LINKS│
+│  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     │
+│  ├── lot_id (TEXT)               -- FK to lots                  │
+│  ├── url (TEXT)                  -- Image URL                   │
+│  ├── local_path (TEXT)           -- Path after download         │
+│  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded     │
+│      FOREIGN KEY (lot_id) → lots(lot_id)                         │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+## Sequence Diagram
+
+```
+User          Scraper         Playwright      Cache DB        Data Tables
+ │               │                │               │                │
+ │  Run          │                │               │                │
+ ├──────────────▶│                │               │                │
+ │               │                │               │                │
+ │               │ Phase 1: Listing Pages         │                │
+ │               ├───────────────▶│               │                │
+ │               │   goto()       │               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               ├───────────────────────────────▶│                │
+ │               │   compress & cache             │                │
+ │               │                │               │                │
+ │               │ Phase 2: Auction Pages         │                │
+ │               ├───────────────▶│               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               │                │               │                │
+ │               │ Parse __NEXT_DATA__ JSON       │                │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT auctions
+ │               │                │               │                │
+ │               │ Phase 3: Lot Pages             │                │
+ │               ├───────────────▶│               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               │                │               │                │
+ │               │ Parse __NEXT_DATA__ JSON       │                │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT lots  │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT images│
+ │               │                │               │                │
+ │               │ Export to CSV/JSON             │                │
+ │               │◀────────────────────────────────────────────────┤
+ │               │   Query all data               │                │
+ │◀──────────────┤                │               │                │
+ │   Results     │                │               │                │
+```
+
+## Data Flow Details
+
+### 1. **Page Retrieval & Caching**
+```
+Request URL
+    │
+    ├──▶ Check cache DB (with timestamp validation)
+    │    │
+    │    ├─[HIT]──▶ Decompress (if compressed=1)
+    │    │          └──▶ Return HTML
+    │    │
+    │    └─[MISS]─▶ Fetch via Playwright
+    │               │
+    │               ├──▶ Compress HTML (zlib level 9)
+    │               │    ~70-90% size reduction
+    │               │
+    │               └──▶ Store in cache DB (compressed=1)
+    │
+    └──▶ Return HTML for parsing
+```
+
+### 2. **JSON Parsing Strategy**
+```
+HTML Content
+    │
+    └──▶ Extract <script id="__NEXT_DATA__">
+         │
+         ├──▶ Parse JSON
+         │    │
+         │    ├─[has pageProps.lot]──▶ Individual LOT
+         │    │   └──▶ Extract: title, bid, location, images, etc.
+         │    │
+         │    └─[has pageProps.auction]──▶ AUCTION
+         │        │
+         │        ├─[has lots[] array]──▶ Auction with lots
+         │        │   └──▶ Extract: title, location, lots_count
+         │        │
+         │        └─[no lots[] array]──▶ Old format lot
+         │            └──▶ Parse as lot
+         │
+         └──▶ Fallback to HTML regex parsing (if JSON fails)
+```
+
+### 3. **Image Handling**
+```
+Lot Page Parsed
+    │
+    ├──▶ Extract images[] from JSON
+    │    │
+    │    └──▶ INSERT INTO images (lot_id, url, downloaded=0)
+    │
+    └──▶ [If DOWNLOAD_IMAGES=True]
+         │
+         ├──▶ Download each image
+         │    │
+         │    ├──▶ Save to: /images/{lot_id}/001.jpg
+         │    │
+         │    └──▶ UPDATE images SET local_path=?, downloaded=1
+         │
+         └──▶ Rate limit between downloads (0.5s)
+```
+
+## Key Configuration
+
+| Setting | Value | Purpose |
+|---------|-------|---------|
+| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
+| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
+| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
+| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
+| `MAX_PAGES` | `50` | Number of listing pages to crawl |
+
+## Output Files
+
+```
+/mnt/okcomputer/output/
+├── cache.db                              # SQLite database (compressed HTML + data)
+├── auctions_{timestamp}.json             # Exported auctions
+├── auctions_{timestamp}.csv              # Exported auctions
+├── lots_{timestamp}.json                 # Exported lots
+├── lots_{timestamp}.csv                  # Exported lots
+└── images/                               # Downloaded images (if enabled)
+    ├── A1-28505-5/
+    │   ├── 001.jpg
+    │   └── 002.jpg
+    └── A1-28505-6/
+        └── 001.jpg
+```
+
+## Extension Points for Integration
+
+### 1. **Downstream Processing Pipeline**
+```python
+# Query lots without downloaded images
+SELECT lot_id, url FROM images WHERE downloaded = 0
+
+# Process images: OCR, classification, etc.
+# Update status when complete
+UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?
+```
+
+### 2. **Real-time Monitoring**
+```python
+# Check for new lots every N minutes
+SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour')
+
+# Monitor bid changes
+SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0
+```
+
+### 3. **Analytics & Reporting**
+```python
+# Top locations
+SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location
+
+# Auction statistics
+SELECT
+    a.auction_id,
+    a.title,
+    COUNT(l.lot_id) as actual_lots,
+    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
+FROM auctions a
+LEFT JOIN lots l ON a.auction_id = l.auction_id
+GROUP BY a.auction_id
+```
+
+### 4. **Image Processing Integration**
+```python
+# Get all images for a lot
+SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5'
+
+# Batch process unprocessed images
+SELECT i.id, i.lot_id, i.local_path, l.title, l.category
+FROM images i
+JOIN lots l ON i.lot_id = l.lot_id
+WHERE i.downloaded = 1 AND i.local_path IS NOT NULL
+```
+
+## Performance Characteristics
+
+- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
+- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
+- **Caching**: 24-hour default cache validity (configurable)
+- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
+- **Scalability**: SQLite handles millions of rows efficiently
+
+## Error Handling
+
+- **Network failures**: Cached as status_code=500, retry after cache expiry
+- **Parse failures**: Falls back to HTML regex patterns
+- **Compression errors**: Auto-detects and handles uncompressed legacy data
+- **Missing fields**: Defaults to "No bids", empty string, or 0
+
+## Rate Limiting & Ethics
+
+- **REQUIRED**: 0.5 second delay between ALL requests
+- **Respects cache**: Avoids unnecessary re-fetching
+- **User-Agent**: Identifies as standard browser
+- **No parallelization**: Single-threaded sequential crawling
--- a/wiki/QUICKSTART.md
+++ b/wiki/QUICKSTART.md
@@ -0,0 +1,191 @@
+# Quick Start Guide
+
+Get the scraper running in minutes without downloading YOLO models!
+
+## Minimal Setup (No Object Detection)
+
+The scraper works perfectly fine **without** YOLO object detection. You can run it immediately and add object detection later if needed.
+
+### Step 1: Run the Scraper
+
+```bash
+# Using Maven
+mvn clean compile exec:java -Dexec.mainClass="com.auction.scraper.TroostwijkScraper"
+```
+
+Or in IntelliJ IDEA:
+1. Open `TroostwijkScraper.java`
+2. Right-click on the `main` method
+3. Select "Run 'TroostwijkScraper.main()'"
+
+### What You'll See
+
+```
+=== Troostwijk Auction Scraper ===
+
+Initializing scraper...
+⚠️  Object detection disabled: YOLO model files not found
+   Expected files:
+   - models/yolov4.cfg
+   - models/yolov4.weights
+   - models/coco.names
+   Scraper will continue without image analysis.
+
+[1/3] Discovering Dutch auctions...
+✓ Found 5 auctions: [12345, 12346, 12347, 12348, 12349]
+
+[2/3] Fetching lot details...
+  Processing sale 12345...
+
+[3/3] Starting monitoring service...
+✓ Monitoring active. Press Ctrl+C to stop.
+```
+
+### Step 2: Test Desktop Notifications
+
+The scraper will automatically send desktop notifications when:
+- A new bid is placed on a monitored lot
+- An auction is closing within 5 minutes
+
+**No setup required** - desktop notifications work out of the box!
+
+---
+
+## Optional: Add Email Notifications
+
+If you want email notifications in addition to desktop notifications:
+
+```bash
+# Set environment variable
+export NOTIFICATION_CONFIG="smtp:your.email@gmail.com:app_password:your.email@gmail.com"
+
+# Then run the scraper
+mvn exec:java -Dexec.mainClass="com.auction.scraper.TroostwijkScraper"
+```
+
+**Get Gmail App Password:**
+1. Enable 2FA in Google Account
+2. Go to: Google Account → Security → 2-Step Verification → App passwords
+3. Generate password for "Mail"
+4. Use that password (not your regular Gmail password)
+
+---
+
+## Optional: Add Object Detection Later
+
+If you want AI-powered image analysis to detect objects in auction photos:
+
+### 1. Create models directory
+```bash
+mkdir models
+cd models
+```
+
+### 2. Download YOLO files
+```bash
+# YOLOv4 config (small)
+curl -O https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4.cfg
+
+# YOLOv4 weights (245 MB - takes a few minutes)
+curl -LO https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights
+
+# COCO class names
+curl -O https://raw.githubusercontent.com/AlexeyAB/darknet/master/data/coco.names
+```
+
+### 3. Run again
+```bash
+mvn exec:java -Dexec.mainClass="com.auction.scraper.TroostwijkScraper"
+```
+
+Now you'll see:
+```
+✓ Object detection enabled with YOLO
+```
+
+The scraper will now analyze auction images and detect objects like:
+- Vehicles (cars, trucks, forklifts)
+- Equipment (machines, tools)
+- Furniture
+- Electronics
+- And 80+ other object types
+
+---
+
+## Features Without Object Detection
+
+Even without YOLO, the scraper provides:
+
+✅ **Full auction scraping** - Discovers all Dutch auctions
+✅ **Lot tracking** - Monitors bids and closing times
+✅ **Desktop notifications** - Real-time alerts
+✅ **SQLite database** - All data persisted locally
+✅ **Image downloading** - Saves all lot images
+✅ **Scheduled monitoring** - Automatic updates every hour
+
+Object detection simply adds:
+- AI-powered image analysis
+- Automatic object labeling
+- Searchable image database
+
+---
+
+## Database Location
+
+The scraper creates `troostwijk.db` in your current directory with:
+- All auction data
+- Lot details (title, description, bids, etc.)
+- Downloaded image paths
+- Object labels (if detection enabled)
+
+View the database with any SQLite browser:
+```bash
+sqlite3 troostwijk.db
+.tables
+SELECT * FROM lots LIMIT 5;
+```
+
+---
+
+## Stopping the Scraper
+
+Press **Ctrl+C** to stop the monitoring service.
+
+---
+
+## Next Steps
+
+1. ✅ **Run the scraper** without YOLO to test it
+2. ✅ **Verify desktop notifications** work
+3. ⚙️ **Optional**: Add email notifications
+4. ⚙️ **Optional**: Download YOLO models for object detection
+5. 🔧 **Customize**: Edit monitoring frequency, closing alerts, etc.
+
+---
+
+## Troubleshooting
+
+### Desktop notifications not appearing?
+- **Windows**: Check if Java has notification permissions
+- **Linux**: Ensure desktop environment is running (not headless)
+- **macOS**: Check System Preferences → Notifications
+
+### OpenCV warnings?
+These are normal and can be ignored:
+```
+WARNING: A restricted method in java.lang.System has been called
+WARNING: Use --enable-native-access=ALL-UNNAMED to avoid warning
+```
+
+The scraper works fine despite these warnings.
+
+---
+
+## Full Documentation
+
+See [README.md](../README.md) for complete documentation including:
+- Email setup details
+- YOLO installation guide
+- Configuration options
+- Database schema
+- API endpoints
--- a/wiki/RUN_INSTRUCTIONS.md
+++ b/wiki/RUN_INSTRUCTIONS.md
@@ -0,0 +1,164 @@
+# Troostwijk Auction Extractor - Run Instructions
+
+## Fixed Warnings
+
+All warnings have been resolved:
+- ✅ SLF4J logging configured (slf4j-simple)
+- ✅ Native access enabled for SQLite JDBC
+- ✅ Logging output controlled via simplelogger.properties
+
+## Prerequisites
+
+1. **Java 21** installed
+2. **Maven** installed
+3. **IntelliJ IDEA** (recommended) or command line
+
+## Setup (First Time Only)
+
+### 1. Install Dependencies
+
+In IntelliJ Terminal or PowerShell:
+
+```bash
+# Reload Maven dependencies
+mvn clean install
+
+# Install Playwright browser binaries (first time only)
+mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
+```
+
+## Running the Application
+
+### Option A: Using IntelliJ IDEA (Easiest)
+
+1. **Add VM Options for native access:**
+   - Run → Edit Configurations
+   - Select or create configuration for `TroostwijkAuctionExtractor`
+   - In "VM options" field, add:
+     ```
+     --enable-native-access=ALL-UNNAMED
+     ```
+
+2. **Add Program Arguments (optional):**
+   - In "Program arguments" field, add:
+     ```
+     --max-visits 3
+     ```
+
+3. **Run the application:**
+   - Click the green Run button
+
+### Option B: Using Maven (Command Line)
+
+```bash
+# Run with 3 page limit
+mvn exec:java
+
+# Run with custom arguments (override pom.xml defaults)
+mvn exec:java -Dexec.args="--max-visits 5"
+
+# Run without cache
+mvn exec:java -Dexec.args="--no-cache --max-visits 2"
+
+# Run with unlimited visits
+mvn exec:java -Dexec.args=""
+```
+
+### Option C: Using Java Directly
+
+```bash
+# Compile first
+mvn clean compile
+
+# Run with native access enabled
+java --enable-native-access=ALL-UNNAMED \
+  -cp target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
+  com.auction.TroostwijkAuctionExtractor --max-visits 3
+```
+
+## Command Line Arguments
+
+```
+--max-visits <n>   Limit actual page fetches to n (0 = unlimited, default)
+--no-cache         Disable page caching
+--help             Show help message
+```
+
+## Examples
+
+### Test with 3 page visits (cached pages don't count):
+```bash
+mvn exec:java -Dexec.args="--max-visits 3"
+```
+
+### Fresh extraction without cache:
+```bash
+mvn exec:java -Dexec.args="--no-cache --max-visits 5"
+```
+
+### Full extraction (all pages, unlimited):
+```bash
+mvn exec:java -Dexec.args=""
+```
+
+## Expected Output (No Warnings)
+
+```
+=== Troostwijk Auction Extractor ===
+Max page visits set to: 3
+
+Initializing Playwright browser...
+✓ Browser ready
+✓ Cache database initialized
+
+Starting auction extraction from https://www.troostwijkauctions.com/auctions
+
+[Page 1] Fetching auctions...
+  ✓ Fetched from website (visit 1/3)
+  ✓ Found 20 auctions
+
+[Page 2] Fetching auctions...
+  ✓ Loaded from cache
+  ✓ Found 20 auctions
+
+[Page 3] Fetching auctions...
+  ✓ Fetched from website (visit 2/3)
+  ✓ Found 20 auctions
+
+✓ Total auctions extracted: 60
+
+=== Results ===
+Total auctions found: 60
+Dutch auctions (NL): 45
+Actual page visits: 2
+
+✓ Browser and cache closed
+```
+
+## Cache Management
+
+- Cache is stored in: `cache/page_cache.db`
+- Cache expires after: 24 hours (configurable in code)
+- To clear cache: Delete `cache/page_cache.db` file
+
+## Troubleshooting
+
+### If you still see warnings:
+
+1. **Reload Maven project in IntelliJ:**
+   - Right-click `pom.xml` → Maven → Reload project
+
+2. **Verify VM options:**
+   - Ensure `--enable-native-access=ALL-UNNAMED` is in VM options
+
+3. **Clean and rebuild:**
+   ```bash
+   mvn clean install
+   ```
+
+### If Playwright fails:
+
+```bash
+# Reinstall browser binaries
+mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"
+```