start
This commit is contained in:
326
wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
Normal file
326
wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
Normal file
@@ -0,0 +1,326 @@
|
||||
# Troostwijk Scraper - Architecture & Data Flow
|
||||
|
||||
## System Overview
|
||||
|
||||
The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TROOSTWIJK SCRAPER │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 1: COLLECT AUCTION URLs │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Listing Page │────────▶│ Extract /a/ │ │
|
||||
│ │ /auctions? │ │ auction URLs │ │
|
||||
│ │ page=1..N │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [ List of Auction URLs ] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 2: EXTRACT LOT URLs FROM AUCTIONS │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Auction Page │────────▶│ Parse │ │
|
||||
│ │ /a/... │ │ __NEXT_DATA__│ │
|
||||
│ └──────────────┘ │ JSON │ │
|
||||
│ │ └──────────────┘ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Save Auction │ │ Extract /l/ │ │
|
||||
│ │ Metadata │ │ lot URLs │ │
|
||||
│ │ to DB │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [ List of Lot URLs ] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 3: SCRAPE LOT DETAILS │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Lot Page │────────▶│ Parse │ │
|
||||
│ │ /l/... │ │ __NEXT_DATA__│ │
|
||||
│ └──────────────┘ │ JSON │ │
|
||||
│ └──────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────────────────┴─────────────────┐ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Save Lot │ │ Save Images │ │
|
||||
│ │ Details │ │ URLs to DB │ │
|
||||
│ │ to DB │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [Optional Download] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
```sql
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ CACHE TABLE (HTML Storage with Compression) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ cache │
|
||||
│ ├── url (TEXT, PRIMARY KEY) │
|
||||
│ ├── content (BLOB) -- Compressed HTML (zlib) │
|
||||
│ ├── timestamp (REAL) │
|
||||
│ ├── status_code (INTEGER) │
|
||||
│ └── compressed (INTEGER) -- 1=compressed, 0=plain │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ AUCTIONS TABLE │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ auctions │
|
||||
│ ├── auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
|
||||
│ ├── url (TEXT, UNIQUE) │
|
||||
│ ├── title (TEXT) │
|
||||
│ ├── location (TEXT) -- e.g. "Cluj-Napoca, RO" │
|
||||
│ ├── lots_count (INTEGER) │
|
||||
│ ├── first_lot_closing_time (TEXT) │
|
||||
│ └── scraped_at (TEXT) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ LOTS TABLE │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ lots │
|
||||
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
|
||||
│ ├── auction_id (TEXT) -- FK to auctions │
|
||||
│ ├── url (TEXT, UNIQUE) │
|
||||
│ ├── title (TEXT) │
|
||||
│ ├── current_bid (TEXT) -- "€123.45" or "No bids" │
|
||||
│ ├── bid_count (INTEGER) │
|
||||
│ ├── closing_time (TEXT) │
|
||||
│ ├── viewing_time (TEXT) │
|
||||
│ ├── pickup_date (TEXT) │
|
||||
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
|
||||
│ ├── description (TEXT) │
|
||||
│ ├── category (TEXT) │
|
||||
│ └── scraped_at (TEXT) │
|
||||
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ IMAGES TABLE (Image URLs & Download Status) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ images ◀── THIS TABLE HOLDS IMAGE LINKS│
|
||||
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
|
||||
│ ├── lot_id (TEXT) -- FK to lots │
|
||||
│ ├── url (TEXT) -- Image URL │
|
||||
│ ├── local_path (TEXT) -- Path after download │
|
||||
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
|
||||
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Sequence Diagram
|
||||
|
||||
```
|
||||
User Scraper Playwright Cache DB Data Tables
|
||||
│ │ │ │ │
|
||||
│ Run │ │ │ │
|
||||
├──────────────▶│ │ │ │
|
||||
│ │ │ │ │
|
||||
│ │ Phase 1: Listing Pages │ │
|
||||
│ ├───────────────▶│ │ │
|
||||
│ │ goto() │ │ │
|
||||
│ │◀───────────────┤ │ │
|
||||
│ │ HTML │ │ │
|
||||
│ ├───────────────────────────────▶│ │
|
||||
│ │ compress & cache │ │
|
||||
│ │ │ │ │
|
||||
│ │ Phase 2: Auction Pages │ │
|
||||
│ ├───────────────▶│ │ │
|
||||
│ │◀───────────────┤ │ │
|
||||
│ │ HTML │ │ │
|
||||
│ │ │ │ │
|
||||
│ │ Parse __NEXT_DATA__ JSON │ │
|
||||
│ │────────────────────────────────────────────────▶│
|
||||
│ │ │ │ INSERT auctions
|
||||
│ │ │ │ │
|
||||
│ │ Phase 3: Lot Pages │ │
|
||||
│ ├───────────────▶│ │ │
|
||||
│ │◀───────────────┤ │ │
|
||||
│ │ HTML │ │ │
|
||||
│ │ │ │ │
|
||||
│ │ Parse __NEXT_DATA__ JSON │ │
|
||||
│ │────────────────────────────────────────────────▶│
|
||||
│ │ │ │ INSERT lots │
|
||||
│ │────────────────────────────────────────────────▶│
|
||||
│ │ │ │ INSERT images│
|
||||
│ │ │ │ │
|
||||
│ │ Export to CSV/JSON │ │
|
||||
│ │◀────────────────────────────────────────────────┤
|
||||
│ │ Query all data │ │
|
||||
│◀──────────────┤ │ │ │
|
||||
│ Results │ │ │ │
|
||||
```
|
||||
|
||||
## Data Flow Details
|
||||
|
||||
### 1. **Page Retrieval & Caching**
|
||||
```
|
||||
Request URL
|
||||
│
|
||||
├──▶ Check cache DB (with timestamp validation)
|
||||
│ │
|
||||
│ ├─[HIT]──▶ Decompress (if compressed=1)
|
||||
│ │ └──▶ Return HTML
|
||||
│ │
|
||||
│ └─[MISS]─▶ Fetch via Playwright
|
||||
│ │
|
||||
│ ├──▶ Compress HTML (zlib level 9)
|
||||
│ │ ~70-90% size reduction
|
||||
│ │
|
||||
│ └──▶ Store in cache DB (compressed=1)
|
||||
│
|
||||
└──▶ Return HTML for parsing
|
||||
```
|
||||
|
||||
### 2. **JSON Parsing Strategy**
|
||||
```
|
||||
HTML Content
|
||||
│
|
||||
└──▶ Extract <script id="__NEXT_DATA__">
|
||||
│
|
||||
├──▶ Parse JSON
|
||||
│ │
|
||||
│ ├─[has pageProps.lot]──▶ Individual LOT
|
||||
│ │ └──▶ Extract: title, bid, location, images, etc.
|
||||
│ │
|
||||
│ └─[has pageProps.auction]──▶ AUCTION
|
||||
│ │
|
||||
│ ├─[has lots[] array]──▶ Auction with lots
|
||||
│ │ └──▶ Extract: title, location, lots_count
|
||||
│ │
|
||||
│ └─[no lots[] array]──▶ Old format lot
|
||||
│ └──▶ Parse as lot
|
||||
│
|
||||
└──▶ Fallback to HTML regex parsing (if JSON fails)
|
||||
```
|
||||
|
||||
### 3. **Image Handling**
|
||||
```
|
||||
Lot Page Parsed
|
||||
│
|
||||
├──▶ Extract images[] from JSON
|
||||
│ │
|
||||
│ └──▶ INSERT INTO images (lot_id, url, downloaded=0)
|
||||
│
|
||||
└──▶ [If DOWNLOAD_IMAGES=True]
|
||||
│
|
||||
├──▶ Download each image
|
||||
│ │
|
||||
│ ├──▶ Save to: /images/{lot_id}/001.jpg
|
||||
│ │
|
||||
│ └──▶ UPDATE images SET local_path=?, downloaded=1
|
||||
│
|
||||
└──▶ Rate limit between downloads (0.5s)
|
||||
```
|
||||
|
||||
## Key Configuration
|
||||
|
||||
| Setting | Value | Purpose |
|
||||
|---------|-------|---------|
|
||||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||
|
||||
## Output Files
|
||||
|
||||
```
|
||||
/mnt/okcomputer/output/
|
||||
├── cache.db # SQLite database (compressed HTML + data)
|
||||
├── auctions_{timestamp}.json # Exported auctions
|
||||
├── auctions_{timestamp}.csv # Exported auctions
|
||||
├── lots_{timestamp}.json # Exported lots
|
||||
├── lots_{timestamp}.csv # Exported lots
|
||||
└── images/ # Downloaded images (if enabled)
|
||||
├── A1-28505-5/
|
||||
│ ├── 001.jpg
|
||||
│ └── 002.jpg
|
||||
└── A1-28505-6/
|
||||
└── 001.jpg
|
||||
```
|
||||
|
||||
## Extension Points for Integration
|
||||
|
||||
### 1. **Downstream Processing Pipeline**
|
||||
```python
|
||||
# Query lots without downloaded images
|
||||
SELECT lot_id, url FROM images WHERE downloaded = 0
|
||||
|
||||
# Process images: OCR, classification, etc.
|
||||
# Update status when complete
|
||||
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?
|
||||
```
|
||||
|
||||
### 2. **Real-time Monitoring**
|
||||
```python
|
||||
# Check for new lots every N minutes
|
||||
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour')
|
||||
|
||||
# Monitor bid changes
|
||||
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0
|
||||
```
|
||||
|
||||
### 3. **Analytics & Reporting**
|
||||
```python
|
||||
# Top locations
|
||||
SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location
|
||||
|
||||
# Auction statistics
|
||||
SELECT
|
||||
a.auction_id,
|
||||
a.title,
|
||||
COUNT(l.lot_id) as actual_lots,
|
||||
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
|
||||
FROM auctions a
|
||||
LEFT JOIN lots l ON a.auction_id = l.auction_id
|
||||
GROUP BY a.auction_id
|
||||
```
|
||||
|
||||
### 4. **Image Processing Integration**
|
||||
```python
|
||||
# Get all images for a lot
|
||||
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5'
|
||||
|
||||
# Batch process unprocessed images
|
||||
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
|
||||
FROM images i
|
||||
JOIN lots l ON i.lot_id = l.lot_id
|
||||
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
|
||||
- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
|
||||
- **Caching**: 24-hour default cache validity (configurable)
|
||||
- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
|
||||
- **Scalability**: SQLite handles millions of rows efficiently
|
||||
|
||||
## Error Handling
|
||||
|
||||
- **Network failures**: Cached as status_code=500, retry after cache expiry
|
||||
- **Parse failures**: Falls back to HTML regex patterns
|
||||
- **Compression errors**: Auto-detects and handles uncompressed legacy data
|
||||
- **Missing fields**: Defaults to "No bids", empty string, or 0
|
||||
|
||||
## Rate Limiting & Ethics
|
||||
|
||||
- **REQUIRED**: 0.5 second delay between ALL requests
|
||||
- **Respects cache**: Avoids unnecessary re-fetching
|
||||
- **User-Agent**: Identifies as standard browser
|
||||
- **No parallelization**: Single-threaded sequential crawling
|
||||
191
wiki/QUICKSTART.md
Normal file
191
wiki/QUICKSTART.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Quick Start Guide
|
||||
|
||||
Get the scraper running in minutes without downloading YOLO models!
|
||||
|
||||
## Minimal Setup (No Object Detection)
|
||||
|
||||
The scraper works perfectly fine **without** YOLO object detection. You can run it immediately and add object detection later if needed.
|
||||
|
||||
### Step 1: Run the Scraper
|
||||
|
||||
```bash
|
||||
# Using Maven
|
||||
mvn clean compile exec:java -Dexec.mainClass="com.auction.scraper.TroostwijkScraper"
|
||||
```
|
||||
|
||||
Or in IntelliJ IDEA:
|
||||
1. Open `TroostwijkScraper.java`
|
||||
2. Right-click on the `main` method
|
||||
3. Select "Run 'TroostwijkScraper.main()'"
|
||||
|
||||
### What You'll See
|
||||
|
||||
```
|
||||
=== Troostwijk Auction Scraper ===
|
||||
|
||||
Initializing scraper...
|
||||
⚠️ Object detection disabled: YOLO model files not found
|
||||
Expected files:
|
||||
- models/yolov4.cfg
|
||||
- models/yolov4.weights
|
||||
- models/coco.names
|
||||
Scraper will continue without image analysis.
|
||||
|
||||
[1/3] Discovering Dutch auctions...
|
||||
✓ Found 5 auctions: [12345, 12346, 12347, 12348, 12349]
|
||||
|
||||
[2/3] Fetching lot details...
|
||||
Processing sale 12345...
|
||||
|
||||
[3/3] Starting monitoring service...
|
||||
✓ Monitoring active. Press Ctrl+C to stop.
|
||||
```
|
||||
|
||||
### Step 2: Test Desktop Notifications
|
||||
|
||||
The scraper will automatically send desktop notifications when:
|
||||
- A new bid is placed on a monitored lot
|
||||
- An auction is closing within 5 minutes
|
||||
|
||||
**No setup required** - desktop notifications work out of the box!
|
||||
|
||||
---
|
||||
|
||||
## Optional: Add Email Notifications
|
||||
|
||||
If you want email notifications in addition to desktop notifications:
|
||||
|
||||
```bash
|
||||
# Set environment variable
|
||||
export NOTIFICATION_CONFIG="smtp:your.email@gmail.com:app_password:your.email@gmail.com"
|
||||
|
||||
# Then run the scraper
|
||||
mvn exec:java -Dexec.mainClass="com.auction.scraper.TroostwijkScraper"
|
||||
```
|
||||
|
||||
**Get Gmail App Password:**
|
||||
1. Enable 2FA in Google Account
|
||||
2. Go to: Google Account → Security → 2-Step Verification → App passwords
|
||||
3. Generate password for "Mail"
|
||||
4. Use that password (not your regular Gmail password)
|
||||
|
||||
---
|
||||
|
||||
## Optional: Add Object Detection Later
|
||||
|
||||
If you want AI-powered image analysis to detect objects in auction photos:
|
||||
|
||||
### 1. Create models directory
|
||||
```bash
|
||||
mkdir models
|
||||
cd models
|
||||
```
|
||||
|
||||
### 2. Download YOLO files
|
||||
```bash
|
||||
# YOLOv4 config (small)
|
||||
curl -O https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4.cfg
|
||||
|
||||
# YOLOv4 weights (245 MB - takes a few minutes)
|
||||
curl -LO https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights
|
||||
|
||||
# COCO class names
|
||||
curl -O https://raw.githubusercontent.com/AlexeyAB/darknet/master/data/coco.names
|
||||
```
|
||||
|
||||
### 3. Run again
|
||||
```bash
|
||||
mvn exec:java -Dexec.mainClass="com.auction.scraper.TroostwijkScraper"
|
||||
```
|
||||
|
||||
Now you'll see:
|
||||
```
|
||||
✓ Object detection enabled with YOLO
|
||||
```
|
||||
|
||||
The scraper will now analyze auction images and detect objects like:
|
||||
- Vehicles (cars, trucks, forklifts)
|
||||
- Equipment (machines, tools)
|
||||
- Furniture
|
||||
- Electronics
|
||||
- And 80+ other object types
|
||||
|
||||
---
|
||||
|
||||
## Features Without Object Detection
|
||||
|
||||
Even without YOLO, the scraper provides:
|
||||
|
||||
✅ **Full auction scraping** - Discovers all Dutch auctions
|
||||
✅ **Lot tracking** - Monitors bids and closing times
|
||||
✅ **Desktop notifications** - Real-time alerts
|
||||
✅ **SQLite database** - All data persisted locally
|
||||
✅ **Image downloading** - Saves all lot images
|
||||
✅ **Scheduled monitoring** - Automatic updates every hour
|
||||
|
||||
Object detection simply adds:
|
||||
- AI-powered image analysis
|
||||
- Automatic object labeling
|
||||
- Searchable image database
|
||||
|
||||
---
|
||||
|
||||
## Database Location
|
||||
|
||||
The scraper creates `troostwijk.db` in your current directory with:
|
||||
- All auction data
|
||||
- Lot details (title, description, bids, etc.)
|
||||
- Downloaded image paths
|
||||
- Object labels (if detection enabled)
|
||||
|
||||
View the database with any SQLite browser:
|
||||
```bash
|
||||
sqlite3 troostwijk.db
|
||||
.tables
|
||||
SELECT * FROM lots LIMIT 5;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stopping the Scraper
|
||||
|
||||
Press **Ctrl+C** to stop the monitoring service.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ **Run the scraper** without YOLO to test it
|
||||
2. ✅ **Verify desktop notifications** work
|
||||
3. ⚙️ **Optional**: Add email notifications
|
||||
4. ⚙️ **Optional**: Download YOLO models for object detection
|
||||
5. 🔧 **Customize**: Edit monitoring frequency, closing alerts, etc.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Desktop notifications not appearing?
|
||||
- **Windows**: Check if Java has notification permissions
|
||||
- **Linux**: Ensure desktop environment is running (not headless)
|
||||
- **macOS**: Check System Preferences → Notifications
|
||||
|
||||
### OpenCV warnings?
|
||||
These are normal and can be ignored:
|
||||
```
|
||||
WARNING: A restricted method in java.lang.System has been called
|
||||
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid warning
|
||||
```
|
||||
|
||||
The scraper works fine despite these warnings.
|
||||
|
||||
---
|
||||
|
||||
## Full Documentation
|
||||
|
||||
See [README.md](../README.md) for complete documentation including:
|
||||
- Email setup details
|
||||
- YOLO installation guide
|
||||
- Configuration options
|
||||
- Database schema
|
||||
- API endpoints
|
||||
164
wiki/RUN_INSTRUCTIONS.md
Normal file
164
wiki/RUN_INSTRUCTIONS.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Troostwijk Auction Extractor - Run Instructions
|
||||
|
||||
## Fixed Warnings
|
||||
|
||||
All warnings have been resolved:
|
||||
- ✅ SLF4J logging configured (slf4j-simple)
|
||||
- ✅ Native access enabled for SQLite JDBC
|
||||
- ✅ Logging output controlled via simplelogger.properties
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Java 21** installed
|
||||
2. **Maven** installed
|
||||
3. **IntelliJ IDEA** (recommended) or command line
|
||||
|
||||
## Setup (First Time Only)
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
In IntelliJ Terminal or PowerShell:
|
||||
|
||||
```bash
|
||||
# Reload Maven dependencies
|
||||
mvn clean install
|
||||
|
||||
# Install Playwright browser binaries (first time only)
|
||||
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
|
||||
```
|
||||
|
||||
## Running the Application
|
||||
|
||||
### Option A: Using IntelliJ IDEA (Easiest)
|
||||
|
||||
1. **Add VM Options for native access:**
|
||||
- Run → Edit Configurations
|
||||
- Select or create configuration for `TroostwijkAuctionExtractor`
|
||||
- In "VM options" field, add:
|
||||
```
|
||||
--enable-native-access=ALL-UNNAMED
|
||||
```
|
||||
|
||||
2. **Add Program Arguments (optional):**
|
||||
- In "Program arguments" field, add:
|
||||
```
|
||||
--max-visits 3
|
||||
```
|
||||
|
||||
3. **Run the application:**
|
||||
- Click the green Run button
|
||||
|
||||
### Option B: Using Maven (Command Line)
|
||||
|
||||
```bash
|
||||
# Run with 3 page limit
|
||||
mvn exec:java
|
||||
|
||||
# Run with custom arguments (override pom.xml defaults)
|
||||
mvn exec:java -Dexec.args="--max-visits 5"
|
||||
|
||||
# Run without cache
|
||||
mvn exec:java -Dexec.args="--no-cache --max-visits 2"
|
||||
|
||||
# Run with unlimited visits
|
||||
mvn exec:java -Dexec.args=""
|
||||
```
|
||||
|
||||
### Option C: Using Java Directly
|
||||
|
||||
```bash
|
||||
# Compile first
|
||||
mvn clean compile
|
||||
|
||||
# Run with native access enabled
|
||||
java --enable-native-access=ALL-UNNAMED \
|
||||
-cp target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
|
||||
com.auction.TroostwijkAuctionExtractor --max-visits 3
|
||||
```
|
||||
|
||||
## Command Line Arguments
|
||||
|
||||
```
|
||||
--max-visits <n> Limit actual page fetches to n (0 = unlimited, default)
|
||||
--no-cache Disable page caching
|
||||
--help Show help message
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Test with 3 page visits (cached pages don't count):
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--max-visits 3"
|
||||
```
|
||||
|
||||
### Fresh extraction without cache:
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--no-cache --max-visits 5"
|
||||
```
|
||||
|
||||
### Full extraction (all pages, unlimited):
|
||||
```bash
|
||||
mvn exec:java -Dexec.args=""
|
||||
```
|
||||
|
||||
## Expected Output (No Warnings)
|
||||
|
||||
```
|
||||
=== Troostwijk Auction Extractor ===
|
||||
Max page visits set to: 3
|
||||
|
||||
Initializing Playwright browser...
|
||||
✓ Browser ready
|
||||
✓ Cache database initialized
|
||||
|
||||
Starting auction extraction from https://www.troostwijkauctions.com/auctions
|
||||
|
||||
[Page 1] Fetching auctions...
|
||||
✓ Fetched from website (visit 1/3)
|
||||
✓ Found 20 auctions
|
||||
|
||||
[Page 2] Fetching auctions...
|
||||
✓ Loaded from cache
|
||||
✓ Found 20 auctions
|
||||
|
||||
[Page 3] Fetching auctions...
|
||||
✓ Fetched from website (visit 2/3)
|
||||
✓ Found 20 auctions
|
||||
|
||||
✓ Total auctions extracted: 60
|
||||
|
||||
=== Results ===
|
||||
Total auctions found: 60
|
||||
Dutch auctions (NL): 45
|
||||
Actual page visits: 2
|
||||
|
||||
✓ Browser and cache closed
|
||||
```
|
||||
|
||||
## Cache Management
|
||||
|
||||
- Cache is stored in: `cache/page_cache.db`
|
||||
- Cache expires after: 24 hours (configurable in code)
|
||||
- To clear cache: Delete `cache/page_cache.db` file
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### If you still see warnings:
|
||||
|
||||
1. **Reload Maven project in IntelliJ:**
|
||||
- Right-click `pom.xml` → Maven → Reload project
|
||||
|
||||
2. **Verify VM options:**
|
||||
- Ensure `--enable-native-access=ALL-UNNAMED` is in VM options
|
||||
|
||||
3. **Clean and rebuild:**
|
||||
```bash
|
||||
mvn clean install
|
||||
```
|
||||
|
||||
### If Playwright fails:
|
||||
|
||||
```bash
|
||||
# Reinstall browser binaries
|
||||
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"
|
||||
```
|
||||
Reference in New Issue
Block a user