start

2025-12-03 15:09:39 +01:00
parent 7fa3e4a545
commit 853c3cf53e
16 changed files with 1405 additions and 2000 deletions
--- a/wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
+++ b/wiki/ARCHITECTURE-TROOSTWIJK-SCRAPER.md
@@ -0,0 +1,326 @@
+# Troostwijk Scraper - Architecture & Data Flow
+
+## System Overview
+
+The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     TROOSTWIJK SCRAPER                          │
+└─────────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 1: COLLECT AUCTION URLs                                  │
+│  ┌──────────────┐         ┌──────────────┐                      │
+│  │ Listing Page │────────▶│ Extract /a/  │                      │
+│  │ /auctions?   │         │ auction URLs │                      │
+│  │ page=1..N    │         └──────────────┘                      │
+│  └──────────────┘                │                              │
+│                                   ▼                             │
+│                        [ List of Auction URLs ]                 │
+└─────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        │
+│  ┌──────────────┐         ┌──────────────┐                     │
+│  │ Auction Page │────────▶│ Parse        │                     │
+│  │ /a/...       │         │ __NEXT_DATA__│                     │
+│  └──────────────┘         │ JSON         │                     │
+│         │                 └──────────────┘                     │
+│         │                        │                              │
+│         ▼                        ▼                              │
+│  ┌──────────────┐         ┌──────────────┐                     │
+│  │ Save Auction │         │ Extract /l/  │                     │
+│  │ Metadata     │         │ lot URLs     │                     │
+│  │ to DB        │         └──────────────┘                     │
+│  └──────────────┘                │                              │
+│                                   ▼                              │
+│                          [ List of Lot URLs ]                   │
+└─────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 3: SCRAPE LOT DETAILS                                    │
+│  ┌──────────────┐         ┌──────────────┐                     │
+│  │ Lot Page     │────────▶│ Parse        │                     │
+│  │ /l/...       │         │ __NEXT_DATA__│                     │
+│  └──────────────┘         │ JSON         │                     │
+│                           └──────────────┘                     │
+│                                   │                              │
+│         ┌─────────────────────────┴─────────────────┐           │
+│         ▼                                           ▼           │
+│  ┌──────────────┐                          ┌──────────────┐    │
+│  │ Save Lot     │                          │ Save Images  │    │
+│  │ Details      │                          │ URLs to DB   │    │
+│  │ to DB        │                          └──────────────┘    │
+│  └──────────────┘                                 │            │
+│                                                    ▼            │
+│                                          [Optional Download]    │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Database Schema
+
+```sql
+┌──────────────────────────────────────────────────────────────────┐
+│  CACHE TABLE (HTML Storage with Compression)                     │
+├──────────────────────────────────────────────────────────────────┤
+│  cache                                                            │
+│  ├── url (TEXT, PRIMARY KEY)                                     │
+│  ├── content (BLOB)              -- Compressed HTML (zlib)       │
+│  ├── timestamp (REAL)                                            │
+│  ├── status_code (INTEGER)                                       │
+│  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  AUCTIONS TABLE                                                   │
+├──────────────────────────────────────────────────────────────────┤
+│  auctions                                                         │
+│  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"         │
+│  ├── url (TEXT, UNIQUE)                                          │
+│  ├── title (TEXT)                                                │
+│  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"  │
+│  ├── lots_count (INTEGER)                                        │
+│  ├── first_lot_closing_time (TEXT)                              │
+│  └── scraped_at (TEXT)                                           │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  LOTS TABLE                                                       │
+├──────────────────────────────────────────────────────────────────┤
+│  lots                                                             │
+│  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"       │
+│  ├── auction_id (TEXT)               -- FK to auctions          │
+│  ├── url (TEXT, UNIQUE)                                          │
+│  ├── title (TEXT)                                                │
+│  ├── current_bid (TEXT)              -- "€123.45" or "No bids"  │
+│  ├── bid_count (INTEGER)                                         │
+│  ├── closing_time (TEXT)                                         │
+│  ├── viewing_time (TEXT)                                         │
+│  ├── pickup_date (TEXT)                                          │
+│  ├── location (TEXT)                 -- e.g. "Dongen, NL"       │
+│  ├── description (TEXT)                                          │
+│  ├── category (TEXT)                                             │
+│  └── scraped_at (TEXT)                                           │
+│      FOREIGN KEY (auction_id) → auctions(auction_id)             │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  IMAGES TABLE (Image URLs & Download Status)                     │
+├──────────────────────────────────────────────────────────────────┤
+│  images                          ◀── THIS TABLE HOLDS IMAGE LINKS│
+│  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     │
+│  ├── lot_id (TEXT)               -- FK to lots                  │
+│  ├── url (TEXT)                  -- Image URL                   │
+│  ├── local_path (TEXT)           -- Path after download         │
+│  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded     │
+│      FOREIGN KEY (lot_id) → lots(lot_id)                         │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+## Sequence Diagram
+
+```
+User          Scraper         Playwright      Cache DB        Data Tables
+ │               │                │               │                │
+ │  Run          │                │               │                │
+ ├──────────────▶│                │               │                │
+ │               │                │               │                │
+ │               │ Phase 1: Listing Pages         │                │
+ │               ├───────────────▶│               │                │
+ │               │   goto()       │               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               ├───────────────────────────────▶│                │
+ │               │   compress & cache             │                │
+ │               │                │               │                │
+ │               │ Phase 2: Auction Pages         │                │
+ │               ├───────────────▶│               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               │                │               │                │
+ │               │ Parse __NEXT_DATA__ JSON       │                │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT auctions
+ │               │                │               │                │
+ │               │ Phase 3: Lot Pages             │                │
+ │               ├───────────────▶│               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               │                │               │                │
+ │               │ Parse __NEXT_DATA__ JSON       │                │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT lots  │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT images│
+ │               │                │               │                │
+ │               │ Export to CSV/JSON             │                │
+ │               │◀────────────────────────────────────────────────┤
+ │               │   Query all data               │                │
+ │◀──────────────┤                │               │                │
+ │   Results     │                │               │                │
+```
+
+## Data Flow Details
+
+### 1. **Page Retrieval & Caching**
+```
+Request URL
+    │
+    ├──▶ Check cache DB (with timestamp validation)
+    │    │
+    │    ├─[HIT]──▶ Decompress (if compressed=1)
+    │    │          └──▶ Return HTML
+    │    │
+    │    └─[MISS]─▶ Fetch via Playwright
+    │               │
+    │               ├──▶ Compress HTML (zlib level 9)
+    │               │    ~70-90% size reduction
+    │               │
+    │               └──▶ Store in cache DB (compressed=1)
+    │
+    └──▶ Return HTML for parsing
+```
+
+### 2. **JSON Parsing Strategy**
+```
+HTML Content
+    │
+    └──▶ Extract <script id="__NEXT_DATA__">
+         │
+         ├──▶ Parse JSON
+         │    │
+         │    ├─[has pageProps.lot]──▶ Individual LOT
+         │    │   └──▶ Extract: title, bid, location, images, etc.
+         │    │
+         │    └─[has pageProps.auction]──▶ AUCTION
+         │        │
+         │        ├─[has lots[] array]──▶ Auction with lots
+         │        │   └──▶ Extract: title, location, lots_count
+         │        │
+         │        └─[no lots[] array]──▶ Old format lot
+         │            └──▶ Parse as lot
+         │
+         └──▶ Fallback to HTML regex parsing (if JSON fails)
+```
+
+### 3. **Image Handling**
+```
+Lot Page Parsed
+    │
+    ├──▶ Extract images[] from JSON
+    │    │
+    │    └──▶ INSERT INTO images (lot_id, url, downloaded=0)
+    │
+    └──▶ [If DOWNLOAD_IMAGES=True]
+         │
+         ├──▶ Download each image
+         │    │
+         │    ├──▶ Save to: /images/{lot_id}/001.jpg
+         │    │
+         │    └──▶ UPDATE images SET local_path=?, downloaded=1
+         │
+         └──▶ Rate limit between downloads (0.5s)
+```
+
+## Key Configuration
+
+| Setting | Value | Purpose |
+|---------|-------|---------|
+| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
+| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
+| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
+| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
+| `MAX_PAGES` | `50` | Number of listing pages to crawl |
+
+## Output Files
+
+```
+/mnt/okcomputer/output/
+├── cache.db                              # SQLite database (compressed HTML + data)
+├── auctions_{timestamp}.json             # Exported auctions
+├── auctions_{timestamp}.csv              # Exported auctions
+├── lots_{timestamp}.json                 # Exported lots
+├── lots_{timestamp}.csv                  # Exported lots
+└── images/                               # Downloaded images (if enabled)
+    ├── A1-28505-5/
+    │   ├── 001.jpg
+    │   └── 002.jpg
+    └── A1-28505-6/
+        └── 001.jpg
+```
+
+## Extension Points for Integration
+
+### 1. **Downstream Processing Pipeline**
+```python
+# Query lots without downloaded images
+SELECT lot_id, url FROM images WHERE downloaded = 0
+
+# Process images: OCR, classification, etc.
+# Update status when complete
+UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?
+```
+
+### 2. **Real-time Monitoring**
+```python
+# Check for new lots every N minutes
+SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour')
+
+# Monitor bid changes
+SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0
+```
+
+### 3. **Analytics & Reporting**
+```python
+# Top locations
+SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location
+
+# Auction statistics
+SELECT
+    a.auction_id,
+    a.title,
+    COUNT(l.lot_id) as actual_lots,
+    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
+FROM auctions a
+LEFT JOIN lots l ON a.auction_id = l.auction_id
+GROUP BY a.auction_id
+```
+
+### 4. **Image Processing Integration**
+```python
+# Get all images for a lot
+SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5'
+
+# Batch process unprocessed images
+SELECT i.id, i.lot_id, i.local_path, l.title, l.category
+FROM images i
+JOIN lots l ON i.lot_id = l.lot_id
+WHERE i.downloaded = 1 AND i.local_path IS NOT NULL
+```
+
+## Performance Characteristics
+
+- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
+- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
+- **Caching**: 24-hour default cache validity (configurable)
+- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
+- **Scalability**: SQLite handles millions of rows efficiently
+
+## Error Handling
+
+- **Network failures**: Cached as status_code=500, retry after cache expiry
+- **Parse failures**: Falls back to HTML regex patterns
+- **Compression errors**: Auto-detects and handles uncompressed legacy data
+- **Missing fields**: Defaults to "No bids", empty string, or 0
+
+## Rate Limiting & Ethics
+
+- **REQUIRED**: 0.5 second delay between ALL requests
+- **Respects cache**: Avoids unnecessary re-fetching
+- **User-Agent**: Identifies as standard browser
+- **No parallelization**: Single-threaded sequential crawling