first

2025-12-04 14:49:58 +01:00
commit 79e14be37a
22 changed files with 2765 additions and 0 deletions
--- a/wiki/ARCHITECTURE.md
+++ b/wiki/ARCHITECTURE.md
@@ -0,0 +1,326 @@
+# Scaev - Architecture & Data Flow
+
+## System Overview
+
+The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
+
+## Architecture Diagram
+
+```mariadb
+┌─────────────────────────────────────────────────────────────────┐
+│                     TROOSTWIJK SCRAPER                          │
+└─────────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 1: COLLECT AUCTION URLs                                  │
+│  ┌──────────────┐         ┌──────────────┐                      │
+│  │ Listing Page │────────▶│ Extract /a/  │                      │
+│  │ /auctions?   │         │ auction URLs │                      │
+│  │ page=1..N    │         └──────────────┘                      │
+│  └──────────────┘                │                              │
+│                                   ▼                             │
+│                        [ List of Auction URLs ]                 │
+└─────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        │
+│  ┌──────────────┐         ┌──────────────┐                      │
+│  │ Auction Page │────────▶│ Parse        │                      │
+│  │ /a/...       │         │ __NEXT_DATA__│                      │
+│  └──────────────┘         │ JSON         │                      │
+│         │                 └──────────────┘                      │
+│         │                        │                              │
+│         ▼                        ▼                              │
+│  ┌──────────────┐         ┌──────────────┐                      │
+│  │ Save Auction │         │ Extract /l/  │                      │
+│  │ Metadata     │         │ lot URLs     │                      │
+│  │ to DB        │         └──────────────┘                      │
+│  └──────────────┘                │                              │
+│                                   ▼                             │
+│                          [ List of Lot URLs ]                   │
+└─────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  PHASE 3: SCRAPE LOT DETAILS                                    │
+│  ┌──────────────┐         ┌──────────────┐                      │
+│  │ Lot Page     │────────▶│ Parse        │                      │
+│  │ /l/...       │         │ __NEXT_DATA__│                      │
+│  └──────────────┘         │ JSON         │                      │
+│                           └──────────────┘                      │
+│                                   │                             │
+│         ┌─────────────────────────┴─────────────────┐           │
+│         ▼                                           ▼           │
+│  ┌──────────────┐                          ┌──────────────┐     │
+│  │ Save Lot     │                          │ Save Images  │     │
+│  │ Details      │                          │ URLs to DB   │     │
+│  │ to DB        │                          └──────────────┘     │
+│  └──────────────┘                                 │             │
+│                                                    ▼            │
+│                                          [Optional Download]    │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Database Schema
+
+```mariadb
+┌──────────────────────────────────────────────────────────────────┐
+│  CACHE TABLE (HTML Storage with Compression)                     │
+├──────────────────────────────────────────────────────────────────┤
+│  cache                                                           │
+│  ├── url (TEXT, PRIMARY KEY)                                     │
+│  ├── content (BLOB)              -- Compressed HTML (zlib)       │
+│  ├── timestamp (REAL)                                            │
+│  ├── status_code (INTEGER)                                       │
+│  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  AUCTIONS TABLE                                                  │
+├──────────────────────────────────────────────────────────────────┤
+│  auctions                                                        │
+│  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"          │
+│  ├── url (TEXT, UNIQUE)                                          │
+│  ├── title (TEXT)                                                │
+│  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"   │
+│  ├── lots_count (INTEGER)                                        │
+│  ├── first_lot_closing_time (TEXT)                               │
+│  └── scraped_at (TEXT)                                           │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  LOTS TABLE                                                      │
+├──────────────────────────────────────────────────────────────────┤
+│  lots                                                            │
+│  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"        │
+│  ├── auction_id (TEXT)               -- FK to auctions           │
+│  ├── url (TEXT, UNIQUE)                                          │
+│  ├── title (TEXT)                                                │
+│  ├── current_bid (TEXT)              -- "€123.45" or "No bids"   │
+│  ├── bid_count (INTEGER)                                         │
+│  ├── closing_time (TEXT)                                         │
+│  ├── viewing_time (TEXT)                                         │
+│  ├── pickup_date (TEXT)                                          │
+│  ├── location (TEXT)                 -- e.g. "Dongen, NL"        │
+│  ├── description (TEXT)                                          │
+│  ├── category (TEXT)                                             │
+│  └── scraped_at (TEXT)                                           │
+│      FOREIGN KEY (auction_id) → auctions(auction_id)             │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  IMAGES TABLE (Image URLs & Download Status)                     │
+├──────────────────────────────────────────────────────────────────┤
+│  images                          ◀── THIS TABLE HOLDS IMAGE LINKS│
+│  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     │
+│  ├── lot_id (TEXT)               -- FK to lots                   │
+│  ├── url (TEXT)                  -- Image URL                    │
+│  ├── local_path (TEXT)           -- Path after download          │
+│  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded      │
+│      FOREIGN KEY (lot_id) → lots(lot_id)                         │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+## Sequence Diagram
+
+```
+User          Scraper         Playwright      Cache DB        Data Tables
+ │               │                │               │                │
+ │  Run          │                │               │                │
+ ├──────────────▶│                │               │                │
+ │               │                │               │                │
+ │               │ Phase 1: Listing Pages         │                │
+ │               ├───────────────▶│               │                │
+ │               │   goto()       │               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               ├───────────────────────────────▶│                │
+ │               │   compress & cache             │                │
+ │               │                │               │                │
+ │               │ Phase 2: Auction Pages         │                │
+ │               ├───────────────▶│               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               │                │               │                │
+ │               │ Parse __NEXT_DATA__ JSON       │                │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT auctions
+ │               │                │               │                │
+ │               │ Phase 3: Lot Pages             │                │
+ │               ├───────────────▶│               │                │
+ │               │◀───────────────┤               │                │
+ │               │   HTML         │               │                │
+ │               │                │               │                │
+ │               │ Parse __NEXT_DATA__ JSON       │                │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT lots  │
+ │               │────────────────────────────────────────────────▶│
+ │               │                │               │   INSERT images│
+ │               │                │               │                │
+ │               │ Export to CSV/JSON             │                │
+ │               │◀────────────────────────────────────────────────┤
+ │               │   Query all data               │                │
+ │◀──────────────┤                │               │                │
+ │   Results     │                │               │                │
+```
+
+## Data Flow Details
+
+### 1. **Page Retrieval & Caching**
+```
+Request URL
+    │
+    ├──▶ Check cache DB (with timestamp validation)
+    │    │
+    │    ├─[HIT]──▶ Decompress (if compressed=1)
+    │    │          └──▶ Return HTML
+    │    │
+    │    └─[MISS]─▶ Fetch via Playwright
+    │               │
+    │               ├──▶ Compress HTML (zlib level 9)
+    │               │    ~70-90% size reduction
+    │               │
+    │               └──▶ Store in cache DB (compressed=1)
+    │
+    └──▶ Return HTML for parsing
+```
+
+### 2. **JSON Parsing Strategy**
+```
+HTML Content
+    │
+    └──▶ Extract <script id="__NEXT_DATA__">
+         │
+         ├──▶ Parse JSON
+         │    │
+         │    ├─[has pageProps.lot]──▶ Individual LOT
+         │    │   └──▶ Extract: title, bid, location, images, etc.
+         │    │
+         │    └─[has pageProps.auction]──▶ AUCTION
+         │        │
+         │        ├─[has lots[] array]──▶ Auction with lots
+         │        │   └──▶ Extract: title, location, lots_count
+         │        │
+         │        └─[no lots[] array]──▶ Old format lot
+         │            └──▶ Parse as lot
+         │
+         └──▶ Fallback to HTML regex parsing (if JSON fails)
+```
+
+### 3. **Image Handling**
+```
+Lot Page Parsed
+    │
+    ├──▶ Extract images[] from JSON
+    │    │
+    │    └──▶ INSERT INTO images (lot_id, url, downloaded=0)
+    │
+    └──▶ [If DOWNLOAD_IMAGES=True]
+         │
+         ├──▶ Download each image
+         │    │
+         │    ├──▶ Save to: /images/{lot_id}/001.jpg
+         │    │
+         │    └──▶ UPDATE images SET local_path=?, downloaded=1
+         │
+         └──▶ Rate limit between downloads (0.5s)
+```
+
+## Key Configuration
+
+| Setting | Value | Purpose |
+|---------|-------|---------|
+| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
+| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
+| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
+| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
+| `MAX_PAGES` | `50` | Number of listing pages to crawl |
+
+## Output Files
+
+```
+/mnt/okcomputer/output/
+├── cache.db                              # SQLite database (compressed HTML + data)
+├── auctions_{timestamp}.json             # Exported auctions
+├── auctions_{timestamp}.csv              # Exported auctions
+├── lots_{timestamp}.json                 # Exported lots
+├── lots_{timestamp}.csv                  # Exported lots
+└── images/                               # Downloaded images (if enabled)
+    ├── A1-28505-5/
+    │   ├── 001.jpg
+    │   └── 002.jpg
+    └── A1-28505-6/
+        └── 001.jpg
+```
+
+## Extension Points for Integration
+
+### 1. **Downstream Processing Pipeline**
+```sqlite
+-- Query lots without downloaded images
+SELECT lot_id, url FROM images WHERE downloaded = 0;
+
+-- Process images: OCR, classification, etc.
+-- Update status when complete
+UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
+```
+
+### 2. **Real-time Monitoring**
+```sqlite
+-- Check for new lots every N minutes
+SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');
+
+-- Monitor bid changes
+SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
+```
+
+### 3. **Analytics & Reporting**
+```sqlite
+-- Top locations
+SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location;
+
+-- Auction statistics
+SELECT
+    a.auction_id,
+    a.title,
+    COUNT(l.lot_id) as actual_lots,
+    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
+FROM auctions a
+LEFT JOIN lots l ON a.auction_id = l.auction_id
+GROUP BY a.auction_id
+```
+
+### 4. **Image Processing Integration**
+```sqlite
+-- Get all images for a lot
+SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';
+
+-- Batch process unprocessed images
+SELECT i.id, i.lot_id, i.local_path, l.title, l.category
+FROM images i
+JOIN lots l ON i.lot_id = l.lot_id
+WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
+```
+
+## Performance Characteristics
+
+- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
+- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
+- **Caching**: 24-hour default cache validity (configurable)
+- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
+- **Scalability**: SQLite handles millions of rows efficiently
+
+## Error Handling
+
+- **Network failures**: Cached as status_code=500, retry after cache expiry
+- **Parse failures**: Falls back to HTML regex patterns
+- **Compression errors**: Auto-detects and handles uncompressed legacy data
+- **Missing fields**: Defaults to "No bids", empty string, or 0
+
+## Rate Limiting & Ethics
+
+- **REQUIRED**: 0.5 second delay between ALL requests
+- **Respects cache**: Avoids unnecessary re-fetching
+- **User-Agent**: Identifies as standard browser
+- **No parallelization**: Single-threaded sequential crawling
--- a/wiki/Deployment.md
+++ b/wiki/Deployment.md
@@ -0,0 +1,122 @@
+# Deployment
+
+## Prerequisites
+
+- Python 3.8+ installed
+- Access to a server (Linux/Windows)
+- Playwright and dependencies installed
+
+## Production Setup
+
+### 1. Install on Server
+
+```bash
+# Clone repository
+git clone git@git.appmodel.nl:Tour/troost-scraper.git
+cd troost-scraper
+
+# Create virtual environment
+python -m venv .venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+
+# Install dependencies
+pip install -r requirements.txt
+playwright install chromium
+playwright install-deps  # Install system dependencies
+```
+
+### 2. Configuration
+
+Create a configuration file or set environment variables:
+
+```python
+# main.py configuration
+BASE_URL = "https://www.troostwijkauctions.com"
+CACHE_DB = "/var/troost-scraper/cache.db"
+OUTPUT_DIR = "/var/troost-scraper/output"
+RATE_LIMIT_SECONDS = 0.5
+MAX_PAGES = 50
+```
+
+### 3. Create Output Directories
+
+```bash
+sudo mkdir -p /var/troost-scraper/output
+sudo chown $USER:$USER /var/troost-scraper
+```
+
+### 4. Run as Cron Job
+
+Add to crontab (`crontab -e`):
+
+```bash
+# Run scraper daily at 2 AM
+0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
+```
+
+## Docker Deployment (Optional)
+
+Create `Dockerfile`:
+
+```dockerfile
+FROM python:3.10-slim
+
+WORKDIR /app
+
+# Install system dependencies for Playwright
+RUN apt-get update && apt-get install -y \
+    wget \
+    gnupg \
+    && rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+RUN playwright install chromium
+RUN playwright install-deps
+
+COPY main.py .
+
+CMD ["python", "main.py"]
+```
+
+Build and run:
+
+```bash
+docker build -t troost-scraper .
+docker run -v /path/to/output:/output troost-scraper
+```
+
+## Monitoring
+
+### Check Logs
+
+```bash
+tail -f /var/log/troost-scraper.log
+```
+
+### Monitor Output
+
+```bash
+ls -lh /var/troost-scraper/output/
+```
+
+## Troubleshooting
+
+### Playwright Browser Issues
+
+```bash
+# Reinstall browsers
+playwright install --force chromium
+```
+
+### Permission Issues
+
+```bash
+# Fix permissions
+sudo chown -R $USER:$USER /var/troost-scraper
+```
+
+### Memory Issues
+
+- Reduce `MAX_PAGES` in configuration
+- Run on machine with more RAM (Playwright needs ~1GB)
--- a/wiki/Getting-Started.md
+++ b/wiki/Getting-Started.md
@@ -0,0 +1,71 @@
+# Getting Started
+
+## Prerequisites
+
+- Python 3.8+
+- Git
+- pip (Python package manager)
+
+## Installation
+
+### 1. Clone the repository
+
+```bash
+git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
+cd troost-scraper
+```
+
+### 2. Install dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 3. Install Playwright browsers
+
+```bash
+playwright install chromium
+```
+
+## Configuration
+
+Edit the configuration in `main.py`:
+
+```python
+BASE_URL = "https://www.troostwijkauctions.com"
+CACHE_DB = "/path/to/cache.db"           # Path to cache database
+OUTPUT_DIR = "/path/to/output"            # Output directory
+RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
+MAX_PAGES = 50                            # Number of listing pages
+```
+
+**Windows users:** Use paths like `C:\\output\\cache.db`
+
+## Usage
+
+### Basic scraping
+
+```bash
+python main.py
+```
+
+This will:
+1. Crawl listing pages to collect lot URLs
+2. Scrape each individual lot page
+3. Save results in JSON and CSV formats
+4. Cache all pages for future runs
+
+### Test mode
+
+Debug extraction on a specific URL:
+
+```bash
+python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
+```
+
+## Output
+
+The scraper generates:
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
+- `cache.db` - SQLite cache (persistent)
--- a/wiki/HOLISTIC.md
+++ b/wiki/HOLISTIC.md
@@ -0,0 +1,107 @@
+# Architecture
+
+## Overview
+
+The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
+
+## Core Components
+
+### 1. **Browser Automation (Playwright)**
+- Launches Chromium browser in headless mode
+- Bypasses Cloudflare protection
+- Handles dynamic content rendering
+- Supports network idle detection
+
+### 2. **Cache Manager (SQLite)**
+- Caches every fetched page
+- Prevents redundant requests
+- Stores page content, timestamps, and status codes
+- Auto-cleans entries older than 7 days
+- Database: `cache.db`
+
+### 3. **Rate Limiter**
+- Enforces exactly 0.5 seconds between requests
+- Prevents server overload
+- Tracks last request time globally
+
+### 4. **Data Extractor**
+- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
+- **Fallback method:** HTML pattern matching with regex
+- Extracts: title, location, bid info, dates, images, descriptions
+
+### 5. **Output Manager**
+- Exports data in JSON and CSV formats
+- Saves progress checkpoints every 10 lots
+- Timestamped filenames for tracking
+
+## Data Flow
+
+```
+1. Listing Pages → Extract lot URLs → Store in memory
+                                           ↓
+2. For each lot URL → Check cache → If cached: use cached content
+                          ↓              If not: fetch with rate limit
+                          ↓
+3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
+                          ↓
+4. Every 10 lots → Save progress checkpoint
+                          ↓
+5. All lots complete → Export final JSON + CSV
+```
+
+## Key Design Decisions
+
+### Why Playwright?
+- Handles JavaScript-rendered content (Next.js)
+- Bypasses Cloudflare protection
+- More reliable than requests/BeautifulSoup for modern SPAs
+
+### Why JSON extraction?
+- Site uses Next.js with embedded `__NEXT_DATA__`
+- JSON is more reliable than HTML pattern matching
+- Avoids breaking when HTML/CSS changes
+- Faster parsing
+
+### Why SQLite caching?
+- Persistent across runs
+- Reduces load on target server
+- Enables test mode without re-fetching
+- Respects website resources
+
+## File Structure
+
+```
+troost-scraper/
+├── main.py                    # Main scraper logic
+├── requirements.txt           # Python dependencies
+├── README.md                  # Documentation
+├── .gitignore                 # Git exclusions
+└── output/                    # Generated files (not in git)
+    ├── cache.db              # SQLite cache
+    ├── *_partial_*.json      # Progress checkpoints
+    ├── *_final_*.json        # Final JSON output
+    └── *_final_*.csv         # Final CSV output
+```
+
+## Classes
+
+### `CacheManager`
+- `__init__(db_path)` - Initialize cache database
+- `get(url, max_age_hours)` - Retrieve cached page
+- `set(url, content, status_code)` - Cache a page
+- `clear_old(max_age_hours)` - Remove old entries
+
+### `TroostwijkScraper`
+- `crawl_auctions(max_pages)` - Main entry point
+- `crawl_listing_page(page, page_num)` - Extract lot URLs
+- `crawl_lot(page, url)` - Scrape individual lot
+- `_extract_nextjs_data(content)` - Parse JSON data
+- `_parse_lot_page(content, url)` - Extract all fields
+- `save_final_results(data)` - Export JSON + CSV
+
+## Scalability Notes
+
+- **Rate limiting** prevents IP blocks but slows execution
+- **Caching** makes subsequent runs instant for unchanged pages
+- **Progress checkpoints** allow resuming after interruption
+- **Async/await** used throughout for non-blocking I/O
--- a/wiki/Home.md
+++ b/wiki/Home.md
@@ -0,0 +1,18 @@
+# scaev Wiki
+
+Welcome to the scaev documentation.
+
+## Contents
+
+- [Getting Started](Getting-Started)
+- [Architecture](Architecture)
+- [Deployment](Deployment)
+
+## Overview
+
+Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
+
+## Quick Links
+
+- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
+- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
--- a/wiki/TESTING.md
+++ b/wiki/TESTING.md
@@ -0,0 +1,279 @@
+# Testing & Migration Guide
+
+## Overview
+
+This guide covers:
+1. Migrating existing cache to compressed format
+2. Running the test suite
+3. Understanding test results
+
+## Step 1: Migrate Cache to Compressed Format
+
+If you have an existing database with uncompressed entries (from before compression was added), run the migration script:
+
+```bash
+python migrate_compress_cache.py
+```
+
+### What it does:
+- Finds all cache entries where data is uncompressed
+- Compresses them using zlib (level 9)
+- Reports compression statistics and space saved
+- Verifies all entries are compressed
+
+### Expected output:
+```
+Cache Compression Migration Tool
+============================================================
+Initial database size: 1024.56 MB
+
+Found 1134 uncompressed cache entries
+Starting compression...
+  Compressed 100/1134 entries... (78.3% reduction so far)
+  Compressed 200/1134 entries... (79.1% reduction so far)
+  ...
+
+============================================================
+MIGRATION COMPLETE
+============================================================
+Entries compressed: 1134
+Original size:      1024.56 MB
+Compressed size:    198.34 MB
+Space saved:        826.22 MB
+Compression ratio:  80.6%
+============================================================
+
+VERIFICATION:
+  Compressed entries:   1134
+  Uncompressed entries: 0
+  ✓ All cache entries are compressed!
+
+Final database size: 1024.56 MB
+Database size reduced by: 0.00 MB
+
+✓ Migration complete! You can now run VACUUM to reclaim disk space:
+  sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
+```
+
+### Reclaim disk space:
+After migration, the database file still contains the space used by old uncompressed data. To actually reclaim the disk space:
+
+```bash
+sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
+```
+
+This will rebuild the database file and reduce its size significantly.
+
+## Step 2: Run Tests
+
+The test suite validates that auction and lot parsing works correctly using **cached data only** (no live requests to server).
+
+```bash
+python test_scraper.py
+```
+
+### What it tests:
+
+**Auction Pages:**
+- Type detection (must be 'auction')
+- auction_id extraction
+- title extraction
+- location extraction
+- lots_count extraction
+- first_lot_closing_time extraction
+
+**Lot Pages:**
+- Type detection (must be 'lot')
+- lot_id extraction
+- title extraction (must not be '...', 'N/A', or empty)
+- location extraction (must not be 'Locatie', 'Location', or empty)
+- current_bid extraction (must not be '€Huidig bod' or invalid)
+- closing_time extraction
+- images array extraction
+- bid_count validation
+- viewing_time and pickup_date (optional)
+
+### Expected output:
+
+```
+======================================================================
+TROOSTWIJK SCRAPER TEST SUITE
+======================================================================
+
+This test suite uses CACHED data only - no live requests to server
+======================================================================
+
+======================================================================
+CACHE STATUS CHECK
+======================================================================
+Total cache entries: 1134
+Compressed: 1134 (100.0%)
+Uncompressed: 0 (0.0%)
+
+✓ All cache entries are compressed!
+
+======================================================================
+TEST URL CACHE STATUS:
+======================================================================
+✓ https://www.troostwijkauctions.com/a/online-auction-cnc-lat...
+✓ https://www.troostwijkauctions.com/a/faillissement-bab-sho...
+✓ https://www.troostwijkauctions.com/a/industriele-goederen-...
+✓ https://www.troostwijkauctions.com/l/%25282x%2529-duo-bure...
+✓ https://www.troostwijkauctions.com/l/tos-sui-50-1000-unive...
+✓ https://www.troostwijkauctions.com/l/rolcontainer-%25282x%...
+
+6/6 test URLs are cached
+
+======================================================================
+TESTING AUCTIONS
+======================================================================
+
+======================================================================
+Testing Auction: https://www.troostwijkauctions.com/a/online-auction...
+======================================================================
+✓ Cache hit (age: 12.3 hours)
+  ✓ auction_id: A7-39813
+  ✓ title: Online Auction: CNC Lathes, Machining Centres & Precision...
+  ✓ location: Cluj-Napoca, RO
+  ✓ first_lot_closing_time: 2024-12-05 14:30:00
+  ✓ lots_count: 45
+
+======================================================================
+TESTING LOTS
+======================================================================
+
+======================================================================
+Testing Lot: https://www.troostwijkauctions.com/l/%25282x%2529-duo...
+======================================================================
+✓ Cache hit (age: 8.7 hours)
+  ✓ lot_id: A1-28505-5
+  ✓ title: (2x) Duo Bureau - 160x168 cm
+  ✓ location: Dongen, NL
+  ✓ current_bid: No bids
+  ✓ closing_time: 2024-12-10 16:00:00
+  ✓ images: 2 images
+      1. https://media.tbauctions.com/image-media/c3f9825f-e3fd...
+      2. https://media.tbauctions.com/image-media/45c85ced-9c63...
+  ✓ bid_count: 0
+  ✓ viewing_time: 2024-12-08 09:00:00 - 2024-12-08 17:00:00
+  ✓ pickup_date: 2024-12-11 09:00:00 - 2024-12-11 15:00:00
+
+======================================================================
+TEST SUMMARY
+======================================================================
+
+Total tests: 6
+Passed: 6 ✓
+Failed: 0 ✗
+Success rate: 100.0%
+
+======================================================================
+```
+
+## Test URLs
+
+The test suite tests these specific URLs (you can modify in `test_scraper.py`):
+
+**Auctions:**
+- https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813
+- https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557
+- https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675
+
+**Lots:**
+- https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5
+- https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9
+- https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101
+
+## Adding More Test Cases
+
+To add more test URLs, edit `test_scraper.py`:
+
+```python
+TEST_AUCTIONS = [
+    "https://www.troostwijkauctions.com/a/your-auction-url",
+    # ... add more
+]
+
+TEST_LOTS = [
+    "https://www.troostwijkauctions.com/l/your-lot-url",
+    # ... add more
+]
+```
+
+Then run the main scraper to cache these URLs:
+```bash
+python main.py
+```
+
+Then run tests:
+```bash
+python test_scraper.py
+```
+
+## Troubleshooting
+
+### "NOT IN CACHE" errors
+If tests show URLs are not cached, run the main scraper first:
+```bash
+python main.py
+```
+
+### "Failed to decompress cache" warnings
+This means you have uncompressed legacy data. Run the migration:
+```bash
+python migrate_compress_cache.py
+```
+
+### Tests failing with parsing errors
+Check the detailed error output in the TEST SUMMARY section. It will show:
+- Which field failed validation
+- The actual value that was extracted
+- Why it failed (empty, wrong type, invalid format)
+
+## Cache Behavior
+
+The test suite uses cached data with these characteristics:
+- **No rate limiting** - reads from DB instantly
+- **No server load** - zero HTTP requests
+- **Repeatable** - same results every time
+- **Fast** - all tests run in < 5 seconds
+
+This allows you to:
+- Test parsing changes without re-scraping
+- Run tests repeatedly during development
+- Validate changes before deploying
+- Ensure data quality without server impact
+
+## Continuous Integration
+
+You can integrate these tests into CI/CD:
+
+```bash
+# Run migration if needed
+python migrate_compress_cache.py
+
+# Run tests
+python test_scraper.py
+
+# Exit code: 0 = success, 1 = failure
+```
+
+## Performance Benchmarks
+
+Based on typical HTML sizes:
+
+| Metric | Before Compression | After Compression | Improvement |
+|--------|-------------------|-------------------|-------------|
+| Avg page size | 800 KB | 150 KB | 81.3% |
+| 1000 pages | 800 MB | 150 MB | 650 MB saved |
+| 10,000 pages | 8 GB | 1.5 GB | 6.5 GB saved |
+| DB read speed | ~50 ms | ~5 ms | 10x faster |
+
+## Best Practices
+
+1. **Always run migration after upgrading** to the compressed cache version
+2. **Run VACUUM** after migration to reclaim disk space
+3. **Run tests after major changes** to parsing logic
+4. **Add test cases for edge cases** you encounter in production
+5. **Keep test URLs diverse** - different auctions, lot types, languages
+6. **Monitor cache hit rates** to ensure effective caching