This commit is contained in:
Tour
2025-12-04 14:53:55 +01:00
parent 79e14be37a
commit b12f3a5ee2
6 changed files with 0 additions and 0 deletions

326
_wiki/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,326 @@
# Scaev - Architecture & Data Flow
## System Overview
The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
## Architecture Diagram
```mariadb
TROOSTWIJK SCRAPER
PHASE 1: COLLECT AUCTION URLs
Listing Page Extract /a/
/auctions? auction URLs
page=1..N
[ List of Auction URLs ]
PHASE 2: EXTRACT LOT URLs FROM AUCTIONS
Auction Page Parse
/a/... __NEXT_DATA__
JSON
Save Auction Extract /l/
Metadata lot URLs
to DB
[ List of Lot URLs ]
PHASE 3: SCRAPE LOT DETAILS
Lot Page Parse
/l/... __NEXT_DATA__
JSON
Save Lot Save Images
Details URLs to DB
to DB
[Optional Download]
```
## Database Schema
```mariadb
CACHE TABLE (HTML Storage with Compression)
cache
url (TEXT, PRIMARY KEY)
content (BLOB) -- Compressed HTML (zlib) │
timestamp (REAL)
status_code (INTEGER)
compressed (INTEGER) -- 1=compressed, 0=plain │
AUCTIONS TABLE
auctions
auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
url (TEXT, UNIQUE)
title (TEXT)
location (TEXT) -- e.g. "Cluj-Napoca, RO" │
lots_count (INTEGER)
first_lot_closing_time (TEXT)
scraped_at (TEXT)
LOTS TABLE
lots
lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
auction_id (TEXT) -- FK to auctions │
url (TEXT, UNIQUE)
title (TEXT)
current_bid (TEXT) -- "€123.45" or "No bids" │
bid_count (INTEGER)
closing_time (TEXT)
viewing_time (TEXT)
pickup_date (TEXT)
location (TEXT) -- e.g. "Dongen, NL" │
description (TEXT)
category (TEXT)
scraped_at (TEXT)
FOREIGN KEY (auction_id) auctions(auction_id)
IMAGES TABLE (Image URLs & Download Status)
images THIS TABLE HOLDS IMAGE LINKS
id (INTEGER, PRIMARY KEY AUTOINCREMENT)
lot_id (TEXT) -- FK to lots │
url (TEXT) -- Image URL │
local_path (TEXT) -- Path after download │
downloaded (INTEGER) -- 0=pending, 1=downloaded │
FOREIGN KEY (lot_id) lots(lot_id)
```
## Sequence Diagram
```
User Scraper Playwright Cache DB Data Tables
│ │ │ │ │
│ Run │ │ │ │
├──────────────▶│ │ │ │
│ │ │ │ │
│ │ Phase 1: Listing Pages │ │
│ ├───────────────▶│ │ │
│ │ goto() │ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ ├───────────────────────────────▶│ │
│ │ compress & cache │ │
│ │ │ │ │
│ │ Phase 2: Auction Pages │ │
│ ├───────────────▶│ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ │ │ │ │
│ │ Parse __NEXT_DATA__ JSON │ │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT auctions
│ │ │ │ │
│ │ Phase 3: Lot Pages │ │
│ ├───────────────▶│ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ │ │ │ │
│ │ Parse __NEXT_DATA__ JSON │ │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT lots │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT images│
│ │ │ │ │
│ │ Export to CSV/JSON │ │
│ │◀────────────────────────────────────────────────┤
│ │ Query all data │ │
│◀──────────────┤ │ │ │
│ Results │ │ │ │
```
## Data Flow Details
### 1. **Page Retrieval & Caching**
```
Request URL
├──▶ Check cache DB (with timestamp validation)
│ │
│ ├─[HIT]──▶ Decompress (if compressed=1)
│ │ └──▶ Return HTML
│ │
│ └─[MISS]─▶ Fetch via Playwright
│ │
│ ├──▶ Compress HTML (zlib level 9)
│ │ ~70-90% size reduction
│ │
│ └──▶ Store in cache DB (compressed=1)
└──▶ Return HTML for parsing
```
### 2. **JSON Parsing Strategy**
```
HTML Content
└──▶ Extract <script id="__NEXT_DATA__">
├──▶ Parse JSON
│ │
│ ├─[has pageProps.lot]──▶ Individual LOT
│ │ └──▶ Extract: title, bid, location, images, etc.
│ │
│ └─[has pageProps.auction]──▶ AUCTION
│ │
│ ├─[has lots[] array]──▶ Auction with lots
│ │ └──▶ Extract: title, location, lots_count
│ │
│ └─[no lots[] array]──▶ Old format lot
│ └──▶ Parse as lot
└──▶ Fallback to HTML regex parsing (if JSON fails)
```
### 3. **Image Handling**
```
Lot Page Parsed
├──▶ Extract images[] from JSON
│ │
│ └──▶ INSERT INTO images (lot_id, url, downloaded=0)
└──▶ [If DOWNLOAD_IMAGES=True]
├──▶ Download each image
│ │
│ ├──▶ Save to: /images/{lot_id}/001.jpg
│ │
│ └──▶ UPDATE images SET local_path=?, downloaded=1
└──▶ Rate limit between downloads (0.5s)
```
## Key Configuration
| Setting | Value | Purpose |
|---------|-------|---------|
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
## Output Files
```
/mnt/okcomputer/output/
├── cache.db # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json # Exported auctions
├── auctions_{timestamp}.csv # Exported auctions
├── lots_{timestamp}.json # Exported lots
├── lots_{timestamp}.csv # Exported lots
└── images/ # Downloaded images (if enabled)
├── A1-28505-5/
│ ├── 001.jpg
│ └── 002.jpg
└── A1-28505-6/
└── 001.jpg
```
## Extension Points for Integration
### 1. **Downstream Processing Pipeline**
```sqlite
-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;
-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
```
### 2. **Real-time Monitoring**
```sqlite
-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');
-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
```
### 3. **Analytics & Reporting**
```sqlite
-- Top locations
SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location;
-- Auction statistics
SELECT
a.auction_id,
a.title,
COUNT(l.lot_id) as actual_lots,
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id
```
### 4. **Image Processing Integration**
```sqlite
-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';
-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
```
## Performance Characteristics
- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
- **Caching**: 24-hour default cache validity (configurable)
- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
- **Scalability**: SQLite handles millions of rows efficiently
## Error Handling
- **Network failures**: Cached as status_code=500, retry after cache expiry
- **Parse failures**: Falls back to HTML regex patterns
- **Compression errors**: Auto-detects and handles uncompressed legacy data
- **Missing fields**: Defaults to "No bids", empty string, or 0
## Rate Limiting & Ethics
- **REQUIRED**: 0.5 second delay between ALL requests
- **Respects cache**: Avoids unnecessary re-fetching
- **User-Agent**: Identifies as standard browser
- **No parallelization**: Single-threaded sequential crawling

122
_wiki/Deployment.md Normal file
View File

@@ -0,0 +1,122 @@
# Deployment
## Prerequisites
- Python 3.8+ installed
- Access to a server (Linux/Windows)
- Playwright and dependencies installed
## Production Setup
### 1. Install on Server
```bash
# Clone repository
git clone git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
playwright install chromium
playwright install-deps # Install system dependencies
```
### 2. Configuration
Create a configuration file or set environment variables:
```python
# main.py configuration
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/var/troost-scraper/cache.db"
OUTPUT_DIR = "/var/troost-scraper/output"
RATE_LIMIT_SECONDS = 0.5
MAX_PAGES = 50
```
### 3. Create Output Directories
```bash
sudo mkdir -p /var/troost-scraper/output
sudo chown $USER:$USER /var/troost-scraper
```
### 4. Run as Cron Job
Add to crontab (`crontab -e`):
```bash
# Run scraper daily at 2 AM
0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
```
## Docker Deployment (Optional)
Create `Dockerfile`:
```dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies for Playwright
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN playwright install chromium
RUN playwright install-deps
COPY main.py .
CMD ["python", "main.py"]
```
Build and run:
```bash
docker build -t troost-scraper .
docker run -v /path/to/output:/output troost-scraper
```
## Monitoring
### Check Logs
```bash
tail -f /var/log/troost-scraper.log
```
### Monitor Output
```bash
ls -lh /var/troost-scraper/output/
```
## Troubleshooting
### Playwright Browser Issues
```bash
# Reinstall browsers
playwright install --force chromium
```
### Permission Issues
```bash
# Fix permissions
sudo chown -R $USER:$USER /var/troost-scraper
```
### Memory Issues
- Reduce `MAX_PAGES` in configuration
- Run on machine with more RAM (Playwright needs ~1GB)

71
_wiki/Getting-Started.md Normal file
View File

@@ -0,0 +1,71 @@
# Getting Started
## Prerequisites
- Python 3.8+
- Git
- pip (Python package manager)
## Installation
### 1. Clone the repository
```bash
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
### 3. Install Playwright browsers
```bash
playwright install chromium
```
## Configuration
Edit the configuration in `main.py`:
```python
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db" # Path to cache database
OUTPUT_DIR = "/path/to/output" # Output directory
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
MAX_PAGES = 50 # Number of listing pages
```
**Windows users:** Use paths like `C:\\output\\cache.db`
## Usage
### Basic scraping
```bash
python main.py
```
This will:
1. Crawl listing pages to collect lot URLs
2. Scrape each individual lot page
3. Save results in JSON and CSV formats
4. Cache all pages for future runs
### Test mode
Debug extraction on a specific URL:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
```
## Output
The scraper generates:
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
- `cache.db` - SQLite cache (persistent)

107
_wiki/HOLISTIC.md Normal file
View File

@@ -0,0 +1,107 @@
# Architecture
## Overview
The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
## Core Components
### 1. **Browser Automation (Playwright)**
- Launches Chromium browser in headless mode
- Bypasses Cloudflare protection
- Handles dynamic content rendering
- Supports network idle detection
### 2. **Cache Manager (SQLite)**
- Caches every fetched page
- Prevents redundant requests
- Stores page content, timestamps, and status codes
- Auto-cleans entries older than 7 days
- Database: `cache.db`
### 3. **Rate Limiter**
- Enforces exactly 0.5 seconds between requests
- Prevents server overload
- Tracks last request time globally
### 4. **Data Extractor**
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
- **Fallback method:** HTML pattern matching with regex
- Extracts: title, location, bid info, dates, images, descriptions
### 5. **Output Manager**
- Exports data in JSON and CSV formats
- Saves progress checkpoints every 10 lots
- Timestamped filenames for tracking
## Data Flow
```
1. Listing Pages → Extract lot URLs → Store in memory
2. For each lot URL → Check cache → If cached: use cached content
↓ If not: fetch with rate limit
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
4. Every 10 lots → Save progress checkpoint
5. All lots complete → Export final JSON + CSV
```
## Key Design Decisions
### Why Playwright?
- Handles JavaScript-rendered content (Next.js)
- Bypasses Cloudflare protection
- More reliable than requests/BeautifulSoup for modern SPAs
### Why JSON extraction?
- Site uses Next.js with embedded `__NEXT_DATA__`
- JSON is more reliable than HTML pattern matching
- Avoids breaking when HTML/CSS changes
- Faster parsing
### Why SQLite caching?
- Persistent across runs
- Reduces load on target server
- Enables test mode without re-fetching
- Respects website resources
## File Structure
```
troost-scraper/
├── main.py # Main scraper logic
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── .gitignore # Git exclusions
└── output/ # Generated files (not in git)
├── cache.db # SQLite cache
├── *_partial_*.json # Progress checkpoints
├── *_final_*.json # Final JSON output
└── *_final_*.csv # Final CSV output
```
## Classes
### `CacheManager`
- `__init__(db_path)` - Initialize cache database
- `get(url, max_age_hours)` - Retrieve cached page
- `set(url, content, status_code)` - Cache a page
- `clear_old(max_age_hours)` - Remove old entries
### `TroostwijkScraper`
- `crawl_auctions(max_pages)` - Main entry point
- `crawl_listing_page(page, page_num)` - Extract lot URLs
- `crawl_lot(page, url)` - Scrape individual lot
- `_extract_nextjs_data(content)` - Parse JSON data
- `_parse_lot_page(content, url)` - Extract all fields
- `save_final_results(data)` - Export JSON + CSV
## Scalability Notes
- **Rate limiting** prevents IP blocks but slows execution
- **Caching** makes subsequent runs instant for unchanged pages
- **Progress checkpoints** allow resuming after interruption
- **Async/await** used throughout for non-blocking I/O

18
_wiki/Home.md Normal file
View File

@@ -0,0 +1,18 @@
# scaev Wiki
Welcome to the scaev documentation.
## Contents
- [Getting Started](Getting-Started)
- [Architecture](Architecture)
- [Deployment](Deployment)
## Overview
Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
## Quick Links
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)

279
_wiki/TESTING.md Normal file
View File

@@ -0,0 +1,279 @@
# Testing & Migration Guide
## Overview
This guide covers:
1. Migrating existing cache to compressed format
2. Running the test suite
3. Understanding test results
## Step 1: Migrate Cache to Compressed Format
If you have an existing database with uncompressed entries (from before compression was added), run the migration script:
```bash
python migrate_compress_cache.py
```
### What it does:
- Finds all cache entries where data is uncompressed
- Compresses them using zlib (level 9)
- Reports compression statistics and space saved
- Verifies all entries are compressed
### Expected output:
```
Cache Compression Migration Tool
============================================================
Initial database size: 1024.56 MB
Found 1134 uncompressed cache entries
Starting compression...
Compressed 100/1134 entries... (78.3% reduction so far)
Compressed 200/1134 entries... (79.1% reduction so far)
...
============================================================
MIGRATION COMPLETE
============================================================
Entries compressed: 1134
Original size: 1024.56 MB
Compressed size: 198.34 MB
Space saved: 826.22 MB
Compression ratio: 80.6%
============================================================
VERIFICATION:
Compressed entries: 1134
Uncompressed entries: 0
✓ All cache entries are compressed!
Final database size: 1024.56 MB
Database size reduced by: 0.00 MB
✓ Migration complete! You can now run VACUUM to reclaim disk space:
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
```
### Reclaim disk space:
After migration, the database file still contains the space used by old uncompressed data. To actually reclaim the disk space:
```bash
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
```
This will rebuild the database file and reduce its size significantly.
## Step 2: Run Tests
The test suite validates that auction and lot parsing works correctly using **cached data only** (no live requests to server).
```bash
python test_scraper.py
```
### What it tests:
**Auction Pages:**
- Type detection (must be 'auction')
- auction_id extraction
- title extraction
- location extraction
- lots_count extraction
- first_lot_closing_time extraction
**Lot Pages:**
- Type detection (must be 'lot')
- lot_id extraction
- title extraction (must not be '...', 'N/A', or empty)
- location extraction (must not be 'Locatie', 'Location', or empty)
- current_bid extraction (must not be '€Huidig bod' or invalid)
- closing_time extraction
- images array extraction
- bid_count validation
- viewing_time and pickup_date (optional)
### Expected output:
```
======================================================================
TROOSTWIJK SCRAPER TEST SUITE
======================================================================
This test suite uses CACHED data only - no live requests to server
======================================================================
======================================================================
CACHE STATUS CHECK
======================================================================
Total cache entries: 1134
Compressed: 1134 (100.0%)
Uncompressed: 0 (0.0%)
✓ All cache entries are compressed!
======================================================================
TEST URL CACHE STATUS:
======================================================================
✓ https://www.troostwijkauctions.com/a/online-auction-cnc-lat...
✓ https://www.troostwijkauctions.com/a/faillissement-bab-sho...
✓ https://www.troostwijkauctions.com/a/industriele-goederen-...
✓ https://www.troostwijkauctions.com/l/%25282x%2529-duo-bure...
✓ https://www.troostwijkauctions.com/l/tos-sui-50-1000-unive...
✓ https://www.troostwijkauctions.com/l/rolcontainer-%25282x%...
6/6 test URLs are cached
======================================================================
TESTING AUCTIONS
======================================================================
======================================================================
Testing Auction: https://www.troostwijkauctions.com/a/online-auction...
======================================================================
✓ Cache hit (age: 12.3 hours)
✓ auction_id: A7-39813
✓ title: Online Auction: CNC Lathes, Machining Centres & Precision...
✓ location: Cluj-Napoca, RO
✓ first_lot_closing_time: 2024-12-05 14:30:00
✓ lots_count: 45
======================================================================
TESTING LOTS
======================================================================
======================================================================
Testing Lot: https://www.troostwijkauctions.com/l/%25282x%2529-duo...
======================================================================
✓ Cache hit (age: 8.7 hours)
✓ lot_id: A1-28505-5
✓ title: (2x) Duo Bureau - 160x168 cm
✓ location: Dongen, NL
✓ current_bid: No bids
✓ closing_time: 2024-12-10 16:00:00
✓ images: 2 images
1. https://media.tbauctions.com/image-media/c3f9825f-e3fd...
2. https://media.tbauctions.com/image-media/45c85ced-9c63...
✓ bid_count: 0
✓ viewing_time: 2024-12-08 09:00:00 - 2024-12-08 17:00:00
✓ pickup_date: 2024-12-11 09:00:00 - 2024-12-11 15:00:00
======================================================================
TEST SUMMARY
======================================================================
Total tests: 6
Passed: 6 ✓
Failed: 0 ✗
Success rate: 100.0%
======================================================================
```
## Test URLs
The test suite tests these specific URLs (you can modify in `test_scraper.py`):
**Auctions:**
- https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813
- https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557
- https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675
**Lots:**
- https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5
- https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9
- https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101
## Adding More Test Cases
To add more test URLs, edit `test_scraper.py`:
```python
TEST_AUCTIONS = [
"https://www.troostwijkauctions.com/a/your-auction-url",
# ... add more
]
TEST_LOTS = [
"https://www.troostwijkauctions.com/l/your-lot-url",
# ... add more
]
```
Then run the main scraper to cache these URLs:
```bash
python main.py
```
Then run tests:
```bash
python test_scraper.py
```
## Troubleshooting
### "NOT IN CACHE" errors
If tests show URLs are not cached, run the main scraper first:
```bash
python main.py
```
### "Failed to decompress cache" warnings
This means you have uncompressed legacy data. Run the migration:
```bash
python migrate_compress_cache.py
```
### Tests failing with parsing errors
Check the detailed error output in the TEST SUMMARY section. It will show:
- Which field failed validation
- The actual value that was extracted
- Why it failed (empty, wrong type, invalid format)
## Cache Behavior
The test suite uses cached data with these characteristics:
- **No rate limiting** - reads from DB instantly
- **No server load** - zero HTTP requests
- **Repeatable** - same results every time
- **Fast** - all tests run in < 5 seconds
This allows you to:
- Test parsing changes without re-scraping
- Run tests repeatedly during development
- Validate changes before deploying
- Ensure data quality without server impact
## Continuous Integration
You can integrate these tests into CI/CD:
```bash
# Run migration if needed
python migrate_compress_cache.py
# Run tests
python test_scraper.py
# Exit code: 0 = success, 1 = failure
```
## Performance Benchmarks
Based on typical HTML sizes:
| Metric | Before Compression | After Compression | Improvement |
|--------|-------------------|-------------------|-------------|
| Avg page size | 800 KB | 150 KB | 81.3% |
| 1000 pages | 800 MB | 150 MB | 650 MB saved |
| 10,000 pages | 8 GB | 1.5 GB | 6.5 GB saved |
| DB read speed | ~50 ms | ~5 ms | 10x faster |
## Best Practices
1. **Always run migration after upgrading** to the compressed cache version
2. **Run VACUUM** after migration to reclaim disk space
3. **Run tests after major changes** to parsing logic
4. **Add test cases for edge cases** you encounter in production
5. **Keep test URLs diverse** - different auctions, lot types, languages
6. **Monitor cache hit rates** to ensure effective caching