first

2025-12-04 14:49:58 +01:00
commit 79e14be37a
22 changed files with 2765 additions and 0 deletions
--- a/wiki/HOLISTIC.md
+++ b/wiki/HOLISTIC.md
@@ -0,0 +1,107 @@
+# Architecture
+
+## Overview
+
+The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
+
+## Core Components
+
+### 1. **Browser Automation (Playwright)**
+- Launches Chromium browser in headless mode
+- Bypasses Cloudflare protection
+- Handles dynamic content rendering
+- Supports network idle detection
+
+### 2. **Cache Manager (SQLite)**
+- Caches every fetched page
+- Prevents redundant requests
+- Stores page content, timestamps, and status codes
+- Auto-cleans entries older than 7 days
+- Database: `cache.db`
+
+### 3. **Rate Limiter**
+- Enforces exactly 0.5 seconds between requests
+- Prevents server overload
+- Tracks last request time globally
+
+### 4. **Data Extractor**
+- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
+- **Fallback method:** HTML pattern matching with regex
+- Extracts: title, location, bid info, dates, images, descriptions
+
+### 5. **Output Manager**
+- Exports data in JSON and CSV formats
+- Saves progress checkpoints every 10 lots
+- Timestamped filenames for tracking
+
+## Data Flow
+
+```
+1. Listing Pages → Extract lot URLs → Store in memory
+                                           ↓
+2. For each lot URL → Check cache → If cached: use cached content
+                          ↓              If not: fetch with rate limit
+                          ↓
+3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
+                          ↓
+4. Every 10 lots → Save progress checkpoint
+                          ↓
+5. All lots complete → Export final JSON + CSV
+```
+
+## Key Design Decisions
+
+### Why Playwright?
+- Handles JavaScript-rendered content (Next.js)
+- Bypasses Cloudflare protection
+- More reliable than requests/BeautifulSoup for modern SPAs
+
+### Why JSON extraction?
+- Site uses Next.js with embedded `__NEXT_DATA__`
+- JSON is more reliable than HTML pattern matching
+- Avoids breaking when HTML/CSS changes
+- Faster parsing
+
+### Why SQLite caching?
+- Persistent across runs
+- Reduces load on target server
+- Enables test mode without re-fetching
+- Respects website resources
+
+## File Structure
+
+```
+troost-scraper/
+├── main.py                    # Main scraper logic
+├── requirements.txt           # Python dependencies
+├── README.md                  # Documentation
+├── .gitignore                 # Git exclusions
+└── output/                    # Generated files (not in git)
+    ├── cache.db              # SQLite cache
+    ├── *_partial_*.json      # Progress checkpoints
+    ├── *_final_*.json        # Final JSON output
+    └── *_final_*.csv         # Final CSV output
+```
+
+## Classes
+
+### `CacheManager`
+- `__init__(db_path)` - Initialize cache database
+- `get(url, max_age_hours)` - Retrieve cached page
+- `set(url, content, status_code)` - Cache a page
+- `clear_old(max_age_hours)` - Remove old entries
+
+### `TroostwijkScraper`
+- `crawl_auctions(max_pages)` - Main entry point
+- `crawl_listing_page(page, page_num)` - Extract lot URLs
+- `crawl_lot(page, url)` - Scrape individual lot
+- `_extract_nextjs_data(content)` - Parse JSON data
+- `_parse_lot_page(content, url)` - Extract all fields
+- `save_final_results(data)` - Export JSON + CSV
+
+## Scalability Notes
+
+- **Rate limiting** prevents IP blocks but slows execution
+- **Caching** makes subsequent runs instant for unchanged pages
+- **Progress checkpoints** allow resuming after interruption
+- **Async/await** used throughout for non-blocking I/O