# Architecture ## Overview The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching. ## Core Components ### 1. **Browser Automation (Playwright)** - Launches Chromium browser in headless mode - Bypasses Cloudflare protection - Handles dynamic content rendering - Supports network idle detection ### 2. **Cache Manager (SQLite)** - Caches every fetched page - Prevents redundant requests - Stores page content, timestamps, and status codes - Auto-cleans entries older than 7 days - Database: `cache.db` ### 3. **Rate Limiter** - Enforces exactly 0.5 seconds between requests - Prevents server overload - Tracks last request time globally ### 4. **Data Extractor** - **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages - **Fallback method:** HTML pattern matching with regex - Extracts: title, location, bid info, dates, images, descriptions ### 5. **Output Manager** - Exports data in JSON and CSV formats - Saves progress checkpoints every 10 lots - Timestamped filenames for tracking ## Data Flow ``` 1. Listing Pages → Extract lot URLs → Store in memory ↓ 2. For each lot URL → Check cache → If cached: use cached content ↓ If not: fetch with rate limit ↓ 3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results ↓ 4. Every 10 lots → Save progress checkpoint ↓ 5. All lots complete → Export final JSON + CSV ``` ## Key Design Decisions ### Why Playwright? - Handles JavaScript-rendered content (Next.js) - Bypasses Cloudflare protection - More reliable than requests/BeautifulSoup for modern SPAs ### Why JSON extraction? - Site uses Next.js with embedded `__NEXT_DATA__` - JSON is more reliable than HTML pattern matching - Avoids breaking when HTML/CSS changes - Faster parsing ### Why SQLite caching? - Persistent across runs - Reduces load on target server - Enables test mode without re-fetching - Respects website resources ## File Structure ``` troost-scraper/ ├── main.py # Main scraper logic ├── requirements.txt # Python dependencies ├── README.md # Documentation ├── .gitignore # Git exclusions └── output/ # Generated files (not in git) ├── cache.db # SQLite cache ├── *_partial_*.json # Progress checkpoints ├── *_final_*.json # Final JSON output └── *_final_*.csv # Final CSV output ``` ## Classes ### `CacheManager` - `__init__(db_path)` - Initialize cache database - `get(url, max_age_hours)` - Retrieve cached page - `set(url, content, status_code)` - Cache a page - `clear_old(max_age_hours)` - Remove old entries ### `TroostwijkScraper` - `crawl_auctions(max_pages)` - Main entry point - `crawl_listing_page(page, page_num)` - Extract lot URLs - `crawl_lot(page, url)` - Scrape individual lot - `_extract_nextjs_data(content)` - Parse JSON data - `_parse_lot_page(content, url)` - Extract all fields - `save_final_results(data)` - Export JSON + CSV ## Scalability Notes - **Rate limiting** prevents IP blocks but slows execution - **Caching** makes subsequent runs instant for unchanged pages - **Progress checkpoints** allow resuming after interruption - **Async/await** used throughout for non-blocking I/O