scaev/_wiki/HOLISTIC.md

# Architecture

## Overview

The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.

## Core Components

### 1. **Browser Automation (Playwright)**
- Launches Chromium browser in headless mode
- Bypasses Cloudflare protection
- Handles dynamic content rendering
- Supports network idle detection

### 2. **Cache Manager (SQLite)**
- Caches every fetched page
- Prevents redundant requests
- Stores page content, timestamps, and status codes
- Auto-cleans entries older than 7 days
- Database: `cache.db`

### 3. **Rate Limiter**
- Enforces exactly 0.5 seconds between requests
- Prevents server overload
- Tracks last request time globally

### 4. **Data Extractor**
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
- **Fallback method:** HTML pattern matching with regex
- Extracts: title, location, bid info, dates, images, descriptions

### 5. **Output Manager**
- Exports data in JSON and CSV formats
- Saves progress checkpoints every 10 lots
- Timestamped filenames for tracking

## Data Flow

```
1. Listing Pages → Extract lot URLs → Store in memory
                                           ↓
2. For each lot URL → Check cache → If cached: use cached content
                          ↓              If not: fetch with rate limit
                          ↓
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
                          ↓
4. Every 10 lots → Save progress checkpoint
                          ↓
5. All lots complete → Export final JSON + CSV
```

## Key Design Decisions

### Why Playwright?
- Handles JavaScript-rendered content (Next.js)
- Bypasses Cloudflare protection
- More reliable than requests/BeautifulSoup for modern SPAs

### Why JSON extraction?
- Site uses Next.js with embedded `__NEXT_DATA__`
- JSON is more reliable than HTML pattern matching
- Avoids breaking when HTML/CSS changes
- Faster parsing

### Why SQLite caching?
- Persistent across runs
- Reduces load on target server
- Enables test mode without re-fetching
- Respects website resources

## File Structure

```
troost-scraper/
├── main.py                    # Main scraper logic
├── requirements.txt           # Python dependencies
├── README.md                  # Documentation
├── .gitignore                 # Git exclusions
└── output/                    # Generated files (not in git)
    ├── cache.db              # SQLite cache
    ├── *_partial_*.json      # Progress checkpoints
    ├── *_final_*.json        # Final JSON output
    └── *_final_*.csv         # Final CSV output
```

## Classes

### `CacheManager`
- `__init__(db_path)` - Initialize cache database
- `get(url, max_age_hours)` - Retrieve cached page
- `set(url, content, status_code)` - Cache a page
- `clear_old(max_age_hours)` - Remove old entries

### `TroostwijkScraper`
- `crawl_auctions(max_pages)` - Main entry point
- `crawl_listing_page(page, page_num)` - Extract lot URLs
- `crawl_lot(page, url)` - Scrape individual lot
- `_extract_nextjs_data(content)` - Parse JSON data
- `_parse_lot_page(content, url)` - Extract all fields
- `save_final_results(data)` - Export JSON + CSV

## Scalability Notes

- **Rate limiting** prevents IP blocks but slows execution
- **Caching** makes subsequent runs instant for unchanged pages
- **Progress checkpoints** allow resuming after interruption
- **Async/await** used throughout for non-blocking I/O