Files
scaev/_wiki/HOLISTIC.md
Tour b12f3a5ee2 a
2025-12-04 14:53:55 +01:00

3.5 KiB

Architecture

Overview

The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.

Core Components

1. Browser Automation (Playwright)

  • Launches Chromium browser in headless mode
  • Bypasses Cloudflare protection
  • Handles dynamic content rendering
  • Supports network idle detection

2. Cache Manager (SQLite)

  • Caches every fetched page
  • Prevents redundant requests
  • Stores page content, timestamps, and status codes
  • Auto-cleans entries older than 7 days
  • Database: cache.db

3. Rate Limiter

  • Enforces exactly 0.5 seconds between requests
  • Prevents server overload
  • Tracks last request time globally

4. Data Extractor

  • Primary method: Parses __NEXT_DATA__ JSON from Next.js pages
  • Fallback method: HTML pattern matching with regex
  • Extracts: title, location, bid info, dates, images, descriptions

5. Output Manager

  • Exports data in JSON and CSV formats
  • Saves progress checkpoints every 10 lots
  • Timestamped filenames for tracking

Data Flow

1. Listing Pages → Extract lot URLs → Store in memory
                                           ↓
2. For each lot URL → Check cache → If cached: use cached content
                          ↓              If not: fetch with rate limit
                          ↓
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
                          ↓
4. Every 10 lots → Save progress checkpoint
                          ↓
5. All lots complete → Export final JSON + CSV

Key Design Decisions

Why Playwright?

  • Handles JavaScript-rendered content (Next.js)
  • Bypasses Cloudflare protection
  • More reliable than requests/BeautifulSoup for modern SPAs

Why JSON extraction?

  • Site uses Next.js with embedded __NEXT_DATA__
  • JSON is more reliable than HTML pattern matching
  • Avoids breaking when HTML/CSS changes
  • Faster parsing

Why SQLite caching?

  • Persistent across runs
  • Reduces load on target server
  • Enables test mode without re-fetching
  • Respects website resources

File Structure

troost-scraper/
├── main.py                    # Main scraper logic
├── requirements.txt           # Python dependencies
├── README.md                  # Documentation
├── .gitignore                 # Git exclusions
└── output/                    # Generated files (not in git)
    ├── cache.db              # SQLite cache
    ├── *_partial_*.json      # Progress checkpoints
    ├── *_final_*.json        # Final JSON output
    └── *_final_*.csv         # Final CSV output

Classes

CacheManager

  • __init__(db_path) - Initialize cache database
  • get(url, max_age_hours) - Retrieve cached page
  • set(url, content, status_code) - Cache a page
  • clear_old(max_age_hours) - Remove old entries

TroostwijkScraper

  • crawl_auctions(max_pages) - Main entry point
  • crawl_listing_page(page, page_num) - Extract lot URLs
  • crawl_lot(page, url) - Scrape individual lot
  • _extract_nextjs_data(content) - Parse JSON data
  • _parse_lot_page(content, url) - Extract all fields
  • save_final_results(data) - Export JSON + CSV

Scalability Notes

  • Rate limiting prevents IP blocks but slows execution
  • Caching makes subsequent runs instant for unchanged pages
  • Progress checkpoints allow resuming after interruption
  • Async/await used throughout for non-blocking I/O