Page:
Architecture
Clone
3
Architecture
Tour edited this page 2025-12-03 12:33:06 +01:00
Table of Contents
Architecture
Overview
The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
Core Components
1. Browser Automation (Playwright)
- Launches Chromium browser in headless mode
- Bypasses Cloudflare protection
- Handles dynamic content rendering
- Supports network idle detection
2. Cache Manager (SQLite)
- Caches every fetched page
- Prevents redundant requests
- Stores page content, timestamps, and status codes
- Auto-cleans entries older than 7 days
- Database:
cache.db
3. Rate Limiter
- Enforces exactly 0.5 seconds between requests
- Prevents server overload
- Tracks last request time globally
4. Data Extractor
- Primary method: Parses
__NEXT_DATA__JSON from Next.js pages - Fallback method: HTML pattern matching with regex
- Extracts: title, location, bid info, dates, images, descriptions
5. Output Manager
- Exports data in JSON and CSV formats
- Saves progress checkpoints every 10 lots
- Timestamped filenames for tracking
Data Flow
1. Listing Pages → Extract lot URLs → Store in memory
↓
2. For each lot URL → Check cache → If cached: use cached content
↓ If not: fetch with rate limit
↓
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
↓
4. Every 10 lots → Save progress checkpoint
↓
5. All lots complete → Export final JSON + CSV
Key Design Decisions
Why Playwright?
- Handles JavaScript-rendered content (Next.js)
- Bypasses Cloudflare protection
- More reliable than requests/BeautifulSoup for modern SPAs
Why JSON extraction?
- Site uses Next.js with embedded
__NEXT_DATA__ - JSON is more reliable than HTML pattern matching
- Avoids breaking when HTML/CSS changes
- Faster parsing
Why SQLite caching?
- Persistent across runs
- Reduces load on target server
- Enables test mode without re-fetching
- Respects website resources
File Structure
troost-scraper/
├── main.py # Main scraper logic
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── .gitignore # Git exclusions
└── output/ # Generated files (not in git)
├── cache.db # SQLite cache
├── *_partial_*.json # Progress checkpoints
├── *_final_*.json # Final JSON output
└── *_final_*.csv # Final CSV output
Classes
CacheManager
__init__(db_path)- Initialize cache databaseget(url, max_age_hours)- Retrieve cached pageset(url, content, status_code)- Cache a pageclear_old(max_age_hours)- Remove old entries
TroostwijkScraper
crawl_auctions(max_pages)- Main entry pointcrawl_listing_page(page, page_num)- Extract lot URLscrawl_lot(page, url)- Scrape individual lot_extract_nextjs_data(content)- Parse JSON data_parse_lot_page(content, url)- Extract all fieldssave_final_results(data)- Export JSON + CSV
Scalability Notes
- Rate limiting prevents IP blocks but slows execution
- Caching makes subsequent runs instant for unchanged pages
- Progress checkpoints allow resuming after interruption
- Async/await used throughout for non-blocking I/O