108 lines
3.5 KiB
Markdown
108 lines
3.5 KiB
Markdown
# Architecture
|
|
|
|
## Overview
|
|
|
|
The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
|
|
|
## Core Components
|
|
|
|
### 1. **Browser Automation (Playwright)**
|
|
- Launches Chromium browser in headless mode
|
|
- Bypasses Cloudflare protection
|
|
- Handles dynamic content rendering
|
|
- Supports network idle detection
|
|
|
|
### 2. **Cache Manager (SQLite)**
|
|
- Caches every fetched page
|
|
- Prevents redundant requests
|
|
- Stores page content, timestamps, and status codes
|
|
- Auto-cleans entries older than 7 days
|
|
- Database: `cache.db`
|
|
|
|
### 3. **Rate Limiter**
|
|
- Enforces exactly 0.5 seconds between requests
|
|
- Prevents server overload
|
|
- Tracks last request time globally
|
|
|
|
### 4. **Data Extractor**
|
|
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
|
|
- **Fallback method:** HTML pattern matching with regex
|
|
- Extracts: title, location, bid info, dates, images, descriptions
|
|
|
|
### 5. **Output Manager**
|
|
- Exports data in JSON and CSV formats
|
|
- Saves progress checkpoints every 10 lots
|
|
- Timestamped filenames for tracking
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
1. Listing Pages → Extract lot URLs → Store in memory
|
|
↓
|
|
2. For each lot URL → Check cache → If cached: use cached content
|
|
↓ If not: fetch with rate limit
|
|
↓
|
|
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
|
|
↓
|
|
4. Every 10 lots → Save progress checkpoint
|
|
↓
|
|
5. All lots complete → Export final JSON + CSV
|
|
```
|
|
|
|
## Key Design Decisions
|
|
|
|
### Why Playwright?
|
|
- Handles JavaScript-rendered content (Next.js)
|
|
- Bypasses Cloudflare protection
|
|
- More reliable than requests/BeautifulSoup for modern SPAs
|
|
|
|
### Why JSON extraction?
|
|
- Site uses Next.js with embedded `__NEXT_DATA__`
|
|
- JSON is more reliable than HTML pattern matching
|
|
- Avoids breaking when HTML/CSS changes
|
|
- Faster parsing
|
|
|
|
### Why SQLite caching?
|
|
- Persistent across runs
|
|
- Reduces load on target server
|
|
- Enables test mode without re-fetching
|
|
- Respects website resources
|
|
|
|
## File Structure
|
|
|
|
```
|
|
troost-scraper/
|
|
├── main.py # Main scraper logic
|
|
├── requirements.txt # Python dependencies
|
|
├── README.md # Documentation
|
|
├── .gitignore # Git exclusions
|
|
└── output/ # Generated files (not in git)
|
|
├── cache.db # SQLite cache
|
|
├── *_partial_*.json # Progress checkpoints
|
|
├── *_final_*.json # Final JSON output
|
|
└── *_final_*.csv # Final CSV output
|
|
```
|
|
|
|
## Classes
|
|
|
|
### `CacheManager`
|
|
- `__init__(db_path)` - Initialize cache database
|
|
- `get(url, max_age_hours)` - Retrieve cached page
|
|
- `set(url, content, status_code)` - Cache a page
|
|
- `clear_old(max_age_hours)` - Remove old entries
|
|
|
|
### `TroostwijkScraper`
|
|
- `crawl_auctions(max_pages)` - Main entry point
|
|
- `crawl_listing_page(page, page_num)` - Extract lot URLs
|
|
- `crawl_lot(page, url)` - Scrape individual lot
|
|
- `_extract_nextjs_data(content)` - Parse JSON data
|
|
- `_parse_lot_page(content, url)` - Extract all fields
|
|
- `save_final_results(data)` - Export JSON + CSV
|
|
|
|
## Scalability Notes
|
|
|
|
- **Rate limiting** prevents IP blocks but slows execution
|
|
- **Caching** makes subsequent runs instant for unchanged pages
|
|
- **Progress checkpoints** allow resuming after interruption
|
|
- **Async/await** used throughout for non-blocking I/O
|