Tour/scaev

Files

Tour b12f3a5ee2 a

2025-12-04 14:53:55 +01:00

3.5 KiB

Raw Blame History

Architecture

Overview

The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.

Core Components

1. Browser Automation (Playwright)

Launches Chromium browser in headless mode
Bypasses Cloudflare protection
Handles dynamic content rendering
Supports network idle detection

2. Cache Manager (SQLite)

Caches every fetched page
Prevents redundant requests
Stores page content, timestamps, and status codes
Auto-cleans entries older than 7 days
Database: cache.db

3. Rate Limiter

Enforces exactly 0.5 seconds between requests
Prevents server overload
Tracks last request time globally

4. Data Extractor

Primary method: Parses __NEXT_DATA__ JSON from Next.js pages
Fallback method: HTML pattern matching with regex
Extracts: title, location, bid info, dates, images, descriptions

5. Output Manager

Exports data in JSON and CSV formats
Saves progress checkpoints every 10 lots
Timestamped filenames for tracking

Data Flow

1. Listing Pages → Extract lot URLs → Store in memory
                                           ↓
2. For each lot URL → Check cache → If cached: use cached content
                          ↓              If not: fetch with rate limit
                          ↓
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
                          ↓
4. Every 10 lots → Save progress checkpoint
                          ↓
5. All lots complete → Export final JSON + CSV

Key Design Decisions

Why Playwright?

Handles JavaScript-rendered content (Next.js)
Bypasses Cloudflare protection
More reliable than requests/BeautifulSoup for modern SPAs

Why JSON extraction?

Site uses Next.js with embedded __NEXT_DATA__
JSON is more reliable than HTML pattern matching
Avoids breaking when HTML/CSS changes
Faster parsing

Why SQLite caching?

Persistent across runs
Reduces load on target server
Enables test mode without re-fetching
Respects website resources

File Structure

troost-scraper/
├── main.py                    # Main scraper logic
├── requirements.txt           # Python dependencies
├── README.md                  # Documentation
├── .gitignore                 # Git exclusions
└── output/                    # Generated files (not in git)
    ├── cache.db              # SQLite cache
    ├── *_partial_*.json      # Progress checkpoints
    ├── *_final_*.json        # Final JSON output
    └── *_final_*.csv         # Final CSV output

Classes

`CacheManager`

__init__(db_path) - Initialize cache database
get(url, max_age_hours) - Retrieve cached page
set(url, content, status_code) - Cache a page
clear_old(max_age_hours) - Remove old entries

`TroostwijkScraper`

crawl_auctions(max_pages) - Main entry point
crawl_listing_page(page, page_num) - Extract lot URLs
crawl_lot(page, url) - Scrape individual lot
_extract_nextjs_data(content) - Parse JSON data
_parse_lot_page(content, url) - Extract all fields
save_final_results(data) - Export JSON + CSV

Scalability Notes

Rate limiting prevents IP blocks but slows execution
Caching makes subsequent runs instant for unchanged pages
Progress checkpoints allow resuming after interruption
Async/await used throughout for non-blocking I/O

3.5 KiB Raw Blame History