Getting Started

Prerequisites

Python 3.8+
Git
pip (Python package manager)

Installation

1. Clone the repository

git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper

2. Install dependencies

pip install -r requirements.txt

3. Install Playwright browsers

playwright install chromium

Configuration

Edit the configuration in main.py:

BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db"           # Path to cache database
OUTPUT_DIR = "/path/to/output"            # Output directory
RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
MAX_PAGES = 50                            # Number of listing pages

Windows users: Use paths like C:\\output\\cache.db

Usage

Basic scraping

python main.py

This will:

Crawl listing pages to collect lot URLs
Scrape each individual lot page
Save results in JSON and CSV formats
Cache all pages for future runs

Test mode

Debug extraction on a specific URL:

python main.py --test "https://www.troostwijkauctions.com/a/lot-url"

Output

The scraper generates:

troostwijk_lots_final_YYYYMMDD_HHMMSS.json - Complete data
troostwijk_lots_final_YYYYMMDD_HHMMSS.csv - CSV export
cache.db - SQLite cache (persistent)