Getting Started

Prerequisites

Python 3.8+
Git
pip (Python package manager)

Installation

1. Clone the repository

git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper

2. Install dependencies

pip install -r requirements.txt

3. Install Playwright browsers

playwright install chromium

Configuration

Edit the configuration in main.py:

BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db"           # Path to cache database
OUTPUT_DIR = "/path/to/output"            # Output directory
RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
MAX_PAGES = 50                            # Number of listing pages

Windows users: Use paths like C:\\output\\cache.db

Usage

Basic scraping

python main.py

This will:

Crawl listing pages to collect lot URLs
Scrape each individual lot page
Save results in JSON and CSV formats
Cache all pages for future runs

Test mode

Debug extraction on a specific URL:

python main.py --test "https://www.troostwijkauctions.com/a/lot-url"

Output

The scraper generates:

troostwijk_lots_final_YYYYMMDD_HHMMSS.json - Complete data
troostwijk_lots_final_YYYYMMDD_HHMMSS.csv - CSV export
cache.db - SQLite cache (persistent)

1.4 KiB Raw Blame History

Getting Started

Prerequisites

Installation

1. Clone the repository

2. Install dependencies

3. Install Playwright browsers

Configuration

Usage

Basic scraping

Test mode

Output

1.4 KiB

Raw Blame History