Clone
3
Getting Started
Tour edited this page 2025-12-03 12:33:06 +01:00

Getting Started

Prerequisites

  • Python 3.8+
  • Git
  • pip (Python package manager)

Installation

1. Clone the repository

git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper

2. Install dependencies

pip install -r requirements.txt

3. Install Playwright browsers

playwright install chromium

Configuration

Edit the configuration in main.py:

BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db"           # Path to cache database
OUTPUT_DIR = "/path/to/output"            # Output directory
RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
MAX_PAGES = 50                            # Number of listing pages

Windows users: Use paths like C:\\output\\cache.db

Usage

Basic scraping

python main.py

This will:

  1. Crawl listing pages to collect lot URLs
  2. Scrape each individual lot page
  3. Save results in JSON and CSV formats
  4. Cache all pages for future runs

Test mode

Debug extraction on a specific URL:

python main.py --test "https://www.troostwijkauctions.com/a/lot-url"

Output

The scraper generates:

  • troostwijk_lots_final_YYYYMMDD_HHMMSS.json - Complete data
  • troostwijk_lots_final_YYYYMMDD_HHMMSS.csv - CSV export
  • cache.db - SQLite cache (persistent)