1.4 KiB
1.4 KiB
Getting Started
Prerequisites
- Python 3.8+
- Git
- pip (Python package manager)
Installation
1. Clone the repository
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper
2. Install dependencies
pip install -r requirements.txt
3. Install Playwright browsers
playwright install chromium
Configuration
Edit the configuration in main.py:
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db" # Path to cache database
OUTPUT_DIR = "/path/to/output" # Output directory
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
MAX_PAGES = 50 # Number of listing pages
Windows users: Use paths like C:\\output\\cache.db
Usage
Basic scraping
python main.py
This will:
- Crawl listing pages to collect lot URLs
- Scrape each individual lot page
- Save results in JSON and CSV formats
- Cache all pages for future runs
Test mode
Debug extraction on a specific URL:
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
Output
The scraper generates:
troostwijk_lots_final_YYYYMMDD_HHMMSS.json- Complete datatroostwijk_lots_final_YYYYMMDD_HHMMSS.csv- CSV exportcache.db- SQLite cache (persistent)