# Getting Started ## Prerequisites - Python 3.8+ - Git - pip (Python package manager) ## Installation ### 1. Clone the repository ```bash git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git cd troost-scraper ``` ### 2. Install dependencies ```bash pip install -r requirements.txt ``` ### 3. Install Playwright browsers ```bash playwright install chromium ``` ## Configuration Edit the configuration in `main.py`: ```python BASE_URL = "https://www.troostwijkauctions.com" CACHE_DB = "/path/to/cache.db" # Path to cache database OUTPUT_DIR = "/path/to/output" # Output directory RATE_LIMIT_SECONDS = 0.5 # Delay between requests MAX_PAGES = 50 # Number of listing pages ``` **Windows users:** Use paths like `C:\\output\\cache.db` ## Usage ### Basic scraping ```bash python main.py ``` This will: 1. Crawl listing pages to collect lot URLs 2. Scrape each individual lot page 3. Save results in JSON and CSV formats 4. Cache all pages for future runs ### Test mode Debug extraction on a specific URL: ```bash python main.py --test "https://www.troostwijkauctions.com/a/lot-url" ``` ## Output The scraper generates: - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export - `cache.db` - SQLite cache (persistent)