# Troostwijk Auctions Scraper A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities. ## Features - 1 - **Playwright-based scraping** - Bypasses Cloudflare protection - **SQLite caching** - Caches every page to avoid redundant requests - **Rate limiting** - Strictly enforces 0.5 seconds between requests - **Multi-format output** - Exports data in both JSON and CSV formats - **Progress saving** - Automatically saves progress every 10 lots - **Test mode** - Debug extraction patterns on cached pages ## Requirements - Python 3.8+ - Playwright (with Chromium browser) ## Installation 1. **Clone or download this project** 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` 3. **Install Playwright browsers:** ```bash playwright install chromium ``` ## Configuration Edit the configuration variables in `main.py`: ```python BASE_URL = "https://www.troostwijkauctions.com" CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory RATE_LIMIT_SECONDS = 0.5 # Delay between requests MAX_PAGES = 50 # Number of listing pages to crawl ``` **Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`). ## Usage ### Basic Scraping Run the scraper to collect auction lot data: ```bash python main.py ``` This will: 1. Crawl listing pages to collect lot URLs 2. Scrape each individual lot page 3. Save results in both JSON and CSV formats 4. Cache all pages to avoid re-fetching ### Test Mode Test extraction patterns on a specific cached URL: ```bash # Test with default URL python main.py --test # Test with specific URL python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here" ``` This is useful for debugging extraction patterns and verifying data is being extracted correctly. ## Output Files The scraper generates the following files: ### During Execution - `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots) ### Final Output - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format ### Cache - `cache.db` - SQLite database with cached page content (persistent across runs) ## Data Extracted For each auction lot, the scraper extracts: - **URL** - Direct link to the lot - **Lot ID** - Unique identifier (e.g., A7-35847) - **Title** - Lot title/description - **Current Bid** - Current bid amount - **Bid Count** - Number of bids placed - **End Date** - Auction end time - **Location** - Physical location of the item - **Description** - Detailed description - **Category** - Auction category - **Images** - Up to 5 product images - **Scraped At** - Timestamp of data collection ## How It Works ### Phase 1: Collect Lot URLs The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs. ### Phase 2: Scrape Individual Lots Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast. ### Caching Strategy - Every successfully fetched page is cached in SQLite - Cache is checked before making any request - Cache entries older than 7 days are automatically cleaned - Failed requests (500 errors) are also cached to avoid retrying ### Rate Limiting - Enforces exactly 0.5 seconds between ALL requests - Applies to both listing pages and individual lot pages - Prevents server overload and potential IP blocking ## Troubleshooting ### Issue: "Huidig bod" / "Locatie" instead of actual values **✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed. The scraper correctly extracts: - **Title** from `auction.name` - **Location** from `viewingDays` or `collectionDays` - **Images** from `auction.image.url` - **End date** from `minEndDate` - **Lot ID** from `auction.displayId` To verify extraction is working: ```bash python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url" ``` **Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details. ### Issue: No lots found - Check if the website structure has changed - Verify `BASE_URL` is correct - Try clearing the cache database ### Issue: Cloudflare blocking - Playwright should bypass this automatically - If issues persist, try adjusting user agent or headers in `crawl_auctions()` ### Issue: Slow scraping - This is intentional due to rate limiting (0.5s between requests) - Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s) - First run will be slower; subsequent runs use cache ## Project Structure ``` troost-scraper/ ├── main.py # Main scraper script ├── requirements.txt # Python dependencies ├── README.md # This file └── output/ # Generated output files (created automatically) ├── cache.db # SQLite cache ├── *.json # JSON output files └── *.csv # CSV output files ``` ## Development ### Adding New Extraction Fields 1. Add extraction method in `TroostwijkScraper` class: ```python def _extract_new_field(self, content: str) -> str: pattern = r'your-regex-pattern' match = re.search(pattern, content) return match.group(1) if match else "" ``` 2. Add field to `_parse_lot_page()`: ```python data = { # ... existing fields ... 'new_field': self._extract_new_field(content), } ``` 3. Add field to CSV export in `save_final_results()`: ```python fieldnames = ['url', 'lot_id', ..., 'new_field', ...] ``` ### Testing Extraction Patterns Use test mode to verify patterns work correctly: ```bash python main.py --test "https://www.troostwijkauctions.com/a/your-test-url" ``` ## License This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool. ## Notes - **Be respectful:** The rate limiting is intentionally conservative - **Check legality:** Ensure web scraping is permitted in your jurisdiction - **Monitor changes:** Website structure may change over time, requiring pattern updates - **Cache management:** Old cache entries are auto-cleaned after 7 days