6.7 KiB
Troostwijk Auctions Scraper
A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
Features
- Playwright-based scraping - Bypasses Cloudflare protection
- SQLite caching - Caches every page to avoid redundant requests
- Rate limiting - Strictly enforces 0.5 seconds between requests
- Multi-format output - Exports data in both JSON and CSV formats
- Progress saving - Automatically saves progress every 10 lots
- Test mode - Debug extraction patterns on cached pages
Requirements
- Python 3.8+
- Playwright (with Chromium browser)
Installation
-
Clone or download this project
-
Install dependencies:
pip install -r requirements.txt -
Install Playwright browsers:
playwright install chromium
Configuration
Edit the configuration variables in main.py:
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database
OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
MAX_PAGES = 50 # Number of listing pages to crawl
Note: Update the paths to match your system (especially on Windows, use paths like C:\\output\\cache.db).
Usage
Basic Scraping
Run the scraper to collect auction lot data:
python main.py
This will:
- Crawl listing pages to collect lot URLs
- Scrape each individual lot page
- Save results in both JSON and CSV formats
- Cache all pages to avoid re-fetching
Test Mode
Test extraction patterns on a specific cached URL:
# Test with default URL
python main.py --test
# Test with specific URL
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
This is useful for debugging extraction patterns and verifying data is being extracted correctly.
Output Files
The scraper generates the following files:
During Execution
troostwijk_lots_partial_YYYYMMDD_HHMMSS.json- Progress checkpoints (every 10 lots)
Final Output
troostwijk_lots_final_YYYYMMDD_HHMMSS.json- Complete data in JSON formattroostwijk_lots_final_YYYYMMDD_HHMMSS.csv- Complete data in CSV format
Cache
cache.db- SQLite database with cached page content (persistent across runs)
Data Extracted
For each auction lot, the scraper extracts:
- URL - Direct link to the lot
- Lot ID - Unique identifier (e.g., A7-35847)
- Title - Lot title/description
- Current Bid - Current bid amount
- Bid Count - Number of bids placed
- End Date - Auction end time
- Location - Physical location of the item
- Description - Detailed description
- Category - Auction category
- Images - Up to 5 product images
- Scraped At - Timestamp of data collection
How It Works
Phase 1: Collect Lot URLs
The scraper iterates through auction listing pages (/auctions?page=N) and collects all lot URLs.
Phase 2: Scrape Individual Lots
Each lot page is visited and data is extracted from the embedded JSON data (__NEXT_DATA__). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
Caching Strategy
- Every successfully fetched page is cached in SQLite
- Cache is checked before making any request
- Cache entries older than 7 days are automatically cleaned
- Failed requests (500 errors) are also cached to avoid retrying
Rate Limiting
- Enforces exactly 0.5 seconds between ALL requests
- Applies to both listing pages and individual lot pages
- Prevents server overload and potential IP blocking
Troubleshooting
Issue: "Huidig bod" / "Locatie" instead of actual values
✓ FIXED! The site uses Next.js with all data embedded in __NEXT_DATA__ JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
The scraper correctly extracts:
- Title from
auction.name - Location from
viewingDaysorcollectionDays - Images from
auction.image.url - End date from
minEndDate - Lot ID from
auction.displayId
To verify extraction is working:
python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
Note: Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
Issue: No lots found
- Check if the website structure has changed
- Verify
BASE_URLis correct - Try clearing the cache database
Issue: Cloudflare blocking
- Playwright should bypass this automatically
- If issues persist, try adjusting user agent or headers in
crawl_auctions()
Issue: Slow scraping
- This is intentional due to rate limiting (0.5s between requests)
- Adjust
RATE_LIMIT_SECONDSif needed (not recommended below 0.5s) - First run will be slower; subsequent runs use cache
Project Structure
troost-scraper/
├── main.py # Main scraper script
├── requirements.txt # Python dependencies
├── README.md # This file
└── output/ # Generated output files (created automatically)
├── cache.db # SQLite cache
├── *.json # JSON output files
└── *.csv # CSV output files
Development
Adding New Extraction Fields
-
Add extraction method in
TroostwijkScraperclass:def _extract_new_field(self, content: str) -> str: pattern = r'your-regex-pattern' match = re.search(pattern, content) return match.group(1) if match else "" -
Add field to
_parse_lot_page():data = { # ... existing fields ... 'new_field': self._extract_new_field(content), } -
Add field to CSV export in
save_final_results():fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
Testing Extraction Patterns
Use test mode to verify patterns work correctly:
python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
License
This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
Notes
- Be respectful: The rate limiting is intentionally conservative
- Check legality: Ensure web scraping is permitted in your jurisdiction
- Monitor changes: Website structure may change over time, requiring pattern updates
- Cache management: Old cache entries are auto-cleaned after 7 days