Archived

This repository has been archived on 2025-12-05. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

Tour 8b71d5e113 init

2025-12-03 11:44:11 +01:00

6.7 KiB

Raw Blame History

Troostwijk Auctions Scraper

A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.

Features

Playwright-based scraping - Bypasses Cloudflare protection
SQLite caching - Caches every page to avoid redundant requests
Rate limiting - Strictly enforces 0.5 seconds between requests
Multi-format output - Exports data in both JSON and CSV formats
Progress saving - Automatically saves progress every 10 lots
Test mode - Debug extraction patterns on cached pages

Requirements

Python 3.8+
Playwright (with Chromium browser)

Installation

Clone or download this project
Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright browsers:
```
playwright install chromium
```

Configuration

Edit the configuration variables in main.py:

BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
MAX_PAGES = 50                                     # Number of listing pages to crawl

Note: Update the paths to match your system (especially on Windows, use paths like C:\\output\\cache.db).

Usage

Basic Scraping

Run the scraper to collect auction lot data:

python main.py

This will:

Crawl listing pages to collect lot URLs
Scrape each individual lot page
Save results in both JSON and CSV formats
Cache all pages to avoid re-fetching

Test Mode

Test extraction patterns on a specific cached URL:

# Test with default URL
python main.py --test

# Test with specific URL
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"

This is useful for debugging extraction patterns and verifying data is being extracted correctly.

Output Files

The scraper generates the following files:

During Execution

troostwijk_lots_partial_YYYYMMDD_HHMMSS.json - Progress checkpoints (every 10 lots)

Final Output

troostwijk_lots_final_YYYYMMDD_HHMMSS.json - Complete data in JSON format
troostwijk_lots_final_YYYYMMDD_HHMMSS.csv - Complete data in CSV format

Cache

cache.db - SQLite database with cached page content (persistent across runs)

Data Extracted

For each auction lot, the scraper extracts:

URL - Direct link to the lot
Lot ID - Unique identifier (e.g., A7-35847)
Title - Lot title/description
Current Bid - Current bid amount
Bid Count - Number of bids placed
End Date - Auction end time
Location - Physical location of the item
Description - Detailed description
Category - Auction category
Images - Up to 5 product images
Scraped At - Timestamp of data collection

How It Works

Phase 1: Collect Lot URLs

The scraper iterates through auction listing pages (/auctions?page=N) and collects all lot URLs.

Phase 2: Scrape Individual Lots

Each lot page is visited and data is extracted from the embedded JSON data (__NEXT_DATA__). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.

Caching Strategy

Every successfully fetched page is cached in SQLite
Cache is checked before making any request
Cache entries older than 7 days are automatically cleaned
Failed requests (500 errors) are also cached to avoid retrying

Rate Limiting

Enforces exactly 0.5 seconds between ALL requests
Applies to both listing pages and individual lot pages
Prevents server overload and potential IP blocking

Troubleshooting

Issue: "Huidig bod" / "Locatie" instead of actual values

✓ FIXED! The site uses Next.js with all data embedded in __NEXT_DATA__ JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.

The scraper correctly extracts:

Title from auction.name
Location from viewingDays or collectionDays
Images from auction.image.url
End date from minEndDate
Lot ID from auction.displayId

To verify extraction is working:

python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"

Note: Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.

Issue: No lots found

Check if the website structure has changed
Verify BASE_URL is correct
Try clearing the cache database

Issue: Cloudflare blocking

Playwright should bypass this automatically
If issues persist, try adjusting user agent or headers in crawl_auctions()

Issue: Slow scraping

This is intentional due to rate limiting (0.5s between requests)
Adjust RATE_LIMIT_SECONDS if needed (not recommended below 0.5s)
First run will be slower; subsequent runs use cache

Project Structure

troost-scraper/
├── main.py              # Main scraper script
├── requirements.txt     # Python dependencies
├── README.md           # This file
└── output/             # Generated output files (created automatically)
    ├── cache.db        # SQLite cache
    ├── *.json          # JSON output files
    └── *.csv           # CSV output files

Development

Adding New Extraction Fields

Add extraction method in TroostwijkScraper class:

def _extract_new_field(self, content: str) -> str:
    pattern = r'your-regex-pattern'
    match = re.search(pattern, content)
    return match.group(1) if match else ""

Add field to _parse_lot_page():

data = {
    # ... existing fields ...
    'new_field': self._extract_new_field(content),
}

Add field to CSV export in save_final_results():

fieldnames = ['url', 'lot_id', ..., 'new_field', ...]

Testing Extraction Patterns

Use test mode to verify patterns work correctly:

python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"

License

This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.

Notes

Be respectful: The rate limiting is intentionally conservative
Check legality: Ensure web scraping is permitted in your jurisdiction
Monitor changes: Website structure may change over time, requiring pattern updates
Cache management: Old cache entries are auto-cleaned after 7 days

6.7 KiB Raw Blame History