This repository has been archived on 2025-12-05. You can view files and clone it. You cannot open issues or pull requests or push a commit.
2025-12-03 12:15:58 +01:00
tmp
2025-12-03 12:15:58 +01:00
2025-12-03 11:44:11 +01:00
tmp
2025-12-03 12:15:58 +01:00
2025-12-03 11:44:11 +01:00
all
2025-12-03 11:58:26 +01:00
2025-12-03 11:44:11 +01:00

Troostwijk Auctions Scraper

A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.

Features - 1

  • Playwright-based scraping - Bypasses Cloudflare protection
  • SQLite caching - Caches every page to avoid redundant requests
  • Rate limiting - Strictly enforces 0.5 seconds between requests
  • Multi-format output - Exports data in both JSON and CSV formats
  • Progress saving - Automatically saves progress every 10 lots
  • Test mode - Debug extraction patterns on cached pages

Requirements

  • Python 3.8+
  • Playwright (with Chromium browser)

Installation

  1. Clone or download this project

  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Install Playwright browsers:

    playwright install chromium
    

Configuration

Edit the configuration variables in main.py:

BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
MAX_PAGES = 50                                     # Number of listing pages to crawl

Note: Update the paths to match your system (especially on Windows, use paths like C:\\output\\cache.db).

Usage

Basic Scraping

Run the scraper to collect auction lot data:

python main.py

This will:

  1. Crawl listing pages to collect lot URLs
  2. Scrape each individual lot page
  3. Save results in both JSON and CSV formats
  4. Cache all pages to avoid re-fetching

Test Mode

Test extraction patterns on a specific cached URL:

# Test with default URL
python main.py --test

# Test with specific URL
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"

This is useful for debugging extraction patterns and verifying data is being extracted correctly.

Output Files

The scraper generates the following files:

During Execution

  • troostwijk_lots_partial_YYYYMMDD_HHMMSS.json - Progress checkpoints (every 10 lots)

Final Output

  • troostwijk_lots_final_YYYYMMDD_HHMMSS.json - Complete data in JSON format
  • troostwijk_lots_final_YYYYMMDD_HHMMSS.csv - Complete data in CSV format

Cache

  • cache.db - SQLite database with cached page content (persistent across runs)

Data Extracted

For each auction lot, the scraper extracts:

  • URL - Direct link to the lot
  • Lot ID - Unique identifier (e.g., A7-35847)
  • Title - Lot title/description
  • Current Bid - Current bid amount
  • Bid Count - Number of bids placed
  • End Date - Auction end time
  • Location - Physical location of the item
  • Description - Detailed description
  • Category - Auction category
  • Images - Up to 5 product images
  • Scraped At - Timestamp of data collection

How It Works

Phase 1: Collect Lot URLs

The scraper iterates through auction listing pages (/auctions?page=N) and collects all lot URLs.

Phase 2: Scrape Individual Lots

Each lot page is visited and data is extracted from the embedded JSON data (__NEXT_DATA__). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.

Caching Strategy

  • Every successfully fetched page is cached in SQLite
  • Cache is checked before making any request
  • Cache entries older than 7 days are automatically cleaned
  • Failed requests (500 errors) are also cached to avoid retrying

Rate Limiting

  • Enforces exactly 0.5 seconds between ALL requests
  • Applies to both listing pages and individual lot pages
  • Prevents server overload and potential IP blocking

Troubleshooting

Issue: "Huidig bod" / "Locatie" instead of actual values

✓ FIXED! The site uses Next.js with all data embedded in __NEXT_DATA__ JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.

The scraper correctly extracts:

  • Title from auction.name
  • Location from viewingDays or collectionDays
  • Images from auction.image.url
  • End date from minEndDate
  • Lot ID from auction.displayId

To verify extraction is working:

python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"

Note: Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.

Issue: No lots found

  • Check if the website structure has changed
  • Verify BASE_URL is correct
  • Try clearing the cache database

Issue: Cloudflare blocking

  • Playwright should bypass this automatically
  • If issues persist, try adjusting user agent or headers in crawl_auctions()

Issue: Slow scraping

  • This is intentional due to rate limiting (0.5s between requests)
  • Adjust RATE_LIMIT_SECONDS if needed (not recommended below 0.5s)
  • First run will be slower; subsequent runs use cache

Project Structure

troost-scraper/
├── main.py              # Main scraper script
├── requirements.txt     # Python dependencies
├── README.md           # This file
└── output/             # Generated output files (created automatically)
    ├── cache.db        # SQLite cache
    ├── *.json          # JSON output files
    └── *.csv           # CSV output files

Development

Adding New Extraction Fields

  1. Add extraction method in TroostwijkScraper class:

    def _extract_new_field(self, content: str) -> str:
        pattern = r'your-regex-pattern'
        match = re.search(pattern, content)
        return match.group(1) if match else ""
    
  2. Add field to _parse_lot_page():

    data = {
        # ... existing fields ...
        'new_field': self._extract_new_field(content),
    }
    
  3. Add field to CSV export in save_final_results():

    fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
    

Testing Extraction Patterns

Use test mode to verify patterns work correctly:

python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"

License

This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.

Notes

  • Be respectful: The rate limiting is intentionally conservative
  • Check legality: Ensure web scraping is permitted in your jurisdiction
  • Monitor changes: Website structure may change over time, requiring pattern updates
  • Cache management: Old cache entries are auto-cleaned after 7 days
Description
No description provided
Readme 25 KiB
Languages
Python 100%