troost-scraper/README.md

# Troostwijk Auctions Scraper

A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.

## Features

- **Playwright-based scraping** - Bypasses Cloudflare protection
- **SQLite caching** - Caches every page to avoid redundant requests
- **Rate limiting** - Strictly enforces 0.5 seconds between requests
- **Multi-format output** - Exports data in both JSON and CSV formats
- **Progress saving** - Automatically saves progress every 10 lots
- **Test mode** - Debug extraction patterns on cached pages

## Requirements

- Python 3.8+
- Playwright (with Chromium browser)

## Installation

1. **Clone or download this project**

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. **Install Playwright browsers:**
   ```bash
   playwright install chromium
   ```

## Configuration

Edit the configuration variables in `main.py`:

```python
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
MAX_PAGES = 50                                     # Number of listing pages to crawl
```

**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).

## Usage

### Basic Scraping

Run the scraper to collect auction lot data:

```bash
python main.py
```

This will:
1. Crawl listing pages to collect lot URLs
2. Scrape each individual lot page
3. Save results in both JSON and CSV formats
4. Cache all pages to avoid re-fetching

### Test Mode

Test extraction patterns on a specific cached URL:

```bash
# Test with default URL
python main.py --test

# Test with specific URL
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
```

This is useful for debugging extraction patterns and verifying data is being extracted correctly.

## Output Files

The scraper generates the following files:

### During Execution
- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)

### Final Output
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format

### Cache
- `cache.db` - SQLite database with cached page content (persistent across runs)

## Data Extracted

For each auction lot, the scraper extracts:

- **URL** - Direct link to the lot
- **Lot ID** - Unique identifier (e.g., A7-35847)
- **Title** - Lot title/description
- **Current Bid** - Current bid amount
- **Bid Count** - Number of bids placed
- **End Date** - Auction end time
- **Location** - Physical location of the item
- **Description** - Detailed description
- **Category** - Auction category
- **Images** - Up to 5 product images
- **Scraped At** - Timestamp of data collection

## How It Works

### Phase 1: Collect Lot URLs
The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.

### Phase 2: Scrape Individual Lots
Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.

### Caching Strategy
- Every successfully fetched page is cached in SQLite
- Cache is checked before making any request
- Cache entries older than 7 days are automatically cleaned
- Failed requests (500 errors) are also cached to avoid retrying

### Rate Limiting
- Enforces exactly 0.5 seconds between ALL requests
- Applies to both listing pages and individual lot pages
- Prevents server overload and potential IP blocking

## Troubleshooting

### Issue: "Huidig bod" / "Locatie" instead of actual values

**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.

The scraper correctly extracts:
- **Title** from `auction.name`
- **Location** from `viewingDays` or `collectionDays`
- **Images** from `auction.image.url`
- **End date** from `minEndDate`
- **Lot ID** from `auction.displayId`

To verify extraction is working:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
```

**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.

### Issue: No lots found

- Check if the website structure has changed
- Verify `BASE_URL` is correct
- Try clearing the cache database

### Issue: Cloudflare blocking

- Playwright should bypass this automatically
- If issues persist, try adjusting user agent or headers in `crawl_auctions()`

### Issue: Slow scraping

- This is intentional due to rate limiting (0.5s between requests)
- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
- First run will be slower; subsequent runs use cache

## Project Structure

```
troost-scraper/
├── main.py              # Main scraper script
├── requirements.txt     # Python dependencies
├── README.md           # This file
└── output/             # Generated output files (created automatically)
    ├── cache.db        # SQLite cache
    ├── *.json          # JSON output files
    └── *.csv           # CSV output files
```

## Development

### Adding New Extraction Fields

1. Add extraction method in `TroostwijkScraper` class:
   ```python
   def _extract_new_field(self, content: str) -> str:
       pattern = r'your-regex-pattern'
       match = re.search(pattern, content)
       return match.group(1) if match else ""
   ```

2. Add field to `_parse_lot_page()`:
   ```python
   data = {
       # ... existing fields ...
       'new_field': self._extract_new_field(content),
   }
   ```

3. Add field to CSV export in `save_final_results()`:
   ```python
   fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
   ```

### Testing Extraction Patterns

Use test mode to verify patterns work correctly:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
```

## License

This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.

## Notes

- **Be respectful:** The rate limiting is intentionally conservative
- **Check legality:** Ensure web scraping is permitted in your jurisdiction
- **Monitor changes:** Website structure may change over time, requiring pattern updates
- **Cache management:** Old cache entries are auto-cleaned after 7 days