From 7b370ee3802e3d94758a85e2ff576623ba3dee69 Mon Sep 17 00:00:00 2001
From: Tour <tour@example.com>
Date: Wed, 3 Dec 2025 12:13:30 +0100
Subject: [PATCH] docs: update wiki

---
 wiki/basis.md | 217 --------------------------------------------------
 1 file changed, 217 deletions(-)
 delete mode 100644 wiki/basis.md

diff --git a/wiki/basis.md b/wiki/basis.md
deleted file mode 100644
index 94aa631..0000000
--- a/wiki/basis.md
+++ /dev/null
@@ -1,217 +0,0 @@
-# Troostwijk Auctions Scraper
-
-A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
-
-## Features - 1
-
-- **Playwright-based scraping** - Bypasses Cloudflare protection
-- **SQLite caching** - Caches every page to avoid redundant requests
-- **Rate limiting** - Strictly enforces 0.5 seconds between requests
-- **Multi-format output** - Exports data in both JSON and CSV formats
-- **Progress saving** - Automatically saves progress every 10 lots
-- **Test mode** - Debug extraction patterns on cached pages
-
-## Requirements
-
-- Python 3.8+
-- Playwright (with Chromium browser)
-
-## Installation
-
-1. **Clone or download this project**
-
-2. **Install dependencies:**
-   ```bash
-   pip install -r requirements.txt
-   ```
-
-3. **Install Playwright browsers:**
-   ```bash
-   playwright install chromium
-   ```
-
-## Configuration
-
-Edit the configuration variables in `main.py`:
-
-```python
-BASE_URL = "https://www.troostwijkauctions.com"
-CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
-OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
-RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
-MAX_PAGES = 50                                     # Number of listing pages to crawl
-```
-
-**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
-
-## Usage
-
-### Basic Scraping
-
-Run the scraper to collect auction lot data:
-
-```bash
-python main.py
-```
-
-This will:
-1. Crawl listing pages to collect lot URLs
-2. Scrape each individual lot page
-3. Save results in both JSON and CSV formats
-4. Cache all pages to avoid re-fetching
-
-### Test Mode
-
-Test extraction patterns on a specific cached URL:
-
-```bash
-# Test with default URL
-python main.py --test
-
-# Test with specific URL
-python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
-```
-
-This is useful for debugging extraction patterns and verifying data is being extracted correctly.
-
-## Output Files
-
-The scraper generates the following files:
-
-### During Execution
-- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
-
-### Final Output
-- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
-- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
-
-### Cache
-- `cache.db` - SQLite database with cached page content (persistent across runs)
-
-## Data Extracted
-
-For each auction lot, the scraper extracts:
-
-- **URL** - Direct link to the lot
-- **Lot ID** - Unique identifier (e.g., A7-35847)
-- **Title** - Lot title/description
-- **Current Bid** - Current bid amount
-- **Bid Count** - Number of bids placed
-- **End Date** - Auction end time
-- **Location** - Physical location of the item
-- **Description** - Detailed description
-- **Category** - Auction category
-- **Images** - Up to 5 product images
-- **Scraped At** - Timestamp of data collection
-
-## How It Works
-
-### Phase 1: Collect Lot URLs
-The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
-
-### Phase 2: Scrape Individual Lots
-Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
-
-### Caching Strategy
-- Every successfully fetched page is cached in SQLite
-- Cache is checked before making any request
-- Cache entries older than 7 days are automatically cleaned
-- Failed requests (500 errors) are also cached to avoid retrying
-
-### Rate Limiting
-- Enforces exactly 0.5 seconds between ALL requests
-- Applies to both listing pages and individual lot pages
-- Prevents server overload and potential IP blocking
-
-## Troubleshooting
-
-### Issue: "Huidig bod" / "Locatie" instead of actual values
-
-**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
-
-The scraper correctly extracts:
-- **Title** from `auction.name`
-- **Location** from `viewingDays` or `collectionDays`
-- **Images** from `auction.image.url`
-- **End date** from `minEndDate`
-- **Lot ID** from `auction.displayId`
-
-To verify extraction is working:
-```bash
-python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
-```
-
-**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
-
-### Issue: No lots found
-
-- Check if the website structure has changed
-- Verify `BASE_URL` is correct
-- Try clearing the cache database
-
-### Issue: Cloudflare blocking
-
-- Playwright should bypass this automatically
-- If issues persist, try adjusting user agent or headers in `crawl_auctions()`
-
-### Issue: Slow scraping
-
-- This is intentional due to rate limiting (0.5s between requests)
-- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
-- First run will be slower; subsequent runs use cache
-
-## Project Structure
-
-```
-troost-scraper/
-├── main.py              # Main scraper script
-├── requirements.txt     # Python dependencies
-├── README.md           # This file
-└── output/             # Generated output files (created automatically)
-    ├── cache.db        # SQLite cache
-    ├── *.json          # JSON output files
-    └── *.csv           # CSV output files
-```
-
-## Development
-
-### Adding New Extraction Fields
-
-1. Add extraction method in `TroostwijkScraper` class:
-   ```python
-   def _extract_new_field(self, content: str) -> str:
-       pattern = r'your-regex-pattern'
-       match = re.search(pattern, content)
-       return match.group(1) if match else ""
-   ```
-
-2. Add field to `_parse_lot_page()`:
-   ```python
-   data = {
-       # ... existing fields ...
-       'new_field': self._extract_new_field(content),
-   }
-   ```
-
-3. Add field to CSV export in `save_final_results()`:
-   ```python
-   fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
-   ```
-
-### Testing Extraction Patterns
-
-Use test mode to verify patterns work correctly:
-```bash
-python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
-```
-
-## License
-
-This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
-
-## Notes
-
-- **Be respectful:** The rate limiting is intentionally conservative
-- **Check legality:** Ensure web scraping is permitted in your jurisdiction
-- **Monitor changes:** Website structure may change over time, requiring pattern updates
-- **Cache management:** Old cache entries are auto-cleaned after 7 days