docs: update wiki

2025-12-03 12:12:29 +01:00
parent 1070d4f0cd
commit e51decd9f2
1 changed files with 217 additions and 0 deletions
--- a/wiki/basis.md
+++ b/wiki/basis.md
@@ -0,0 +1,217 @@
+# Troostwijk Auctions Scraper
+
+A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
+
+## Features - 1
+
+- **Playwright-based scraping** - Bypasses Cloudflare protection
+- **SQLite caching** - Caches every page to avoid redundant requests
+- **Rate limiting** - Strictly enforces 0.5 seconds between requests
+- **Multi-format output** - Exports data in both JSON and CSV formats
+- **Progress saving** - Automatically saves progress every 10 lots
+- **Test mode** - Debug extraction patterns on cached pages
+
+## Requirements
+
+- Python 3.8+
+- Playwright (with Chromium browser)
+
+## Installation
+
+1. **Clone or download this project**
+
+2. **Install dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+3. **Install Playwright browsers:**
+   ```bash
+   playwright install chromium
+   ```
+
+## Configuration
+
+Edit the configuration variables in `main.py`:
+
+```python
+BASE_URL = "https://www.troostwijkauctions.com"
+CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
+OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
+RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
+MAX_PAGES = 50                                     # Number of listing pages to crawl
+```
+
+**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
+
+## Usage
+
+### Basic Scraping
+
+Run the scraper to collect auction lot data:
+
+```bash
+python main.py
+```
+
+This will:
+1. Crawl listing pages to collect lot URLs
+2. Scrape each individual lot page
+3. Save results in both JSON and CSV formats
+4. Cache all pages to avoid re-fetching
+
+### Test Mode
+
+Test extraction patterns on a specific cached URL:
+
+```bash
+# Test with default URL
+python main.py --test
+
+# Test with specific URL
+python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
+```
+
+This is useful for debugging extraction patterns and verifying data is being extracted correctly.
+
+## Output Files
+
+The scraper generates the following files:
+
+### During Execution
+- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
+
+### Final Output
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
+
+### Cache
+- `cache.db` - SQLite database with cached page content (persistent across runs)
+
+## Data Extracted
+
+For each auction lot, the scraper extracts:
+
+- **URL** - Direct link to the lot
+- **Lot ID** - Unique identifier (e.g., A7-35847)
+- **Title** - Lot title/description
+- **Current Bid** - Current bid amount
+- **Bid Count** - Number of bids placed
+- **End Date** - Auction end time
+- **Location** - Physical location of the item
+- **Description** - Detailed description
+- **Category** - Auction category
+- **Images** - Up to 5 product images
+- **Scraped At** - Timestamp of data collection
+
+## How It Works
+
+### Phase 1: Collect Lot URLs
+The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
+
+### Phase 2: Scrape Individual Lots
+Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
+
+### Caching Strategy
+- Every successfully fetched page is cached in SQLite
+- Cache is checked before making any request
+- Cache entries older than 7 days are automatically cleaned
+- Failed requests (500 errors) are also cached to avoid retrying
+
+### Rate Limiting
+- Enforces exactly 0.5 seconds between ALL requests
+- Applies to both listing pages and individual lot pages
+- Prevents server overload and potential IP blocking
+
+## Troubleshooting
+
+### Issue: "Huidig bod" / "Locatie" instead of actual values
+
+**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
+
+The scraper correctly extracts:
+- **Title** from `auction.name`
+- **Location** from `viewingDays` or `collectionDays`
+- **Images** from `auction.image.url`
+- **End date** from `minEndDate`
+- **Lot ID** from `auction.displayId`
+
+To verify extraction is working:
+```bash
+python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
+```
+
+**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
+
+### Issue: No lots found
+
+- Check if the website structure has changed
+- Verify `BASE_URL` is correct
+- Try clearing the cache database
+
+### Issue: Cloudflare blocking
+
+- Playwright should bypass this automatically
+- If issues persist, try adjusting user agent or headers in `crawl_auctions()`
+
+### Issue: Slow scraping
+
+- This is intentional due to rate limiting (0.5s between requests)
+- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
+- First run will be slower; subsequent runs use cache
+
+## Project Structure
+
+```
+troost-scraper/
+├── main.py              # Main scraper script
+├── requirements.txt     # Python dependencies
+├── README.md           # This file
+└── output/             # Generated output files (created automatically)
+    ├── cache.db        # SQLite cache
+    ├── *.json          # JSON output files
+    └── *.csv           # CSV output files
+```
+
+## Development
+
+### Adding New Extraction Fields
+
+1. Add extraction method in `TroostwijkScraper` class:
+   ```python
+   def _extract_new_field(self, content: str) -> str:
+       pattern = r'your-regex-pattern'
+       match = re.search(pattern, content)
+       return match.group(1) if match else ""
+   ```
+
+2. Add field to `_parse_lot_page()`:
+   ```python
+   data = {
+       # ... existing fields ...
+       'new_field': self._extract_new_field(content),
+   }
+   ```
+
+3. Add field to CSV export in `save_final_results()`:
+   ```python
+   fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
+   ```
+
+### Testing Extraction Patterns
+
+Use test mode to verify patterns work correctly:
+```bash
+python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
+```
+
+## License
+
+This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
+
+## Notes
+
+- **Be respectful:** The rate limiting is intentionally conservative
+- **Check legality:** Ensure web scraping is permitted in your jurisdiction
+- **Monitor changes:** Website structure may change over time, requiring pattern updates
+- **Cache management:** Old cache entries are auto-cleaned after 7 days