docs: update wiki

2025-12-03 12:13:30 +01:00
parent e51decd9f2
commit 7b370ee380
1 changed files with 0 additions and 217 deletions
--- a/wiki/basis.md
+++ b/wiki/basis.md
@@ -1,217 +0,0 @@
 # Troostwijk Auctions Scraper
 A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
 ## Features - 1
 - **Playwright-based scraping** - Bypasses Cloudflare protection
 - **SQLite caching** - Caches every page to avoid redundant requests
 - **Rate limiting** - Strictly enforces 0.5 seconds between requests
 - **Multi-format output** - Exports data in both JSON and CSV formats
 - **Progress saving** - Automatically saves progress every 10 lots
 - **Test mode** - Debug extraction patterns on cached pages
 ## Requirements
 - Python 3.8+
 - Playwright (with Chromium browser)
 ## Installation
 1. **Clone or download this project**
 2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```
 3. **Install Playwright browsers:**
   ```bash
   playwright install chromium
   ```
 ## Configuration
 Edit the configuration variables in `main.py`:
 ```python
 BASE_URL = "https://www.troostwijkauctions.com"
 CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
 OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
 RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
 MAX_PAGES = 50                                     # Number of listing pages to crawl
 ```
 **Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
 ## Usage
 ### Basic Scraping
 Run the scraper to collect auction lot data:
 ```bash
 python main.py
 ```
 This will:
 1. Crawl listing pages to collect lot URLs
 2. Scrape each individual lot page
 3. Save results in both JSON and CSV formats
 4. Cache all pages to avoid re-fetching
 ### Test Mode
 Test extraction patterns on a specific cached URL:
 ```bash
 # Test with default URL
 python main.py --test
 # Test with specific URL
 python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
 ```
 This is useful for debugging extraction patterns and verifying data is being extracted correctly.
 ## Output Files
 The scraper generates the following files:
 ### During Execution
 - `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
 ### Final Output
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
 ### Cache
 - `cache.db` - SQLite database with cached page content (persistent across runs)
 ## Data Extracted
 For each auction lot, the scraper extracts:
 - **URL** - Direct link to the lot
 - **Lot ID** - Unique identifier (e.g., A7-35847)
 - **Title** - Lot title/description
 - **Current Bid** - Current bid amount
 - **Bid Count** - Number of bids placed
 - **End Date** - Auction end time
 - **Location** - Physical location of the item
 - **Description** - Detailed description
 - **Category** - Auction category
 - **Images** - Up to 5 product images
 - **Scraped At** - Timestamp of data collection
 ## How It Works
 ### Phase 1: Collect Lot URLs
 The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
 ### Phase 2: Scrape Individual Lots
 Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
 ### Caching Strategy
 - Every successfully fetched page is cached in SQLite
 - Cache is checked before making any request
 - Cache entries older than 7 days are automatically cleaned
 - Failed requests (500 errors) are also cached to avoid retrying
 ### Rate Limiting
 - Enforces exactly 0.5 seconds between ALL requests
 - Applies to both listing pages and individual lot pages
 - Prevents server overload and potential IP blocking
 ## Troubleshooting
 ### Issue: "Huidig bod" / "Locatie" instead of actual values
 **✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
 The scraper correctly extracts:
 - **Title** from `auction.name`
 - **Location** from `viewingDays` or `collectionDays`
 - **Images** from `auction.image.url`
 - **End date** from `minEndDate`
 - **Lot ID** from `auction.displayId`
 To verify extraction is working:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
 ```
 **Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
 ### Issue: No lots found
 - Check if the website structure has changed
 - Verify `BASE_URL` is correct
 - Try clearing the cache database
 ### Issue: Cloudflare blocking
 - Playwright should bypass this automatically
 - If issues persist, try adjusting user agent or headers in `crawl_auctions()`
 ### Issue: Slow scraping
 - This is intentional due to rate limiting (0.5s between requests)
 - Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
 - First run will be slower; subsequent runs use cache
 ## Project Structure
 ```
 troost-scraper/
 ├── main.py              # Main scraper script
 ├── requirements.txt     # Python dependencies
 ├── README.md           # This file
 └── output/             # Generated output files (created automatically)
    ├── cache.db        # SQLite cache
    ├── *.json          # JSON output files
    └── *.csv           # CSV output files
 ```
 ## Development
 ### Adding New Extraction Fields
 1. Add extraction method in `TroostwijkScraper` class:
   ```python
   def _extract_new_field(self, content: str) -> str:
       pattern = r'your-regex-pattern'
       match = re.search(pattern, content)
       return match.group(1) if match else ""
   ```
 2. Add field to `_parse_lot_page()`:
   ```python
   data = {
       # ... existing fields ...
       'new_field': self._extract_new_field(content),
   }
   ```
 3. Add field to CSV export in `save_final_results()`:
   ```python
   fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
   ```
 ### Testing Extraction Patterns
 Use test mode to verify patterns work correctly:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
 ```
 ## License
 This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
 ## Notes
 - **Be respectful:** The rate limiting is intentionally conservative
 - **Check legality:** Ensure web scraping is permitted in your jurisdiction
 - **Monitor changes:** Website structure may change over time, requiring pattern updates
 - **Cache management:** Old cache entries are auto-cleaned after 7 days