diff --git a/wiki/basis.md b/wiki/basis.md new file mode 100644 index 0000000..94aa631 --- /dev/null +++ b/wiki/basis.md @@ -0,0 +1,217 @@ +# Troostwijk Auctions Scraper + +A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities. + +## Features - 1 + +- **Playwright-based scraping** - Bypasses Cloudflare protection +- **SQLite caching** - Caches every page to avoid redundant requests +- **Rate limiting** - Strictly enforces 0.5 seconds between requests +- **Multi-format output** - Exports data in both JSON and CSV formats +- **Progress saving** - Automatically saves progress every 10 lots +- **Test mode** - Debug extraction patterns on cached pages + +## Requirements + +- Python 3.8+ +- Playwright (with Chromium browser) + +## Installation + +1. **Clone or download this project** + +2. **Install dependencies:** + ```bash + pip install -r requirements.txt + ``` + +3. **Install Playwright browsers:** + ```bash + playwright install chromium + ``` + +## Configuration + +Edit the configuration variables in `main.py`: + +```python +BASE_URL = "https://www.troostwijkauctions.com" +CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database +OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory +RATE_LIMIT_SECONDS = 0.5 # Delay between requests +MAX_PAGES = 50 # Number of listing pages to crawl +``` + +**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`). + +## Usage + +### Basic Scraping + +Run the scraper to collect auction lot data: + +```bash +python main.py +``` + +This will: +1. Crawl listing pages to collect lot URLs +2. Scrape each individual lot page +3. Save results in both JSON and CSV formats +4. Cache all pages to avoid re-fetching + +### Test Mode + +Test extraction patterns on a specific cached URL: + +```bash +# Test with default URL +python main.py --test + +# Test with specific URL +python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here" +``` + +This is useful for debugging extraction patterns and verifying data is being extracted correctly. + +## Output Files + +The scraper generates the following files: + +### During Execution +- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots) + +### Final Output +- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format +- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format + +### Cache +- `cache.db` - SQLite database with cached page content (persistent across runs) + +## Data Extracted + +For each auction lot, the scraper extracts: + +- **URL** - Direct link to the lot +- **Lot ID** - Unique identifier (e.g., A7-35847) +- **Title** - Lot title/description +- **Current Bid** - Current bid amount +- **Bid Count** - Number of bids placed +- **End Date** - Auction end time +- **Location** - Physical location of the item +- **Description** - Detailed description +- **Category** - Auction category +- **Images** - Up to 5 product images +- **Scraped At** - Timestamp of data collection + +## How It Works + +### Phase 1: Collect Lot URLs +The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs. + +### Phase 2: Scrape Individual Lots +Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast. + +### Caching Strategy +- Every successfully fetched page is cached in SQLite +- Cache is checked before making any request +- Cache entries older than 7 days are automatically cleaned +- Failed requests (500 errors) are also cached to avoid retrying + +### Rate Limiting +- Enforces exactly 0.5 seconds between ALL requests +- Applies to both listing pages and individual lot pages +- Prevents server overload and potential IP blocking + +## Troubleshooting + +### Issue: "Huidig bod" / "Locatie" instead of actual values + +**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed. + +The scraper correctly extracts: +- **Title** from `auction.name` +- **Location** from `viewingDays` or `collectionDays` +- **Images** from `auction.image.url` +- **End date** from `minEndDate` +- **Lot ID** from `auction.displayId` + +To verify extraction is working: +```bash +python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url" +``` + +**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details. + +### Issue: No lots found + +- Check if the website structure has changed +- Verify `BASE_URL` is correct +- Try clearing the cache database + +### Issue: Cloudflare blocking + +- Playwright should bypass this automatically +- If issues persist, try adjusting user agent or headers in `crawl_auctions()` + +### Issue: Slow scraping + +- This is intentional due to rate limiting (0.5s between requests) +- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s) +- First run will be slower; subsequent runs use cache + +## Project Structure + +``` +troost-scraper/ +├── main.py # Main scraper script +├── requirements.txt # Python dependencies +├── README.md # This file +└── output/ # Generated output files (created automatically) + ├── cache.db # SQLite cache + ├── *.json # JSON output files + └── *.csv # CSV output files +``` + +## Development + +### Adding New Extraction Fields + +1. Add extraction method in `TroostwijkScraper` class: + ```python + def _extract_new_field(self, content: str) -> str: + pattern = r'your-regex-pattern' + match = re.search(pattern, content) + return match.group(1) if match else "" + ``` + +2. Add field to `_parse_lot_page()`: + ```python + data = { + # ... existing fields ... + 'new_field': self._extract_new_field(content), + } + ``` + +3. Add field to CSV export in `save_final_results()`: + ```python + fieldnames = ['url', 'lot_id', ..., 'new_field', ...] + ``` + +### Testing Extraction Patterns + +Use test mode to verify patterns work correctly: +```bash +python main.py --test "https://www.troostwijkauctions.com/a/your-test-url" +``` + +## License + +This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool. + +## Notes + +- **Be respectful:** The rate limiting is intentionally conservative +- **Check legality:** Ensure web scraping is permitted in your jurisdiction +- **Monitor changes:** Website structure may change over time, requiring pattern updates +- **Cache management:** Old cache entries are auto-cleaned after 7 days