From 7b370ee3802e3d94758a85e2ff576623ba3dee69 Mon Sep 17 00:00:00 2001 From: Tour Date: Wed, 3 Dec 2025 12:13:30 +0100 Subject: [PATCH] docs: update wiki --- wiki/basis.md | 217 -------------------------------------------------- 1 file changed, 217 deletions(-) delete mode 100644 wiki/basis.md diff --git a/wiki/basis.md b/wiki/basis.md deleted file mode 100644 index 94aa631..0000000 --- a/wiki/basis.md +++ /dev/null @@ -1,217 +0,0 @@ -# Troostwijk Auctions Scraper - -A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities. - -## Features - 1 - -- **Playwright-based scraping** - Bypasses Cloudflare protection -- **SQLite caching** - Caches every page to avoid redundant requests -- **Rate limiting** - Strictly enforces 0.5 seconds between requests -- **Multi-format output** - Exports data in both JSON and CSV formats -- **Progress saving** - Automatically saves progress every 10 lots -- **Test mode** - Debug extraction patterns on cached pages - -## Requirements - -- Python 3.8+ -- Playwright (with Chromium browser) - -## Installation - -1. **Clone or download this project** - -2. **Install dependencies:** - ```bash - pip install -r requirements.txt - ``` - -3. **Install Playwright browsers:** - ```bash - playwright install chromium - ``` - -## Configuration - -Edit the configuration variables in `main.py`: - -```python -BASE_URL = "https://www.troostwijkauctions.com" -CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database -OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory -RATE_LIMIT_SECONDS = 0.5 # Delay between requests -MAX_PAGES = 50 # Number of listing pages to crawl -``` - -**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`). - -## Usage - -### Basic Scraping - -Run the scraper to collect auction lot data: - -```bash -python main.py -``` - -This will: -1. Crawl listing pages to collect lot URLs -2. Scrape each individual lot page -3. Save results in both JSON and CSV formats -4. Cache all pages to avoid re-fetching - -### Test Mode - -Test extraction patterns on a specific cached URL: - -```bash -# Test with default URL -python main.py --test - -# Test with specific URL -python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here" -``` - -This is useful for debugging extraction patterns and verifying data is being extracted correctly. - -## Output Files - -The scraper generates the following files: - -### During Execution -- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots) - -### Final Output -- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format -- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format - -### Cache -- `cache.db` - SQLite database with cached page content (persistent across runs) - -## Data Extracted - -For each auction lot, the scraper extracts: - -- **URL** - Direct link to the lot -- **Lot ID** - Unique identifier (e.g., A7-35847) -- **Title** - Lot title/description -- **Current Bid** - Current bid amount -- **Bid Count** - Number of bids placed -- **End Date** - Auction end time -- **Location** - Physical location of the item -- **Description** - Detailed description -- **Category** - Auction category -- **Images** - Up to 5 product images -- **Scraped At** - Timestamp of data collection - -## How It Works - -### Phase 1: Collect Lot URLs -The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs. - -### Phase 2: Scrape Individual Lots -Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast. - -### Caching Strategy -- Every successfully fetched page is cached in SQLite -- Cache is checked before making any request -- Cache entries older than 7 days are automatically cleaned -- Failed requests (500 errors) are also cached to avoid retrying - -### Rate Limiting -- Enforces exactly 0.5 seconds between ALL requests -- Applies to both listing pages and individual lot pages -- Prevents server overload and potential IP blocking - -## Troubleshooting - -### Issue: "Huidig bod" / "Locatie" instead of actual values - -**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed. - -The scraper correctly extracts: -- **Title** from `auction.name` -- **Location** from `viewingDays` or `collectionDays` -- **Images** from `auction.image.url` -- **End date** from `minEndDate` -- **Lot ID** from `auction.displayId` - -To verify extraction is working: -```bash -python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url" -``` - -**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details. - -### Issue: No lots found - -- Check if the website structure has changed -- Verify `BASE_URL` is correct -- Try clearing the cache database - -### Issue: Cloudflare blocking - -- Playwright should bypass this automatically -- If issues persist, try adjusting user agent or headers in `crawl_auctions()` - -### Issue: Slow scraping - -- This is intentional due to rate limiting (0.5s between requests) -- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s) -- First run will be slower; subsequent runs use cache - -## Project Structure - -``` -troost-scraper/ -├── main.py # Main scraper script -├── requirements.txt # Python dependencies -├── README.md # This file -└── output/ # Generated output files (created automatically) - ├── cache.db # SQLite cache - ├── *.json # JSON output files - └── *.csv # CSV output files -``` - -## Development - -### Adding New Extraction Fields - -1. Add extraction method in `TroostwijkScraper` class: - ```python - def _extract_new_field(self, content: str) -> str: - pattern = r'your-regex-pattern' - match = re.search(pattern, content) - return match.group(1) if match else "" - ``` - -2. Add field to `_parse_lot_page()`: - ```python - data = { - # ... existing fields ... - 'new_field': self._extract_new_field(content), - } - ``` - -3. Add field to CSV export in `save_final_results()`: - ```python - fieldnames = ['url', 'lot_id', ..., 'new_field', ...] - ``` - -### Testing Extraction Patterns - -Use test mode to verify patterns work correctly: -```bash -python main.py --test "https://www.troostwijkauctions.com/a/your-test-url" -``` - -## License - -This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool. - -## Notes - -- **Be respectful:** The rate limiting is intentionally conservative -- **Check legality:** Ensure web scraping is permitted in your jurisdiction -- **Monitor changes:** Website structure may change over time, requiring pattern updates -- **Cache management:** Old cache entries are auto-cleaned after 7 days