docs: update wiki
This commit is contained in:
217
wiki/basis.md
Normal file
217
wiki/basis.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Troostwijk Auctions Scraper
|
||||
|
||||
A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
|
||||
|
||||
## Features - 1
|
||||
|
||||
- **Playwright-based scraping** - Bypasses Cloudflare protection
|
||||
- **SQLite caching** - Caches every page to avoid redundant requests
|
||||
- **Rate limiting** - Strictly enforces 0.5 seconds between requests
|
||||
- **Multi-format output** - Exports data in both JSON and CSV formats
|
||||
- **Progress saving** - Automatically saves progress every 10 lots
|
||||
- **Test mode** - Debug extraction patterns on cached pages
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.8+
|
||||
- Playwright (with Chromium browser)
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone or download this project**
|
||||
|
||||
2. **Install dependencies:**
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. **Install Playwright browsers:**
|
||||
```bash
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit the configuration variables in `main.py`:
|
||||
|
||||
```python
|
||||
BASE_URL = "https://www.troostwijkauctions.com"
|
||||
CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database
|
||||
OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory
|
||||
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
||||
MAX_PAGES = 50 # Number of listing pages to crawl
|
||||
```
|
||||
|
||||
**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Scraping
|
||||
|
||||
Run the scraper to collect auction lot data:
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Crawl listing pages to collect lot URLs
|
||||
2. Scrape each individual lot page
|
||||
3. Save results in both JSON and CSV formats
|
||||
4. Cache all pages to avoid re-fetching
|
||||
|
||||
### Test Mode
|
||||
|
||||
Test extraction patterns on a specific cached URL:
|
||||
|
||||
```bash
|
||||
# Test with default URL
|
||||
python main.py --test
|
||||
|
||||
# Test with specific URL
|
||||
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
|
||||
```
|
||||
|
||||
This is useful for debugging extraction patterns and verifying data is being extracted correctly.
|
||||
|
||||
## Output Files
|
||||
|
||||
The scraper generates the following files:
|
||||
|
||||
### During Execution
|
||||
- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
|
||||
|
||||
### Final Output
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
|
||||
|
||||
### Cache
|
||||
- `cache.db` - SQLite database with cached page content (persistent across runs)
|
||||
|
||||
## Data Extracted
|
||||
|
||||
For each auction lot, the scraper extracts:
|
||||
|
||||
- **URL** - Direct link to the lot
|
||||
- **Lot ID** - Unique identifier (e.g., A7-35847)
|
||||
- **Title** - Lot title/description
|
||||
- **Current Bid** - Current bid amount
|
||||
- **Bid Count** - Number of bids placed
|
||||
- **End Date** - Auction end time
|
||||
- **Location** - Physical location of the item
|
||||
- **Description** - Detailed description
|
||||
- **Category** - Auction category
|
||||
- **Images** - Up to 5 product images
|
||||
- **Scraped At** - Timestamp of data collection
|
||||
|
||||
## How It Works
|
||||
|
||||
### Phase 1: Collect Lot URLs
|
||||
The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
|
||||
|
||||
### Phase 2: Scrape Individual Lots
|
||||
Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
|
||||
|
||||
### Caching Strategy
|
||||
- Every successfully fetched page is cached in SQLite
|
||||
- Cache is checked before making any request
|
||||
- Cache entries older than 7 days are automatically cleaned
|
||||
- Failed requests (500 errors) are also cached to avoid retrying
|
||||
|
||||
### Rate Limiting
|
||||
- Enforces exactly 0.5 seconds between ALL requests
|
||||
- Applies to both listing pages and individual lot pages
|
||||
- Prevents server overload and potential IP blocking
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: "Huidig bod" / "Locatie" instead of actual values
|
||||
|
||||
**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
|
||||
|
||||
The scraper correctly extracts:
|
||||
- **Title** from `auction.name`
|
||||
- **Location** from `viewingDays` or `collectionDays`
|
||||
- **Images** from `auction.image.url`
|
||||
- **End date** from `minEndDate`
|
||||
- **Lot ID** from `auction.displayId`
|
||||
|
||||
To verify extraction is working:
|
||||
```bash
|
||||
python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
|
||||
```
|
||||
|
||||
**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
|
||||
|
||||
### Issue: No lots found
|
||||
|
||||
- Check if the website structure has changed
|
||||
- Verify `BASE_URL` is correct
|
||||
- Try clearing the cache database
|
||||
|
||||
### Issue: Cloudflare blocking
|
||||
|
||||
- Playwright should bypass this automatically
|
||||
- If issues persist, try adjusting user agent or headers in `crawl_auctions()`
|
||||
|
||||
### Issue: Slow scraping
|
||||
|
||||
- This is intentional due to rate limiting (0.5s between requests)
|
||||
- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
|
||||
- First run will be slower; subsequent runs use cache
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
troost-scraper/
|
||||
├── main.py # Main scraper script
|
||||
├── requirements.txt # Python dependencies
|
||||
├── README.md # This file
|
||||
└── output/ # Generated output files (created automatically)
|
||||
├── cache.db # SQLite cache
|
||||
├── *.json # JSON output files
|
||||
└── *.csv # CSV output files
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
### Adding New Extraction Fields
|
||||
|
||||
1. Add extraction method in `TroostwijkScraper` class:
|
||||
```python
|
||||
def _extract_new_field(self, content: str) -> str:
|
||||
pattern = r'your-regex-pattern'
|
||||
match = re.search(pattern, content)
|
||||
return match.group(1) if match else ""
|
||||
```
|
||||
|
||||
2. Add field to `_parse_lot_page()`:
|
||||
```python
|
||||
data = {
|
||||
# ... existing fields ...
|
||||
'new_field': self._extract_new_field(content),
|
||||
}
|
||||
```
|
||||
|
||||
3. Add field to CSV export in `save_final_results()`:
|
||||
```python
|
||||
fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
|
||||
```
|
||||
|
||||
### Testing Extraction Patterns
|
||||
|
||||
Use test mode to verify patterns work correctly:
|
||||
```bash
|
||||
python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
|
||||
|
||||
## Notes
|
||||
|
||||
- **Be respectful:** The rate limiting is intentionally conservative
|
||||
- **Check legality:** Ensure web scraping is permitted in your jurisdiction
|
||||
- **Monitor changes:** Website structure may change over time, requiring pattern updates
|
||||
- **Cache management:** Old cache entries are auto-cleaned after 7 days
|
||||
Reference in New Issue
Block a user