docs: update wiki
This commit is contained in:
217
wiki/basis.md
217
wiki/basis.md
@@ -1,217 +0,0 @@
|
|||||||
# Troostwijk Auctions Scraper
|
|
||||||
|
|
||||||
A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
|
|
||||||
|
|
||||||
## Features - 1
|
|
||||||
|
|
||||||
- **Playwright-based scraping** - Bypasses Cloudflare protection
|
|
||||||
- **SQLite caching** - Caches every page to avoid redundant requests
|
|
||||||
- **Rate limiting** - Strictly enforces 0.5 seconds between requests
|
|
||||||
- **Multi-format output** - Exports data in both JSON and CSV formats
|
|
||||||
- **Progress saving** - Automatically saves progress every 10 lots
|
|
||||||
- **Test mode** - Debug extraction patterns on cached pages
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
- Python 3.8+
|
|
||||||
- Playwright (with Chromium browser)
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
1. **Clone or download this project**
|
|
||||||
|
|
||||||
2. **Install dependencies:**
|
|
||||||
```bash
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Install Playwright browsers:**
|
|
||||||
```bash
|
|
||||||
playwright install chromium
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
Edit the configuration variables in `main.py`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
BASE_URL = "https://www.troostwijkauctions.com"
|
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database
|
|
||||||
OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory
|
|
||||||
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
|
||||||
MAX_PAGES = 50 # Number of listing pages to crawl
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### Basic Scraping
|
|
||||||
|
|
||||||
Run the scraper to collect auction lot data:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python main.py
|
|
||||||
```
|
|
||||||
|
|
||||||
This will:
|
|
||||||
1. Crawl listing pages to collect lot URLs
|
|
||||||
2. Scrape each individual lot page
|
|
||||||
3. Save results in both JSON and CSV formats
|
|
||||||
4. Cache all pages to avoid re-fetching
|
|
||||||
|
|
||||||
### Test Mode
|
|
||||||
|
|
||||||
Test extraction patterns on a specific cached URL:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Test with default URL
|
|
||||||
python main.py --test
|
|
||||||
|
|
||||||
# Test with specific URL
|
|
||||||
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
|
|
||||||
```
|
|
||||||
|
|
||||||
This is useful for debugging extraction patterns and verifying data is being extracted correctly.
|
|
||||||
|
|
||||||
## Output Files
|
|
||||||
|
|
||||||
The scraper generates the following files:
|
|
||||||
|
|
||||||
### During Execution
|
|
||||||
- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
|
|
||||||
|
|
||||||
### Final Output
|
|
||||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
|
|
||||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
|
|
||||||
|
|
||||||
### Cache
|
|
||||||
- `cache.db` - SQLite database with cached page content (persistent across runs)
|
|
||||||
|
|
||||||
## Data Extracted
|
|
||||||
|
|
||||||
For each auction lot, the scraper extracts:
|
|
||||||
|
|
||||||
- **URL** - Direct link to the lot
|
|
||||||
- **Lot ID** - Unique identifier (e.g., A7-35847)
|
|
||||||
- **Title** - Lot title/description
|
|
||||||
- **Current Bid** - Current bid amount
|
|
||||||
- **Bid Count** - Number of bids placed
|
|
||||||
- **End Date** - Auction end time
|
|
||||||
- **Location** - Physical location of the item
|
|
||||||
- **Description** - Detailed description
|
|
||||||
- **Category** - Auction category
|
|
||||||
- **Images** - Up to 5 product images
|
|
||||||
- **Scraped At** - Timestamp of data collection
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
### Phase 1: Collect Lot URLs
|
|
||||||
The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
|
|
||||||
|
|
||||||
### Phase 2: Scrape Individual Lots
|
|
||||||
Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
|
|
||||||
|
|
||||||
### Caching Strategy
|
|
||||||
- Every successfully fetched page is cached in SQLite
|
|
||||||
- Cache is checked before making any request
|
|
||||||
- Cache entries older than 7 days are automatically cleaned
|
|
||||||
- Failed requests (500 errors) are also cached to avoid retrying
|
|
||||||
|
|
||||||
### Rate Limiting
|
|
||||||
- Enforces exactly 0.5 seconds between ALL requests
|
|
||||||
- Applies to both listing pages and individual lot pages
|
|
||||||
- Prevents server overload and potential IP blocking
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Issue: "Huidig bod" / "Locatie" instead of actual values
|
|
||||||
|
|
||||||
**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
|
|
||||||
|
|
||||||
The scraper correctly extracts:
|
|
||||||
- **Title** from `auction.name`
|
|
||||||
- **Location** from `viewingDays` or `collectionDays`
|
|
||||||
- **Images** from `auction.image.url`
|
|
||||||
- **End date** from `minEndDate`
|
|
||||||
- **Lot ID** from `auction.displayId`
|
|
||||||
|
|
||||||
To verify extraction is working:
|
|
||||||
```bash
|
|
||||||
python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
|
|
||||||
|
|
||||||
### Issue: No lots found
|
|
||||||
|
|
||||||
- Check if the website structure has changed
|
|
||||||
- Verify `BASE_URL` is correct
|
|
||||||
- Try clearing the cache database
|
|
||||||
|
|
||||||
### Issue: Cloudflare blocking
|
|
||||||
|
|
||||||
- Playwright should bypass this automatically
|
|
||||||
- If issues persist, try adjusting user agent or headers in `crawl_auctions()`
|
|
||||||
|
|
||||||
### Issue: Slow scraping
|
|
||||||
|
|
||||||
- This is intentional due to rate limiting (0.5s between requests)
|
|
||||||
- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
|
|
||||||
- First run will be slower; subsequent runs use cache
|
|
||||||
|
|
||||||
## Project Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
troost-scraper/
|
|
||||||
├── main.py # Main scraper script
|
|
||||||
├── requirements.txt # Python dependencies
|
|
||||||
├── README.md # This file
|
|
||||||
└── output/ # Generated output files (created automatically)
|
|
||||||
├── cache.db # SQLite cache
|
|
||||||
├── *.json # JSON output files
|
|
||||||
└── *.csv # CSV output files
|
|
||||||
```
|
|
||||||
|
|
||||||
## Development
|
|
||||||
|
|
||||||
### Adding New Extraction Fields
|
|
||||||
|
|
||||||
1. Add extraction method in `TroostwijkScraper` class:
|
|
||||||
```python
|
|
||||||
def _extract_new_field(self, content: str) -> str:
|
|
||||||
pattern = r'your-regex-pattern'
|
|
||||||
match = re.search(pattern, content)
|
|
||||||
return match.group(1) if match else ""
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Add field to `_parse_lot_page()`:
|
|
||||||
```python
|
|
||||||
data = {
|
|
||||||
# ... existing fields ...
|
|
||||||
'new_field': self._extract_new_field(content),
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Add field to CSV export in `save_final_results()`:
|
|
||||||
```python
|
|
||||||
fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
|
|
||||||
```
|
|
||||||
|
|
||||||
### Testing Extraction Patterns
|
|
||||||
|
|
||||||
Use test mode to verify patterns work correctly:
|
|
||||||
```bash
|
|
||||||
python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
|
|
||||||
```
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- **Be respectful:** The rate limiting is intentionally conservative
|
|
||||||
- **Check legality:** Ensure web scraping is permitted in your jurisdiction
|
|
||||||
- **Monitor changes:** Website structure may change over time, requiring pattern updates
|
|
||||||
- **Cache management:** Old cache entries are auto-cleaned after 7 days
|
|
||||||
Reference in New Issue
Block a user