327 lines
20 KiB
Markdown
327 lines
20 KiB
Markdown
# Troostwijk Scraper - Architecture & Data Flow
|
|
|
|
## System Overview
|
|
|
|
The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
|
|
|
|
## Architecture Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ TROOSTWIJK SCRAPER │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 1: COLLECT AUCTION URLs │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Listing Page │────────▶│ Extract /a/ │ │
|
|
│ │ /auctions? │ │ auction URLs │ │
|
|
│ │ page=1..N │ └──────────────┘ │
|
|
│ └──────────────┘ │ │
|
|
│ ▼ │
|
|
│ [ List of Auction URLs ] │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 2: EXTRACT LOT URLs FROM AUCTIONS │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Auction Page │────────▶│ Parse │ │
|
|
│ │ /a/... │ │ __NEXT_DATA__│ │
|
|
│ └──────────────┘ │ JSON │ │
|
|
│ │ └──────────────┘ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Save Auction │ │ Extract /l/ │ │
|
|
│ │ Metadata │ │ lot URLs │ │
|
|
│ │ to DB │ └──────────────┘ │
|
|
│ └──────────────┘ │ │
|
|
│ ▼ │
|
|
│ [ List of Lot URLs ] │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 3: SCRAPE LOT DETAILS │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Lot Page │────────▶│ Parse │ │
|
|
│ │ /l/... │ │ __NEXT_DATA__│ │
|
|
│ └──────────────┘ │ JSON │ │
|
|
│ └──────────────┘ │
|
|
│ │ │
|
|
│ ┌─────────────────────────┴─────────────────┐ │
|
|
│ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Save Lot │ │ Save Images │ │
|
|
│ │ Details │ │ URLs to DB │ │
|
|
│ │ to DB │ └──────────────┘ │
|
|
│ └──────────────┘ │ │
|
|
│ ▼ │
|
|
│ [Optional Download] │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
```sql
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ CACHE TABLE (HTML Storage with Compression) │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ cache │
|
|
│ ├── url (TEXT, PRIMARY KEY) │
|
|
│ ├── content (BLOB) -- Compressed HTML (zlib) │
|
|
│ ├── timestamp (REAL) │
|
|
│ ├── status_code (INTEGER) │
|
|
│ └── compressed (INTEGER) -- 1=compressed, 0=plain │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ AUCTIONS TABLE │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ auctions │
|
|
│ ├── auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
|
|
│ ├── url (TEXT, UNIQUE) │
|
|
│ ├── title (TEXT) │
|
|
│ ├── location (TEXT) -- e.g. "Cluj-Napoca, RO" │
|
|
│ ├── lots_count (INTEGER) │
|
|
│ ├── first_lot_closing_time (TEXT) │
|
|
│ └── scraped_at (TEXT) │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ LOTS TABLE │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ lots │
|
|
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
|
|
│ ├── auction_id (TEXT) -- FK to auctions │
|
|
│ ├── url (TEXT, UNIQUE) │
|
|
│ ├── title (TEXT) │
|
|
│ ├── current_bid (TEXT) -- "€123.45" or "No bids" │
|
|
│ ├── bid_count (INTEGER) │
|
|
│ ├── closing_time (TEXT) │
|
|
│ ├── viewing_time (TEXT) │
|
|
│ ├── pickup_date (TEXT) │
|
|
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
|
|
│ ├── description (TEXT) │
|
|
│ ├── category (TEXT) │
|
|
│ └── scraped_at (TEXT) │
|
|
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ IMAGES TABLE (Image URLs & Download Status) │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ images ◀── THIS TABLE HOLDS IMAGE LINKS│
|
|
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
|
|
│ ├── lot_id (TEXT) -- FK to lots │
|
|
│ ├── url (TEXT) -- Image URL │
|
|
│ ├── local_path (TEXT) -- Path after download │
|
|
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
|
|
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Sequence Diagram
|
|
|
|
```
|
|
User Scraper Playwright Cache DB Data Tables
|
|
│ │ │ │ │
|
|
│ Run │ │ │ │
|
|
├──────────────▶│ │ │ │
|
|
│ │ │ │ │
|
|
│ │ Phase 1: Listing Pages │ │
|
|
│ ├───────────────▶│ │ │
|
|
│ │ goto() │ │ │
|
|
│ │◀───────────────┤ │ │
|
|
│ │ HTML │ │ │
|
|
│ ├───────────────────────────────▶│ │
|
|
│ │ compress & cache │ │
|
|
│ │ │ │ │
|
|
│ │ Phase 2: Auction Pages │ │
|
|
│ ├───────────────▶│ │ │
|
|
│ │◀───────────────┤ │ │
|
|
│ │ HTML │ │ │
|
|
│ │ │ │ │
|
|
│ │ Parse __NEXT_DATA__ JSON │ │
|
|
│ │────────────────────────────────────────────────▶│
|
|
│ │ │ │ INSERT auctions
|
|
│ │ │ │ │
|
|
│ │ Phase 3: Lot Pages │ │
|
|
│ ├───────────────▶│ │ │
|
|
│ │◀───────────────┤ │ │
|
|
│ │ HTML │ │ │
|
|
│ │ │ │ │
|
|
│ │ Parse __NEXT_DATA__ JSON │ │
|
|
│ │────────────────────────────────────────────────▶│
|
|
│ │ │ │ INSERT lots │
|
|
│ │────────────────────────────────────────────────▶│
|
|
│ │ │ │ INSERT images│
|
|
│ │ │ │ │
|
|
│ │ Export to CSV/JSON │ │
|
|
│ │◀────────────────────────────────────────────────┤
|
|
│ │ Query all data │ │
|
|
│◀──────────────┤ │ │ │
|
|
│ Results │ │ │ │
|
|
```
|
|
|
|
## Data Flow Details
|
|
|
|
### 1. **Page Retrieval & Caching**
|
|
```
|
|
Request URL
|
|
│
|
|
├──▶ Check cache DB (with timestamp validation)
|
|
│ │
|
|
│ ├─[HIT]──▶ Decompress (if compressed=1)
|
|
│ │ └──▶ Return HTML
|
|
│ │
|
|
│ └─[MISS]─▶ Fetch via Playwright
|
|
│ │
|
|
│ ├──▶ Compress HTML (zlib level 9)
|
|
│ │ ~70-90% size reduction
|
|
│ │
|
|
│ └──▶ Store in cache DB (compressed=1)
|
|
│
|
|
└──▶ Return HTML for parsing
|
|
```
|
|
|
|
### 2. **JSON Parsing Strategy**
|
|
```
|
|
HTML Content
|
|
│
|
|
└──▶ Extract <script id="__NEXT_DATA__">
|
|
│
|
|
├──▶ Parse JSON
|
|
│ │
|
|
│ ├─[has pageProps.lot]──▶ Individual LOT
|
|
│ │ └──▶ Extract: title, bid, location, images, etc.
|
|
│ │
|
|
│ └─[has pageProps.auction]──▶ AUCTION
|
|
│ │
|
|
│ ├─[has lots[] array]──▶ Auction with lots
|
|
│ │ └──▶ Extract: title, location, lots_count
|
|
│ │
|
|
│ └─[no lots[] array]──▶ Old format lot
|
|
│ └──▶ Parse as lot
|
|
│
|
|
└──▶ Fallback to HTML regex parsing (if JSON fails)
|
|
```
|
|
|
|
### 3. **Image Handling**
|
|
```
|
|
Lot Page Parsed
|
|
│
|
|
├──▶ Extract images[] from JSON
|
|
│ │
|
|
│ └──▶ INSERT INTO images (lot_id, url, downloaded=0)
|
|
│
|
|
└──▶ [If DOWNLOAD_IMAGES=True]
|
|
│
|
|
├──▶ Download each image
|
|
│ │
|
|
│ ├──▶ Save to: /images/{lot_id}/001.jpg
|
|
│ │
|
|
│ └──▶ UPDATE images SET local_path=?, downloaded=1
|
|
│
|
|
└──▶ Rate limit between downloads (0.5s)
|
|
```
|
|
|
|
## Key Configuration
|
|
|
|
| Setting | Value | Purpose |
|
|
|---------|-------|---------|
|
|
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
|
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
|
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
|
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
|
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
|
|
|
## Output Files
|
|
|
|
```
|
|
/mnt/okcomputer/output/
|
|
├── cache.db # SQLite database (compressed HTML + data)
|
|
├── auctions_{timestamp}.json # Exported auctions
|
|
├── auctions_{timestamp}.csv # Exported auctions
|
|
├── lots_{timestamp}.json # Exported lots
|
|
├── lots_{timestamp}.csv # Exported lots
|
|
└── images/ # Downloaded images (if enabled)
|
|
├── A1-28505-5/
|
|
│ ├── 001.jpg
|
|
│ └── 002.jpg
|
|
└── A1-28505-6/
|
|
└── 001.jpg
|
|
```
|
|
|
|
## Extension Points for Integration
|
|
|
|
### 1. **Downstream Processing Pipeline**
|
|
```python
|
|
# Query lots without downloaded images
|
|
SELECT lot_id, url FROM images WHERE downloaded = 0
|
|
|
|
# Process images: OCR, classification, etc.
|
|
# Update status when complete
|
|
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?
|
|
```
|
|
|
|
### 2. **Real-time Monitoring**
|
|
```python
|
|
# Check for new lots every N minutes
|
|
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour')
|
|
|
|
# Monitor bid changes
|
|
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0
|
|
```
|
|
|
|
### 3. **Analytics & Reporting**
|
|
```python
|
|
# Top locations
|
|
SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location
|
|
|
|
# Auction statistics
|
|
SELECT
|
|
a.auction_id,
|
|
a.title,
|
|
COUNT(l.lot_id) as actual_lots,
|
|
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
|
|
FROM auctions a
|
|
LEFT JOIN lots l ON a.auction_id = l.auction_id
|
|
GROUP BY a.auction_id
|
|
```
|
|
|
|
### 4. **Image Processing Integration**
|
|
```python
|
|
# Get all images for a lot
|
|
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5'
|
|
|
|
# Batch process unprocessed images
|
|
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
|
|
FROM images i
|
|
JOIN lots l ON i.lot_id = l.lot_id
|
|
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL
|
|
```
|
|
|
|
## Performance Characteristics
|
|
|
|
- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
|
|
- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
|
|
- **Caching**: 24-hour default cache validity (configurable)
|
|
- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
|
|
- **Scalability**: SQLite handles millions of rows efficiently
|
|
|
|
## Error Handling
|
|
|
|
- **Network failures**: Cached as status_code=500, retry after cache expiry
|
|
- **Parse failures**: Falls back to HTML regex patterns
|
|
- **Compression errors**: Auto-detects and handles uncompressed legacy data
|
|
- **Missing fields**: Defaults to "No bids", empty string, or 0
|
|
|
|
## Rate Limiting & Ethics
|
|
|
|
- **REQUIRED**: 0.5 second delay between ALL requests
|
|
- **Respects cache**: Avoids unnecessary re-fetching
|
|
- **User-Agent**: Identifies as standard browser
|
|
- **No parallelization**: Single-threaded sequential crawling
|