30 KiB
30 KiB
Scaev - Architecture & Data Flow
System Overview
The scraper follows a 3-phase hierarchical crawling pattern to extract auction and lot data from Troostwijk Auctions website.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ TROOSTWIJK SCRAPER │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 1: COLLECT AUCTION URLs │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Listing Page │────────▶│ Extract /a/ │ │
│ │ /auctions? │ │ auction URLs │ │
│ │ page=1..N │ └──────────────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ [ List of Auction URLs ] │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 2: EXTRACT LOT URLs FROM AUCTIONS │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Auction Page │────────▶│ Parse │ │
│ │ /a/... │ │ __NEXT_DATA__│ │
│ └──────────────┘ │ JSON │ │
│ │ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Save Auction │ │ Extract /l/ │ │
│ │ Metadata │ │ lot URLs │ │
│ │ to DB │ └──────────────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ [ List of Lot URLs ] │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Lot Page │────────▶│ Parse │ │
│ │ /l/... │ │ __NEXT_DATA__│ │
│ └──────────────┘ │ JSON │ │
│ └──────────────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ GraphQL API │ │ Bid History │ │ Save Images │ │
│ │ (Bidding + │ │ REST API │ │ URLs to DB │ │
│ │ Enrichment) │ │ (per lot) │ └──────────────┘ │
│ └──────────────┘ └──────────────┘ │ │
│ │ │ ▼ │
│ └──────────┬────────────┘ [Optional Download │
│ ▼ Concurrent per Lot] │
│ ┌──────────────┐ │
│ │ Save to DB: │ │
│ │ - Lot data │ │
│ │ - Bid data │ │
│ │ - Enrichment │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Database Schema
┌──────────────────────────────────────────────────────────────────┐
│ CACHE TABLE (HTML Storage with Compression) │
├──────────────────────────────────────────────────────────────────┤
│ cache │
│ ├── url (TEXT, PRIMARY KEY) │
│ ├── content (BLOB) -- Compressed HTML (zlib) │
│ ├── timestamp (REAL) │
│ ├── status_code (INTEGER) │
│ └── compressed (INTEGER) -- 1=compressed, 0=plain │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ AUCTIONS TABLE │
├──────────────────────────────────────────────────────────────────┤
│ auctions │
│ ├── auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
│ ├── url (TEXT, UNIQUE) │
│ ├── title (TEXT) │
│ ├── location (TEXT) -- e.g. "Cluj-Napoca, RO" │
│ ├── lots_count (INTEGER) │
│ ├── first_lot_closing_time (TEXT) │
│ └── scraped_at (TEXT) │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ LOTS TABLE (Core + Enriched Intelligence) │
├──────────────────────────────────────────────────────────────────┤
│ lots │
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
│ ├── auction_id (TEXT) -- FK to auctions │
│ ├── url (TEXT, UNIQUE) │
│ ├── title (TEXT) │
│ │ │
│ ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
│ ├── current_bid (TEXT) -- Current bid amount │
│ ├── starting_bid (TEXT) -- Initial/opening bid │
│ ├── minimum_bid (TEXT) -- Next minimum bid │
│ ├── bid_count (INTEGER) -- Number of bids │
│ ├── bid_increment (REAL) -- Bid step size │
│ ├── closing_time (TEXT) -- Lot end date │
│ ├── status (TEXT) -- Minimum bid status │
│ │ │
│ ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
│ ├── first_bid_time (TEXT) -- First bid timestamp │
│ ├── last_bid_time (TEXT) -- Latest bid timestamp │
│ ├── bid_velocity (REAL) -- Bids per hour │
│ │ │
│ ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
│ ├── brand (TEXT) -- Brand from attributes │
│ ├── model (TEXT) -- Model from attributes │
│ ├── manufacturer (TEXT) -- Manufacturer name │
│ ├── year_manufactured (INTEGER) -- Year extracted │
│ ├── condition_score (REAL) -- 0-10 condition rating │
│ ├── condition_description (TEXT) -- Condition text │
│ ├── serial_number (TEXT) -- Serial/VIN number │
│ ├── damage_description (TEXT) -- Damage notes │
│ ├── attributes_json (TEXT) -- Full attributes JSON │
│ │ │
│ ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
│ ├── viewing_time (TEXT) -- Viewing schedule │
│ ├── pickup_date (TEXT) -- Pickup schedule │
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
│ ├── description (TEXT) -- Lot description │
│ ├── category (TEXT) -- Lot category │
│ ├── sale_id (INTEGER) -- Legacy field │
│ ├── type (TEXT) -- Legacy field │
│ ├── year (INTEGER) -- Legacy field │
│ ├── currency (TEXT) -- Currency code │
│ ├── closing_notified (INTEGER) -- Notification flag │
│ └── scraped_at (TEXT) -- Scrape timestamp │
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ IMAGES TABLE (Image URLs & Download Status) │
├──────────────────────────────────────────────────────────────────┤
│ images ◀── THIS TABLE HOLDS IMAGE LINKS│
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
│ ├── lot_id (TEXT) -- FK to lots │
│ ├── url (TEXT) -- Image URL │
│ ├── local_path (TEXT) -- Path after download │
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
│ FOREIGN KEY (lot_id) → lots(lot_id) │
│ UNIQUE INDEX idx_unique_lot_url ON (lot_id, url) │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ BID_HISTORY TABLE (Complete Bid Tracking for Intelligence) │
├──────────────────────────────────────────────────────────────────┤
│ bid_history ◀── REST API: /bidding-history │
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
│ ├── lot_id (TEXT) -- FK to lots │
│ ├── bid_amount (REAL) -- Bid in EUR │
│ ├── bid_time (TEXT) -- ISO 8601 timestamp │
│ ├── is_autobid (INTEGER) -- 0=manual, 1=autobid │
│ ├── bidder_id (TEXT) -- Anonymized bidder UUID │
│ ├── bidder_number (INTEGER) -- Bidder display number │
│ └── created_at (TEXT) -- Record creation timestamp │
│ FOREIGN KEY (lot_id) → lots(lot_id) │
│ INDEX idx_bid_history_lot ON (lot_id) │
│ INDEX idx_bid_history_time ON (bid_time) │
└──────────────────────────────────────────────────────────────────┘
Sequence Diagram
User Scraper Playwright Cache DB Data Tables
│ │ │ │ │
│ Run │ │ │ │
├──────────────▶│ │ │ │
│ │ │ │ │
│ │ Phase 1: Listing Pages │ │
│ ├───────────────▶│ │ │
│ │ goto() │ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ ├───────────────────────────────▶│ │
│ │ compress & cache │ │
│ │ │ │ │
│ │ Phase 2: Auction Pages │ │
│ ├───────────────▶│ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ │ │ │ │
│ │ Parse __NEXT_DATA__ JSON │ │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT auctions
│ │ │ │ │
│ │ Phase 3: Lot Pages │ │
│ ├───────────────▶│ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ │ │ │ │
│ │ Parse __NEXT_DATA__ JSON │ │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT lots │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT images│
│ │ │ │ │
│ │ Export to CSV/JSON │ │
│ │◀────────────────────────────────────────────────┤
│ │ Query all data │ │
│◀──────────────┤ │ │ │
│ Results │ │ │ │
Data Flow Details
1. Page Retrieval & Caching
Request URL
│
├──▶ Check cache DB (with timestamp validation)
│ │
│ ├─[HIT]──▶ Decompress (if compressed=1)
│ │ └──▶ Return HTML
│ │
│ └─[MISS]─▶ Fetch via Playwright
│ │
│ ├──▶ Compress HTML (zlib level 9)
│ │ ~70-90% size reduction
│ │
│ └──▶ Store in cache DB (compressed=1)
│
└──▶ Return HTML for parsing
2. JSON Parsing Strategy
HTML Content
│
└──▶ Extract <script id="__NEXT_DATA__">
│
├──▶ Parse JSON
│ │
│ ├─[has pageProps.lot]──▶ Individual LOT
│ │ └──▶ Extract: title, bid, location, images, etc.
│ │
│ └─[has pageProps.auction]──▶ AUCTION
│ │
│ ├─[has lots[] array]──▶ Auction with lots
│ │ └──▶ Extract: title, location, lots_count
│ │
│ └─[no lots[] array]──▶ Old format lot
│ └──▶ Parse as lot
│
└──▶ Fallback to HTML regex parsing (if JSON fails)
3. API Enrichment Flow
Lot Page Scraped (__NEXT_DATA__ parsed)
│
├──▶ Extract lot UUID from JSON
│
├──▶ GraphQL API Call (fetch_lot_bidding_data)
│ └──▶ Returns: current_bid, starting_bid, minimum_bid,
│ bid_count, closing_time, status, bid_increment
│
├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
│ │
│ ├──▶ Fetch all bid pages (paginated)
│ │
│ └──▶ Returns: Complete bid history with timestamps,
│ bidder_ids, autobid flags, amounts
│ │
│ ├──▶ INSERT INTO bid_history (multiple records)
│ │
│ └──▶ Calculate bid intelligence:
│ - first_bid_time (earliest timestamp)
│ - last_bid_time (latest timestamp)
│ - bid_velocity (bids per hour)
│
├──▶ Extract enrichment from __NEXT_DATA__:
│ - Brand, model, manufacturer (from attributes)
│ - Year (regex from title/attributes)
│ - Condition (map to 0-10 score)
│ - Serial number, damage description
│
└──▶ INSERT/UPDATE lots table with all data
4. Image Handling (Concurrent per Lot)
Lot Page Parsed
│
├──▶ Extract images[] from JSON
│ │
│ └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
│ └──▶ Unique constraint prevents duplicates
│
└──▶ [If DOWNLOAD_IMAGES=True]
│
├──▶ Create concurrent download tasks (asyncio.gather)
│ │
│ ├──▶ All images for lot download in parallel
│ │ (No rate limiting between images in same lot)
│ │
│ ├──▶ Save to: /images/{lot_id}/001.jpg
│ │
│ └──▶ UPDATE images SET local_path=?, downloaded=1
│
└──▶ Rate limit only between lots (0.5s)
(Not between images within a lot)
Key Configuration
| Setting | Value | Purpose |
|---|---|---|
CACHE_DB |
/mnt/okcomputer/output/cache.db |
SQLite database path |
IMAGES_DIR |
/mnt/okcomputer/output/images |
Downloaded images storage |
RATE_LIMIT_SECONDS |
0.5 |
Delay between requests |
DOWNLOAD_IMAGES |
False |
Toggle image downloading |
MAX_PAGES |
50 |
Number of listing pages to crawl |
Output Files
/mnt/okcomputer/output/
├── cache.db # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json # Exported auctions
├── auctions_{timestamp}.csv # Exported auctions
├── lots_{timestamp}.json # Exported lots
├── lots_{timestamp}.csv # Exported lots
└── images/ # Downloaded images (if enabled)
├── A1-28505-5/
│ ├── 001.jpg
│ └── 002.jpg
└── A1-28505-6/
└── 001.jpg
Extension Points for Integration
1. Downstream Processing Pipeline
-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;
-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
2. Real-time Monitoring
-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');
-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
3. Analytics & Reporting
-- Top locations
SELECT location, COUNT(*) as lots_count FROM lots GROUP BY location;
-- Auction statistics
SELECT
a.auction_id,
a.title,
COUNT(l.lot_id) as actual_lots,
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id
4. Image Processing Integration
-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';
-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
Performance Characteristics
- Compression: ~70-90% HTML size reduction (1GB → ~100-300MB)
- Rate Limiting: Exactly 0.5s between requests (respectful scraping)
- Caching: 24-hour default cache validity (configurable)
- Throughput: ~7,200 pages/hour (with 0.5s rate limit)
- Scalability: SQLite handles millions of rows efficiently
Error Handling
- Network failures: Cached as status_code=500, retry after cache expiry
- Parse failures: Falls back to HTML regex patterns
- Compression errors: Auto-detects and handles uncompressed legacy data
- Missing fields: Defaults to "No bids", empty string, or 0
Rate Limiting & Ethics
- REQUIRED: 0.5 second delay between page requests (not between images)
- Respects cache: Avoids unnecessary re-fetching
- User-Agent: Identifies as standard browser
- No parallelization: Single-threaded sequential crawling for pages
- Image downloads: Concurrent within each lot (16x speedup)
API Integration Architecture
GraphQL API
Endpoint: https://storefront.tbauctions.com/storefront/graphql
Purpose: Real-time bidding data and lot enrichment
Key Query:
query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
lot {
currentBidAmount { cents currency }
initialAmount { cents currency }
nextMinimalBid { cents currency }
nextBidStepInCents
bidsCount
followersCount # Available - Watch count
startDate
endDate
minimumBidAmountMet
biddingStatus
condition
location { city countryCode }
categoryInformation { name path }
attributes { name value }
}
estimatedFullPrice { # Available - Estimated value
min { cents currency }
max { cents currency }
}
}
}
Currently Captured:
- ✅ Current bid, starting bid, minimum bid
- ✅ Bid count and bid increment
- ✅ Closing time and status
- ✅ Brand, model, manufacturer (from attributes)
Available but Not Yet Captured:
- ⚠️
followersCount- Watch count for popularity analysis - ⚠️
estimatedFullPrice- Min/max estimated values - ⚠️
biddingStatus- More detailed status enum - ⚠️
condition- Direct condition field - ⚠️
location- City, country details - ⚠️
categoryInformation- Structured category
REST API - Bid History
Endpoint: https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history
Purpose: Complete bid history for intelligence analysis
Parameters:
pageNumber(starts at 1)pageSize(default: 100)
Response Example:
{
"results": [
{
"buyerId": "uuid", // Anonymized bidder ID
"buyerNumber": 4, // Display number
"currentBid": {
"cents": 370000,
"currency": "EUR"
},
"autoBid": false, // Is autobid
"negotiated": false, // Was negotiated
"createdAt": "2025-12-05T04:53:56.763033Z"
}
],
"hasNext": true,
"pageNumber": 1
}
Captured Data:
- ✅ Bid amount, timestamp, bidder ID
- ✅ Autobid flag
- ⚠️
negotiated- Not yet captured
Calculated Intelligence:
- ✅ First bid time
- ✅ Last bid time
- ✅ Bid velocity (bids per hour)
API Integration Points
Files:
src/graphql_client.py- GraphQL queries and parsingsrc/bid_history_client.py- REST API pagination and parsingsrc/scraper.py- Integration during lot scraping
Flow:
- Lot page scraped → Extract lot UUID from
__NEXT_DATA__ - Call GraphQL API → Get bidding data
- If bid_count > 0 → Call REST API → Get complete bid history
- Calculate bid intelligence metrics
- Save to database
Rate Limiting:
- API calls happen during lot scraping phase
- Overall 0.5s rate limit applies to page requests
- API calls are part of lot processing (not separately limited)
See API_INTELLIGENCE_FINDINGS.md for detailed field analysis and roadmap.