- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
31 KiB
Scaev - Architecture & Data Flow
System Overview
The scraper follows a 3-phase hierarchical crawling pattern to extract auction and lot data from Troostwijk Auctions website.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ SCAEV SCRAPER │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 1: COLLECT AUCTION URLs │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Listing Page │────────▶│ Extract /a/ │ │
│ │ /auctions? │ │ auction URLs │ │
│ │ page=1..N │ └──────────────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ [ List of Auction URLs ] │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 2: EXTRACT LOT URLs FROM AUCTIONS │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Auction Page │────────▶│ Parse │ │
│ │ /a/... │ │ __NEXT_DATA__│ │
│ └──────────────┘ │ JSON │ │
│ │ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Save Auction │ │ Extract /l/ │ │
│ │ Metadata │ │ lot URLs │ │
│ │ to DB │ └──────────────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ [ List of Lot URLs ] │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Lot Page │────────▶│ Parse │ │
│ │ /l/... │ │ __NEXT_DATA__│ │
│ └──────────────┘ │ JSON │ │
│ └──────────────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ GraphQL API │ │ Bid History │ │ Save Images │ │
│ │ (Bidding + │ │ REST API │ │ URLs to DB │ │
│ │ Enrichment) │ │ (per lot) │ └──────────────┘ │
│ └──────────────┘ └──────────────┘ │ │
│ │ │ ▼ │
│ └──────────┬────────────┘ [Optional Download │
│ ▼ Concurrent per Lot] │
│ ┌──────────────┐ │
│ │ Save to DB: │ │
│ │ - Lot data │ │
│ │ - Bid data │ │
│ │ - Enrichment │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Database Schema
┌──────────────────────────────────────────────────────────────────┐
│ CACHE TABLE (HTML Storage with Compression) │
├──────────────────────────────────────────────────────────────────┤
│ cache │
│ ├── url (TEXT, PRIMARY KEY) │
│ ├── content (BLOB) -- Compressed HTML (zlib) │
│ ├── timestamp (REAL) │
│ ├── status_code (INTEGER) │
│ └── compressed (INTEGER) -- 1=compressed, 0=plain │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ AUCTIONS TABLE │
├──────────────────────────────────────────────────────────────────┤
│ auctions │
│ ├── auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
│ ├── url (TEXT, UNIQUE) │
│ ├── title (TEXT) │
│ ├── location (TEXT) -- e.g. "Cluj-Napoca, RO" │
│ ├── lots_count (INTEGER) │
│ ├── first_lot_closing_time (TEXT) │
│ └── scraped_at (TEXT) │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ LOTS TABLE (Core + Enriched Intelligence) │
├──────────────────────────────────────────────────────────────────┤
│ lots │
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
│ ├── auction_id (TEXT) -- FK to auctions │
│ ├── url (TEXT, UNIQUE) │
│ ├── title (TEXT) │
│ │ │
│ ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
│ ├── current_bid (TEXT) -- Current bid amount │
│ ├── starting_bid (TEXT) -- Initial/opening bid │
│ ├── minimum_bid (TEXT) -- Next minimum bid │
│ ├── bid_count (INTEGER) -- Number of bids │
│ ├── bid_increment (REAL) -- Bid step size │
│ ├── closing_time (TEXT) -- Lot end date │
│ ├── status (TEXT) -- Minimum bid status │
│ │ │
│ ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
│ ├── first_bid_time (TEXT) -- First bid timestamp │
│ ├── last_bid_time (TEXT) -- Latest bid timestamp │
│ ├── bid_velocity (REAL) -- Bids per hour │
│ │ │
│ ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
│ ├── brand (TEXT) -- Brand from attributes │
│ ├── model (TEXT) -- Model from attributes │
│ ├── manufacturer (TEXT) -- Manufacturer name │
│ ├── year_manufactured (INTEGER) -- Year extracted │
│ ├── condition_score (REAL) -- 0-10 condition rating │
│ ├── condition_description (TEXT) -- Condition text │
│ ├── serial_number (TEXT) -- Serial/VIN number │
│ ├── damage_description (TEXT) -- Damage notes │
│ ├── attributes_json (TEXT) -- Full attributes JSON │
│ │ │
│ ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
│ ├── viewing_time (TEXT) -- Viewing schedule │
│ ├── pickup_date (TEXT) -- Pickup schedule │
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
│ ├── description (TEXT) -- Lot description │
│ ├── category (TEXT) -- Lot category │
│ ├── sale_id (INTEGER) -- Legacy field │
│ ├── type (TEXT) -- Legacy field │
│ ├── year (INTEGER) -- Legacy field │
│ ├── currency (TEXT) -- Currency code │
│ ├── closing_notified (INTEGER) -- Notification flag │
│ └── scraped_at (TEXT) -- Scrape timestamp │
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ IMAGES TABLE (Image URLs & Download Status) │
├──────────────────────────────────────────────────────────────────┤
│ images ◀── THIS TABLE HOLDS IMAGE LINKS│
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
│ ├── lot_id (TEXT) -- FK to lots │
│ ├── url (TEXT) -- Image URL │
│ ├── local_path (TEXT) -- Path after download │
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
│ FOREIGN KEY (lot_id) → lots(lot_id) │
│ UNIQUE INDEX idx_unique_lot_url ON (lot_id, url) │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ BID_HISTORY TABLE (Complete Bid Tracking for Intelligence) │
├──────────────────────────────────────────────────────────────────┤
│ bid_history ◀── REST API: /bidding-history │
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
│ ├── lot_id (TEXT) -- FK to lots │
│ ├── bid_amount (REAL) -- Bid in EUR │
│ ├── bid_time (TEXT) -- ISO 8601 timestamp │
│ ├── is_autobid (INTEGER) -- 0=manual, 1=autobid │
│ ├── bidder_id (TEXT) -- Anonymized bidder UUID │
│ ├── bidder_number (INTEGER) -- Bidder display number │
│ └── created_at (TEXT) -- Record creation timestamp │
│ FOREIGN KEY (lot_id) → lots(lot_id) │
│ INDEX idx_bid_history_lot ON (lot_id) │
│ INDEX idx_bid_history_time ON (bid_time) │
└──────────────────────────────────────────────────────────────────┘
Sequence Diagram
User Scraper Playwright Cache DB Data Tables
│ │ │ │ │
│ Run │ │ │ │
├──────────────▶│ │ │ │
│ │ │ │ │
│ │ Phase 1: Listing Pages │ │
│ ├───────────────▶│ │ │
│ │ goto() │ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ ├───────────────────────────────▶│ │
│ │ compress & cache │ │
│ │ │ │ │
│ │ Phase 2: Auction Pages │ │
│ ├───────────────▶│ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ │ │ │ │
│ │ Parse __NEXT_DATA__ JSON │ │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT auctions
│ │ │ │ │
│ │ Phase 3: Lot Pages │ │
│ ├───────────────▶│ │ │
│ │◀───────────────┤ │ │
│ │ HTML │ │ │
│ │ │ │ │
│ │ Parse __NEXT_DATA__ JSON │ │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT lots │
│ │────────────────────────────────────────────────▶│
│ │ │ │ INSERT images│
│ │ │ │ │
│ │ Export to CSV/JSON │ │
│ │◀────────────────────────────────────────────────┤
│ │ Query all data │ │
│◀──────────────┤ │ │ │
│ Results │ │ │ │
Data Flow Details
1. Page Retrieval & Caching
Request URL
│
├──▶ Check cache DB (with timestamp validation)
│ │
│ ├─[HIT]──▶ Decompress (if compressed=1)
│ │ └──▶ Return HTML
│ │
│ └─[MISS]─▶ Fetch via Playwright
│ │
│ ├──▶ Compress HTML (zlib level 9)
│ │ ~70-90% size reduction
│ │
│ └──▶ Store in cache DB (compressed=1)
│
└──▶ Return HTML for parsing
2. JSON Parsing Strategy
HTML Content
│
└──▶ Extract <script id="__NEXT_DATA__">
│
├──▶ Parse JSON
│ │
│ ├─[has pageProps.lot]──▶ Individual LOT
│ │ └──▶ Extract: title, bid, location, images, etc.
│ │
│ └─[has pageProps.auction]──▶ AUCTION
│ │
│ ├─[has lots[] array]──▶ Auction with lots
│ │ └──▶ Extract: title, location, lots_count
│ │
│ └─[no lots[] array]──▶ Old format lot
│ └──▶ Parse as lot
│
└──▶ Fallback to HTML regex parsing (if JSON fails)
3. API Enrichment Flow
Lot Page Scraped (__NEXT_DATA__ parsed)
│
├──▶ Extract lot UUID from JSON
│
├──▶ GraphQL API Call (fetch_lot_bidding_data)
│ └──▶ Returns: current_bid, starting_bid, minimum_bid,
│ bid_count, closing_time, status, bid_increment
│
├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
│ │
│ ├──▶ Fetch all bid pages (paginated)
│ │
│ └──▶ Returns: Complete bid history with timestamps,
│ bidder_ids, autobid flags, amounts
│ │
│ ├──▶ INSERT INTO bid_history (multiple records)
│ │
│ └──▶ Calculate bid intelligence:
│ - first_bid_time (earliest timestamp)
│ - last_bid_time (latest timestamp)
│ - bid_velocity (bids per hour)
│
├──▶ Extract enrichment from __NEXT_DATA__:
│ - Brand, model, manufacturer (from attributes)
│ - Year (regex from title/attributes)
│ - Condition (map to 0-10 score)
│ - Serial number, damage description
│
└──▶ INSERT/UPDATE lots table with all data
4. Image Handling (Concurrent per Lot)
Lot Page Parsed
│
├──▶ Extract images[] from JSON
│ │
│ └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
│ └──▶ Unique constraint prevents duplicates
│
└──▶ [If DOWNLOAD_IMAGES=True]
│
├──▶ Create concurrent download tasks (asyncio.gather)
│ │
│ ├──▶ All images for lot download in parallel
│ │ (No rate limiting between images in same lot)
│ │
│ ├──▶ Save to: /images/{lot_id}/001.jpg
│ │
│ └──▶ UPDATE images SET local_path=?, downloaded=1
│
└──▶ Rate limit only between lots (0.5s)
(Not between images within a lot)
Key Configuration
| Setting | Value | Purpose |
|---|---|---|
CACHE_DB |
/mnt/okcomputer/output/cache.db |
SQLite database path |
IMAGES_DIR |
/mnt/okcomputer/output/images |
Downloaded images storage |
RATE_LIMIT_SECONDS |
0.5 |
Delay between requests |
DOWNLOAD_IMAGES |
False |
Toggle image downloading |
MAX_PAGES |
50 |
Number of listing pages to crawl |
Output Files
/mnt/okcomputer/output/
├── auctions_{timestamp}.json # Exported auctions
├── auctions_{timestamp}.csv # Exported auctions
├── lots_{timestamp}.json # Exported lots
├── lots_{timestamp}.csv # Exported lots
└── images/ # Downloaded images (if enabled)
├── A1-28505-5/
│ ├── 001.jpg
│ └── 002.jpg
└── A1-28505-6/
└── 001.jpg
Terminal Progress per Lot (TTY)
During lot analysis, Scaev now shows a per‑lot TTY progress animation with a final summary of all inputs used:
- Spinner runs while enrichment is in progress.
- Summary lists every page/API used to analyze the lot with:
- URL/label
- Size in bytes
- Source state: cache | realtime | offline | db | intercepted
- Duration in ms
Example output snippet:
[LOT A1-28505-5] ✓ Done in 812 ms — pages/APIs used:
• [html] https://www.troostwijkauctions.com/l/... | 142331 B | cache | 4 ms
• [graphql] GraphQL lotDetails | 5321 B | realtime | 142 ms
• [rest] REST bid history | 18234 B | realtime | 236 ms
Notes:
- In non‑TTY environments the spinner is replaced by simple log lines.
- Intercepted GraphQL responses (captured during page load) are labeled as
interceptedwith near‑zero duration.
Data Flow “Tunnel” (Simplified)
For each lot, the data “tunnels through” the following stages:
- HTML page → parse
__NEXT_DATA__for core lot fields and lot UUID. - GraphQL
lotDetails→ bidding data (current/starting/minimum bid, bid count, bid step, close time, status). - Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
- Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.
Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
Migrations and ORM Roadmap
- Migrations follow a Flyway‑style convention in
db/migration(e.g.,V1__initial_schema.sql). - Current baseline is V1; there are no new migrations required at this time.
- Raw SQL usage remains in place (SQLite) while we prepare a gradual move to SQLAlchemy 2.x targeting PostgreSQL.
- See
docs/MIGRATIONS.mdfor details on naming, workflow, and the future switch to PostgreSQL.
Extension Points for Integration
1. Downstream Processing Pipeline
-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;
-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
2. Real-time Monitoring
-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');
-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
3. Analytics & Reporting
-- Top locations
SELECT location, COUNT(*) as lots_count FROM lots GROUP BY location;
-- Auction statistics
SELECT
a.auction_id,
a.title,
COUNT(l.lot_id) as actual_lots,
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id
4. Image Processing Integration
-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';
-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
Performance Characteristics
- Compression: ~70-90% HTML size reduction (1GB → ~100-300MB)
- Rate Limiting: Exactly 0.5s between requests (respectful scraping)
- Caching: 24-hour default cache validity (configurable)
- Throughput: ~7,200 pages/hour (with 0.5s rate limit)
- Scalability: SQLite handles millions of rows efficiently
Error Handling
- Network failures: Cached as status_code=500, retry after cache expiry
- Parse failures: Falls back to HTML regex patterns
- Compression errors: Auto-detects and handles uncompressed legacy data
- Missing fields: Defaults to "No bids", empty string, or 0
Rate Limiting & Ethics
- REQUIRED: 0.5 second delay between page requests (not between images)
- Respects cache: Avoids unnecessary re-fetching
- User-Agent: Identifies as standard browser
- No parallelization: Single-threaded sequential crawling for pages
- Image downloads: Concurrent within each lot (16x speedup)
API Integration Architecture
GraphQL API
Endpoint: https://storefront.tbauctions.com/storefront/graphql
Purpose: Real-time bidding data and lot enrichment
Key Query:
query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
lot {
currentBidAmount { cents currency }
initialAmount { cents currency }
nextMinimalBid { cents currency }
nextBidStepInCents
bidsCount
followersCount # Available - Watch count
startDate
endDate
minimumBidAmountMet
biddingStatus
condition
location { city countryCode }
categoryInformation { name path }
attributes { name value }
}
estimatedFullPrice { # Available - Estimated value
min { cents currency }
max { cents currency }
}
}
}
Currently Captured:
- ✅ Current bid, starting bid, minimum bid
- ✅ Bid count and bid increment
- ✅ Closing time and status
- ✅ Brand, model, manufacturer (from attributes)
REST API - Bid History
Endpoint: https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history
Purpose: Complete bid history for intelligence analysis
Parameters:
pageNumber(starts at 1)pageSize(default: 100)
Response Example:
{
"results": [
{
"buyerId": "uuid", // Anonymized bidder ID
"buyerNumber": 4, // Display number
"currentBid": {
"cents": 370000,
"currency": "EUR"
},
"autoBid": false, // Is autobid
"negotiated": false, // Was negotiated
"createdAt": "2025-12-05T04:53:56.763033Z"
}
],
"hasNext": true,
"pageNumber": 1
}
Captured Data:
- ✅ Bid amount, timestamp, bidder ID
- ✅ Autobid flag
- ⚠️
negotiated- Not yet captured
Calculated Intelligence:
- ✅ First bid time
- ✅ Last bid time
- ✅ Bid velocity (bids per hour)
API Integration Points
Flow:
- Lot page scraped → Extract lot UUID from
__NEXT_DATA__ - Call GraphQL API → Get bidding data
- If bid_count > 0 → Call REST API → Get complete bid history
- Calculate bid intelligence metrics
- Save to database
Rate Limiting:
- API calls happen during lot scraping phase
- Overall 0.5s rate limit applies to page requests
- API calls are part of lot processing (not separately limited)