- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
560 lines
31 KiB
Markdown
560 lines
31 KiB
Markdown
# Scaev - Architecture & Data Flow
|
||
|
||
## System Overview
|
||
|
||
The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
|
||
|
||
## Architecture Diagram
|
||
|
||
```mariadb
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ SCAEV SCRAPER │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ PHASE 1: COLLECT AUCTION URLs │
|
||
│ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Listing Page │────────▶│ Extract /a/ │ │
|
||
│ │ /auctions? │ │ auction URLs │ │
|
||
│ │ page=1..N │ └──────────────┘ │
|
||
│ └──────────────┘ │ │
|
||
│ ▼ │
|
||
│ [ List of Auction URLs ] │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ PHASE 2: EXTRACT LOT URLs FROM AUCTIONS │
|
||
│ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Auction Page │────────▶│ Parse │ │
|
||
│ │ /a/... │ │ __NEXT_DATA__│ │
|
||
│ └──────────────┘ │ JSON │ │
|
||
│ │ └──────────────┘ │
|
||
│ │ │ │
|
||
│ ▼ ▼ │
|
||
│ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Save Auction │ │ Extract /l/ │ │
|
||
│ │ Metadata │ │ lot URLs │ │
|
||
│ │ to DB │ └──────────────┘ │
|
||
│ └──────────────┘ │ │
|
||
│ ▼ │
|
||
│ [ List of Lot URLs ] │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT │
|
||
│ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Lot Page │────────▶│ Parse │ │
|
||
│ │ /l/... │ │ __NEXT_DATA__│ │
|
||
│ └──────────────┘ │ JSON │ │
|
||
│ └──────────────┘ │
|
||
│ │ │
|
||
│ ┌─────────────────────────┼─────────────────┐ │
|
||
│ ▼ ▼ ▼ │
|
||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ GraphQL API │ │ Bid History │ │ Save Images │ │
|
||
│ │ (Bidding + │ │ REST API │ │ URLs to DB │ │
|
||
│ │ Enrichment) │ │ (per lot) │ └──────────────┘ │
|
||
│ └──────────────┘ └──────────────┘ │ │
|
||
│ │ │ ▼ │
|
||
│ └──────────┬────────────┘ [Optional Download │
|
||
│ ▼ Concurrent per Lot] │
|
||
│ ┌──────────────┐ │
|
||
│ │ Save to DB: │ │
|
||
│ │ - Lot data │ │
|
||
│ │ - Bid data │ │
|
||
│ │ - Enrichment │ │
|
||
│ └──────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Database Schema
|
||
|
||
```mariadb
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ CACHE TABLE (HTML Storage with Compression) │
|
||
├──────────────────────────────────────────────────────────────────┤
|
||
│ cache │
|
||
│ ├── url (TEXT, PRIMARY KEY) │
|
||
│ ├── content (BLOB) -- Compressed HTML (zlib) │
|
||
│ ├── timestamp (REAL) │
|
||
│ ├── status_code (INTEGER) │
|
||
│ └── compressed (INTEGER) -- 1=compressed, 0=plain │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ AUCTIONS TABLE │
|
||
├──────────────────────────────────────────────────────────────────┤
|
||
│ auctions │
|
||
│ ├── auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
|
||
│ ├── url (TEXT, UNIQUE) │
|
||
│ ├── title (TEXT) │
|
||
│ ├── location (TEXT) -- e.g. "Cluj-Napoca, RO" │
|
||
│ ├── lots_count (INTEGER) │
|
||
│ ├── first_lot_closing_time (TEXT) │
|
||
│ └── scraped_at (TEXT) │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ LOTS TABLE (Core + Enriched Intelligence) │
|
||
├──────────────────────────────────────────────────────────────────┤
|
||
│ lots │
|
||
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
|
||
│ ├── auction_id (TEXT) -- FK to auctions │
|
||
│ ├── url (TEXT, UNIQUE) │
|
||
│ ├── title (TEXT) │
|
||
│ │ │
|
||
│ ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
|
||
│ ├── current_bid (TEXT) -- Current bid amount │
|
||
│ ├── starting_bid (TEXT) -- Initial/opening bid │
|
||
│ ├── minimum_bid (TEXT) -- Next minimum bid │
|
||
│ ├── bid_count (INTEGER) -- Number of bids │
|
||
│ ├── bid_increment (REAL) -- Bid step size │
|
||
│ ├── closing_time (TEXT) -- Lot end date │
|
||
│ ├── status (TEXT) -- Minimum bid status │
|
||
│ │ │
|
||
│ ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
|
||
│ ├── first_bid_time (TEXT) -- First bid timestamp │
|
||
│ ├── last_bid_time (TEXT) -- Latest bid timestamp │
|
||
│ ├── bid_velocity (REAL) -- Bids per hour │
|
||
│ │ │
|
||
│ ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
|
||
│ ├── brand (TEXT) -- Brand from attributes │
|
||
│ ├── model (TEXT) -- Model from attributes │
|
||
│ ├── manufacturer (TEXT) -- Manufacturer name │
|
||
│ ├── year_manufactured (INTEGER) -- Year extracted │
|
||
│ ├── condition_score (REAL) -- 0-10 condition rating │
|
||
│ ├── condition_description (TEXT) -- Condition text │
|
||
│ ├── serial_number (TEXT) -- Serial/VIN number │
|
||
│ ├── damage_description (TEXT) -- Damage notes │
|
||
│ ├── attributes_json (TEXT) -- Full attributes JSON │
|
||
│ │ │
|
||
│ ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
|
||
│ ├── viewing_time (TEXT) -- Viewing schedule │
|
||
│ ├── pickup_date (TEXT) -- Pickup schedule │
|
||
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
|
||
│ ├── description (TEXT) -- Lot description │
|
||
│ ├── category (TEXT) -- Lot category │
|
||
│ ├── sale_id (INTEGER) -- Legacy field │
|
||
│ ├── type (TEXT) -- Legacy field │
|
||
│ ├── year (INTEGER) -- Legacy field │
|
||
│ ├── currency (TEXT) -- Currency code │
|
||
│ ├── closing_notified (INTEGER) -- Notification flag │
|
||
│ └── scraped_at (TEXT) -- Scrape timestamp │
|
||
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ IMAGES TABLE (Image URLs & Download Status) │
|
||
├──────────────────────────────────────────────────────────────────┤
|
||
│ images ◀── THIS TABLE HOLDS IMAGE LINKS│
|
||
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
|
||
│ ├── lot_id (TEXT) -- FK to lots │
|
||
│ ├── url (TEXT) -- Image URL │
|
||
│ ├── local_path (TEXT) -- Path after download │
|
||
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
|
||
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
||
│ UNIQUE INDEX idx_unique_lot_url ON (lot_id, url) │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ BID_HISTORY TABLE (Complete Bid Tracking for Intelligence) │
|
||
├──────────────────────────────────────────────────────────────────┤
|
||
│ bid_history ◀── REST API: /bidding-history │
|
||
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
|
||
│ ├── lot_id (TEXT) -- FK to lots │
|
||
│ ├── bid_amount (REAL) -- Bid in EUR │
|
||
│ ├── bid_time (TEXT) -- ISO 8601 timestamp │
|
||
│ ├── is_autobid (INTEGER) -- 0=manual, 1=autobid │
|
||
│ ├── bidder_id (TEXT) -- Anonymized bidder UUID │
|
||
│ ├── bidder_number (INTEGER) -- Bidder display number │
|
||
│ └── created_at (TEXT) -- Record creation timestamp │
|
||
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
||
│ INDEX idx_bid_history_lot ON (lot_id) │
|
||
│ INDEX idx_bid_history_time ON (bid_time) │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Sequence Diagram
|
||
|
||
```
|
||
User Scraper Playwright Cache DB Data Tables
|
||
│ │ │ │ │
|
||
│ Run │ │ │ │
|
||
├──────────────▶│ │ │ │
|
||
│ │ │ │ │
|
||
│ │ Phase 1: Listing Pages │ │
|
||
│ ├───────────────▶│ │ │
|
||
│ │ goto() │ │ │
|
||
│ │◀───────────────┤ │ │
|
||
│ │ HTML │ │ │
|
||
│ ├───────────────────────────────▶│ │
|
||
│ │ compress & cache │ │
|
||
│ │ │ │ │
|
||
│ │ Phase 2: Auction Pages │ │
|
||
│ ├───────────────▶│ │ │
|
||
│ │◀───────────────┤ │ │
|
||
│ │ HTML │ │ │
|
||
│ │ │ │ │
|
||
│ │ Parse __NEXT_DATA__ JSON │ │
|
||
│ │────────────────────────────────────────────────▶│
|
||
│ │ │ │ INSERT auctions
|
||
│ │ │ │ │
|
||
│ │ Phase 3: Lot Pages │ │
|
||
│ ├───────────────▶│ │ │
|
||
│ │◀───────────────┤ │ │
|
||
│ │ HTML │ │ │
|
||
│ │ │ │ │
|
||
│ │ Parse __NEXT_DATA__ JSON │ │
|
||
│ │────────────────────────────────────────────────▶│
|
||
│ │ │ │ INSERT lots │
|
||
│ │────────────────────────────────────────────────▶│
|
||
│ │ │ │ INSERT images│
|
||
│ │ │ │ │
|
||
│ │ Export to CSV/JSON │ │
|
||
│ │◀────────────────────────────────────────────────┤
|
||
│ │ Query all data │ │
|
||
│◀──────────────┤ │ │ │
|
||
│ Results │ │ │ │
|
||
```
|
||
|
||
## Data Flow Details
|
||
|
||
### 1. **Page Retrieval & Caching**
|
||
```
|
||
Request URL
|
||
│
|
||
├──▶ Check cache DB (with timestamp validation)
|
||
│ │
|
||
│ ├─[HIT]──▶ Decompress (if compressed=1)
|
||
│ │ └──▶ Return HTML
|
||
│ │
|
||
│ └─[MISS]─▶ Fetch via Playwright
|
||
│ │
|
||
│ ├──▶ Compress HTML (zlib level 9)
|
||
│ │ ~70-90% size reduction
|
||
│ │
|
||
│ └──▶ Store in cache DB (compressed=1)
|
||
│
|
||
└──▶ Return HTML for parsing
|
||
```
|
||
|
||
### 2. **JSON Parsing Strategy**
|
||
```
|
||
HTML Content
|
||
│
|
||
└──▶ Extract <script id="__NEXT_DATA__">
|
||
│
|
||
├──▶ Parse JSON
|
||
│ │
|
||
│ ├─[has pageProps.lot]──▶ Individual LOT
|
||
│ │ └──▶ Extract: title, bid, location, images, etc.
|
||
│ │
|
||
│ └─[has pageProps.auction]──▶ AUCTION
|
||
│ │
|
||
│ ├─[has lots[] array]──▶ Auction with lots
|
||
│ │ └──▶ Extract: title, location, lots_count
|
||
│ │
|
||
│ └─[no lots[] array]──▶ Old format lot
|
||
│ └──▶ Parse as lot
|
||
│
|
||
└──▶ Fallback to HTML regex parsing (if JSON fails)
|
||
```
|
||
|
||
### 3. **API Enrichment Flow**
|
||
```
|
||
Lot Page Scraped (__NEXT_DATA__ parsed)
|
||
│
|
||
├──▶ Extract lot UUID from JSON
|
||
│
|
||
├──▶ GraphQL API Call (fetch_lot_bidding_data)
|
||
│ └──▶ Returns: current_bid, starting_bid, minimum_bid,
|
||
│ bid_count, closing_time, status, bid_increment
|
||
│
|
||
├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
|
||
│ │
|
||
│ ├──▶ Fetch all bid pages (paginated)
|
||
│ │
|
||
│ └──▶ Returns: Complete bid history with timestamps,
|
||
│ bidder_ids, autobid flags, amounts
|
||
│ │
|
||
│ ├──▶ INSERT INTO bid_history (multiple records)
|
||
│ │
|
||
│ └──▶ Calculate bid intelligence:
|
||
│ - first_bid_time (earliest timestamp)
|
||
│ - last_bid_time (latest timestamp)
|
||
│ - bid_velocity (bids per hour)
|
||
│
|
||
├──▶ Extract enrichment from __NEXT_DATA__:
|
||
│ - Brand, model, manufacturer (from attributes)
|
||
│ - Year (regex from title/attributes)
|
||
│ - Condition (map to 0-10 score)
|
||
│ - Serial number, damage description
|
||
│
|
||
└──▶ INSERT/UPDATE lots table with all data
|
||
```
|
||
|
||
### 4. **Image Handling (Concurrent per Lot)**
|
||
```
|
||
Lot Page Parsed
|
||
│
|
||
├──▶ Extract images[] from JSON
|
||
│ │
|
||
│ └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
|
||
│ └──▶ Unique constraint prevents duplicates
|
||
│
|
||
└──▶ [If DOWNLOAD_IMAGES=True]
|
||
│
|
||
├──▶ Create concurrent download tasks (asyncio.gather)
|
||
│ │
|
||
│ ├──▶ All images for lot download in parallel
|
||
│ │ (No rate limiting between images in same lot)
|
||
│ │
|
||
│ ├──▶ Save to: /images/{lot_id}/001.jpg
|
||
│ │
|
||
│ └──▶ UPDATE images SET local_path=?, downloaded=1
|
||
│
|
||
└──▶ Rate limit only between lots (0.5s)
|
||
(Not between images within a lot)
|
||
```
|
||
|
||
## Key Configuration
|
||
|
||
| Setting | Value | Purpose |
|
||
|----------------------|-----------------------------------|----------------------------------|
|
||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||
|
||
## Output Files
|
||
|
||
```
|
||
/mnt/okcomputer/output/
|
||
├── auctions_{timestamp}.json # Exported auctions
|
||
├── auctions_{timestamp}.csv # Exported auctions
|
||
├── lots_{timestamp}.json # Exported lots
|
||
├── lots_{timestamp}.csv # Exported lots
|
||
└── images/ # Downloaded images (if enabled)
|
||
├── A1-28505-5/
|
||
│ ├── 001.jpg
|
||
│ └── 002.jpg
|
||
└── A1-28505-6/
|
||
└── 001.jpg
|
||
```
|
||
|
||
## Terminal Progress per Lot (TTY)
|
||
|
||
During lot analysis, Scaev now shows a per‑lot TTY progress animation with a final summary of all inputs used:
|
||
|
||
- Spinner runs while enrichment is in progress.
|
||
- Summary lists every page/API used to analyze the lot with:
|
||
- URL/label
|
||
- Size in bytes
|
||
- Source state: cache | realtime | offline | db | intercepted
|
||
- Duration in ms
|
||
|
||
Example output snippet:
|
||
|
||
```
|
||
[LOT A1-28505-5] ✓ Done in 812 ms — pages/APIs used:
|
||
• [html] https://www.troostwijkauctions.com/l/... | 142331 B | cache | 4 ms
|
||
• [graphql] GraphQL lotDetails | 5321 B | realtime | 142 ms
|
||
• [rest] REST bid history | 18234 B | realtime | 236 ms
|
||
```
|
||
|
||
Notes:
|
||
- In non‑TTY environments the spinner is replaced by simple log lines.
|
||
- Intercepted GraphQL responses (captured during page load) are labeled as `intercepted` with near‑zero duration.
|
||
|
||
## Data Flow “Tunnel” (Simplified)
|
||
|
||
For each lot, the data “tunnels through” the following stages:
|
||
|
||
1. HTML page → parse `__NEXT_DATA__` for core lot fields and lot UUID.
|
||
2. GraphQL `lotDetails` → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
|
||
3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
|
||
4. Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.
|
||
|
||
Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
|
||
|
||
## Migrations and ORM Roadmap
|
||
|
||
- Migrations follow a Flyway‑style convention in `db/migration` (e.g., `V1__initial_schema.sql`).
|
||
- Current baseline is V1; there are no new migrations required at this time.
|
||
- Raw SQL usage remains in place (SQLite) while we prepare a gradual move to SQLAlchemy 2.x targeting PostgreSQL.
|
||
- See `docs/MIGRATIONS.md` for details on naming, workflow, and the future switch to PostgreSQL.
|
||
|
||
## Extension Points for Integration
|
||
|
||
### 1. **Downstream Processing Pipeline**
|
||
```sqlite
|
||
-- Query lots without downloaded images
|
||
SELECT lot_id, url FROM images WHERE downloaded = 0;
|
||
|
||
-- Process images: OCR, classification, etc.
|
||
-- Update status when complete
|
||
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
|
||
```
|
||
|
||
### 2. **Real-time Monitoring**
|
||
```sqlite
|
||
-- Check for new lots every N minutes
|
||
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');
|
||
|
||
-- Monitor bid changes
|
||
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
|
||
```
|
||
|
||
### 3. **Analytics & Reporting**
|
||
```sqlite
|
||
-- Top locations
|
||
SELECT location, COUNT(*) as lots_count FROM lots GROUP BY location;
|
||
|
||
-- Auction statistics
|
||
SELECT
|
||
a.auction_id,
|
||
a.title,
|
||
COUNT(l.lot_id) as actual_lots,
|
||
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
|
||
FROM auctions a
|
||
LEFT JOIN lots l ON a.auction_id = l.auction_id
|
||
GROUP BY a.auction_id
|
||
```
|
||
|
||
### 4. **Image Processing Integration**
|
||
```sqlite
|
||
-- Get all images for a lot
|
||
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';
|
||
|
||
-- Batch process unprocessed images
|
||
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
|
||
FROM images i
|
||
JOIN lots l ON i.lot_id = l.lot_id
|
||
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
|
||
```
|
||
|
||
## Performance Characteristics
|
||
|
||
- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
|
||
- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
|
||
- **Caching**: 24-hour default cache validity (configurable)
|
||
- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
|
||
- **Scalability**: SQLite handles millions of rows efficiently
|
||
|
||
## Error Handling
|
||
|
||
- **Network failures**: Cached as status_code=500, retry after cache expiry
|
||
- **Parse failures**: Falls back to HTML regex patterns
|
||
- **Compression errors**: Auto-detects and handles uncompressed legacy data
|
||
- **Missing fields**: Defaults to "No bids", empty string, or 0
|
||
|
||
## Rate Limiting & Ethics
|
||
|
||
- **REQUIRED**: 0.5 second delay between page requests (not between images)
|
||
- **Respects cache**: Avoids unnecessary re-fetching
|
||
- **User-Agent**: Identifies as standard browser
|
||
- **No parallelization**: Single-threaded sequential crawling for pages
|
||
- **Image downloads**: Concurrent within each lot (16x speedup)
|
||
|
||
---
|
||
|
||
## API Integration Architecture
|
||
|
||
### GraphQL API
|
||
**Endpoint:** `https://storefront.tbauctions.com/storefront/graphql`
|
||
|
||
**Purpose:** Real-time bidding data and lot enrichment
|
||
|
||
**Key Query:**
|
||
```graphql
|
||
query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
|
||
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
|
||
lot {
|
||
currentBidAmount { cents currency }
|
||
initialAmount { cents currency }
|
||
nextMinimalBid { cents currency }
|
||
nextBidStepInCents
|
||
bidsCount
|
||
followersCount # Available - Watch count
|
||
startDate
|
||
endDate
|
||
minimumBidAmountMet
|
||
biddingStatus
|
||
condition
|
||
location { city countryCode }
|
||
categoryInformation { name path }
|
||
attributes { name value }
|
||
}
|
||
estimatedFullPrice { # Available - Estimated value
|
||
min { cents currency }
|
||
max { cents currency }
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Currently Captured:**
|
||
- ✅ Current bid, starting bid, minimum bid
|
||
- ✅ Bid count and bid increment
|
||
- ✅ Closing time and status
|
||
- ✅ Brand, model, manufacturer (from attributes)
|
||
|
||
|
||
### REST API - Bid History
|
||
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
|
||
|
||
**Purpose:** Complete bid history for intelligence analysis
|
||
|
||
**Parameters:**
|
||
- `pageNumber` (starts at 1)
|
||
- `pageSize` (default: 100)
|
||
|
||
**Response Example:**
|
||
```json
|
||
{
|
||
"results": [
|
||
{
|
||
"buyerId": "uuid", // Anonymized bidder ID
|
||
"buyerNumber": 4, // Display number
|
||
"currentBid": {
|
||
"cents": 370000,
|
||
"currency": "EUR"
|
||
},
|
||
"autoBid": false, // Is autobid
|
||
"negotiated": false, // Was negotiated
|
||
"createdAt": "2025-12-05T04:53:56.763033Z"
|
||
}
|
||
],
|
||
"hasNext": true,
|
||
"pageNumber": 1
|
||
}
|
||
```
|
||
|
||
**Captured Data:**
|
||
- ✅ Bid amount, timestamp, bidder ID
|
||
- ✅ Autobid flag
|
||
- ⚠️ `negotiated` - Not yet captured
|
||
|
||
**Calculated Intelligence:**
|
||
- ✅ First bid time
|
||
- ✅ Last bid time
|
||
- ✅ Bid velocity (bids per hour)
|
||
|
||
### API Integration Points
|
||
|
||
**Flow:**
|
||
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
|
||
2. Call GraphQL API → Get bidding data
|
||
3. If bid_count > 0 → Call REST API → Get complete bid history
|
||
4. Calculate bid intelligence metrics
|
||
5. Save to database
|
||
|
||
**Rate Limiting:**
|
||
- API calls happen during lot scraping phase
|
||
- Overall 0.5s rate limit applies to page requests
|
||
- API calls are part of lot processing (not separately limited)
|
||
|