Files
scaev/docs/ARCHITECTURE.md
Tour 5ea2342dbc - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 19:53:31 +01:00

31 KiB
Raw Blame History

Scaev - Architecture & Data Flow

System Overview

The scraper follows a 3-phase hierarchical crawling pattern to extract auction and lot data from Troostwijk Auctions website.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
                         SCAEV SCRAPER                           
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
  PHASE 1: COLLECT AUCTION URLs                                  
  ┌──────────────┐         ┌──────────────┐                      
   Listing Page │────────▶│ Extract /a/                        
   /auctions?             auction URLs                       
   page=1..N             └──────────────┘                      
  └──────────────┘                                              
                                                                
                        [ List of Auction URLs ]                 
└─────────────────────────────────────────────────────────────────┘
                                   
                                   
┌─────────────────────────────────────────────────────────────────┐
  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        
  ┌──────────────┐         ┌──────────────┐                      
   Auction Page │────────▶│ Parse                              
   /a/...                 __NEXT_DATA__                      
  └──────────────┘          JSON                               
                          └──────────────┘                      
                                                               
                                                               
  ┌──────────────┐         ┌──────────────┐                      
   Save Auction           Extract /l/                        
   Metadata               lot URLs                           
   to DB                 └──────────────┘                      
  └──────────────┘                                              
                                                                
                          [ List of Lot URLs ]                   
└─────────────────────────────────────────────────────────────────┘
                                   
                                   
┌─────────────────────────────────────────────────────────────────┐
  PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT                   
  ┌──────────────┐         ┌──────────────┐                      
   Lot Page     │────────▶│ Parse                              
   /l/...                 __NEXT_DATA__                      
  └──────────────┘          JSON                               
                           └──────────────┘                      
                                                                
         ┌─────────────────────────┼─────────────────┐           
                                                              
  ┌──────────────┐       ┌──────────────┐    ┌──────────────┐   
   GraphQL API          Bid History       Save Images     
   (Bidding +           REST API          URLs to DB      
    Enrichment)         (per lot)        └──────────────┘   
  └──────────────┘       └──────────────┘                      
                                                              
         └──────────┬────────────┘         [Optional Download   
                                           Concurrent per Lot]  
            ┌──────────────┐                                     
             Save to DB:                                       
             - Lot data                                        
             - Bid data                                        
             - Enrichment                                      
            └──────────────┘                                     
└─────────────────────────────────────────────────────────────────┘

Database Schema

┌──────────────────────────────────────────────────────────────────┐
  CACHE TABLE (HTML Storage with Compression)                     
├──────────────────────────────────────────────────────────────────┤
  cache                                                           
  ├── url (TEXT, PRIMARY KEY)                                     
  ├── content (BLOB)              -- Compressed HTML (zlib)       │
  ├── timestamp (REAL)                                            
  ├── status_code (INTEGER)                                       
  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  AUCTIONS TABLE                                                  
├──────────────────────────────────────────────────────────────────┤
  auctions                                                        
  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"          │
  ├── url (TEXT, UNIQUE)                                          
  ├── title (TEXT)                                                
  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"   │
  ├── lots_count (INTEGER)                                        
  ├── first_lot_closing_time (TEXT)                               
  └── scraped_at (TEXT)                                           
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  LOTS TABLE (Core + Enriched Intelligence)                      
├──────────────────────────────────────────────────────────────────┤
  lots                                                            
  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"        │
  ├── auction_id (TEXT)               -- FK to auctions           │
  ├── url (TEXT, UNIQUE)                                          
  ├── title (TEXT)                                                
                                                                  
  ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
  ├── current_bid (TEXT)              -- Current bid amount       │
  ├── starting_bid (TEXT)             -- Initial/opening bid      │
  ├── minimum_bid (TEXT)              -- Next minimum bid         │
  ├── bid_count (INTEGER)             -- Number of bids           │
  ├── bid_increment (REAL)            -- Bid step size            │
  ├── closing_time (TEXT)             -- Lot end date             │
  ├── status (TEXT)                   -- Minimum bid status       │
                                                                  
  ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
  ├── first_bid_time (TEXT)           -- First bid timestamp      │
  ├── last_bid_time (TEXT)            -- Latest bid timestamp     │
  ├── bid_velocity (REAL)             -- Bids per hour            │
                                                                  
  ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
  ├── brand (TEXT)                    -- Brand from attributes    │
  ├── model (TEXT)                    -- Model from attributes    │
  ├── manufacturer (TEXT)             -- Manufacturer name        │
  ├── year_manufactured (INTEGER)     -- Year extracted           │
  ├── condition_score (REAL)          -- 0-10 condition rating    │
  ├── condition_description (TEXT)    -- Condition text           │
  ├── serial_number (TEXT)            -- Serial/VIN number        │
  ├── damage_description (TEXT)       -- Damage notes             │
  ├── attributes_json (TEXT)          -- Full attributes JSON     │
                                                                  
  ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
  ├── viewing_time (TEXT)             -- Viewing schedule         │
  ├── pickup_date (TEXT)              -- Pickup schedule          │
  ├── location (TEXT)                 -- e.g. "Dongen, NL"        │
  ├── description (TEXT)              -- Lot description          │
  ├── category (TEXT)                 -- Lot category             │
  ├── sale_id (INTEGER)               -- Legacy field             │
  ├── type (TEXT)                     -- Legacy field             │
  ├── year (INTEGER)                  -- Legacy field             │
  ├── currency (TEXT)                 -- Currency code            │
  ├── closing_notified (INTEGER)      -- Notification flag        │
  └── scraped_at (TEXT)               -- Scrape timestamp         │
      FOREIGN KEY (auction_id)  auctions(auction_id)             
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  IMAGES TABLE (Image URLs & Download Status)                     
├──────────────────────────────────────────────────────────────────┤
  images                          ◀── THIS TABLE HOLDS IMAGE LINKS
  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     
  ├── lot_id (TEXT)               -- FK to lots                   │
  ├── url (TEXT)                  -- Image URL                    │
  ├── local_path (TEXT)           -- Path after download          │
  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded      │
      FOREIGN KEY (lot_id)  lots(lot_id)                         
      UNIQUE INDEX idx_unique_lot_url ON (lot_id, url)            
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  BID_HISTORY TABLE (Complete Bid Tracking for Intelligence)      
├──────────────────────────────────────────────────────────────────┤
  bid_history                     ◀── REST API: /bidding-history  
  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     
  ├── lot_id (TEXT)               -- FK to lots                   │
  ├── bid_amount (REAL)           -- Bid in EUR                   │
  ├── bid_time (TEXT)             -- ISO 8601 timestamp           │
  ├── is_autobid (INTEGER)        -- 0=manual, 1=autobid          │
  ├── bidder_id (TEXT)            -- Anonymized bidder UUID       │
  ├── bidder_number (INTEGER)     -- Bidder display number        │
  └── created_at (TEXT)           -- Record creation timestamp    │
      FOREIGN KEY (lot_id)  lots(lot_id)                         
      INDEX idx_bid_history_lot ON (lot_id)                       
      INDEX idx_bid_history_time ON (bid_time)                    
└──────────────────────────────────────────────────────────────────┘

Sequence Diagram

User          Scraper         Playwright      Cache DB        Data Tables
 │               │                │               │                │
 │  Run          │                │               │                │
 ├──────────────▶│                │               │                │
 │               │                │               │                │
 │               │ Phase 1: Listing Pages         │                │
 │               ├───────────────▶│               │                │
 │               │   goto()       │               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               ├───────────────────────────────▶│                │
 │               │   compress & cache             │                │
 │               │                │               │                │
 │               │ Phase 2: Auction Pages         │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT auctions
 │               │                │               │                │
 │               │ Phase 3: Lot Pages             │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT lots  │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT images│
 │               │                │               │                │
 │               │ Export to CSV/JSON             │                │
 │               │◀────────────────────────────────────────────────┤
 │               │   Query all data               │                │
 │◀──────────────┤                │               │                │
 │   Results     │                │               │                │

Data Flow Details

1. Page Retrieval & Caching

Request URL
    │
    ├──▶ Check cache DB (with timestamp validation)
    │    │
    │    ├─[HIT]──▶ Decompress (if compressed=1)
    │    │          └──▶ Return HTML
    │    │
    │    └─[MISS]─▶ Fetch via Playwright
    │               │
    │               ├──▶ Compress HTML (zlib level 9)
    │               │    ~70-90% size reduction
    │               │
    │               └──▶ Store in cache DB (compressed=1)
    │
    └──▶ Return HTML for parsing

2. JSON Parsing Strategy

HTML Content
    │
    └──▶ Extract <script id="__NEXT_DATA__">
         │
         ├──▶ Parse JSON
         │    │
         │    ├─[has pageProps.lot]──▶ Individual LOT
         │    │   └──▶ Extract: title, bid, location, images, etc.
         │    │
         │    └─[has pageProps.auction]──▶ AUCTION
         │        │
         │        ├─[has lots[] array]──▶ Auction with lots
         │        │   └──▶ Extract: title, location, lots_count
         │        │
         │        └─[no lots[] array]──▶ Old format lot
         │            └──▶ Parse as lot
         │
         └──▶ Fallback to HTML regex parsing (if JSON fails)

3. API Enrichment Flow

Lot Page Scraped (__NEXT_DATA__ parsed)
    │
    ├──▶ Extract lot UUID from JSON
    │
    ├──▶ GraphQL API Call (fetch_lot_bidding_data)
    │    └──▶ Returns: current_bid, starting_bid, minimum_bid,
    │         bid_count, closing_time, status, bid_increment
    │
    ├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
    │    │
    │    ├──▶ Fetch all bid pages (paginated)
    │    │
    │    └──▶ Returns: Complete bid history with timestamps,
    │         bidder_ids, autobid flags, amounts
    │         │
    │         ├──▶ INSERT INTO bid_history (multiple records)
    │         │
    │         └──▶ Calculate bid intelligence:
    │              - first_bid_time (earliest timestamp)
    │              - last_bid_time (latest timestamp)
    │              - bid_velocity (bids per hour)
    │
    ├──▶ Extract enrichment from __NEXT_DATA__:
    │    - Brand, model, manufacturer (from attributes)
    │    - Year (regex from title/attributes)
    │    - Condition (map to 0-10 score)
    │    - Serial number, damage description
    │
    └──▶ INSERT/UPDATE lots table with all data

4. Image Handling (Concurrent per Lot)

Lot Page Parsed
    │
    ├──▶ Extract images[] from JSON
    │    │
    │    └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
    │         └──▶ Unique constraint prevents duplicates
    │
    └──▶ [If DOWNLOAD_IMAGES=True]
         │
         ├──▶ Create concurrent download tasks (asyncio.gather)
         │    │
         │    ├──▶ All images for lot download in parallel
         │    │    (No rate limiting between images in same lot)
         │    │
         │    ├──▶ Save to: /images/{lot_id}/001.jpg
         │    │
         │    └──▶ UPDATE images SET local_path=?, downloaded=1
         │
         └──▶ Rate limit only between lots (0.5s)
              (Not between images within a lot)

Key Configuration

Setting Value Purpose
CACHE_DB /mnt/okcomputer/output/cache.db SQLite database path
IMAGES_DIR /mnt/okcomputer/output/images Downloaded images storage
RATE_LIMIT_SECONDS 0.5 Delay between requests
DOWNLOAD_IMAGES False Toggle image downloading
MAX_PAGES 50 Number of listing pages to crawl

Output Files

/mnt/okcomputer/output/
├── auctions_{timestamp}.json             # Exported auctions
├── auctions_{timestamp}.csv              # Exported auctions
├── lots_{timestamp}.json                 # Exported lots
├── lots_{timestamp}.csv                  # Exported lots
└── images/                               # Downloaded images (if enabled)
    ├── A1-28505-5/
    │   ├── 001.jpg
    │   └── 002.jpg
    └── A1-28505-6/
        └── 001.jpg

Terminal Progress per Lot (TTY)

During lot analysis, Scaev now shows a perlot TTY progress animation with a final summary of all inputs used:

  • Spinner runs while enrichment is in progress.
  • Summary lists every page/API used to analyze the lot with:
    • URL/label
    • Size in bytes
    • Source state: cache | realtime | offline | db | intercepted
    • Duration in ms

Example output snippet:

[LOT A1-28505-5] ✓ Done in 812 ms — pages/APIs used:
  • [html] https://www.troostwijkauctions.com/l/... | 142331 B | cache | 4 ms
  • [graphql] GraphQL lotDetails | 5321 B | realtime | 142 ms
  • [rest] REST bid history | 18234 B | realtime | 236 ms

Notes:

  • In nonTTY environments the spinner is replaced by simple log lines.
  • Intercepted GraphQL responses (captured during page load) are labeled as intercepted with nearzero duration.

Data Flow “Tunnel” (Simplified)

For each lot, the data “tunnels through” the following stages:

  1. HTML page → parse __NEXT_DATA__ for core lot fields and lot UUID.
  2. GraphQL lotDetails → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
  3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
  4. Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.

Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.

Migrations and ORM Roadmap

  • Migrations follow a Flywaystyle convention in db/migration (e.g., V1__initial_schema.sql).
  • Current baseline is V1; there are no new migrations required at this time.
  • Raw SQL usage remains in place (SQLite) while we prepare a gradual move to SQLAlchemy 2.x targeting PostgreSQL.
  • See docs/MIGRATIONS.md for details on naming, workflow, and the future switch to PostgreSQL.

Extension Points for Integration

1. Downstream Processing Pipeline

-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;

-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;

2. Real-time Monitoring

-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');

-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;

3. Analytics & Reporting

-- Top locations
SELECT location, COUNT(*) as lots_count FROM lots GROUP BY location;

-- Auction statistics
SELECT
    a.auction_id,
    a.title,
    COUNT(l.lot_id) as actual_lots,
    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id

4. Image Processing Integration

-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';

-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;

Performance Characteristics

  • Compression: ~70-90% HTML size reduction (1GB → ~100-300MB)
  • Rate Limiting: Exactly 0.5s between requests (respectful scraping)
  • Caching: 24-hour default cache validity (configurable)
  • Throughput: ~7,200 pages/hour (with 0.5s rate limit)
  • Scalability: SQLite handles millions of rows efficiently

Error Handling

  • Network failures: Cached as status_code=500, retry after cache expiry
  • Parse failures: Falls back to HTML regex patterns
  • Compression errors: Auto-detects and handles uncompressed legacy data
  • Missing fields: Defaults to "No bids", empty string, or 0

Rate Limiting & Ethics

  • REQUIRED: 0.5 second delay between page requests (not between images)
  • Respects cache: Avoids unnecessary re-fetching
  • User-Agent: Identifies as standard browser
  • No parallelization: Single-threaded sequential crawling for pages
  • Image downloads: Concurrent within each lot (16x speedup)

API Integration Architecture

GraphQL API

Endpoint: https://storefront.tbauctions.com/storefront/graphql

Purpose: Real-time bidding data and lot enrichment

Key Query:

query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
  lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
    lot {
      currentBidAmount { cents currency }
      initialAmount { cents currency }
      nextMinimalBid { cents currency }
      nextBidStepInCents
      bidsCount
      followersCount                    # Available - Watch count
      startDate
      endDate
      minimumBidAmountMet
      biddingStatus
      condition
      location { city countryCode }
      categoryInformation { name path }
      attributes { name value }
    }
    estimatedFullPrice {                # Available - Estimated value
      min { cents currency }
      max { cents currency }
    }
  }
}

Currently Captured:

  • Current bid, starting bid, minimum bid
  • Bid count and bid increment
  • Closing time and status
  • Brand, model, manufacturer (from attributes)

REST API - Bid History

Endpoint: https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history

Purpose: Complete bid history for intelligence analysis

Parameters:

  • pageNumber (starts at 1)
  • pageSize (default: 100)

Response Example:

{
  "results": [
    {
      "buyerId": "uuid",           // Anonymized bidder ID
      "buyerNumber": 4,            // Display number
      "currentBid": {
        "cents": 370000,
        "currency": "EUR"
      },
      "autoBid": false,            // Is autobid
      "negotiated": false,         // Was negotiated
      "createdAt": "2025-12-05T04:53:56.763033Z"
    }
  ],
  "hasNext": true,
  "pageNumber": 1
}

Captured Data:

  • Bid amount, timestamp, bidder ID
  • Autobid flag
  • ⚠️ negotiated - Not yet captured

Calculated Intelligence:

  • First bid time
  • Last bid time
  • Bid velocity (bids per hour)

API Integration Points

Flow:

  1. Lot page scraped → Extract lot UUID from __NEXT_DATA__
  2. Call GraphQL API → Get bidding data
  3. If bid_count > 0 → Call REST API → Get complete bid history
  4. Calculate bid intelligence metrics
  5. Save to database

Rate Limiting:

  • API calls happen during lot scraping phase
  • Overall 0.5s rate limit applies to page requests
  • API calls are part of lot processing (not separately limited)