Files
scaev/_wiki/ARCHITECTURE.md
2025-12-07 01:59:45 +01:00

30 KiB

Scaev - Architecture & Data Flow

System Overview

The scraper follows a 3-phase hierarchical crawling pattern to extract auction and lot data from Troostwijk Auctions website.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
                     TROOSTWIJK SCRAPER                          
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
  PHASE 1: COLLECT AUCTION URLs                                  
  ┌──────────────┐         ┌──────────────┐                      
   Listing Page │────────▶│ Extract /a/                        
   /auctions?             auction URLs                       
   page=1..N             └──────────────┘                      
  └──────────────┘                                              
                                                                
                        [ List of Auction URLs ]                 
└─────────────────────────────────────────────────────────────────┘
                                   
                                   
┌─────────────────────────────────────────────────────────────────┐
  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        
  ┌──────────────┐         ┌──────────────┐                      
   Auction Page │────────▶│ Parse                              
   /a/...                 __NEXT_DATA__                      
  └──────────────┘          JSON                               
                          └──────────────┘                      
                                                               
                                                               
  ┌──────────────┐         ┌──────────────┐                      
   Save Auction           Extract /l/                        
   Metadata               lot URLs                           
   to DB                 └──────────────┘                      
  └──────────────┘                                              
                                                                
                          [ List of Lot URLs ]                   
└─────────────────────────────────────────────────────────────────┘
                                   
                                   
┌─────────────────────────────────────────────────────────────────┐
  PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT                   
  ┌──────────────┐         ┌──────────────┐                      
   Lot Page     │────────▶│ Parse                              
   /l/...                 __NEXT_DATA__                      
  └──────────────┘          JSON                               
                           └──────────────┘                      
                                                                
         ┌─────────────────────────┼─────────────────┐           
                                                              
  ┌──────────────┐       ┌──────────────┐    ┌──────────────┐   
   GraphQL API          Bid History       Save Images     
   (Bidding +           REST API          URLs to DB      
    Enrichment)         (per lot)        └──────────────┘   
  └──────────────┘       └──────────────┘                      
                                                              
         └──────────┬────────────┘         [Optional Download   
                                           Concurrent per Lot]  
            ┌──────────────┐                                     
             Save to DB:                                       
             - Lot data                                        
             - Bid data                                        
             - Enrichment                                      
            └──────────────┘                                     
└─────────────────────────────────────────────────────────────────┘

Database Schema

┌──────────────────────────────────────────────────────────────────┐
  CACHE TABLE (HTML Storage with Compression)                     
├──────────────────────────────────────────────────────────────────┤
  cache                                                           
  ├── url (TEXT, PRIMARY KEY)                                     
  ├── content (BLOB)              -- Compressed HTML (zlib)       │
  ├── timestamp (REAL)                                            
  ├── status_code (INTEGER)                                       
  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  AUCTIONS TABLE                                                  
├──────────────────────────────────────────────────────────────────┤
  auctions                                                        
  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"          │
  ├── url (TEXT, UNIQUE)                                          
  ├── title (TEXT)                                                
  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"   │
  ├── lots_count (INTEGER)                                        
  ├── first_lot_closing_time (TEXT)                               
  └── scraped_at (TEXT)                                           
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  LOTS TABLE (Core + Enriched Intelligence)                      
├──────────────────────────────────────────────────────────────────┤
  lots                                                            
  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"        │
  ├── auction_id (TEXT)               -- FK to auctions           │
  ├── url (TEXT, UNIQUE)                                          
  ├── title (TEXT)                                                
                                                                  
  ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
  ├── current_bid (TEXT)              -- Current bid amount       │
  ├── starting_bid (TEXT)             -- Initial/opening bid      │
  ├── minimum_bid (TEXT)              -- Next minimum bid         │
  ├── bid_count (INTEGER)             -- Number of bids           │
  ├── bid_increment (REAL)            -- Bid step size            │
  ├── closing_time (TEXT)             -- Lot end date             │
  ├── status (TEXT)                   -- Minimum bid status       │
                                                                  
  ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
  ├── first_bid_time (TEXT)           -- First bid timestamp      │
  ├── last_bid_time (TEXT)            -- Latest bid timestamp     │
  ├── bid_velocity (REAL)             -- Bids per hour            │
                                                                  
  ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
  ├── brand (TEXT)                    -- Brand from attributes    │
  ├── model (TEXT)                    -- Model from attributes    │
  ├── manufacturer (TEXT)             -- Manufacturer name        │
  ├── year_manufactured (INTEGER)     -- Year extracted           │
  ├── condition_score (REAL)          -- 0-10 condition rating    │
  ├── condition_description (TEXT)    -- Condition text           │
  ├── serial_number (TEXT)            -- Serial/VIN number        │
  ├── damage_description (TEXT)       -- Damage notes             │
  ├── attributes_json (TEXT)          -- Full attributes JSON     │
                                                                  
  ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
  ├── viewing_time (TEXT)             -- Viewing schedule         │
  ├── pickup_date (TEXT)              -- Pickup schedule          │
  ├── location (TEXT)                 -- e.g. "Dongen, NL"        │
  ├── description (TEXT)              -- Lot description          │
  ├── category (TEXT)                 -- Lot category             │
  ├── sale_id (INTEGER)               -- Legacy field             │
  ├── type (TEXT)                     -- Legacy field             │
  ├── year (INTEGER)                  -- Legacy field             │
  ├── currency (TEXT)                 -- Currency code            │
  ├── closing_notified (INTEGER)      -- Notification flag        │
  └── scraped_at (TEXT)               -- Scrape timestamp         │
      FOREIGN KEY (auction_id)  auctions(auction_id)             
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  IMAGES TABLE (Image URLs & Download Status)                     
├──────────────────────────────────────────────────────────────────┤
  images                          ◀── THIS TABLE HOLDS IMAGE LINKS
  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     
  ├── lot_id (TEXT)               -- FK to lots                   │
  ├── url (TEXT)                  -- Image URL                    │
  ├── local_path (TEXT)           -- Path after download          │
  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded      │
      FOREIGN KEY (lot_id)  lots(lot_id)                         
      UNIQUE INDEX idx_unique_lot_url ON (lot_id, url)            
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  BID_HISTORY TABLE (Complete Bid Tracking for Intelligence)      
├──────────────────────────────────────────────────────────────────┤
  bid_history                     ◀── REST API: /bidding-history  
  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     
  ├── lot_id (TEXT)               -- FK to lots                   │
  ├── bid_amount (REAL)           -- Bid in EUR                   │
  ├── bid_time (TEXT)             -- ISO 8601 timestamp           │
  ├── is_autobid (INTEGER)        -- 0=manual, 1=autobid          │
  ├── bidder_id (TEXT)            -- Anonymized bidder UUID       │
  ├── bidder_number (INTEGER)     -- Bidder display number        │
  └── created_at (TEXT)           -- Record creation timestamp    │
      FOREIGN KEY (lot_id)  lots(lot_id)                         
      INDEX idx_bid_history_lot ON (lot_id)                       
      INDEX idx_bid_history_time ON (bid_time)                    
└──────────────────────────────────────────────────────────────────┘

Sequence Diagram

User          Scraper         Playwright      Cache DB        Data Tables
 │               │                │               │                │
 │  Run          │                │               │                │
 ├──────────────▶│                │               │                │
 │               │                │               │                │
 │               │ Phase 1: Listing Pages         │                │
 │               ├───────────────▶│               │                │
 │               │   goto()       │               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               ├───────────────────────────────▶│                │
 │               │   compress & cache             │                │
 │               │                │               │                │
 │               │ Phase 2: Auction Pages         │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT auctions
 │               │                │               │                │
 │               │ Phase 3: Lot Pages             │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT lots  │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT images│
 │               │                │               │                │
 │               │ Export to CSV/JSON             │                │
 │               │◀────────────────────────────────────────────────┤
 │               │   Query all data               │                │
 │◀──────────────┤                │               │                │
 │   Results     │                │               │                │

Data Flow Details

1. Page Retrieval & Caching

Request URL
    │
    ├──▶ Check cache DB (with timestamp validation)
    │    │
    │    ├─[HIT]──▶ Decompress (if compressed=1)
    │    │          └──▶ Return HTML
    │    │
    │    └─[MISS]─▶ Fetch via Playwright
    │               │
    │               ├──▶ Compress HTML (zlib level 9)
    │               │    ~70-90% size reduction
    │               │
    │               └──▶ Store in cache DB (compressed=1)
    │
    └──▶ Return HTML for parsing

2. JSON Parsing Strategy

HTML Content
    │
    └──▶ Extract <script id="__NEXT_DATA__">
         │
         ├──▶ Parse JSON
         │    │
         │    ├─[has pageProps.lot]──▶ Individual LOT
         │    │   └──▶ Extract: title, bid, location, images, etc.
         │    │
         │    └─[has pageProps.auction]──▶ AUCTION
         │        │
         │        ├─[has lots[] array]──▶ Auction with lots
         │        │   └──▶ Extract: title, location, lots_count
         │        │
         │        └─[no lots[] array]──▶ Old format lot
         │            └──▶ Parse as lot
         │
         └──▶ Fallback to HTML regex parsing (if JSON fails)

3. API Enrichment Flow

Lot Page Scraped (__NEXT_DATA__ parsed)
    │
    ├──▶ Extract lot UUID from JSON
    │
    ├──▶ GraphQL API Call (fetch_lot_bidding_data)
    │    └──▶ Returns: current_bid, starting_bid, minimum_bid,
    │         bid_count, closing_time, status, bid_increment
    │
    ├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
    │    │
    │    ├──▶ Fetch all bid pages (paginated)
    │    │
    │    └──▶ Returns: Complete bid history with timestamps,
    │         bidder_ids, autobid flags, amounts
    │         │
    │         ├──▶ INSERT INTO bid_history (multiple records)
    │         │
    │         └──▶ Calculate bid intelligence:
    │              - first_bid_time (earliest timestamp)
    │              - last_bid_time (latest timestamp)
    │              - bid_velocity (bids per hour)
    │
    ├──▶ Extract enrichment from __NEXT_DATA__:
    │    - Brand, model, manufacturer (from attributes)
    │    - Year (regex from title/attributes)
    │    - Condition (map to 0-10 score)
    │    - Serial number, damage description
    │
    └──▶ INSERT/UPDATE lots table with all data

4. Image Handling (Concurrent per Lot)

Lot Page Parsed
    │
    ├──▶ Extract images[] from JSON
    │    │
    │    └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
    │         └──▶ Unique constraint prevents duplicates
    │
    └──▶ [If DOWNLOAD_IMAGES=True]
         │
         ├──▶ Create concurrent download tasks (asyncio.gather)
         │    │
         │    ├──▶ All images for lot download in parallel
         │    │    (No rate limiting between images in same lot)
         │    │
         │    ├──▶ Save to: /images/{lot_id}/001.jpg
         │    │
         │    └──▶ UPDATE images SET local_path=?, downloaded=1
         │
         └──▶ Rate limit only between lots (0.5s)
              (Not between images within a lot)

Key Configuration

Setting Value Purpose
CACHE_DB /mnt/okcomputer/output/cache.db SQLite database path
IMAGES_DIR /mnt/okcomputer/output/images Downloaded images storage
RATE_LIMIT_SECONDS 0.5 Delay between requests
DOWNLOAD_IMAGES False Toggle image downloading
MAX_PAGES 50 Number of listing pages to crawl

Output Files

/mnt/okcomputer/output/
├── cache.db                              # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json             # Exported auctions
├── auctions_{timestamp}.csv              # Exported auctions
├── lots_{timestamp}.json                 # Exported lots
├── lots_{timestamp}.csv                  # Exported lots
└── images/                               # Downloaded images (if enabled)
    ├── A1-28505-5/
    │   ├── 001.jpg
    │   └── 002.jpg
    └── A1-28505-6/
        └── 001.jpg

Extension Points for Integration

1. Downstream Processing Pipeline

-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;

-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;

2. Real-time Monitoring

-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');

-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;

3. Analytics & Reporting

-- Top locations
SELECT location, COUNT(*) as lots_count FROM lots GROUP BY location;

-- Auction statistics
SELECT
    a.auction_id,
    a.title,
    COUNT(l.lot_id) as actual_lots,
    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id

4. Image Processing Integration

-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';

-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;

Performance Characteristics

  • Compression: ~70-90% HTML size reduction (1GB → ~100-300MB)
  • Rate Limiting: Exactly 0.5s between requests (respectful scraping)
  • Caching: 24-hour default cache validity (configurable)
  • Throughput: ~7,200 pages/hour (with 0.5s rate limit)
  • Scalability: SQLite handles millions of rows efficiently

Error Handling

  • Network failures: Cached as status_code=500, retry after cache expiry
  • Parse failures: Falls back to HTML regex patterns
  • Compression errors: Auto-detects and handles uncompressed legacy data
  • Missing fields: Defaults to "No bids", empty string, or 0

Rate Limiting & Ethics

  • REQUIRED: 0.5 second delay between page requests (not between images)
  • Respects cache: Avoids unnecessary re-fetching
  • User-Agent: Identifies as standard browser
  • No parallelization: Single-threaded sequential crawling for pages
  • Image downloads: Concurrent within each lot (16x speedup)

API Integration Architecture

GraphQL API

Endpoint: https://storefront.tbauctions.com/storefront/graphql

Purpose: Real-time bidding data and lot enrichment

Key Query:

query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
  lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
    lot {
      currentBidAmount { cents currency }
      initialAmount { cents currency }
      nextMinimalBid { cents currency }
      nextBidStepInCents
      bidsCount
      followersCount                    # Available - Watch count
      startDate
      endDate
      minimumBidAmountMet
      biddingStatus
      condition
      location { city countryCode }
      categoryInformation { name path }
      attributes { name value }
    }
    estimatedFullPrice {                # Available - Estimated value
      min { cents currency }
      max { cents currency }
    }
  }
}

Currently Captured:

  • Current bid, starting bid, minimum bid
  • Bid count and bid increment
  • Closing time and status
  • Brand, model, manufacturer (from attributes)

Available but Not Yet Captured:

  • ⚠️ followersCount - Watch count for popularity analysis
  • ⚠️ estimatedFullPrice - Min/max estimated values
  • ⚠️ biddingStatus - More detailed status enum
  • ⚠️ condition - Direct condition field
  • ⚠️ location - City, country details
  • ⚠️ categoryInformation - Structured category

REST API - Bid History

Endpoint: https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history

Purpose: Complete bid history for intelligence analysis

Parameters:

  • pageNumber (starts at 1)
  • pageSize (default: 100)

Response Example:

{
  "results": [
    {
      "buyerId": "uuid",           // Anonymized bidder ID
      "buyerNumber": 4,            // Display number
      "currentBid": {
        "cents": 370000,
        "currency": "EUR"
      },
      "autoBid": false,            // Is autobid
      "negotiated": false,         // Was negotiated
      "createdAt": "2025-12-05T04:53:56.763033Z"
    }
  ],
  "hasNext": true,
  "pageNumber": 1
}

Captured Data:

  • Bid amount, timestamp, bidder ID
  • Autobid flag
  • ⚠️ negotiated - Not yet captured

Calculated Intelligence:

  • First bid time
  • Last bid time
  • Bid velocity (bids per hour)

API Integration Points

Files:

  • src/graphql_client.py - GraphQL queries and parsing
  • src/bid_history_client.py - REST API pagination and parsing
  • src/scraper.py - Integration during lot scraping

Flow:

  1. Lot page scraped → Extract lot UUID from __NEXT_DATA__
  2. Call GraphQL API → Get bidding data
  3. If bid_count > 0 → Call REST API → Get complete bid history
  4. Calculate bid intelligence metrics
  5. Save to database

Rate Limiting:

  • API calls happen during lot scraping phase
  • Overall 0.5s rate limit applies to page requests
  • API calls are part of lot processing (not separately limited)

See API_INTELLIGENCE_FINDINGS.md for detailed field analysis and roadmap.