scaev/docs/ARCHITECTURE.md

# Scaev - Architecture & Data Flow

## System Overview

The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.

## Architecture Diagram

```mariadb
┌─────────────────────────────────────────────────────────────────┐
│                     TROOSTWIJK SCRAPER                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: COLLECT AUCTION URLs                                  │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Listing Page │────────▶│ Extract /a/  │                      │
│  │ /auctions?   │         │ auction URLs │                      │
│  │ page=1..N    │         └──────────────┘                      │
│  └──────────────┘                │                              │
│                                   ▼                             │
│                        [ List of Auction URLs ]                 │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Auction Page │────────▶│ Parse        │                      │
│  │ /a/...       │         │ __NEXT_DATA__│                      │
│  └──────────────┘         │ JSON         │                      │
│         │                 └──────────────┘                      │
│         │                        │                              │
│         ▼                        ▼                              │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Save Auction │         │ Extract /l/  │                      │
│  │ Metadata     │         │ lot URLs     │                      │
│  │ to DB        │         └──────────────┘                      │
│  └──────────────┘                │                              │
│                                   ▼                             │
│                          [ List of Lot URLs ]                   │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT                   │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Lot Page     │────────▶│ Parse        │                      │
│  │ /l/...       │         │ __NEXT_DATA__│                      │
│  └──────────────┘         │ JSON         │                      │
│                           └──────────────┘                      │
│                                   │                             │
│         ┌─────────────────────────┼─────────────────┐           │
│         ▼                         ▼                 ▼           │
│  ┌──────────────┐       ┌──────────────┐    ┌──────────────┐   │
│  │ GraphQL API  │       │ Bid History  │    │ Save Images  │   │
│  │ (Bidding +   │       │ REST API     │    │ URLs to DB   │   │
│  │  Enrichment) │       │ (per lot)    │    └──────────────┘   │
│  └──────────────┘       └──────────────┘           │           │
│         │                       │                   ▼           │
│         └──────────┬────────────┘         [Optional Download   │
│                    ▼                       Concurrent per Lot]  │
│            ┌──────────────┐                                     │
│            │ Save to DB:  │                                     │
│            │ - Lot data   │                                     │
│            │ - Bid data   │                                     │
│            │ - Enrichment │                                     │
│            └──────────────┘                                     │
└─────────────────────────────────────────────────────────────────┘
```

## Database Schema

```mariadb
┌──────────────────────────────────────────────────────────────────┐
│  CACHE TABLE (HTML Storage with Compression)                     │
├──────────────────────────────────────────────────────────────────┤
│  cache                                                           │
│  ├── url (TEXT, PRIMARY KEY)                                     │
│  ├── content (BLOB)              -- Compressed HTML (zlib)       │
│  ├── timestamp (REAL)                                            │
│  ├── status_code (INTEGER)                                       │
│  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  AUCTIONS TABLE                                                  │
├──────────────────────────────────────────────────────────────────┤
│  auctions                                                        │
│  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"          │
│  ├── url (TEXT, UNIQUE)                                          │
│  ├── title (TEXT)                                                │
│  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"   │
│  ├── lots_count (INTEGER)                                        │
│  ├── first_lot_closing_time (TEXT)                               │
│  └── scraped_at (TEXT)                                           │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  LOTS TABLE (Core + Enriched Intelligence)                      │
├──────────────────────────────────────────────────────────────────┤
│  lots                                                            │
│  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"        │
│  ├── auction_id (TEXT)               -- FK to auctions           │
│  ├── url (TEXT, UNIQUE)                                          │
│  ├── title (TEXT)                                                │
│  │                                                                │
│  ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
│  ├── current_bid (TEXT)              -- Current bid amount       │
│  ├── starting_bid (TEXT)             -- Initial/opening bid      │
│  ├── minimum_bid (TEXT)              -- Next minimum bid         │
│  ├── bid_count (INTEGER)             -- Number of bids           │
│  ├── bid_increment (REAL)            -- Bid step size            │
│  ├── closing_time (TEXT)             -- Lot end date             │
│  ├── status (TEXT)                   -- Minimum bid status       │
│  │                                                                │
│  ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
│  ├── first_bid_time (TEXT)           -- First bid timestamp      │
│  ├── last_bid_time (TEXT)            -- Latest bid timestamp     │
│  ├── bid_velocity (REAL)             -- Bids per hour            │
│  │                                                                │
│  ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
│  ├── brand (TEXT)                    -- Brand from attributes    │
│  ├── model (TEXT)                    -- Model from attributes    │
│  ├── manufacturer (TEXT)             -- Manufacturer name        │
│  ├── year_manufactured (INTEGER)     -- Year extracted           │
│  ├── condition_score (REAL)          -- 0-10 condition rating    │
│  ├── condition_description (TEXT)    -- Condition text           │
│  ├── serial_number (TEXT)            -- Serial/VIN number        │
│  ├── damage_description (TEXT)       -- Damage notes             │
│  ├── attributes_json (TEXT)          -- Full attributes JSON     │
│  │                                                                │
│  ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
│  ├── viewing_time (TEXT)             -- Viewing schedule         │
│  ├── pickup_date (TEXT)              -- Pickup schedule          │
│  ├── location (TEXT)                 -- e.g. "Dongen, NL"        │
│  ├── description (TEXT)              -- Lot description          │
│  ├── category (TEXT)                 -- Lot category             │
│  ├── sale_id (INTEGER)               -- Legacy field             │
│  ├── type (TEXT)                     -- Legacy field             │
│  ├── year (INTEGER)                  -- Legacy field             │
│  ├── currency (TEXT)                 -- Currency code            │
│  ├── closing_notified (INTEGER)      -- Notification flag        │
│  └── scraped_at (TEXT)               -- Scrape timestamp         │
│      FOREIGN KEY (auction_id) → auctions(auction_id)             │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  IMAGES TABLE (Image URLs & Download Status)                     │
├──────────────────────────────────────────────────────────────────┤
│  images                          ◀── THIS TABLE HOLDS IMAGE LINKS│
│  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     │
│  ├── lot_id (TEXT)               -- FK to lots                   │
│  ├── url (TEXT)                  -- Image URL                    │
│  ├── local_path (TEXT)           -- Path after download          │
│  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded      │
│      FOREIGN KEY (lot_id) → lots(lot_id)                         │
│      UNIQUE INDEX idx_unique_lot_url ON (lot_id, url)            │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  BID_HISTORY TABLE (Complete Bid Tracking for Intelligence)      │
├──────────────────────────────────────────────────────────────────┤
│  bid_history                     ◀── REST API: /bidding-history  │
│  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     │
│  ├── lot_id (TEXT)               -- FK to lots                   │
│  ├── bid_amount (REAL)           -- Bid in EUR                   │
│  ├── bid_time (TEXT)             -- ISO 8601 timestamp           │
│  ├── is_autobid (INTEGER)        -- 0=manual, 1=autobid          │
│  ├── bidder_id (TEXT)            -- Anonymized bidder UUID       │
│  ├── bidder_number (INTEGER)     -- Bidder display number        │
│  └── created_at (TEXT)           -- Record creation timestamp    │
│      FOREIGN KEY (lot_id) → lots(lot_id)                         │
│      INDEX idx_bid_history_lot ON (lot_id)                       │
│      INDEX idx_bid_history_time ON (bid_time)                    │
└──────────────────────────────────────────────────────────────────┘
```

## Sequence Diagram

```
User          Scraper         Playwright      Cache DB        Data Tables
 │               │                │               │                │
 │  Run          │                │               │                │
 ├──────────────▶│                │               │                │
 │               │                │               │                │
 │               │ Phase 1: Listing Pages         │                │
 │               ├───────────────▶│               │                │
 │               │   goto()       │               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               ├───────────────────────────────▶│                │
 │               │   compress & cache             │                │
 │               │                │               │                │
 │               │ Phase 2: Auction Pages         │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT auctions
 │               │                │               │                │
 │               │ Phase 3: Lot Pages             │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT lots  │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT images│
 │               │                │               │                │
 │               │ Export to CSV/JSON             │                │
 │               │◀────────────────────────────────────────────────┤
 │               │   Query all data               │                │
 │◀──────────────┤                │               │                │
 │   Results     │                │               │                │
```

## Data Flow Details

### 1. **Page Retrieval & Caching**
```
Request URL
    │
    ├──▶ Check cache DB (with timestamp validation)
    │    │
    │    ├─[HIT]──▶ Decompress (if compressed=1)
    │    │          └──▶ Return HTML
    │    │
    │    └─[MISS]─▶ Fetch via Playwright
    │               │
    │               ├──▶ Compress HTML (zlib level 9)
    │               │    ~70-90% size reduction
    │               │
    │               └──▶ Store in cache DB (compressed=1)
    │
    └──▶ Return HTML for parsing
```

### 2. **JSON Parsing Strategy**
```
HTML Content
    │
    └──▶ Extract <script id="__NEXT_DATA__">
         │
         ├──▶ Parse JSON
         │    │
         │    ├─[has pageProps.lot]──▶ Individual LOT
         │    │   └──▶ Extract: title, bid, location, images, etc.
         │    │
         │    └─[has pageProps.auction]──▶ AUCTION
         │        │
         │        ├─[has lots[] array]──▶ Auction with lots
         │        │   └──▶ Extract: title, location, lots_count
         │        │
         │        └─[no lots[] array]──▶ Old format lot
         │            └──▶ Parse as lot
         │
         └──▶ Fallback to HTML regex parsing (if JSON fails)
```

### 3. **API Enrichment Flow**
```
Lot Page Scraped (__NEXT_DATA__ parsed)
    │
    ├──▶ Extract lot UUID from JSON
    │
    ├──▶ GraphQL API Call (fetch_lot_bidding_data)
    │    └──▶ Returns: current_bid, starting_bid, minimum_bid,
    │         bid_count, closing_time, status, bid_increment
    │
    ├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
    │    │
    │    ├──▶ Fetch all bid pages (paginated)
    │    │
    │    └──▶ Returns: Complete bid history with timestamps,
    │         bidder_ids, autobid flags, amounts
    │         │
    │         ├──▶ INSERT INTO bid_history (multiple records)
    │         │
    │         └──▶ Calculate bid intelligence:
    │              - first_bid_time (earliest timestamp)
    │              - last_bid_time (latest timestamp)
    │              - bid_velocity (bids per hour)
    │
    ├──▶ Extract enrichment from __NEXT_DATA__:
    │    - Brand, model, manufacturer (from attributes)
    │    - Year (regex from title/attributes)
    │    - Condition (map to 0-10 score)
    │    - Serial number, damage description
    │
    └──▶ INSERT/UPDATE lots table with all data
```

### 4. **Image Handling (Concurrent per Lot)**
```
Lot Page Parsed
    │
    ├──▶ Extract images[] from JSON
    │    │
    │    └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
    │         └──▶ Unique constraint prevents duplicates
    │
    └──▶ [If DOWNLOAD_IMAGES=True]
         │
         ├──▶ Create concurrent download tasks (asyncio.gather)
         │    │
         │    ├──▶ All images for lot download in parallel
         │    │    (No rate limiting between images in same lot)
         │    │
         │    ├──▶ Save to: /images/{lot_id}/001.jpg
         │    │
         │    └──▶ UPDATE images SET local_path=?, downloaded=1
         │
         └──▶ Rate limit only between lots (0.5s)
              (Not between images within a lot)
```

## Key Configuration

| Setting              | Value                             | Purpose                          |
|----------------------|-----------------------------------|----------------------------------|
| `CACHE_DB`           | `/mnt/okcomputer/output/cache.db` | SQLite database path             |
| `IMAGES_DIR`         | `/mnt/okcomputer/output/images`   | Downloaded images storage        |
| `RATE_LIMIT_SECONDS` | `0.5`                             | Delay between requests           |
| `DOWNLOAD_IMAGES`    | `False`                           | Toggle image downloading         |
| `MAX_PAGES`          | `50`                              | Number of listing pages to crawl |

## Output Files

```
/mnt/okcomputer/output/
├── cache.db                              # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json             # Exported auctions
├── auctions_{timestamp}.csv              # Exported auctions
├── lots_{timestamp}.json                 # Exported lots
├── lots_{timestamp}.csv                  # Exported lots
└── images/                               # Downloaded images (if enabled)
    ├── A1-28505-5/
    │   ├── 001.jpg
    │   └── 002.jpg
    └── A1-28505-6/
        └── 001.jpg
```

## Extension Points for Integration

### 1. **Downstream Processing Pipeline**
```sqlite
-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;

-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
```

### 2. **Real-time Monitoring**
```sqlite
-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');

-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
```

### 3. **Analytics & Reporting**
```sqlite
-- Top locations
SELECT location, COUNT(*) as lots_count FROM lots GROUP BY location;

-- Auction statistics
SELECT
    a.auction_id,
    a.title,
    COUNT(l.lot_id) as actual_lots,
    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id
```

### 4. **Image Processing Integration**
```sqlite
-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';

-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
```

## Performance Characteristics

- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
- **Caching**: 24-hour default cache validity (configurable)
- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
- **Scalability**: SQLite handles millions of rows efficiently

## Error Handling

- **Network failures**: Cached as status_code=500, retry after cache expiry
- **Parse failures**: Falls back to HTML regex patterns
- **Compression errors**: Auto-detects and handles uncompressed legacy data
- **Missing fields**: Defaults to "No bids", empty string, or 0

## Rate Limiting & Ethics

- **REQUIRED**: 0.5 second delay between page requests (not between images)
- **Respects cache**: Avoids unnecessary re-fetching
- **User-Agent**: Identifies as standard browser
- **No parallelization**: Single-threaded sequential crawling for pages
- **Image downloads**: Concurrent within each lot (16x speedup)

---

## API Integration Architecture

### GraphQL API
**Endpoint:** `https://storefront.tbauctions.com/storefront/graphql`

**Purpose:** Real-time bidding data and lot enrichment

**Key Query:**
```graphql
query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
  lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
    lot {
      currentBidAmount { cents currency }
      initialAmount { cents currency }
      nextMinimalBid { cents currency }
      nextBidStepInCents
      bidsCount
      followersCount                    # Available - Watch count
      startDate
      endDate
      minimumBidAmountMet
      biddingStatus
      condition
      location { city countryCode }
      categoryInformation { name path }
      attributes { name value }
    }
    estimatedFullPrice {                # Available - Estimated value
      min { cents currency }
      max { cents currency }
    }
  }
}
```

**Currently Captured:**
- ✅ Current bid, starting bid, minimum bid
- ✅ Bid count and bid increment
- ✅ Closing time and status
- ✅ Brand, model, manufacturer (from attributes)

**Available but Not Yet Captured:**
- ⚠️ `followersCount` - Watch count for popularity analysis
- ⚠️ `estimatedFullPrice` - Min/max estimated values
- ⚠️ `biddingStatus` - More detailed status enum
- ⚠️ `condition` - Direct condition field
- ⚠️ `location` - City, country details
- ⚠️ `categoryInformation` - Structured category

### REST API - Bid History
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`

**Purpose:** Complete bid history for intelligence analysis

**Parameters:**
- `pageNumber` (starts at 1)
- `pageSize` (default: 100)

**Response Example:**
```json
{
  "results": [
    {
      "buyerId": "uuid",           // Anonymized bidder ID
      "buyerNumber": 4,            // Display number
      "currentBid": {
        "cents": 370000,
        "currency": "EUR"
      },
      "autoBid": false,            // Is autobid
      "negotiated": false,         // Was negotiated
      "createdAt": "2025-12-05T04:53:56.763033Z"
    }
  ],
  "hasNext": true,
  "pageNumber": 1
}
```

**Captured Data:**
- ✅ Bid amount, timestamp, bidder ID
- ✅ Autobid flag
- ⚠️ `negotiated` - Not yet captured

**Calculated Intelligence:**
- ✅ First bid time
- ✅ Last bid time
- ✅ Bid velocity (bids per hour)

### API Integration Points

**Files:**
- `src/graphql_client.py` - GraphQL queries and parsing
- `src/bid_history_client.py` - REST API pagination and parsing
- `src/scraper.py` - Integration during lot scraping

**Flow:**
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
2. Call GraphQL API → Get bidding data
3. If bid_count > 0 → Call REST API → Get complete bid history
4. Calculate bid intelligence metrics
5. Save to database

**Rate Limiting:**
- API calls happen during lot scraping phase
- Overall 0.5s rate limit applies to page requests
- API calls are part of lot processing (not separately limited)

See `API_INTELLIGENCE_FINDINGS.md` for detailed field analysis and roadmap.