Files
scaev/wiki/ARCHITECTURE.md
2025-12-04 14:49:58 +01:00

20 KiB

Scaev - Architecture & Data Flow

System Overview

The scraper follows a 3-phase hierarchical crawling pattern to extract auction and lot data from Troostwijk Auctions website.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
                     TROOSTWIJK SCRAPER                          
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
  PHASE 1: COLLECT AUCTION URLs                                  
  ┌──────────────┐         ┌──────────────┐                      
   Listing Page │────────▶│ Extract /a/                        
   /auctions?             auction URLs                       
   page=1..N             └──────────────┘                      
  └──────────────┘                                              
                                                                
                        [ List of Auction URLs ]                 
└─────────────────────────────────────────────────────────────────┘
                                   
                                   
┌─────────────────────────────────────────────────────────────────┐
  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        
  ┌──────────────┐         ┌──────────────┐                      
   Auction Page │────────▶│ Parse                              
   /a/...                 __NEXT_DATA__                      
  └──────────────┘          JSON                               
                          └──────────────┘                      
                                                               
                                                               
  ┌──────────────┐         ┌──────────────┐                      
   Save Auction           Extract /l/                        
   Metadata               lot URLs                           
   to DB                 └──────────────┘                      
  └──────────────┘                                              
                                                                
                          [ List of Lot URLs ]                   
└─────────────────────────────────────────────────────────────────┘
                                   
                                   
┌─────────────────────────────────────────────────────────────────┐
  PHASE 3: SCRAPE LOT DETAILS                                    
  ┌──────────────┐         ┌──────────────┐                      
   Lot Page     │────────▶│ Parse                              
   /l/...                 __NEXT_DATA__                      
  └──────────────┘          JSON                               
                           └──────────────┘                      
                                                                
         ┌─────────────────────────┴─────────────────┐           
                                                               
  ┌──────────────┐                          ┌──────────────┐     
   Save Lot                                Save Images       
   Details                                 URLs to DB        
   to DB                                  └──────────────┘     
  └──────────────┘                                              
                                                                
                                          [Optional Download]    
└─────────────────────────────────────────────────────────────────┘

Database Schema

┌──────────────────────────────────────────────────────────────────┐
  CACHE TABLE (HTML Storage with Compression)                     
├──────────────────────────────────────────────────────────────────┤
  cache                                                           
  ├── url (TEXT, PRIMARY KEY)                                     
  ├── content (BLOB)              -- Compressed HTML (zlib)       │
  ├── timestamp (REAL)                                            
  ├── status_code (INTEGER)                                       
  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  AUCTIONS TABLE                                                  
├──────────────────────────────────────────────────────────────────┤
  auctions                                                        
  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"          │
  ├── url (TEXT, UNIQUE)                                          
  ├── title (TEXT)                                                
  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"   │
  ├── lots_count (INTEGER)                                        
  ├── first_lot_closing_time (TEXT)                               
  └── scraped_at (TEXT)                                           
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  LOTS TABLE                                                      
├──────────────────────────────────────────────────────────────────┤
  lots                                                            
  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"        │
  ├── auction_id (TEXT)               -- FK to auctions           │
  ├── url (TEXT, UNIQUE)                                          
  ├── title (TEXT)                                                
  ├── current_bid (TEXT)              -- "€123.45" or "No bids"   │
  ├── bid_count (INTEGER)                                         
  ├── closing_time (TEXT)                                         
  ├── viewing_time (TEXT)                                         
  ├── pickup_date (TEXT)                                          
  ├── location (TEXT)                 -- e.g. "Dongen, NL"        │
  ├── description (TEXT)                                          
  ├── category (TEXT)                                             
  └── scraped_at (TEXT)                                           
      FOREIGN KEY (auction_id)  auctions(auction_id)             
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
  IMAGES TABLE (Image URLs & Download Status)                     
├──────────────────────────────────────────────────────────────────┤
  images                          ◀── THIS TABLE HOLDS IMAGE LINKS
  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     
  ├── lot_id (TEXT)               -- FK to lots                   │
  ├── url (TEXT)                  -- Image URL                    │
  ├── local_path (TEXT)           -- Path after download          │
  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded      │
      FOREIGN KEY (lot_id)  lots(lot_id)                         
└──────────────────────────────────────────────────────────────────┘

Sequence Diagram

User          Scraper         Playwright      Cache DB        Data Tables
 │               │                │               │                │
 │  Run          │                │               │                │
 ├──────────────▶│                │               │                │
 │               │                │               │                │
 │               │ Phase 1: Listing Pages         │                │
 │               ├───────────────▶│               │                │
 │               │   goto()       │               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               ├───────────────────────────────▶│                │
 │               │   compress & cache             │                │
 │               │                │               │                │
 │               │ Phase 2: Auction Pages         │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT auctions
 │               │                │               │                │
 │               │ Phase 3: Lot Pages             │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT lots  │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT images│
 │               │                │               │                │
 │               │ Export to CSV/JSON             │                │
 │               │◀────────────────────────────────────────────────┤
 │               │   Query all data               │                │
 │◀──────────────┤                │               │                │
 │   Results     │                │               │                │

Data Flow Details

1. Page Retrieval & Caching

Request URL
    │
    ├──▶ Check cache DB (with timestamp validation)
    │    │
    │    ├─[HIT]──▶ Decompress (if compressed=1)
    │    │          └──▶ Return HTML
    │    │
    │    └─[MISS]─▶ Fetch via Playwright
    │               │
    │               ├──▶ Compress HTML (zlib level 9)
    │               │    ~70-90% size reduction
    │               │
    │               └──▶ Store in cache DB (compressed=1)
    │
    └──▶ Return HTML for parsing

2. JSON Parsing Strategy

HTML Content
    │
    └──▶ Extract <script id="__NEXT_DATA__">
         │
         ├──▶ Parse JSON
         │    │
         │    ├─[has pageProps.lot]──▶ Individual LOT
         │    │   └──▶ Extract: title, bid, location, images, etc.
         │    │
         │    └─[has pageProps.auction]──▶ AUCTION
         │        │
         │        ├─[has lots[] array]──▶ Auction with lots
         │        │   └──▶ Extract: title, location, lots_count
         │        │
         │        └─[no lots[] array]──▶ Old format lot
         │            └──▶ Parse as lot
         │
         └──▶ Fallback to HTML regex parsing (if JSON fails)

3. Image Handling

Lot Page Parsed
    │
    ├──▶ Extract images[] from JSON
    │    │
    │    └──▶ INSERT INTO images (lot_id, url, downloaded=0)
    │
    └──▶ [If DOWNLOAD_IMAGES=True]
         │
         ├──▶ Download each image
         │    │
         │    ├──▶ Save to: /images/{lot_id}/001.jpg
         │    │
         │    └──▶ UPDATE images SET local_path=?, downloaded=1
         │
         └──▶ Rate limit between downloads (0.5s)

Key Configuration

Setting Value Purpose
CACHE_DB /mnt/okcomputer/output/cache.db SQLite database path
IMAGES_DIR /mnt/okcomputer/output/images Downloaded images storage
RATE_LIMIT_SECONDS 0.5 Delay between requests
DOWNLOAD_IMAGES False Toggle image downloading
MAX_PAGES 50 Number of listing pages to crawl

Output Files

/mnt/okcomputer/output/
├── cache.db                              # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json             # Exported auctions
├── auctions_{timestamp}.csv              # Exported auctions
├── lots_{timestamp}.json                 # Exported lots
├── lots_{timestamp}.csv                  # Exported lots
└── images/                               # Downloaded images (if enabled)
    ├── A1-28505-5/
    │   ├── 001.jpg
    │   └── 002.jpg
    └── A1-28505-6/
        └── 001.jpg

Extension Points for Integration

1. Downstream Processing Pipeline

-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;

-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;

2. Real-time Monitoring

-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');

-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;

3. Analytics & Reporting

-- Top locations
SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location;

-- Auction statistics
SELECT
    a.auction_id,
    a.title,
    COUNT(l.lot_id) as actual_lots,
    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id

4. Image Processing Integration

-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';

-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;

Performance Characteristics

  • Compression: ~70-90% HTML size reduction (1GB → ~100-300MB)
  • Rate Limiting: Exactly 0.5s between requests (respectful scraping)
  • Caching: 24-hour default cache validity (configurable)
  • Throughput: ~7,200 pages/hour (with 0.5s rate limit)
  • Scalability: SQLite handles millions of rows efficiently

Error Handling

  • Network failures: Cached as status_code=500, retry after cache expiry
  • Parse failures: Falls back to HTML regex patterns
  • Compression errors: Auto-detects and handles uncompressed legacy data
  • Missing fields: Defaults to "No bids", empty string, or 0

Rate Limiting & Ethics

  • REQUIRED: 0.5 second delay between ALL requests
  • Respects cache: Avoids unnecessary re-fetching
  • User-Agent: Identifies as standard browser
  • No parallelization: Single-threaded sequential crawling