Tour/scaev

Files

Tour b12f3a5ee2 a

2025-12-04 14:53:55 +01:00

20 KiB

Raw Blame History

Scaev - Architecture & Data Flow

System Overview

The scraper follows a 3-phase hierarchical crawling pattern to extract auction and lot data from Troostwijk Auctions website.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                     TROOSTWIJK SCRAPER                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: COLLECT AUCTION URLs                                  │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Listing Page │────────▶│ Extract /a/  │                      │
│  │ /auctions?   │         │ auction URLs │                      │
│  │ page=1..N    │         └──────────────┘                      │
│  └──────────────┘                │                              │
│                                   ▼                             │
│                        [ List of Auction URLs ]                 │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 2: EXTRACT LOT URLs FROM AUCTIONS                        │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Auction Page │────────▶│ Parse        │                      │
│  │ /a/...       │         │ __NEXT_DATA__│                      │
│  └──────────────┘         │ JSON         │                      │
│         │                 └──────────────┘                      │
│         │                        │                              │
│         ▼                        ▼                              │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Save Auction │         │ Extract /l/  │                      │
│  │ Metadata     │         │ lot URLs     │                      │
│  │ to DB        │         └──────────────┘                      │
│  └──────────────┘                │                              │
│                                   ▼                             │
│                          [ List of Lot URLs ]                   │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 3: SCRAPE LOT DETAILS                                    │
│  ┌──────────────┐         ┌──────────────┐                      │
│  │ Lot Page     │────────▶│ Parse        │                      │
│  │ /l/...       │         │ __NEXT_DATA__│                      │
│  └──────────────┘         │ JSON         │                      │
│                           └──────────────┘                      │
│                                   │                             │
│         ┌─────────────────────────┴─────────────────┐           │
│         ▼                                           ▼           │
│  ┌──────────────┐                          ┌──────────────┐     │
│  │ Save Lot     │                          │ Save Images  │     │
│  │ Details      │                          │ URLs to DB   │     │
│  │ to DB        │                          └──────────────┘     │
│  └──────────────┘                                 │             │
│                                                    ▼            │
│                                          [Optional Download]    │
└─────────────────────────────────────────────────────────────────┘

Database Schema

┌──────────────────────────────────────────────────────────────────┐
│  CACHE TABLE (HTML Storage with Compression)                     │
├──────────────────────────────────────────────────────────────────┤
│  cache                                                           │
│  ├── url (TEXT, PRIMARY KEY)                                     │
│  ├── content (BLOB)              -- Compressed HTML (zlib)       │
│  ├── timestamp (REAL)                                            │
│  ├── status_code (INTEGER)                                       │
│  └── compressed (INTEGER)        -- 1=compressed, 0=plain        │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  AUCTIONS TABLE                                                  │
├──────────────────────────────────────────────────────────────────┤
│  auctions                                                        │
│  ├── auction_id (TEXT, PRIMARY KEY)  -- e.g. "A7-39813"          │
│  ├── url (TEXT, UNIQUE)                                          │
│  ├── title (TEXT)                                                │
│  ├── location (TEXT)                 -- e.g. "Cluj-Napoca, RO"   │
│  ├── lots_count (INTEGER)                                        │
│  ├── first_lot_closing_time (TEXT)                               │
│  └── scraped_at (TEXT)                                           │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  LOTS TABLE                                                      │
├──────────────────────────────────────────────────────────────────┤
│  lots                                                            │
│  ├── lot_id (TEXT, PRIMARY KEY)      -- e.g. "A1-28505-5"        │
│  ├── auction_id (TEXT)               -- FK to auctions           │
│  ├── url (TEXT, UNIQUE)                                          │
│  ├── title (TEXT)                                                │
│  ├── current_bid (TEXT)              -- "€123.45" or "No bids"   │
│  ├── bid_count (INTEGER)                                         │
│  ├── closing_time (TEXT)                                         │
│  ├── viewing_time (TEXT)                                         │
│  ├── pickup_date (TEXT)                                          │
│  ├── location (TEXT)                 -- e.g. "Dongen, NL"        │
│  ├── description (TEXT)                                          │
│  ├── category (TEXT)                                             │
│  └── scraped_at (TEXT)                                           │
│      FOREIGN KEY (auction_id) → auctions(auction_id)             │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  IMAGES TABLE (Image URLs & Download Status)                     │
├──────────────────────────────────────────────────────────────────┤
│  images                          ◀── THIS TABLE HOLDS IMAGE LINKS│
│  ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT)                     │
│  ├── lot_id (TEXT)               -- FK to lots                   │
│  ├── url (TEXT)                  -- Image URL                    │
│  ├── local_path (TEXT)           -- Path after download          │
│  └── downloaded (INTEGER)        -- 0=pending, 1=downloaded      │
│      FOREIGN KEY (lot_id) → lots(lot_id)                         │
└──────────────────────────────────────────────────────────────────┘

Sequence Diagram

User          Scraper         Playwright      Cache DB        Data Tables
 │               │                │               │                │
 │  Run          │                │               │                │
 ├──────────────▶│                │               │                │
 │               │                │               │                │
 │               │ Phase 1: Listing Pages         │                │
 │               ├───────────────▶│               │                │
 │               │   goto()       │               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               ├───────────────────────────────▶│                │
 │               │   compress & cache             │                │
 │               │                │               │                │
 │               │ Phase 2: Auction Pages         │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT auctions
 │               │                │               │                │
 │               │ Phase 3: Lot Pages             │                │
 │               ├───────────────▶│               │                │
 │               │◀───────────────┤               │                │
 │               │   HTML         │               │                │
 │               │                │               │                │
 │               │ Parse __NEXT_DATA__ JSON       │                │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT lots  │
 │               │────────────────────────────────────────────────▶│
 │               │                │               │   INSERT images│
 │               │                │               │                │
 │               │ Export to CSV/JSON             │                │
 │               │◀────────────────────────────────────────────────┤
 │               │   Query all data               │                │
 │◀──────────────┤                │               │                │
 │   Results     │                │               │                │

Data Flow Details

1. Page Retrieval & Caching

Request URL
    │
    ├──▶ Check cache DB (with timestamp validation)
    │    │
    │    ├─[HIT]──▶ Decompress (if compressed=1)
    │    │          └──▶ Return HTML
    │    │
    │    └─[MISS]─▶ Fetch via Playwright
    │               │
    │               ├──▶ Compress HTML (zlib level 9)
    │               │    ~70-90% size reduction
    │               │
    │               └──▶ Store in cache DB (compressed=1)
    │
    └──▶ Return HTML for parsing

2. JSON Parsing Strategy

HTML Content
    │
    └──▶ Extract <script id="__NEXT_DATA__">
         │
         ├──▶ Parse JSON
         │    │
         │    ├─[has pageProps.lot]──▶ Individual LOT
         │    │   └──▶ Extract: title, bid, location, images, etc.
         │    │
         │    └─[has pageProps.auction]──▶ AUCTION
         │        │
         │        ├─[has lots[] array]──▶ Auction with lots
         │        │   └──▶ Extract: title, location, lots_count
         │        │
         │        └─[no lots[] array]──▶ Old format lot
         │            └──▶ Parse as lot
         │
         └──▶ Fallback to HTML regex parsing (if JSON fails)

3. Image Handling

Lot Page Parsed
    │
    ├──▶ Extract images[] from JSON
    │    │
    │    └──▶ INSERT INTO images (lot_id, url, downloaded=0)
    │
    └──▶ [If DOWNLOAD_IMAGES=True]
         │
         ├──▶ Download each image
         │    │
         │    ├──▶ Save to: /images/{lot_id}/001.jpg
         │    │
         │    └──▶ UPDATE images SET local_path=?, downloaded=1
         │
         └──▶ Rate limit between downloads (0.5s)

Key Configuration

Setting	Value	Purpose
`CACHE_DB`	`/mnt/okcomputer/output/cache.db`	SQLite database path
`IMAGES_DIR`	`/mnt/okcomputer/output/images`	Downloaded images storage
`RATE_LIMIT_SECONDS`	`0.5`	Delay between requests
`DOWNLOAD_IMAGES`	`False`	Toggle image downloading
`MAX_PAGES`	`50`	Number of listing pages to crawl

Output Files

/mnt/okcomputer/output/
├── cache.db                              # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json             # Exported auctions
├── auctions_{timestamp}.csv              # Exported auctions
├── lots_{timestamp}.json                 # Exported lots
├── lots_{timestamp}.csv                  # Exported lots
└── images/                               # Downloaded images (if enabled)
    ├── A1-28505-5/
    │   ├── 001.jpg
    │   └── 002.jpg
    └── A1-28505-6/
        └── 001.jpg

Extension Points for Integration

1. Downstream Processing Pipeline

-- Query lots without downloaded images
SELECT lot_id, url FROM images WHERE downloaded = 0;

-- Process images: OCR, classification, etc.
-- Update status when complete
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;

2. Real-time Monitoring

-- Check for new lots every N minutes
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');

-- Monitor bid changes
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;

3. Analytics & Reporting

-- Top locations
SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location;

-- Auction statistics
SELECT
    a.auction_id,
    a.title,
    COUNT(l.lot_id) as actual_lots,
    SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
FROM auctions a
LEFT JOIN lots l ON a.auction_id = l.auction_id
GROUP BY a.auction_id

4. Image Processing Integration

-- Get all images for a lot
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';

-- Batch process unprocessed images
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;

Performance Characteristics

Compression: ~70-90% HTML size reduction (1GB → ~100-300MB)
Rate Limiting: Exactly 0.5s between requests (respectful scraping)
Caching: 24-hour default cache validity (configurable)
Throughput: ~7,200 pages/hour (with 0.5s rate limit)
Scalability: SQLite handles millions of rows efficiently

Error Handling

Network failures: Cached as status_code=500, retry after cache expiry
Parse failures: Falls back to HTML regex patterns
Compression errors: Auto-detects and handles uncompressed legacy data
Missing fields: Defaults to "No bids", empty string, or 0

Rate Limiting & Ethics

REQUIRED: 0.5 second delay between ALL requests
Respects cache: Avoids unnecessary re-fetching
User-Agent: Identifies as standard browser
No parallelization: Single-threaded sequential crawling

20 KiB Raw Blame History

Scaev - Architecture & Data Flow

System Overview

Architecture Diagram

Database Schema

Sequence Diagram

Data Flow Details

1. Page Retrieval & Caching

2. JSON Parsing Strategy

3. Image Handling

Key Configuration

Output Files

Extension Points for Integration

1. Downstream Processing Pipeline

2. Real-time Monitoring

3. Analytics & Reporting

4. Image Processing Integration

Performance Characteristics

Error Handling

Rate Limiting & Ethics

20 KiB

Raw Blame History