# Scaev Scraper Refactoring Summary ## Date: 2025-12-07 ## Objectives Completed ### 1. Image Download Integration ✅ - **Changed**: Enabled `DOWNLOAD_IMAGES = True` in `config.py` and `docker-compose.yml` - **Added**: Unique constraint on `images(lot_id, url)` to prevent duplicates - **Added**: Automatic duplicate cleanup migration in `cache.py` - **Result**: Images are now downloaded to `/mnt/okcomputer/output/images/{lot_id}/` and marked as `downloaded=1` - **Impact**: Eliminates 57M+ duplicate image downloads by monitor app ### 2. Data Completeness Fix ✅ - **Problem**: 99.9% of lots missing closing_time, 100% missing bid data - **Root Cause**: Troostwijk loads bid/timing data dynamically via GraphQL API, not in HTML - **Solution**: Added GraphQL client to fetch real-time bidding data ## Key Changes ### New Files 1. **src/graphql_client.py** - GraphQL API client for fetching lot bidding data - Endpoint: `https://storefront.tbauctions.com/storefront/graphql` - Fetches: current_bid, starting_bid, minimum_bid, bid_count, closing_time ### Modified Files 1. **src/config.py:22** - `DOWNLOAD_IMAGES = True` 2. **docker-compose.yml:13** - `DOWNLOAD_IMAGES: "True"` 3. **src/cache.py** - Added unique index on `images(lot_id, url)` - Added columns `starting_bid`, `minimum_bid` to `lots` table - Added migration to clean duplicates and add missing columns 4. **src/scraper.py** - Integrated GraphQL API calls for each lot - Fetches real-time bidding data after parsing HTML - Removed unicode characters causing Windows encoding issues ## Database Schema Updates ### lots table - New Columns ```sql ALTER TABLE lots ADD COLUMN starting_bid TEXT; ALTER TABLE lots ADD COLUMN minimum_bid TEXT; ``` ### images table - New Index ```sql CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url); ``` ## Data Flow (New Architecture) ``` ┌────────────────────────────────────────────────────┐ │ Phase 3: Scrape Lot Page │ └────────────────────────────────────────────────────┘ │ ├─▶ Parse HTML (__NEXT_DATA__) │ └─▶ Extract: title, location, images, description │ ├─▶ Fetch GraphQL API │ └─▶ Query: LotBiddingData(lot_display_id) │ └─▶ Returns: │ - currentBidAmount (cents) │ - initialAmount (starting_bid) │ - nextMinimalBid (minimum_bid) │ - bidsCount │ - endDate (Unix timestamp) │ - startDate │ - biddingStatus │ └─▶ Save to Database - lots table: complete bid & timing data - images table: deduplicated URLs - Download images immediately ``` ## Testing Results ### Test Lot: A1-28505-5 ``` Current Bid: EUR 50.00 ✅ Starting Bid: EUR 50.00 ✅ Minimum Bid: EUR 55.00 ✅ Bid Count: 1 ✅ Closing Time: 2025-12-16 19:10:00 ✅ Images: Downloaded 2 ✅ ``` ## Deployment Checklist - [x] Enable DOWNLOAD_IMAGES in config - [x] Update docker-compose environment - [x] Add GraphQL client - [x] Update scraper integration - [x] Add database migrations - [x] Test with live lot - [ ] Deploy to production - [ ] Run full scrape to populate data - [ ] Verify monitor app sees downloaded images ## Post-Deployment Verification ### Check Data Quality ```sql -- Bid data completeness SELECT COUNT(*) as total, SUM(CASE WHEN closing_time != '' THEN 1 ELSE 0 END) as has_closing, SUM(CASE WHEN bid_count > 0 THEN 1 ELSE 0 END) as has_bids, SUM(CASE WHEN starting_bid IS NOT NULL THEN 1 ELSE 0 END) as has_starting_bid FROM lots WHERE scraped_at > datetime('now', '-1 hour'); -- Image download rate SELECT COUNT(*) as total, SUM(downloaded) as downloaded, ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate FROM images WHERE id IN ( SELECT i.id FROM images i JOIN lots l ON i.lot_id = l.lot_id WHERE l.scraped_at > datetime('now', '-1 hour') ); -- Duplicate check (should be 0) SELECT lot_id, url, COUNT(*) as dup_count FROM images GROUP BY lot_id, url HAVING COUNT(*) > 1; ``` ## Notes - GraphQL API requires no authentication - API rate limits: handled by existing `RATE_LIMIT_SECONDS = 0.5` - Currency format: Changed from € to EUR for Windows compatibility - Timestamps: API returns Unix timestamps in seconds (not milliseconds) - Existing data: Old lots still have missing data; re-scrape required to populate