Files
scaev/REFACTORING_SUMMARY.md
2025-12-07 00:25:25 +01:00

4.7 KiB

Scaev Scraper Refactoring Summary

Date: 2025-12-07

Objectives Completed

1. Image Download Integration

  • Changed: Enabled DOWNLOAD_IMAGES = True in config.py and docker-compose.yml
  • Added: Unique constraint on images(lot_id, url) to prevent duplicates
  • Added: Automatic duplicate cleanup migration in cache.py
  • Result: Images are now downloaded to /mnt/okcomputer/output/images/{lot_id}/ and marked as downloaded=1
  • Impact: Eliminates 57M+ duplicate image downloads by monitor app

2. Data Completeness Fix

  • Problem: 99.9% of lots missing closing_time, 100% missing bid data
  • Root Cause: Troostwijk loads bid/timing data dynamically via GraphQL API, not in HTML
  • Solution: Added GraphQL client to fetch real-time bidding data

Key Changes

New Files

  1. src/graphql_client.py - GraphQL API client for fetching lot bidding data
    • Endpoint: https://storefront.tbauctions.com/storefront/graphql
    • Fetches: current_bid, starting_bid, minimum_bid, bid_count, closing_time

Modified Files

  1. src/config.py:22 - DOWNLOAD_IMAGES = True
  2. docker-compose.yml:13 - DOWNLOAD_IMAGES: "True"
  3. src/cache.py
    • Added unique index on images(lot_id, url)
    • Added columns starting_bid, minimum_bid to lots table
    • Added migration to clean duplicates and add missing columns
  4. src/scraper.py
    • Integrated GraphQL API calls for each lot
    • Fetches real-time bidding data after parsing HTML
    • Removed unicode characters causing Windows encoding issues

Database Schema Updates

lots table - New Columns

ALTER TABLE lots ADD COLUMN starting_bid TEXT;
ALTER TABLE lots ADD COLUMN minimum_bid TEXT;

images table - New Index

CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);

Data Flow (New Architecture)

┌────────────────────────────────────────────────────┐
│ Phase 3: Scrape Lot Page                           │
└────────────────────────────────────────────────────┘
         │
         ├─▶ Parse HTML (__NEXT_DATA__)
         │    └─▶ Extract: title, location, images, description
         │
         ├─▶ Fetch GraphQL API
         │    └─▶ Query: LotBiddingData(lot_display_id)
         │         └─▶ Returns:
         │              - currentBidAmount (cents)
         │              - initialAmount (starting_bid)
         │              - nextMinimalBid (minimum_bid)
         │              - bidsCount
         │              - endDate (Unix timestamp)
         │              - startDate
         │              - biddingStatus
         │
         └─▶ Save to Database
              - lots table: complete bid & timing data
              - images table: deduplicated URLs
              - Download images immediately

Testing Results

Test Lot: A1-28505-5

Current Bid:    EUR 50.00  ✅
Starting Bid:   EUR 50.00  ✅
Minimum Bid:    EUR 55.00  ✅
Bid Count:      1          ✅
Closing Time:   2025-12-16 19:10:00  ✅
Images:         Downloaded 2  ✅

Deployment Checklist

  • Enable DOWNLOAD_IMAGES in config
  • Update docker-compose environment
  • Add GraphQL client
  • Update scraper integration
  • Add database migrations
  • Test with live lot
  • Deploy to production
  • Run full scrape to populate data
  • Verify monitor app sees downloaded images

Post-Deployment Verification

Check Data Quality

-- Bid data completeness
SELECT
    COUNT(*) as total,
    SUM(CASE WHEN closing_time != '' THEN 1 ELSE 0 END) as has_closing,
    SUM(CASE WHEN bid_count > 0 THEN 1 ELSE 0 END) as has_bids,
    SUM(CASE WHEN starting_bid IS NOT NULL THEN 1 ELSE 0 END) as has_starting_bid
FROM lots
WHERE scraped_at > datetime('now', '-1 hour');

-- Image download rate
SELECT
    COUNT(*) as total,
    SUM(downloaded) as downloaded,
    ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
FROM images
WHERE id IN (
    SELECT i.id FROM images i
    JOIN lots l ON i.lot_id = l.lot_id
    WHERE l.scraped_at > datetime('now', '-1 hour')
);

-- Duplicate check (should be 0)
SELECT lot_id, url, COUNT(*) as dup_count
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;

Notes

  • GraphQL API requires no authentication
  • API rate limits: handled by existing RATE_LIMIT_SECONDS = 0.5
  • Currency format: Changed from € to EUR for Windows compatibility
  • Timestamps: API returns Unix timestamps in seconds (not milliseconds)
  • Existing data: Old lots still have missing data; re-scrape required to populate