4.7 KiB
4.7 KiB
Scaev Scraper Refactoring Summary
Date: 2025-12-07
Objectives Completed
1. Image Download Integration ✅
- Changed: Enabled
DOWNLOAD_IMAGES = Trueinconfig.pyanddocker-compose.yml - Added: Unique constraint on
images(lot_id, url)to prevent duplicates - Added: Automatic duplicate cleanup migration in
cache.py - Result: Images are now downloaded to
/mnt/okcomputer/output/images/{lot_id}/and marked asdownloaded=1 - Impact: Eliminates 57M+ duplicate image downloads by monitor app
2. Data Completeness Fix ✅
- Problem: 99.9% of lots missing closing_time, 100% missing bid data
- Root Cause: Troostwijk loads bid/timing data dynamically via GraphQL API, not in HTML
- Solution: Added GraphQL client to fetch real-time bidding data
Key Changes
New Files
- src/graphql_client.py - GraphQL API client for fetching lot bidding data
- Endpoint:
https://storefront.tbauctions.com/storefront/graphql - Fetches: current_bid, starting_bid, minimum_bid, bid_count, closing_time
- Endpoint:
Modified Files
- src/config.py:22 -
DOWNLOAD_IMAGES = True - docker-compose.yml:13 -
DOWNLOAD_IMAGES: "True" - src/cache.py
- Added unique index on
images(lot_id, url) - Added columns
starting_bid,minimum_bidtolotstable - Added migration to clean duplicates and add missing columns
- Added unique index on
- src/scraper.py
- Integrated GraphQL API calls for each lot
- Fetches real-time bidding data after parsing HTML
- Removed unicode characters causing Windows encoding issues
Database Schema Updates
lots table - New Columns
ALTER TABLE lots ADD COLUMN starting_bid TEXT;
ALTER TABLE lots ADD COLUMN minimum_bid TEXT;
images table - New Index
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
Data Flow (New Architecture)
┌────────────────────────────────────────────────────┐
│ Phase 3: Scrape Lot Page │
└────────────────────────────────────────────────────┘
│
├─▶ Parse HTML (__NEXT_DATA__)
│ └─▶ Extract: title, location, images, description
│
├─▶ Fetch GraphQL API
│ └─▶ Query: LotBiddingData(lot_display_id)
│ └─▶ Returns:
│ - currentBidAmount (cents)
│ - initialAmount (starting_bid)
│ - nextMinimalBid (minimum_bid)
│ - bidsCount
│ - endDate (Unix timestamp)
│ - startDate
│ - biddingStatus
│
└─▶ Save to Database
- lots table: complete bid & timing data
- images table: deduplicated URLs
- Download images immediately
Testing Results
Test Lot: A1-28505-5
Current Bid: EUR 50.00 ✅
Starting Bid: EUR 50.00 ✅
Minimum Bid: EUR 55.00 ✅
Bid Count: 1 ✅
Closing Time: 2025-12-16 19:10:00 ✅
Images: Downloaded 2 ✅
Deployment Checklist
- Enable DOWNLOAD_IMAGES in config
- Update docker-compose environment
- Add GraphQL client
- Update scraper integration
- Add database migrations
- Test with live lot
- Deploy to production
- Run full scrape to populate data
- Verify monitor app sees downloaded images
Post-Deployment Verification
Check Data Quality
-- Bid data completeness
SELECT
COUNT(*) as total,
SUM(CASE WHEN closing_time != '' THEN 1 ELSE 0 END) as has_closing,
SUM(CASE WHEN bid_count > 0 THEN 1 ELSE 0 END) as has_bids,
SUM(CASE WHEN starting_bid IS NOT NULL THEN 1 ELSE 0 END) as has_starting_bid
FROM lots
WHERE scraped_at > datetime('now', '-1 hour');
-- Image download rate
SELECT
COUNT(*) as total,
SUM(downloaded) as downloaded,
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
FROM images
WHERE id IN (
SELECT i.id FROM images i
JOIN lots l ON i.lot_id = l.lot_id
WHERE l.scraped_at > datetime('now', '-1 hour')
);
-- Duplicate check (should be 0)
SELECT lot_id, url, COUNT(*) as dup_count
FROM images
GROUP BY lot_id, url
HAVING COUNT(*) > 1;
Notes
- GraphQL API requires no authentication
- API rate limits: handled by existing
RATE_LIMIT_SECONDS = 0.5 - Currency format: Changed from € to EUR for Windows compatibility
- Timestamps: API returns Unix timestamps in seconds (not milliseconds)
- Existing data: Old lots still have missing data; re-scrape required to populate