GraphQL integrate, data correctness
This commit is contained in:
140
REFACTORING_SUMMARY.md
Normal file
140
REFACTORING_SUMMARY.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# Scaev Scraper Refactoring Summary
|
||||
|
||||
## Date: 2025-12-07
|
||||
|
||||
## Objectives Completed
|
||||
|
||||
### 1. Image Download Integration ✅
|
||||
- **Changed**: Enabled `DOWNLOAD_IMAGES = True` in `config.py` and `docker-compose.yml`
|
||||
- **Added**: Unique constraint on `images(lot_id, url)` to prevent duplicates
|
||||
- **Added**: Automatic duplicate cleanup migration in `cache.py`
|
||||
- **Result**: Images are now downloaded to `/mnt/okcomputer/output/images/{lot_id}/` and marked as `downloaded=1`
|
||||
- **Impact**: Eliminates 57M+ duplicate image downloads by monitor app
|
||||
|
||||
### 2. Data Completeness Fix ✅
|
||||
- **Problem**: 99.9% of lots missing closing_time, 100% missing bid data
|
||||
- **Root Cause**: Troostwijk loads bid/timing data dynamically via GraphQL API, not in HTML
|
||||
- **Solution**: Added GraphQL client to fetch real-time bidding data
|
||||
|
||||
## Key Changes
|
||||
|
||||
### New Files
|
||||
1. **src/graphql_client.py** - GraphQL API client for fetching lot bidding data
|
||||
- Endpoint: `https://storefront.tbauctions.com/storefront/graphql`
|
||||
- Fetches: current_bid, starting_bid, minimum_bid, bid_count, closing_time
|
||||
|
||||
### Modified Files
|
||||
1. **src/config.py:22** - `DOWNLOAD_IMAGES = True`
|
||||
2. **docker-compose.yml:13** - `DOWNLOAD_IMAGES: "True"`
|
||||
3. **src/cache.py**
|
||||
- Added unique index on `images(lot_id, url)`
|
||||
- Added columns `starting_bid`, `minimum_bid` to `lots` table
|
||||
- Added migration to clean duplicates and add missing columns
|
||||
4. **src/scraper.py**
|
||||
- Integrated GraphQL API calls for each lot
|
||||
- Fetches real-time bidding data after parsing HTML
|
||||
- Removed unicode characters causing Windows encoding issues
|
||||
|
||||
## Database Schema Updates
|
||||
|
||||
### lots table - New Columns
|
||||
```sql
|
||||
ALTER TABLE lots ADD COLUMN starting_bid TEXT;
|
||||
ALTER TABLE lots ADD COLUMN minimum_bid TEXT;
|
||||
```
|
||||
|
||||
### images table - New Index
|
||||
```sql
|
||||
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
|
||||
```
|
||||
|
||||
## Data Flow (New Architecture)
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ Phase 3: Scrape Lot Page │
|
||||
└────────────────────────────────────────────────────┘
|
||||
│
|
||||
├─▶ Parse HTML (__NEXT_DATA__)
|
||||
│ └─▶ Extract: title, location, images, description
|
||||
│
|
||||
├─▶ Fetch GraphQL API
|
||||
│ └─▶ Query: LotBiddingData(lot_display_id)
|
||||
│ └─▶ Returns:
|
||||
│ - currentBidAmount (cents)
|
||||
│ - initialAmount (starting_bid)
|
||||
│ - nextMinimalBid (minimum_bid)
|
||||
│ - bidsCount
|
||||
│ - endDate (Unix timestamp)
|
||||
│ - startDate
|
||||
│ - biddingStatus
|
||||
│
|
||||
└─▶ Save to Database
|
||||
- lots table: complete bid & timing data
|
||||
- images table: deduplicated URLs
|
||||
- Download images immediately
|
||||
```
|
||||
|
||||
## Testing Results
|
||||
|
||||
### Test Lot: A1-28505-5
|
||||
```
|
||||
Current Bid: EUR 50.00 ✅
|
||||
Starting Bid: EUR 50.00 ✅
|
||||
Minimum Bid: EUR 55.00 ✅
|
||||
Bid Count: 1 ✅
|
||||
Closing Time: 2025-12-16 19:10:00 ✅
|
||||
Images: Downloaded 2 ✅
|
||||
```
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
- [x] Enable DOWNLOAD_IMAGES in config
|
||||
- [x] Update docker-compose environment
|
||||
- [x] Add GraphQL client
|
||||
- [x] Update scraper integration
|
||||
- [x] Add database migrations
|
||||
- [x] Test with live lot
|
||||
- [ ] Deploy to production
|
||||
- [ ] Run full scrape to populate data
|
||||
- [ ] Verify monitor app sees downloaded images
|
||||
|
||||
## Post-Deployment Verification
|
||||
|
||||
### Check Data Quality
|
||||
```sql
|
||||
-- Bid data completeness
|
||||
SELECT
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN closing_time != '' THEN 1 ELSE 0 END) as has_closing,
|
||||
SUM(CASE WHEN bid_count > 0 THEN 1 ELSE 0 END) as has_bids,
|
||||
SUM(CASE WHEN starting_bid IS NOT NULL THEN 1 ELSE 0 END) as has_starting_bid
|
||||
FROM lots
|
||||
WHERE scraped_at > datetime('now', '-1 hour');
|
||||
|
||||
-- Image download rate
|
||||
SELECT
|
||||
COUNT(*) as total,
|
||||
SUM(downloaded) as downloaded,
|
||||
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
|
||||
FROM images
|
||||
WHERE id IN (
|
||||
SELECT i.id FROM images i
|
||||
JOIN lots l ON i.lot_id = l.lot_id
|
||||
WHERE l.scraped_at > datetime('now', '-1 hour')
|
||||
);
|
||||
|
||||
-- Duplicate check (should be 0)
|
||||
SELECT lot_id, url, COUNT(*) as dup_count
|
||||
FROM images
|
||||
GROUP BY lot_id, url
|
||||
HAVING COUNT(*) > 1;
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- GraphQL API requires no authentication
|
||||
- API rate limits: handled by existing `RATE_LIMIT_SECONDS = 0.5`
|
||||
- Currency format: Changed from € to EUR for Windows compatibility
|
||||
- Timestamps: API returns Unix timestamps in seconds (not milliseconds)
|
||||
- Existing data: Old lots still have missing data; re-scrape required to populate
|
||||
Reference in New Issue
Block a user