enrich data

This commit is contained in:
Tour
2025-12-07 01:26:48 +01:00
parent bb7f4bbe9d
commit d09ee5574f
14 changed files with 1221 additions and 7 deletions

View File

@@ -0,0 +1,143 @@
# Comprehensive Data Enrichment Plan
## Current Status: Working Features
✅ Image downloads (concurrent)
✅ Basic bid data (current_bid, starting_bid, minimum_bid, bid_count, closing_time)
✅ Status extraction
✅ Brand/Model from attributes
✅ Attributes JSON storage
## Phase 1: Core Bidding Intelligence (HIGH PRIORITY)
### Data Sources Identified:
1. **GraphQL lot bidding API** - Already integrated
- currentBidAmount, initialAmount, bidsCount
- startDate, endDate (for first_bid_time calculation)
2. **REST bid history API** ✨ NEW DISCOVERY
- Endpoint: `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
- Returns: bid amounts, timestamps, autobid flags, bidder IDs
- Pagination supported
### Database Schema Changes:
```sql
-- Extend lots table with bidding intelligence
ALTER TABLE lots ADD COLUMN estimated_min DECIMAL(12,2);
ALTER TABLE lots ADD COLUMN estimated_max DECIMAL(12,2);
ALTER TABLE lots ADD COLUMN reserve_price DECIMAL(12,2);
ALTER TABLE lots ADD COLUMN reserve_met BOOLEAN DEFAULT FALSE;
ALTER TABLE lots ADD COLUMN bid_increment DECIMAL(12,2);
ALTER TABLE lots ADD COLUMN watch_count INTEGER DEFAULT 0;
ALTER TABLE lots ADD COLUMN first_bid_time TEXT;
ALTER TABLE lots ADD COLUMN last_bid_time TEXT;
ALTER TABLE lots ADD COLUMN bid_velocity DECIMAL(5,2);
-- NEW: Bid history table
CREATE TABLE IF NOT EXISTS bid_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT NOT NULL,
lot_uuid TEXT NOT NULL,
bid_amount DECIMAL(12,2) NOT NULL,
bid_time TEXT NOT NULL,
is_winning BOOLEAN DEFAULT FALSE,
is_autobid BOOLEAN DEFAULT FALSE,
bidder_id TEXT,
bidder_number INTEGER,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
);
CREATE INDEX IF NOT EXISTS idx_bid_history_lot_time ON bid_history(lot_id, bid_time);
CREATE INDEX IF NOT EXISTS idx_bid_history_bidder ON bid_history(bidder_id);
```
### Implementation:
- Add `fetch_bid_history()` function to call REST API
- Parse and store all historical bids
- Calculate bid_velocity (bids per hour)
- Extract first_bid_time, last_bid_time
## Phase 2: Valuation Intelligence
### Data Sources:
1. **Attributes array** (already in __NEXT_DATA__)
- condition, year, manufacturer, model, serial_number
2. **Description field**
- Extract year patterns, condition mentions, damage descriptions
### Database Schema:
```sql
-- Valuation fields
ALTER TABLE lots ADD COLUMN condition_score DECIMAL(3,2);
ALTER TABLE lots ADD COLUMN condition_description TEXT;
ALTER TABLE lots ADD COLUMN year_manufactured INTEGER;
ALTER TABLE lots ADD COLUMN serial_number TEXT;
ALTER TABLE lots ADD COLUMN manufacturer TEXT;
ALTER TABLE lots ADD COLUMN damage_description TEXT;
ALTER TABLE lots ADD COLUMN provenance TEXT;
```
### Implementation:
- Parse attributes for: Jaar, Conditie, Serienummer, Fabrikant
- Extract 4-digit years from title/description
- Map condition values to 0-10 scale
## Phase 3: Auction House Intelligence
### Data Sources:
1. **GraphQL auction query**
- Already partially working
2. **Auction __NEXT_DATA__**
- May contain buyer's premium, shipping costs
### Database Schema:
```sql
ALTER TABLE auctions ADD COLUMN buyers_premium_percent DECIMAL(5,2);
ALTER TABLE auctions ADD COLUMN shipping_available BOOLEAN;
ALTER TABLE auctions ADD COLUMN payment_methods TEXT;
```
## Viewing/Pickup Times Resolution
### Finding:
- `viewingDays` and `collectionDays` in GraphQL only return location (city, countryCode)
- Times are NOT in the GraphQL API
- Times must be in auction __NEXT_DATA__ or not set for many auctions
### Solution:
- Mark viewing_time/pickup_date as "location only" when times unavailable
- Store: "Nijmegen, NL" instead of full date/time string
- Accept that many auctions don't have viewing times set
## Priority Implementation Order:
1. **BID HISTORY API** (30 min) - Highest value
- Fetch and store all bid history
- Calculate bid_velocity
- Track autobid patterns
2. **ENRICHED ATTRIBUTES** (20 min) - Medium-high value
- Extract year, condition, manufacturer from existing data
- Parse description for damage/condition mentions
3. **VIEWING/PICKUP FIX** (10 min) - Low value (data often missing)
- Update to store location-only when times unavailable
## Data Quality Expectations:
| Field | Coverage Expected | Source |
|-------|------------------|---------|
| bid_history | 100% (for lots with bids) | REST API |
| bid_velocity | 100% (calculated) | Derived |
| year_manufactured | ~40% | Attributes/Title |
| condition_score | ~30% | Attributes |
| manufacturer | ~60% | Attributes |
| viewing_time | ~20% | Often not set |
| buyers_premium | 100% | GraphQL/Props |
## Estimated Total Implementation Time: 60-90 minutes