enrich data
This commit is contained in:
@@ -43,22 +43,29 @@ The scraper follows a **3-phase hierarchical crawling pattern** to extract aucti
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 3: SCRAPE LOT DETAILS │
|
||||
│ PHASE 3: SCRAPE LOT DETAILS + API ENRICHMENT │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Lot Page │────────▶│ Parse │ │
|
||||
│ │ /l/... │ │ __NEXT_DATA__│ │
|
||||
│ └──────────────┘ │ JSON │ │
|
||||
│ └──────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────────────────┴─────────────────┐ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Save Lot │ │ Save Images │ │
|
||||
│ │ Details │ │ URLs to DB │ │
|
||||
│ │ to DB │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [Optional Download] │
|
||||
│ ┌─────────────────────────┼─────────────────┐ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ GraphQL API │ │ Bid History │ │ Save Images │ │
|
||||
│ │ (Bidding + │ │ REST API │ │ URLs to DB │ │
|
||||
│ │ Enrichment) │ │ (per lot) │ └──────────────┘ │
|
||||
│ └──────────────┘ └──────────────┘ │ │
|
||||
│ │ │ ▼ │
|
||||
│ └──────────┬────────────┘ [Optional Download │
|
||||
│ ▼ Concurrent per Lot] │
|
||||
│ ┌──────────────┐ │
|
||||
│ │ Save to DB: │ │
|
||||
│ │ - Lot data │ │
|
||||
│ │ - Bid data │ │
|
||||
│ │ - Enrichment │ │
|
||||
│ └──────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
@@ -90,22 +97,51 @@ The scraper follows a **3-phase hierarchical crawling pattern** to extract aucti
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ LOTS TABLE │
|
||||
│ LOTS TABLE (Core + Enriched Intelligence) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ lots │
|
||||
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
|
||||
│ ├── auction_id (TEXT) -- FK to auctions │
|
||||
│ ├── url (TEXT, UNIQUE) │
|
||||
│ ├── title (TEXT) │
|
||||
│ ├── current_bid (TEXT) -- "€123.45" or "No bids" │
|
||||
│ ├── bid_count (INTEGER) │
|
||||
│ ├── closing_time (TEXT) │
|
||||
│ ├── viewing_time (TEXT) │
|
||||
│ ├── pickup_date (TEXT) │
|
||||
│ │ │
|
||||
│ ├─ BIDDING DATA (GraphQL API) ──────────────────────────────────┤
|
||||
│ ├── current_bid (TEXT) -- Current bid amount │
|
||||
│ ├── starting_bid (TEXT) -- Initial/opening bid │
|
||||
│ ├── minimum_bid (TEXT) -- Next minimum bid │
|
||||
│ ├── bid_count (INTEGER) -- Number of bids │
|
||||
│ ├── bid_increment (REAL) -- Bid step size │
|
||||
│ ├── closing_time (TEXT) -- Lot end date │
|
||||
│ ├── status (TEXT) -- Minimum bid status │
|
||||
│ │ │
|
||||
│ ├─ BID INTELLIGENCE (Calculated from bid_history) ──────────────┤
|
||||
│ ├── first_bid_time (TEXT) -- First bid timestamp │
|
||||
│ ├── last_bid_time (TEXT) -- Latest bid timestamp │
|
||||
│ ├── bid_velocity (REAL) -- Bids per hour │
|
||||
│ │ │
|
||||
│ ├─ VALUATION & ATTRIBUTES (from __NEXT_DATA__) ─────────────────┤
|
||||
│ ├── brand (TEXT) -- Brand from attributes │
|
||||
│ ├── model (TEXT) -- Model from attributes │
|
||||
│ ├── manufacturer (TEXT) -- Manufacturer name │
|
||||
│ ├── year_manufactured (INTEGER) -- Year extracted │
|
||||
│ ├── condition_score (REAL) -- 0-10 condition rating │
|
||||
│ ├── condition_description (TEXT) -- Condition text │
|
||||
│ ├── serial_number (TEXT) -- Serial/VIN number │
|
||||
│ ├── damage_description (TEXT) -- Damage notes │
|
||||
│ ├── attributes_json (TEXT) -- Full attributes JSON │
|
||||
│ │ │
|
||||
│ ├─ LEGACY/OTHER ─────────────────────────────────────────────────┤
|
||||
│ ├── viewing_time (TEXT) -- Viewing schedule │
|
||||
│ ├── pickup_date (TEXT) -- Pickup schedule │
|
||||
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
|
||||
│ ├── description (TEXT) │
|
||||
│ ├── category (TEXT) │
|
||||
│ └── scraped_at (TEXT) │
|
||||
│ ├── description (TEXT) -- Lot description │
|
||||
│ ├── category (TEXT) -- Lot category │
|
||||
│ ├── sale_id (INTEGER) -- Legacy field │
|
||||
│ ├── type (TEXT) -- Legacy field │
|
||||
│ ├── year (INTEGER) -- Legacy field │
|
||||
│ ├── currency (TEXT) -- Currency code │
|
||||
│ ├── closing_notified (INTEGER) -- Notification flag │
|
||||
│ └── scraped_at (TEXT) -- Scrape timestamp │
|
||||
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
@@ -119,6 +155,24 @@ The scraper follows a **3-phase hierarchical crawling pattern** to extract aucti
|
||||
│ ├── local_path (TEXT) -- Path after download │
|
||||
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
|
||||
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
||||
│ UNIQUE INDEX idx_unique_lot_url ON (lot_id, url) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ BID_HISTORY TABLE (Complete Bid Tracking for Intelligence) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ bid_history ◀── REST API: /bidding-history │
|
||||
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
|
||||
│ ├── lot_id (TEXT) -- FK to lots │
|
||||
│ ├── bid_amount (REAL) -- Bid in EUR │
|
||||
│ ├── bid_time (TEXT) -- ISO 8601 timestamp │
|
||||
│ ├── is_autobid (INTEGER) -- 0=manual, 1=autobid │
|
||||
│ ├── bidder_id (TEXT) -- Anonymized bidder UUID │
|
||||
│ ├── bidder_number (INTEGER) -- Bidder display number │
|
||||
│ └── created_at (TEXT) -- Record creation timestamp │
|
||||
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
||||
│ INDEX idx_bid_history_lot ON (lot_id) │
|
||||
│ INDEX idx_bid_history_time ON (bid_time) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
@@ -208,34 +262,72 @@ HTML Content
|
||||
└──▶ Fallback to HTML regex parsing (if JSON fails)
|
||||
```
|
||||
|
||||
### 3. **Image Handling**
|
||||
### 3. **API Enrichment Flow**
|
||||
```
|
||||
Lot Page Scraped (__NEXT_DATA__ parsed)
|
||||
│
|
||||
├──▶ Extract lot UUID from JSON
|
||||
│
|
||||
├──▶ GraphQL API Call (fetch_lot_bidding_data)
|
||||
│ └──▶ Returns: current_bid, starting_bid, minimum_bid,
|
||||
│ bid_count, closing_time, status, bid_increment
|
||||
│
|
||||
├──▶ [If bid_count > 0] REST API Call (fetch_bid_history)
|
||||
│ │
|
||||
│ ├──▶ Fetch all bid pages (paginated)
|
||||
│ │
|
||||
│ └──▶ Returns: Complete bid history with timestamps,
|
||||
│ bidder_ids, autobid flags, amounts
|
||||
│ │
|
||||
│ ├──▶ INSERT INTO bid_history (multiple records)
|
||||
│ │
|
||||
│ └──▶ Calculate bid intelligence:
|
||||
│ - first_bid_time (earliest timestamp)
|
||||
│ - last_bid_time (latest timestamp)
|
||||
│ - bid_velocity (bids per hour)
|
||||
│
|
||||
├──▶ Extract enrichment from __NEXT_DATA__:
|
||||
│ - Brand, model, manufacturer (from attributes)
|
||||
│ - Year (regex from title/attributes)
|
||||
│ - Condition (map to 0-10 score)
|
||||
│ - Serial number, damage description
|
||||
│
|
||||
└──▶ INSERT/UPDATE lots table with all data
|
||||
```
|
||||
|
||||
### 4. **Image Handling (Concurrent per Lot)**
|
||||
```
|
||||
Lot Page Parsed
|
||||
│
|
||||
├──▶ Extract images[] from JSON
|
||||
│ │
|
||||
│ └──▶ INSERT INTO images (lot_id, url, downloaded=0)
|
||||
│ └──▶ INSERT OR IGNORE INTO images (lot_id, url, downloaded=0)
|
||||
│ └──▶ Unique constraint prevents duplicates
|
||||
│
|
||||
└──▶ [If DOWNLOAD_IMAGES=True]
|
||||
│
|
||||
├──▶ Download each image
|
||||
├──▶ Create concurrent download tasks (asyncio.gather)
|
||||
│ │
|
||||
│ ├──▶ All images for lot download in parallel
|
||||
│ │ (No rate limiting between images in same lot)
|
||||
│ │
|
||||
│ ├──▶ Save to: /images/{lot_id}/001.jpg
|
||||
│ │
|
||||
│ └──▶ UPDATE images SET local_path=?, downloaded=1
|
||||
│
|
||||
└──▶ Rate limit between downloads (0.5s)
|
||||
└──▶ Rate limit only between lots (0.5s)
|
||||
(Not between images within a lot)
|
||||
```
|
||||
|
||||
## Key Configuration
|
||||
|
||||
| Setting | Value | Purpose |
|
||||
|---------|-------|---------|
|
||||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||
| Setting | Value | Purpose |
|
||||
|----------------------|-----------------------------------|----------------------------------|
|
||||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||
|
||||
## Output Files
|
||||
|
||||
@@ -320,7 +412,120 @@ WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
|
||||
|
||||
## Rate Limiting & Ethics
|
||||
|
||||
- **REQUIRED**: 0.5 second delay between ALL requests
|
||||
- **REQUIRED**: 0.5 second delay between page requests (not between images)
|
||||
- **Respects cache**: Avoids unnecessary re-fetching
|
||||
- **User-Agent**: Identifies as standard browser
|
||||
- **No parallelization**: Single-threaded sequential crawling
|
||||
- **No parallelization**: Single-threaded sequential crawling for pages
|
||||
- **Image downloads**: Concurrent within each lot (16x speedup)
|
||||
|
||||
---
|
||||
|
||||
## API Integration Architecture
|
||||
|
||||
### GraphQL API
|
||||
**Endpoint:** `https://storefront.tbauctions.com/storefront/graphql`
|
||||
|
||||
**Purpose:** Real-time bidding data and lot enrichment
|
||||
|
||||
**Key Query:**
|
||||
```graphql
|
||||
query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
|
||||
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
|
||||
lot {
|
||||
currentBidAmount { cents currency }
|
||||
initialAmount { cents currency }
|
||||
nextMinimalBid { cents currency }
|
||||
nextBidStepInCents
|
||||
bidsCount
|
||||
followersCount # Available - Watch count
|
||||
startDate
|
||||
endDate
|
||||
minimumBidAmountMet
|
||||
biddingStatus
|
||||
condition
|
||||
location { city countryCode }
|
||||
categoryInformation { name path }
|
||||
attributes { name value }
|
||||
}
|
||||
estimatedFullPrice { # Available - Estimated value
|
||||
min { cents currency }
|
||||
max { cents currency }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Currently Captured:**
|
||||
- ✅ Current bid, starting bid, minimum bid
|
||||
- ✅ Bid count and bid increment
|
||||
- ✅ Closing time and status
|
||||
- ✅ Brand, model, manufacturer (from attributes)
|
||||
|
||||
**Available but Not Yet Captured:**
|
||||
- ⚠️ `followersCount` - Watch count for popularity analysis
|
||||
- ⚠️ `estimatedFullPrice` - Min/max estimated values
|
||||
- ⚠️ `biddingStatus` - More detailed status enum
|
||||
- ⚠️ `condition` - Direct condition field
|
||||
- ⚠️ `location` - City, country details
|
||||
- ⚠️ `categoryInformation` - Structured category
|
||||
|
||||
### REST API - Bid History
|
||||
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
|
||||
|
||||
**Purpose:** Complete bid history for intelligence analysis
|
||||
|
||||
**Parameters:**
|
||||
- `pageNumber` (starts at 1)
|
||||
- `pageSize` (default: 100)
|
||||
|
||||
**Response Example:**
|
||||
```json
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"buyerId": "uuid", // Anonymized bidder ID
|
||||
"buyerNumber": 4, // Display number
|
||||
"currentBid": {
|
||||
"cents": 370000,
|
||||
"currency": "EUR"
|
||||
},
|
||||
"autoBid": false, // Is autobid
|
||||
"negotiated": false, // Was negotiated
|
||||
"createdAt": "2025-12-05T04:53:56.763033Z"
|
||||
}
|
||||
],
|
||||
"hasNext": true,
|
||||
"pageNumber": 1
|
||||
}
|
||||
```
|
||||
|
||||
**Captured Data:**
|
||||
- ✅ Bid amount, timestamp, bidder ID
|
||||
- ✅ Autobid flag
|
||||
- ⚠️ `negotiated` - Not yet captured
|
||||
|
||||
**Calculated Intelligence:**
|
||||
- ✅ First bid time
|
||||
- ✅ Last bid time
|
||||
- ✅ Bid velocity (bids per hour)
|
||||
|
||||
### API Integration Points
|
||||
|
||||
**Files:**
|
||||
- `src/graphql_client.py` - GraphQL queries and parsing
|
||||
- `src/bid_history_client.py` - REST API pagination and parsing
|
||||
- `src/scraper.py` - Integration during lot scraping
|
||||
|
||||
**Flow:**
|
||||
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
|
||||
2. Call GraphQL API → Get bidding data
|
||||
3. If bid_count > 0 → Call REST API → Get complete bid history
|
||||
4. Calculate bid intelligence metrics
|
||||
5. Save to database
|
||||
|
||||
**Rate Limiting:**
|
||||
- API calls happen during lot scraping phase
|
||||
- Overall 0.5s rate limit applies to page requests
|
||||
- API calls are part of lot processing (not separately limited)
|
||||
|
||||
See `API_INTELLIGENCE_FINDINGS.md` for detailed field analysis and roadmap.
|
||||
|
||||
Reference in New Issue
Block a user