- Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
11
.aiignore
11
.aiignore
@@ -10,3 +10,14 @@
|
|||||||
dist/
|
dist/
|
||||||
build/
|
build/
|
||||||
out/
|
out/
|
||||||
|
# An .aiignore file follows the same syntax as a .gitignore file.
|
||||||
|
# .gitignore documentation: https://git-scm.com/docs/gitignore
|
||||||
|
|
||||||
|
# you can ignore files
|
||||||
|
# or folders
|
||||||
|
.idea
|
||||||
|
node_modules/
|
||||||
|
.vscode/
|
||||||
|
.git
|
||||||
|
.github
|
||||||
|
scripts
|
||||||
|
|||||||
@@ -333,7 +333,6 @@ Lot Page Parsed
|
|||||||
|
|
||||||
```
|
```
|
||||||
/mnt/okcomputer/output/
|
/mnt/okcomputer/output/
|
||||||
├── cache.db # SQLite database (compressed HTML + data)
|
|
||||||
├── auctions_{timestamp}.json # Exported auctions
|
├── auctions_{timestamp}.json # Exported auctions
|
||||||
├── auctions_{timestamp}.csv # Exported auctions
|
├── auctions_{timestamp}.csv # Exported auctions
|
||||||
├── lots_{timestamp}.json # Exported lots
|
├── lots_{timestamp}.json # Exported lots
|
||||||
@@ -503,13 +502,6 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
|
|||||||
- ✅ Closing time and status
|
- ✅ Closing time and status
|
||||||
- ✅ Brand, model, manufacturer (from attributes)
|
- ✅ Brand, model, manufacturer (from attributes)
|
||||||
|
|
||||||
**Available but Not Yet Captured:**
|
|
||||||
- ⚠️ `followersCount` - Watch count for popularity analysis
|
|
||||||
- ⚠️ `estimatedFullPrice` - Min/max estimated values
|
|
||||||
- ⚠️ `biddingStatus` - More detailed status enum
|
|
||||||
- ⚠️ `condition` - Direct condition field
|
|
||||||
- ⚠️ `location` - City, country details
|
|
||||||
- ⚠️ `categoryInformation` - Structured category
|
|
||||||
|
|
||||||
### REST API - Bid History
|
### REST API - Bid History
|
||||||
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
|
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
|
||||||
@@ -553,11 +545,6 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
|
|||||||
|
|
||||||
### API Integration Points
|
### API Integration Points
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- `src/graphql_client.py` - GraphQL queries and parsing
|
|
||||||
- `src/bid_history_client.py` - REST API pagination and parsing
|
|
||||||
- `src/scraper.py` - Integration during lot scraping
|
|
||||||
|
|
||||||
**Flow:**
|
**Flow:**
|
||||||
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
|
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
|
||||||
2. Call GraphQL API → Get bidding data
|
2. Call GraphQL API → Get bidding data
|
||||||
@@ -570,4 +557,3 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
|
|||||||
- Overall 0.5s rate limit applies to page requests
|
- Overall 0.5s rate limit applies to page requests
|
||||||
- API calls are part of lot processing (not separately limited)
|
- API calls are part of lot processing (not separately limited)
|
||||||
|
|
||||||
See `API_INTELLIGENCE_FINDINGS.md` for detailed field analysis and roadmap.
|
|
||||||
|
|||||||
@@ -94,12 +94,6 @@ tail -f ~/scaev/logs/monitor.log
|
|||||||
# Check Task Scheduler history
|
# Check Task Scheduler history
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check database is updating:**
|
|
||||||
```bash
|
|
||||||
# Last modified time should update every 30 minutes
|
|
||||||
ls -lh C:/mnt/okcomputer/output/cache.db
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|||||||
@@ -1,23 +0,0 @@
|
|||||||
✅ Routing service configured - scaev-mobile-routing.service active and working
|
|
||||||
✅ Scaev deployed - Container running with dual networks:
|
|
||||||
scaev_mobile_net (172.30.0.10) - for outbound internet via mobile
|
|
||||||
traefik_net (172.20.0.8) - for LAN access
|
|
||||||
✅ Mobile routing verified:
|
|
||||||
Host IP: 5.132.33.195 (LAN gateway)
|
|
||||||
Mobile IP: 77.63.26.140 (mobile provider)
|
|
||||||
Scaev IP: 77.63.26.140 ✅ Using mobile connection!
|
|
||||||
✅ Scraper functional - Successfully accessing troostwijkauctions.com through mobile network
|
|
||||||
Architecture:```
|
|
||||||
┌─────────────────────────────────────────┐
|
|
||||||
│ Tour Machine (192.168.1.159) │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────┐ │
|
|
||||||
│ │ Scaev Container │ │
|
|
||||||
│ │ • scaev_mobile_net: 172.30.0.10 ────┼──> Mobile Gateway (10.133.133.26)
|
|
||||||
│ │ • traefik_net: 172.20.0.8 │ │ └─> Internet (77.63.26.140)
|
|
||||||
│ │ • SQLite: shared-auction-data│ │
|
|
||||||
│ │ • Images: shared-auction-data│ │
|
|
||||||
│ └──────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
└─────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
@@ -1,122 +0,0 @@
|
|||||||
# Deployment (Scaev)
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
- Python 3.8+ installed
|
|
||||||
- Access to a server (Linux/Windows)
|
|
||||||
- Playwright and dependencies installed
|
|
||||||
|
|
||||||
## Production Setup
|
|
||||||
|
|
||||||
### 1. Install on Server
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone repository
|
|
||||||
git clone git@git.appmodel.nl:Tour/scaev.git
|
|
||||||
cd scaev
|
|
||||||
|
|
||||||
# Create virtual environment
|
|
||||||
python -m venv .venv
|
|
||||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
|
||||||
|
|
||||||
# Install dependencies
|
|
||||||
pip install -r requirements.txt
|
|
||||||
playwright install chromium
|
|
||||||
playwright install-deps # Install system dependencies
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Configuration
|
|
||||||
|
|
||||||
Create a configuration file or set environment variables:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# main.py configuration
|
|
||||||
BASE_URL = "https://www.troostwijkauctions.com"
|
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
|
||||||
OUTPUT_DIR = "/mnt/okcomputer/output"
|
|
||||||
RATE_LIMIT_SECONDS = 0.5
|
|
||||||
MAX_PAGES = 50
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Create Output Directories
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo mkdir -p /var/scaev/output
|
|
||||||
sudo chown $USER:$USER /var/scaev
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Run as Cron Job
|
|
||||||
|
|
||||||
Add to crontab (`crontab -e`):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run scraper daily at 2 AM
|
|
||||||
0 2 * * * cd /path/to/scaev && /path/to/.venv/bin/python main.py >> /var/log/scaev.log 2>&1
|
|
||||||
```
|
|
||||||
|
|
||||||
## Docker Deployment (Optional)
|
|
||||||
|
|
||||||
Create `Dockerfile`:
|
|
||||||
|
|
||||||
```dockerfile
|
|
||||||
FROM python:3.10-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
# Install system dependencies for Playwright
|
|
||||||
RUN apt-get update && apt-get install -y \
|
|
||||||
wget \
|
|
||||||
gnupg \
|
|
||||||
&& rm -rf /var/lib/apt/lists/*
|
|
||||||
|
|
||||||
COPY requirements.txt .
|
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
|
||||||
RUN playwright install chromium
|
|
||||||
RUN playwright install-deps
|
|
||||||
|
|
||||||
COPY main.py .
|
|
||||||
|
|
||||||
CMD ["python", "main.py"]
|
|
||||||
```
|
|
||||||
|
|
||||||
Build and run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker build -t scaev .
|
|
||||||
docker run -v /path/to/output:/output scaev
|
|
||||||
```
|
|
||||||
|
|
||||||
## Monitoring
|
|
||||||
|
|
||||||
### Check Logs
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tail -f /var/log/scaev.log
|
|
||||||
```
|
|
||||||
|
|
||||||
### Monitor Output
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ls -lh /var/scaev/output/
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Playwright Browser Issues
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Reinstall browsers
|
|
||||||
playwright install --force chromium
|
|
||||||
```
|
|
||||||
|
|
||||||
### Permission Issues
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Fix permissions
|
|
||||||
sudo chown -R $USER:$USER /var/scaev
|
|
||||||
```
|
|
||||||
|
|
||||||
### Memory Issues
|
|
||||||
|
|
||||||
- Reduce `MAX_PAGES` in configuration
|
|
||||||
- Run on machine with more RAM (Playwright needs ~1GB)
|
|
||||||
@@ -1,377 +1,169 @@
|
|||||||
# Data Quality Fixes - Complete Summary
|
# Data Quality Fixes - Condensed Summary
|
||||||
|
|
||||||
## Executive Summary
|
## Executive Summary
|
||||||
|
✅ **Completed all 5 high-priority data quality tasks:**
|
||||||
|
|
||||||
Successfully completed all 5 high-priority data quality and intelligence tasks:
|
1. Fixed orphaned lots: **16,807 → 13** (99.9% resolved)
|
||||||
|
2. Bid history fetching: Script created, ready to run
|
||||||
|
3. Added followersCount extraction (watch count)
|
||||||
|
4. Added estimatedFullPrice extraction (min/max values)
|
||||||
|
5. Added direct condition field from API
|
||||||
|
|
||||||
1. ✅ **Fixed orphaned lots** (16,807 → 13 orphaned lots)
|
**Impact:** 80%+ increase in intelligence data capture for future scrapes.
|
||||||
2. ✅ **Fixed bid history fetching** (script created, ready to run)
|
|
||||||
3. ✅ **Added followersCount extraction** (watch count)
|
|
||||||
4. ✅ **Added estimatedFullPrice extraction** (min/max values)
|
|
||||||
5. ✅ **Added direct condition field** from API
|
|
||||||
|
|
||||||
**Impact:** Database now captures 80%+ more intelligence data for future scrapes.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Task 1: Fix Orphaned Lots ✅ COMPLETE
|
## Task 1: Fix Orphaned Lots ✅
|
||||||
|
|
||||||
### Problem:
|
**Problem:** 16,807 lots had no matching auction due to auction_id mismatch (UUID vs numeric vs displayId).
|
||||||
- **16,807 lots** had no matching auction (100% orphaned)
|
|
||||||
- Root cause: auction_id mismatch
|
|
||||||
- Lots table used UUID auction_id (e.g., `72928a1a-12bf-4d5d-93ac-292f057aab6e`)
|
|
||||||
- Auctions table used numeric IDs (legacy incorrect data)
|
|
||||||
- Auction pages use `displayId` (e.g., `A1-34731`)
|
|
||||||
|
|
||||||
### Solution:
|
**Solution:**
|
||||||
1. **Updated parse.py** - Modified `_parse_lot_json()` to extract auction displayId from page_props
|
- Updated `parse.py` to extract `auction.displayId` from lot pages
|
||||||
- Lot pages include full auction data
|
- Created migration scripts to rebuild auctions table and re-link lots
|
||||||
- Now extracts `auction.displayId` instead of using UUID `lot.auctionId`
|
|
||||||
|
|
||||||
2. **Created fix_orphaned_lots.py** - Migrated existing 16,793 lots
|
**Results:**
|
||||||
- Read cached lot pages
|
- Orphaned lots: **16,807 → 13** (99.9% fixed)
|
||||||
- Extracted auction displayId from embedded auction data
|
- Auctions table: **0% → 100%** complete (lots_count, first_lot_closing_time)
|
||||||
- Updated lots.auction_id from UUID to displayId
|
|
||||||
|
|
||||||
3. **Created fix_auctions_table.py** - Rebuilt auctions table
|
**Files:** `src/parse.py` | `fix_orphaned_lots.py` | `fix_auctions_table.py`
|
||||||
- Cleared incorrect auction data
|
|
||||||
- Re-extracted from 517 cached auction pages
|
|
||||||
- Inserted 509 auctions with correct displayId
|
|
||||||
|
|
||||||
### Results:
|
|
||||||
- **Orphaned lots:** 16,807 → **13** (99.9% fixed)
|
|
||||||
- **Auctions completeness:**
|
|
||||||
- lots_count: 0% → **100%**
|
|
||||||
- first_lot_closing_time: 0% → **100%**
|
|
||||||
- **All lots now properly linked to auctions**
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/parse.py` - Updated `_extract_nextjs_data()` and `_parse_lot_json()`
|
|
||||||
|
|
||||||
### Scripts Created:
|
|
||||||
- `fix_orphaned_lots.py` - Migrates existing lots
|
|
||||||
- `fix_auctions_table.py` - Rebuilds auctions table
|
|
||||||
- `check_lot_auction_link.py` - Diagnostic script
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Task 2: Fix Bid History Fetching ✅ COMPLETE
|
## Task 2: Fix Bid History Fetching ✅
|
||||||
|
|
||||||
### Problem:
|
**Problem:** 1,590 lots with bids but no bid history (0.1% coverage).
|
||||||
- **1,590 lots** with bids but no bid history (0.1% coverage)
|
|
||||||
- Bid history fetching only ran during scraping, not for existing lots
|
|
||||||
|
|
||||||
### Solution:
|
**Solution:** Created `fetch_missing_bid_history.py` to backfill bid history via REST API.
|
||||||
1. **Verified scraper logic** - src/scraper.py bid history fetching is correct
|
|
||||||
- Extracts lot UUID from __NEXT_DATA__
|
|
||||||
- Calls REST API: `https://shared-api.tbauctions.com/bidmanagement/lots/{uuid}/bidding-history`
|
|
||||||
- Calculates bid velocity, first/last bid time
|
|
||||||
- Saves to bid_history table
|
|
||||||
|
|
||||||
2. **Created fetch_missing_bid_history.py**
|
**Status:** Script ready; future scrapes will auto-capture.
|
||||||
- Builds lot_id → UUID mapping from cached pages
|
|
||||||
- Fetches bid history from REST API for all lots with bids
|
|
||||||
- Updates lots table with bid intelligence
|
|
||||||
- Saves complete bid history records
|
|
||||||
|
|
||||||
### Results:
|
**Runtime:** ~13-15 minutes for 1,590 lots (0.5s rate limit)
|
||||||
- Script created and tested
|
|
||||||
- **Limitation:** Takes ~13 minutes to process 1,590 lots (0.5s rate limit)
|
|
||||||
- **Future scrapes:** Bid history will be captured automatically
|
|
||||||
|
|
||||||
### Files Created:
|
**Files:** `fetch_missing_bid_history.py`
|
||||||
- `fetch_missing_bid_history.py` - Migration script for existing lots
|
|
||||||
|
|
||||||
### Note:
|
|
||||||
- Script is ready to run but requires ~13-15 minutes
|
|
||||||
- Future scrapes will automatically capture bid history
|
|
||||||
- No code changes needed - existing scraper logic is correct
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Task 3: Add followersCount Field ✅ COMPLETE
|
## Task 3: Add followersCount ✅
|
||||||
|
|
||||||
### Problem:
|
**Problem:** Watch count unavailable (thought missing).
|
||||||
- Watch count thought to be unavailable
|
|
||||||
- **Discovery:** `followersCount` field exists in GraphQL API!
|
|
||||||
|
|
||||||
### Solution:
|
**Solution:** Discovered in GraphQL API; implemented extraction and schema update.
|
||||||
1. **Updated database schema** (src/cache.py)
|
|
||||||
- Added `followers_count INTEGER DEFAULT 0` column
|
|
||||||
- Auto-migration on scraper startup
|
|
||||||
|
|
||||||
2. **Updated GraphQL query** (src/graphql_client.py)
|
**Value:** Predict popularity, track interest-to-bid conversion, identify "sleeper" lots.
|
||||||
- Added `followersCount` to LOT_BIDDING_QUERY
|
|
||||||
|
|
||||||
3. **Updated format_bid_data()** (src/graphql_client.py)
|
**Files:** `src/cache.py` | `src/graphql_client.py` | `enrich_existing_lots.py` (~2.3 hours runtime)
|
||||||
- Extracts and returns `followers_count`
|
|
||||||
|
|
||||||
4. **Updated save_lot()** (src/cache.py)
|
|
||||||
- Saves followers_count to database
|
|
||||||
|
|
||||||
5. **Created enrich_existing_lots.py**
|
|
||||||
- Fetches followers_count for existing 16,807 lots
|
|
||||||
- Uses GraphQL API with 0.5s rate limiting
|
|
||||||
- Takes ~2.3 hours to complete
|
|
||||||
|
|
||||||
### Intelligence Value:
|
|
||||||
- **Predict lot popularity** before bidding wars
|
|
||||||
- Calculate interest-to-bid conversion rate
|
|
||||||
- Identify "sleeper" lots (high followers, low bids)
|
|
||||||
- Alert on lots gaining sudden interest
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/cache.py` - Schema + save_lot()
|
|
||||||
- `src/graphql_client.py` - Query + format_bid_data()
|
|
||||||
|
|
||||||
### Files Created:
|
|
||||||
- `enrich_existing_lots.py` - Migration for existing lots
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Task 4: Add estimatedFullPrice Extraction ✅ COMPLETE
|
## Task 4: Add estimatedFullPrice ✅
|
||||||
|
|
||||||
### Problem:
|
**Problem:** Min/max estimates unavailable (thought missing).
|
||||||
- Estimated min/max values thought to be unavailable
|
|
||||||
- **Discovery:** `estimatedFullPrice` object with min/max exists in GraphQL API!
|
|
||||||
|
|
||||||
### Solution:
|
**Solution:** Discovered `estimatedFullPrice{min,max}` in GraphQL API; extracts cents → EUR.
|
||||||
1. **Updated database schema** (src/cache.py)
|
|
||||||
- Added `estimated_min_price REAL` column
|
|
||||||
- Added `estimated_max_price REAL` column
|
|
||||||
|
|
||||||
2. **Updated GraphQL query** (src/graphql_client.py)
|
**Value:** Detect bargains (`final < min`), overvaluation, build pricing models.
|
||||||
- Added `estimatedFullPrice { min { cents currency } max { cents currency } }`
|
|
||||||
|
|
||||||
3. **Updated format_bid_data()** (src/graphql_client.py)
|
**Files:** `src/cache.py` | `src/graphql_client.py` | `enrich_existing_lots.py`
|
||||||
- Extracts estimated_min_obj and estimated_max_obj
|
|
||||||
- Converts cents to EUR
|
|
||||||
- Returns estimated_min_price and estimated_max_price
|
|
||||||
|
|
||||||
4. **Updated save_lot()** (src/cache.py)
|
|
||||||
- Saves both estimated price fields
|
|
||||||
|
|
||||||
5. **Migration** (enrich_existing_lots.py)
|
|
||||||
- Fetches estimated prices for existing lots
|
|
||||||
|
|
||||||
### Intelligence Value:
|
|
||||||
- Compare final price vs estimate (accuracy analysis)
|
|
||||||
- Identify bargains: `final_price < estimated_min`
|
|
||||||
- Identify overvalued: `final_price > estimated_max`
|
|
||||||
- Build pricing models per category
|
|
||||||
- Investment opportunity detection
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/cache.py` - Schema + save_lot()
|
|
||||||
- `src/graphql_client.py` - Query + format_bid_data()
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Task 5: Use Direct Condition Field ✅ COMPLETE
|
## Task 5: Direct Condition Field ✅
|
||||||
|
|
||||||
### Problem:
|
**Problem:** Condition extracted from attributes (0% success rate).
|
||||||
- Condition extracted from attributes (complex, unreliable)
|
|
||||||
- 0% condition_score success rate
|
|
||||||
- **Discovery:** Direct `condition` and `appearance` fields in GraphQL API!
|
|
||||||
|
|
||||||
### Solution:
|
**Solution:** Using direct `condition` and `appearance` fields from GraphQL API.
|
||||||
1. **Updated database schema** (src/cache.py)
|
|
||||||
- Added `lot_condition TEXT` column (direct from API)
|
|
||||||
- Added `appearance TEXT` column (visual condition notes)
|
|
||||||
|
|
||||||
2. **Updated GraphQL query** (src/graphql_client.py)
|
**Value:** Reliable condition data for scoring, filtering, restoration identification.
|
||||||
- Added `condition` field
|
|
||||||
- Added `appearance` field
|
|
||||||
|
|
||||||
3. **Updated format_bid_data()** (src/graphql_client.py)
|
**Files:** `src/cache.py` | `src/graphql_client.py` | `enrich_existing_lots.py`
|
||||||
- Extracts and returns `lot_condition`
|
|
||||||
- Extracts and returns `appearance`
|
|
||||||
|
|
||||||
4. **Updated save_lot()** (src/cache.py)
|
|
||||||
- Saves both condition fields
|
|
||||||
|
|
||||||
5. **Migration** (enrich_existing_lots.py)
|
|
||||||
- Fetches condition data for existing lots
|
|
||||||
|
|
||||||
### Intelligence Value:
|
|
||||||
- **Cleaner, more reliable** condition data
|
|
||||||
- Better condition scoring potential
|
|
||||||
- Identify restoration projects
|
|
||||||
- Filter by condition category
|
|
||||||
- Combined with appearance for detailed assessment
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/cache.py` - Schema + save_lot()
|
|
||||||
- `src/graphql_client.py` - Query + format_bid_data()
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Summary of Code Changes
|
## Code Changes Summary
|
||||||
|
|
||||||
### Core Files Modified:
|
### Modified Core Files
|
||||||
|
|
||||||
#### 1. `src/parse.py`
|
**`src/parse.py`**
|
||||||
**Changes:**
|
- Extract auction displayId from lot pages
|
||||||
- `_extract_nextjs_data()`: Pass auction data to lot parser
|
- Pass auction data to lot parser
|
||||||
- `_parse_lot_json()`: Accept auction_data parameter, extract auction displayId
|
|
||||||
|
|
||||||
**Impact:** Fixes orphaned lots issue going forward
|
**`src/cache.py`**
|
||||||
|
- Added 5 columns: `followers_count`, `estimated_min_price`, `estimated_max_price`, `lot_condition`, `appearance`
|
||||||
|
- Auto-migration on startup
|
||||||
|
- Updated `save_lot()` INSERT
|
||||||
|
|
||||||
#### 2. `src/cache.py`
|
**`src/graphql_client.py`**
|
||||||
**Changes:**
|
- Enhanced `LOT_BIDDING_QUERY` with new fields
|
||||||
- Added 5 new columns to lots table schema
|
- Updated `format_bid_data()` extraction logic
|
||||||
- Updated `save_lot()` INSERT statement to include new fields
|
|
||||||
- Auto-migration logic for new columns
|
|
||||||
|
|
||||||
**New Columns:**
|
### Migration Scripts
|
||||||
- `followers_count INTEGER DEFAULT 0`
|
|
||||||
- `estimated_min_price REAL`
|
|
||||||
- `estimated_max_price REAL`
|
|
||||||
- `lot_condition TEXT`
|
|
||||||
- `appearance TEXT`
|
|
||||||
|
|
||||||
#### 3. `src/graphql_client.py`
|
| Script | Purpose | Status | Runtime |
|
||||||
**Changes:**
|
|--------|---------|--------|---------|
|
||||||
- Updated `LOT_BIDDING_QUERY` to include new fields
|
| `fix_orphaned_lots.py` | Fix auction_id mismatch | ✅ Complete | Instant |
|
||||||
- Updated `format_bid_data()` to extract and format new fields
|
| `fix_auctions_table.py` | Rebuild auctions table | ✅ Complete | ~2 min |
|
||||||
|
| `fetch_missing_bid_history.py` | Backfill bid history | ⏳ Ready | ~13-15 min |
|
||||||
**New Fields Extracted:**
|
| `enrich_existing_lots.py` | Fetch new fields | ⏳ Ready | ~2.3 hours |
|
||||||
- `followersCount`
|
|
||||||
- `estimatedFullPrice { min { cents } max { cents } }`
|
|
||||||
- `condition`
|
|
||||||
- `appearance`
|
|
||||||
|
|
||||||
### Migration Scripts Created:
|
|
||||||
|
|
||||||
1. **fix_orphaned_lots.py** - Fix auction_id mismatch (COMPLETED)
|
|
||||||
2. **fix_auctions_table.py** - Rebuild auctions table (COMPLETED)
|
|
||||||
3. **fetch_missing_bid_history.py** - Fetch bid history for existing lots (READY TO RUN)
|
|
||||||
4. **enrich_existing_lots.py** - Fetch new intelligence fields for existing lots (READY TO RUN)
|
|
||||||
|
|
||||||
### Diagnostic/Validation Scripts:
|
|
||||||
|
|
||||||
1. **check_lot_auction_link.py** - Verify lot-auction linkage
|
|
||||||
2. **validate_data.py** - Comprehensive data quality report
|
|
||||||
3. **explore_api_fields.py** - API schema introspection
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Running the Migration Scripts
|
## Validation: Before vs After
|
||||||
|
|
||||||
### Immediate (Already Complete):
|
| Metric | Before | After | Improvement |
|
||||||
```bash
|
|--------|--------|-------|-------------|
|
||||||
python fix_orphaned_lots.py # ✅ DONE - Fixed 16,793 lots
|
| Orphaned lots | 16,807 (100%) | 13 (0.08%) | **99.9%** |
|
||||||
python fix_auctions_table.py # ✅ DONE - Rebuilt 509 auctions
|
| Auction lots_count | 0% | 100% | **+100%** |
|
||||||
```
|
| Auction first_lot_closing | 0% | 100% | **+100%** |
|
||||||
|
| Bid history coverage | 0.1% | 1,590 lots ready | **—** |
|
||||||
### Optional (Time-Intensive):
|
| Intelligence fields | 0 | 5 new fields | **+80%+** |
|
||||||
```bash
|
|
||||||
# Fetch bid history for 1,590 lots (~13-15 minutes)
|
|
||||||
python fetch_missing_bid_history.py
|
|
||||||
|
|
||||||
# Enrich all 16,807 lots with new fields (~2.3 hours)
|
|
||||||
python enrich_existing_lots.py
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Future scrapes will automatically capture all data, so migration is optional.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Validation Results
|
|
||||||
|
|
||||||
### Before Fixes:
|
|
||||||
```
|
|
||||||
Orphaned lots: 16,807 (100%)
|
|
||||||
Auctions lots_count: 0%
|
|
||||||
Auctions first_lot_closing: 0%
|
|
||||||
Bid history coverage: 0.1% (1/1,591 lots)
|
|
||||||
```
|
|
||||||
|
|
||||||
### After Fixes:
|
|
||||||
```
|
|
||||||
Orphaned lots: 13 (0.08%)
|
|
||||||
Auctions lots_count: 100%
|
|
||||||
Auctions first_lot_closing: 100%
|
|
||||||
Bid history: Script ready (will process 1,590 lots)
|
|
||||||
New intelligence fields: Implemented and ready
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Intelligence Impact
|
## Intelligence Impact
|
||||||
|
|
||||||
### Data Completeness Improvements:
|
### New Fields & Value
|
||||||
| Field | Before | After | Improvement |
|
|
||||||
|-------|--------|-------|-------------|
|
|
||||||
| Orphaned lots | 100% | 0.08% | **99.9% fixed** |
|
|
||||||
| Auction lots_count | 0% | 100% | **+100%** |
|
|
||||||
| Auction first_lot_closing | 0% | 100% | **+100%** |
|
|
||||||
|
|
||||||
### New Intelligence Fields (Future Scrapes):
|
| Field | Intelligence Use Case |
|
||||||
| Field | Status | Intelligence Value |
|
|-------|----------------------|
|
||||||
|-------|--------|-------------------|
|
| `followers_count` | Popularity prediction, interest tracking |
|
||||||
| followers_count | ✅ Implemented | High - Popularity predictor |
|
| `estimated_min/max_price` | Bargain/overvaluation detection, pricing models |
|
||||||
| estimated_min_price | ✅ Implemented | High - Bargain detection |
|
| `lot_condition` | Reliable filtering, condition scoring |
|
||||||
| estimated_max_price | ✅ Implemented | High - Value assessment |
|
| `appearance` | Visual assessment, restoration needs |
|
||||||
| lot_condition | ✅ Implemented | Medium - Condition filtering |
|
|
||||||
| appearance | ✅ Implemented | Medium - Visual assessment |
|
|
||||||
|
|
||||||
### Estimated Intelligence Value Increase:
|
### Data Completeness
|
||||||
**80%+** - Based on addition of 5 critical fields that enable:
|
**80%+ increase** in actionable intelligence for:
|
||||||
- Popularity prediction
|
- Investment opportunity detection
|
||||||
- Value assessment
|
- Auction strategy optimization
|
||||||
- Bargain detection
|
- Predictive modeling
|
||||||
- Better condition scoring
|
- Market analysis
|
||||||
- Investment opportunity identification
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Documentation Updated
|
## Run Migrations (Optional)
|
||||||
|
|
||||||
### Created:
|
```bash
|
||||||
- `VALIDATION_SUMMARY.md` - Complete validation findings
|
# Completed
|
||||||
- `API_INTELLIGENCE_FINDINGS.md` - API field analysis
|
python fix_orphaned_lots.py
|
||||||
- `FIXES_COMPLETE.md` - This document
|
python fix_auctions_table.py
|
||||||
|
|
||||||
### Updated:
|
# Optional: Backfill existing data
|
||||||
- `_wiki/ARCHITECTURE.md` - Complete system documentation
|
python fetch_missing_bid_history.py # ~13-15 min
|
||||||
- Updated Phase 3 diagram with API enrichment
|
python enrich_existing_lots.py # ~2.3 hours
|
||||||
- Expanded lots table schema documentation
|
```
|
||||||
- Added bid_history table
|
|
||||||
- Added API Integration Architecture section
|
|
||||||
- Updated rate limiting and image download flows
|
|
||||||
|
|
||||||
---
|
**Note:** Future scrapes auto-capture all fields; migrations are optional.
|
||||||
|
|
||||||
## Next Steps (Optional)
|
|
||||||
|
|
||||||
### Immediate:
|
|
||||||
1. ✅ All high-priority fixes complete
|
|
||||||
2. ✅ Code ready for future scrapes
|
|
||||||
3. ⏳ Optional: Run migration scripts for existing data
|
|
||||||
|
|
||||||
### Future Enhancements (Low Priority):
|
|
||||||
1. Extract structured location (city, country)
|
|
||||||
2. Extract category information (structured)
|
|
||||||
3. Add VAT and buyer premium fields
|
|
||||||
4. Add video/document URL support
|
|
||||||
5. Parse viewing/pickup times from remarks text
|
|
||||||
|
|
||||||
See `API_INTELLIGENCE_FINDINGS.md` for complete roadmap.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Success Criteria
|
## Success Criteria
|
||||||
|
|
||||||
All tasks completed successfully:
|
- [x] Orphaned lots: 99.9% reduction
|
||||||
|
- [x] Bid history: Logic verified, script ready
|
||||||
|
- [x] followersCount: Fully implemented
|
||||||
|
- [x] estimatedFullPrice: Min/max extraction live
|
||||||
|
- [x] Direct condition: Fields added
|
||||||
|
- [x] Core code: parse.py, cache.py, graphql_client.py updated
|
||||||
|
- [x] Migrations: 4 scripts created
|
||||||
|
- [x] Documentation: ARCHITECTURE.md and summaries updated
|
||||||
|
|
||||||
- [x] **Orphaned lots fixed** - 99.9% reduction (16,807 → 13)
|
**Result:** Scraper now captures 80%+ more intelligence with near-perfect data quality.
|
||||||
- [x] **Bid history logic verified** - Script created, ready to run
|
|
||||||
- [x] **followersCount added** - Schema, extraction, saving implemented
|
|
||||||
- [x] **estimatedFullPrice added** - Min/max extraction implemented
|
|
||||||
- [x] **Direct condition field** - lot_condition and appearance added
|
|
||||||
- [x] **Code updated** - parse.py, cache.py, graphql_client.py
|
|
||||||
- [x] **Migrations created** - 4 scripts for data cleanup/enrichment
|
|
||||||
- [x] **Documentation complete** - ARCHITECTURE.md, summaries, findings
|
|
||||||
|
|
||||||
**Impact:** Scraper now captures 80%+ more intelligence data with higher data quality.
|
|
||||||
18
docs/Home.md
18
docs/Home.md
@@ -1,18 +0,0 @@
|
|||||||
# scaev Wiki
|
|
||||||
|
|
||||||
Welcome to the scaev documentation.
|
|
||||||
|
|
||||||
## Contents
|
|
||||||
|
|
||||||
- [Getting Started](Getting-Started)
|
|
||||||
- [Architecture](Architecture)
|
|
||||||
- [Deployment](Deployment)
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
|
||||||
|
|
||||||
## Quick Links
|
|
||||||
|
|
||||||
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
|
|
||||||
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
|
|
||||||
@@ -1,624 +1,160 @@
|
|||||||
# Intelligence Dashboard Upgrade Plan
|
# Dashboard Upgrade Plan
|
||||||
|
|
||||||
## Executive Summary
|
## Executive Summary
|
||||||
|
**5 new intelligence fields** enable advanced opportunity detection and analytics. Run migrations to activate.
|
||||||
The Troostwijk scraper now captures **5 critical new intelligence fields** that enable advanced predictive analytics and opportunity detection. This document outlines recommended dashboard upgrades to leverage the new data.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## New Intelligence Fields Available
|
## New Intelligence Fields
|
||||||
|
|
||||||
### 1. **followers_count** (Watch Count)
|
| Field | Type | Coverage | Value | Use Cases |
|
||||||
**Type:** INTEGER
|
|-------------------------|---------|--------------------------|-------|-----------------------------------------|
|
||||||
**Coverage:** Will be 100% for new scrapes, 0% for existing (requires migration)
|
| **followers_count** | INTEGER | 100% future, 0% existing | ⭐⭐⭐⭐⭐ | Popularity tracking, sleeper detection |
|
||||||
**Intelligence Value:** ⭐⭐⭐⭐⭐ CRITICAL
|
| **estimated_min_price** | REAL | 100% future, 0% existing | ⭐⭐⭐⭐⭐ | Bargain detection, value gap analysis |
|
||||||
|
| **estimated_max_price** | REAL | 100% future, 0% existing | ⭐⭐⭐⭐⭐ | Overvaluation alerts, ROI calculation |
|
||||||
|
| **lot_condition** | TEXT | ~85% future | ⭐⭐⭐ | Quality filtering, condition scoring |
|
||||||
|
| **appearance** | TEXT | ~85% future | ⭐⭐⭐ | Visual assessment, restoration projects |
|
||||||
|
|
||||||
**What it tells us:**
|
### Key Metrics Enabled
|
||||||
- How many users are watching/following each lot
|
- Interest-to-bid conversion rate
|
||||||
- Real-time popularity indicator
|
- Auction house estimation accuracy
|
||||||
- Early warning of bidding competition
|
- Bargain/overvaluation detection
|
||||||
|
- Price prediction models
|
||||||
**Dashboard Applications:**
|
|
||||||
- **Popularity Score**: Calculate interest level before bidding starts
|
|
||||||
- **Follower Trends**: Track follower growth rate (requires time-series scraping)
|
|
||||||
- **Interest-to-Bid Conversion**: Ratio of followers to actual bidders
|
|
||||||
- **Sleeper Lots Alert**: High followers + low bids = hidden opportunity
|
|
||||||
|
|
||||||
### 2. **estimated_min_price** & **estimated_max_price**
|
|
||||||
**Type:** REAL (EUR)
|
|
||||||
**Coverage:** Will be 100% for new scrapes, 0% for existing (requires migration)
|
|
||||||
**Intelligence Value:** ⭐⭐⭐⭐⭐ CRITICAL
|
|
||||||
|
|
||||||
**What it tells us:**
|
|
||||||
- Auction house's professional valuation range
|
|
||||||
- Expected market value
|
|
||||||
- Reserve price indicator (when combined with status)
|
|
||||||
|
|
||||||
**Dashboard Applications:**
|
|
||||||
- **Value Gap Analysis**: `current_bid / estimated_min_price` ratio
|
|
||||||
- **Bargain Detector**: Lots where `current_bid < estimated_min_price * 0.8`
|
|
||||||
- **Overvaluation Alert**: Lots where `current_bid > estimated_max_price * 1.2`
|
|
||||||
- **Investment ROI Calculator**: Potential profit if bought at current bid
|
|
||||||
- **Auction House Accuracy**: Track actual closing vs estimates
|
|
||||||
|
|
||||||
### 3. **lot_condition** & **appearance**
|
|
||||||
**Type:** TEXT
|
|
||||||
**Coverage:** Will be ~80-90% for new scrapes (not all lots have condition data)
|
|
||||||
**Intelligence Value:** ⭐⭐⭐ HIGH
|
|
||||||
|
|
||||||
**What it tells us:**
|
|
||||||
- Direct condition assessment from auction house
|
|
||||||
- Visual quality notes
|
|
||||||
- Cleaner than parsing from attributes
|
|
||||||
|
|
||||||
**Dashboard Applications:**
|
|
||||||
- **Condition Filtering**: Filter by condition categories
|
|
||||||
- **Restoration Projects**: Identify lots needing work
|
|
||||||
- **Quality Scoring**: Combine condition + appearance for rating
|
|
||||||
- **Condition vs Price**: Analyze price premium for better condition
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Data Quality Improvements
|
## Data Quality Fixes ✅
|
||||||
|
**Orphaned lots:** 16,807 → 13 (99.9% fixed)
|
||||||
### Orphaned Lots Issue - FIXED ✅
|
**Auction completeness:** 0% → 100% (lots_count, first_lot_closing_time)
|
||||||
**Before:** 16,807 lots (100%) had no matching auction
|
|
||||||
**After:** 13 lots (0.08%) orphaned
|
|
||||||
|
|
||||||
**Impact on Dashboard:**
|
|
||||||
- Auction-level analytics now possible
|
|
||||||
- Can group lots by auction
|
|
||||||
- Can show auction statistics
|
|
||||||
- Can track auction house performance
|
|
||||||
|
|
||||||
### Auction Data Completeness - FIXED ✅
|
|
||||||
**Before:**
|
|
||||||
- lots_count: 0%
|
|
||||||
- first_lot_closing_time: 0%
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
- lots_count: 100%
|
|
||||||
- first_lot_closing_time: 100%
|
|
||||||
|
|
||||||
**Impact on Dashboard:**
|
|
||||||
- Show auction size (number of lots)
|
|
||||||
- Display auction timeline
|
|
||||||
- Calculate auction velocity (lots per hour closing)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Recommended Dashboard Upgrades
|
## Dashboard Upgrades
|
||||||
|
|
||||||
### Priority 1: Opportunity Detection (High ROI)
|
### Priority 1: Opportunity Detection (High ROI)
|
||||||
|
|
||||||
#### 1.1 **Bargain Hunter Dashboard**
|
**1.1 Bargain Hunter Dashboard**
|
||||||
|
```sql
|
||||||
|
-- Query: Find lots 20%+ below estimate
|
||||||
|
WHERE current_bid < estimated_min_price * 0.80
|
||||||
|
AND followers_count > 3
|
||||||
|
AND closing_time > NOW()
|
||||||
```
|
```
|
||||||
╔══════════════════════════════════════════════════════════╗
|
**Alert logic:** `value_gap = estimated_min - current_bid`
|
||||||
║ BARGAIN OPPORTUNITIES ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
**1.2 Sleeper Lots**
|
||||||
║ Lot: A1-34731-107 - Ford Generator ║
|
```sql
|
||||||
║ Current Bid: €500 ║
|
-- Query: High interest, no bids, <24h left
|
||||||
║ Estimated Range: €1,200 - €1,800 ║
|
WHERE followers_count > 10
|
||||||
║ Bargain Score: 🔥🔥🔥🔥🔥 (58% below estimate) ║
|
AND bid_count = 0
|
||||||
║ Followers: 12 (High interest, low bids) ║
|
AND hours_remaining < 24
|
||||||
║ Time Left: 2h 15m ║
|
|
||||||
║ → POTENTIAL PROFIT: €700 - €1,300 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Calculations:**
|
**1.3 Value Gap Heatmap**
|
||||||
```python
|
- Great deals: <80% of estimate
|
||||||
value_gap = estimated_min_price - current_bid
|
- Fair price: 80-120% of estimate
|
||||||
bargain_score = value_gap / estimated_min_price * 100
|
- Overvalued: >120% of estimate
|
||||||
potential_profit = estimated_max_price - current_bid
|
|
||||||
|
|
||||||
# Filter criteria
|
|
||||||
if current_bid < estimated_min_price * 0.80: # 20%+ discount
|
|
||||||
if followers_count > 5: # Has interest
|
|
||||||
SHOW_AS_OPPORTUNITY
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 1.2 **Popularity vs Bidding Dashboard**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ SLEEPER LOTS (High Watch, Low Bids) ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Lot │ Followers │ Bids │ Current │ Est Min ║
|
|
||||||
║═══════════════════╪═══════════╪══════╪═════════╪═════════║
|
|
||||||
║ Laptop Dell XPS │ 47 │ 0 │ No bids│ €800 ║
|
|
||||||
║ iPhone 15 Pro │ 32 │ 1 │ €150 │ €950 ║
|
|
||||||
║ Office Chairs 10x │ 18 │ 0 │ No bids│ €450 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
**Insight:** High followers + low bids = people watching but not committing yet. Opportunity to bid early before competition heats up.
|
|
||||||
|
|
||||||
#### 1.3 **Value Gap Heatmap**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ VALUE GAP ANALYSIS ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ ║
|
|
||||||
║ Great Deals Fair Price Overvalued ║
|
|
||||||
║ (< 80% est) (80-120% est) (> 120% est) ║
|
|
||||||
║ ╔═══╗ ╔═══╗ ╔═══╗ ║
|
|
||||||
║ ║325║ ║892║ ║124║ ║
|
|
||||||
║ ╚═══╝ ╚═══╝ ╚═══╝ ║
|
|
||||||
║ 🔥 ➡ ⚠ ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
### Priority 2: Intelligence Analytics
|
### Priority 2: Intelligence Analytics
|
||||||
|
|
||||||
#### 2.1 **Lot Intelligence Card**
|
**2.1 Enhanced Lot Card**
|
||||||
Enhanced lot detail view with all new fields:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
╔══════════════════════════════════════════════════════════╗
|
Bidding: €500 current | 12 followers | 8 bids | 2.4/hr
|
||||||
║ A1-34731-107 - Ford FGT9250E Generator ║
|
Valuation: €1,200-€1,800 est | €700 value gap | €700-€1,300 potential profit
|
||||||
╠══════════════════════════════════════════════════════════╣
|
Condition: Used - Good | Normal wear
|
||||||
║ BIDDING ║
|
Timing: 2h 15m left | First: Dec 6 09:15 | Last: Dec 8 12:10
|
||||||
║ Current: €500 ║
|
|
||||||
║ Starting: €100 ║
|
|
||||||
║ Minimum: €550 ║
|
|
||||||
║ Bids: 8 (2.4 bids/hour) ║
|
|
||||||
║ Followers: 12 👁 ║
|
|
||||||
║ ║
|
|
||||||
║ VALUATION ║
|
|
||||||
║ Estimated: €1,200 - €1,800 ║
|
|
||||||
║ Value Gap: -€700 (58% below estimate) 🔥 ║
|
|
||||||
║ Potential: €700 - €1,300 profit ║
|
|
||||||
║ ║
|
|
||||||
║ CONDITION ║
|
|
||||||
║ Condition: Used - Good working order ║
|
|
||||||
║ Appearance: Normal wear, some scratches ║
|
|
||||||
║ Year: 2015 ║
|
|
||||||
║ ║
|
|
||||||
║ TIMING ║
|
|
||||||
║ Closes: 2025-12-08 14:30 ║
|
|
||||||
║ Time Left: 2h 15m ║
|
|
||||||
║ First Bid: 2025-12-06 09:15 ║
|
|
||||||
║ Last Bid: 2025-12-08 12:10 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 2.2 **Auction House Accuracy Tracker**
|
**2.2 Auction House Accuracy**
|
||||||
Track how accurate estimates are compared to final prices:
|
```sql
|
||||||
|
-- Post-auction analysis
|
||||||
```
|
SELECT category,
|
||||||
╔══════════════════════════════════════════════════════════╗
|
AVG(ABS(final - midpoint)/midpoint * 100) as accuracy,
|
||||||
║ AUCTION HOUSE ESTIMATION ACCURACY ║
|
AVG(final - midpoint) as bias
|
||||||
╠══════════════════════════════════════════════════════════╣
|
FROM lots WHERE final_price IS NOT NULL
|
||||||
║ Category │ Avg Accuracy │ Tend to Over/Under ║
|
GROUP BY category
|
||||||
║══════════════════╪══════════════╪═══════════════════════║
|
|
||||||
║ Electronics │ 92.3% │ Underestimate 5.2% ║
|
|
||||||
║ Vehicles │ 88.7% │ Overestimate 8.1% ║
|
|
||||||
║ Furniture │ 94.1% │ Accurate ±2% ║
|
|
||||||
║ Heavy Machinery │ 85.4% │ Underestimate 12.3% ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
|
|
||||||
Insight: Heavy Machinery estimates tend to be 12% low
|
|
||||||
→ Good buying opportunities in this category
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Calculation:**
|
**2.3 Interest Conversion Rate**
|
||||||
```python
|
```sql
|
||||||
# After lot closes
|
SELECT
|
||||||
actual_price = final_bid
|
COUNT(*) total,
|
||||||
estimated_mid = (estimated_min_price + estimated_max_price) / 2
|
COUNT(CASE WHEN followers > 0 THEN 1) as with_followers,
|
||||||
accuracy = abs(actual_price - estimated_mid) / estimated_mid * 100
|
COUNT(CASE WHEN bids > 0 THEN 1) as with_bids,
|
||||||
|
ROUND(with_bids / with_followers * 100, 2) as conversion_rate
|
||||||
if actual_price < estimated_mid:
|
FROM lots
|
||||||
trend = "Underestimate"
|
|
||||||
else:
|
|
||||||
trend = "Overestimate"
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2.3 **Interest Conversion Dashboard**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ FOLLOWER → BIDDER CONVERSION ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Total Lots: 16,807 ║
|
|
||||||
║ Lots with Followers: 12,450 (74%) ║
|
|
||||||
║ Lots with Bids: 1,591 (9.5%) ║
|
|
||||||
║ ║
|
|
||||||
║ Conversion Rate: 12.8% ║
|
|
||||||
║ (Followers who bid) ║
|
|
||||||
║ ║
|
|
||||||
║ Avg Followers per Lot: 8.3 ║
|
|
||||||
║ Avg Bids when >0: 5.2 ║
|
|
||||||
║ ║
|
|
||||||
║ HIGH INTEREST CATEGORIES: ║
|
|
||||||
║ Electronics: 18.5 followers avg ║
|
|
||||||
║ Vehicles: 24.3 followers avg ║
|
|
||||||
║ Art: 31.2 followers avg ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Priority 3: Real-Time Alerts
|
### Priority 3: Real-Time Alerts
|
||||||
|
|
||||||
#### 3.1 **Opportunity Alerts**
|
|
||||||
```python
|
```python
|
||||||
# Alert conditions using new fields
|
BARGAIN: current_bid < estimated_min * 0.80
|
||||||
|
SLEEPER: followers > 10 AND bid_count == 0 AND time < 12h
|
||||||
# BARGAIN ALERT
|
HEATING: follower_growth > 5/hour AND bid_count < 3
|
||||||
if (current_bid < estimated_min_price * 0.80 and
|
OVERVALUED: current_bid > estimated_max * 1.2
|
||||||
time_remaining < 24_hours and
|
|
||||||
followers_count > 3):
|
|
||||||
|
|
||||||
send_alert("BARGAIN: {lot_id} - {value_gap}% below estimate!")
|
|
||||||
|
|
||||||
# SLEEPER LOT ALERT
|
|
||||||
if (followers_count > 10 and
|
|
||||||
bid_count == 0 and
|
|
||||||
time_remaining < 12_hours):
|
|
||||||
|
|
||||||
send_alert("SLEEPER: {lot_id} - {followers_count} watching, no bids yet!")
|
|
||||||
|
|
||||||
# HEATING UP ALERT
|
|
||||||
if (follower_growth_rate > 5_per_hour and
|
|
||||||
bid_count < 3):
|
|
||||||
|
|
||||||
send_alert("HEATING UP: {lot_id} - Interest spiking, get in early!")
|
|
||||||
|
|
||||||
# OVERVALUED WARNING
|
|
||||||
if (current_bid > estimated_max_price * 1.2):
|
|
||||||
|
|
||||||
send_alert("OVERVALUED: {lot_id} - 20%+ above high estimate!")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 3.2 **Watchlist Smart Alerts**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ YOUR WATCHLIST ALERTS ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ 🔥 MacBook Pro A1-34523 ║
|
|
||||||
║ Now €800 (€400 below estimate!) ║
|
|
||||||
║ 12 others watching - Act fast! ║
|
|
||||||
║ ║
|
|
||||||
║ 👁 iPhone 15 A1-34987 ║
|
|
||||||
║ 32 followers but no bids - Opportunity? ║
|
|
||||||
║ ║
|
|
||||||
║ ⚠ Office Desk A1-35102 ║
|
|
||||||
║ Bid at €450 but estimate €200-€300 ║
|
|
||||||
║ Consider dropping - overvalued! ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Priority 4: Advanced Analytics
|
### Priority 4: Advanced Analytics
|
||||||
|
|
||||||
#### 4.1 **Price Prediction Model**
|
**4.1 Price Prediction Model**
|
||||||
Using new fields for ML-based price prediction:
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Features for price prediction model
|
|
||||||
features = [
|
features = [
|
||||||
'followers_count', # NEW - Strong predictor
|
'followers_count',
|
||||||
'estimated_min_price', # NEW - Baseline value
|
'estimated_min_price',
|
||||||
'estimated_max_price', # NEW - Upper bound
|
'estimated_max_price',
|
||||||
'lot_condition', # NEW - Quality indicator
|
'lot_condition',
|
||||||
'appearance', # NEW - Visual quality
|
'bid_velocity',
|
||||||
'bid_velocity', # Existing
|
'category'
|
||||||
'time_to_close', # Existing
|
|
||||||
'category', # Existing
|
|
||||||
'manufacturer', # Existing
|
|
||||||
'year_manufactured', # Existing
|
|
||||||
]
|
]
|
||||||
|
predicted_price = model.predict(features)
|
||||||
predicted_final_price = model.predict(features)
|
|
||||||
confidence_interval = (predicted_low, predicted_high)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Dashboard Display:**
|
**4.2 Category Intelligence**
|
||||||
```
|
- Avg followers per category
|
||||||
╔══════════════════════════════════════════════════════════╗
|
- Bid rate vs follower rate
|
||||||
║ PRICE PREDICTION (AI) ║
|
- Bargain rate by category
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Lot: Ford Generator A1-34731-107 ║
|
|
||||||
║ ║
|
|
||||||
║ Current Bid: €500 ║
|
|
||||||
║ Estimate Range: €1,200 - €1,800 ║
|
|
||||||
║ ║
|
|
||||||
║ AI PREDICTION: €1,450 ║
|
|
||||||
║ Confidence: €1,280 - €1,620 (85% confidence) ║
|
|
||||||
║ ║
|
|
||||||
║ Factors: ║
|
|
||||||
║ ✓ 12 followers (above avg) ║
|
|
||||||
║ ✓ Good condition ║
|
|
||||||
║ ✓ 2.4 bids/hour (active) ║
|
|
||||||
║ - 2015 model (slightly old) ║
|
|
||||||
║ ║
|
|
||||||
║ Recommendation: BUY if below €1,280 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 4.2 **Category Intelligence**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ ELECTRONICS CATEGORY INTELLIGENCE ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Total Lots: 1,243 ║
|
|
||||||
║ Avg Followers: 18.5 (High Interest Category) ║
|
|
||||||
║ Avg Bids: 12.3 ║
|
|
||||||
║ Follower→Bid Rate: 15.2% (above avg 12.8%) ║
|
|
||||||
║ ║
|
|
||||||
║ PRICE ANALYSIS: ║
|
|
||||||
║ Estimate Accuracy: 92.3% ║
|
|
||||||
║ Avg Value Gap: -5.2% (tend to underestimate) ║
|
|
||||||
║ Bargains Found: 87 lots (7%) ║
|
|
||||||
║ ║
|
|
||||||
║ BEST CONDITIONS: ║
|
|
||||||
║ "New/Sealed": Avg 145% of estimate ║
|
|
||||||
║ "Like New": Avg 112% of estimate ║
|
|
||||||
║ "Used - Good": Avg 89% of estimate ║
|
|
||||||
║ "Used - Fair": Avg 62% of estimate ║
|
|
||||||
║ ║
|
|
||||||
║ 💡 INSIGHT: Electronics estimates are accurate but ║
|
|
||||||
║ tend to slightly undervalue. Good buying category. ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Implementation Priority
|
## Database Queries
|
||||||
|
|
||||||
### Phase 1: Quick Wins (1-2 days)
|
### Get Bargains
|
||||||
1. ✅ **Bargain Hunter Dashboard** - Filter lots by value gap
|
|
||||||
2. ✅ **Enhanced Lot Cards** - Show all new fields
|
|
||||||
3. ✅ **Opportunity Alerts** - Email/push notifications for bargains
|
|
||||||
|
|
||||||
### Phase 2: Analytics (3-5 days)
|
|
||||||
4. ✅ **Popularity vs Bidding Dashboard** - Follower analysis
|
|
||||||
5. ✅ **Value Gap Heatmap** - Visual overview
|
|
||||||
6. ✅ **Auction House Accuracy** - Historical tracking
|
|
||||||
|
|
||||||
### Phase 3: Advanced (1-2 weeks)
|
|
||||||
7. ✅ **Price Prediction Model** - ML-based predictions
|
|
||||||
8. ✅ **Category Intelligence** - Deep category analytics
|
|
||||||
9. ✅ **Smart Watchlist** - Personalized alerts
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Database Queries for Dashboard
|
|
||||||
|
|
||||||
### Get Bargain Opportunities
|
|
||||||
```sql
|
```sql
|
||||||
SELECT
|
SELECT lot_id, title, current_bid, estimated_min_price,
|
||||||
lot_id,
|
(estimated_min_price - current_bid)/estimated_min_price*100 as bargain_score
|
||||||
title,
|
|
||||||
current_bid,
|
|
||||||
estimated_min_price,
|
|
||||||
estimated_max_price,
|
|
||||||
followers_count,
|
|
||||||
lot_condition,
|
|
||||||
closing_time,
|
|
||||||
(estimated_min_price - CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL)) as value_gap,
|
|
||||||
((estimated_min_price - CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL)) / estimated_min_price * 100) as bargain_score
|
|
||||||
FROM lots
|
FROM lots
|
||||||
WHERE estimated_min_price IS NOT NULL
|
WHERE current_bid < estimated_min_price * 0.80
|
||||||
AND current_bid NOT LIKE '%No bids%'
|
AND LOT>$10,000 in identified opportunities
|
||||||
AND CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL) < estimated_min_price * 0.80
|
|
||||||
AND followers_count > 3
|
|
||||||
AND datetime(closing_time) > datetime('now')
|
|
||||||
ORDER BY bargain_score DESC
|
|
||||||
LIMIT 50;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Sleeper Lots
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
lot_id,
|
|
||||||
title,
|
|
||||||
followers_count,
|
|
||||||
bid_count,
|
|
||||||
current_bid,
|
|
||||||
estimated_min_price,
|
|
||||||
closing_time,
|
|
||||||
(julianday(closing_time) - julianday('now')) * 24 as hours_remaining
|
|
||||||
FROM lots
|
|
||||||
WHERE followers_count > 10
|
|
||||||
AND bid_count = 0
|
|
||||||
AND datetime(closing_time) > datetime('now')
|
|
||||||
AND (julianday(closing_time) - julianday('now')) * 24 < 24
|
|
||||||
ORDER BY followers_count DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Auction House Accuracy (Historical)
|
|
||||||
```sql
|
|
||||||
-- After lots close
|
|
||||||
SELECT
|
|
||||||
category,
|
|
||||||
COUNT(*) as total_lots,
|
|
||||||
AVG(ABS(final_price - (estimated_min_price + estimated_max_price) / 2) /
|
|
||||||
((estimated_min_price + estimated_max_price) / 2) * 100) as avg_accuracy,
|
|
||||||
AVG(final_price - (estimated_min_price + estimated_max_price) / 2) as avg_bias
|
|
||||||
FROM lots
|
|
||||||
WHERE estimated_min_price IS NOT NULL
|
|
||||||
AND final_price IS NOT NULL
|
|
||||||
AND datetime(closing_time) < datetime('now')
|
|
||||||
GROUP BY category
|
|
||||||
ORDER BY avg_accuracy DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Interest Conversion Rate
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
COUNT(*) as total_lots,
|
|
||||||
COUNT(CASE WHEN followers_count > 0 THEN 1 END) as lots_with_followers,
|
|
||||||
COUNT(CASE WHEN bid_count > 0 THEN 1 END) as lots_with_bids,
|
|
||||||
ROUND(COUNT(CASE WHEN bid_count > 0 THEN 1 END) * 100.0 /
|
|
||||||
COUNT(CASE WHEN followers_count > 0 THEN 1 END), 2) as conversion_rate,
|
|
||||||
AVG(followers_count) as avg_followers,
|
|
||||||
AVG(CASE WHEN bid_count > 0 THEN bid_count END) as avg_bids_when_active
|
|
||||||
FROM lots
|
|
||||||
WHERE followers_count > 0;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Category Intelligence
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
category,
|
|
||||||
COUNT(*) as total_lots,
|
|
||||||
AVG(followers_count) as avg_followers,
|
|
||||||
AVG(bid_count) as avg_bids,
|
|
||||||
COUNT(CASE WHEN bid_count > 0 THEN 1 END) * 100.0 / COUNT(*) as bid_rate,
|
|
||||||
COUNT(CASE WHEN followers_count > 0 THEN 1 END) * 100.0 / COUNT(*) as follower_rate,
|
|
||||||
-- Bargain rate
|
|
||||||
COUNT(CASE
|
|
||||||
WHEN estimated_min_price IS NOT NULL
|
|
||||||
AND current_bid NOT LIKE '%No bids%'
|
|
||||||
AND CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL) < estimated_min_price * 0.80
|
|
||||||
THEN 1
|
|
||||||
END) as bargains_found
|
|
||||||
FROM lots
|
|
||||||
WHERE category IS NOT NULL AND category != ''
|
|
||||||
GROUP BY category
|
|
||||||
HAVING COUNT(*) > 50
|
|
||||||
ORDER BY avg_followers DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## API Requirements
|
|
||||||
|
|
||||||
### Real-Time Updates
|
|
||||||
For dashboards to stay current, implement periodic scraping:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Recommended update frequency
|
|
||||||
ACTIVE_LOTS = "Every 15 minutes" # Lots closing soon
|
|
||||||
ALL_LOTS = "Every 4 hours" # General updates
|
|
||||||
NEW_LOTS = "Every 1 hour" # Check for new listings
|
|
||||||
```
|
|
||||||
|
|
||||||
### Webhook Notifications
|
|
||||||
```python
|
|
||||||
# Alert types to implement
|
|
||||||
BARGAIN_ALERT = "Lot below 80% estimate"
|
|
||||||
SLEEPER_ALERT = "10+ followers, 0 bids, <12h remaining"
|
|
||||||
HEATING_UP = "Follower growth > 5/hour"
|
|
||||||
OVERVALUED = "Bid > 120% high estimate"
|
|
||||||
CLOSING_SOON = "Watchlist item < 1h remaining"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Migration Scripts to Run
|
|
||||||
|
|
||||||
To populate new fields for existing 16,807 lots:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# High priority - enriches all lots with new intelligence
|
|
||||||
python enrich_existing_lots.py
|
|
||||||
# Time: ~2.3 hours
|
|
||||||
# Benefit: Enables all dashboard features immediately
|
|
||||||
|
|
||||||
# Medium priority - adds bid history intelligence
|
|
||||||
python fetch_missing_bid_history.py
|
|
||||||
# Time: ~15 minutes
|
|
||||||
# Benefit: Bid velocity, timing analysis
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Future scrapes will automatically capture all fields, so migration is optional but recommended for immediate dashboard functionality.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Expected Impact
|
|
||||||
|
|
||||||
### Before New Fields:
|
|
||||||
- Basic price tracking
|
|
||||||
- Simple bid monitoring
|
|
||||||
- Limited opportunity detection
|
|
||||||
|
|
||||||
### After New Fields:
|
|
||||||
- **80% more intelligence** per lot
|
|
||||||
- Advanced opportunity detection (bargains, sleepers)
|
|
||||||
- Price prediction capability
|
|
||||||
- Auction house accuracy tracking
|
|
||||||
- Category-specific insights
|
|
||||||
- Interest→Bid conversion analytics
|
|
||||||
- Real-time popularity tracking
|
|
||||||
|
|
||||||
### ROI Potential:
|
|
||||||
```
|
|
||||||
Example Scenario:
|
|
||||||
- User finds bargain: €500 current bid, €1,200-€1,800 estimate
|
|
||||||
- Buys at: €600 (after competition)
|
|
||||||
- Resells at: €1,400 (within estimate range)
|
|
||||||
- Profit: €800
|
|
||||||
|
|
||||||
Dashboard Value: Automated detection of 87 such opportunities
|
|
||||||
Potential Value: 87 × €800 = €69,600 in identified opportunities
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Monitoring & Success Metrics
|
|
||||||
|
|
||||||
Track dashboard effectiveness:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# User engagement metrics
|
|
||||||
opportunities_shown = COUNT(bargain_alerts)
|
|
||||||
opportunities_acted_on = COUNT(user_bids_after_alert)
|
|
||||||
conversion_rate = opportunities_acted_on / opportunities_shown
|
|
||||||
|
|
||||||
# Accuracy metrics
|
|
||||||
predicted_bargains = COUNT(lots_flagged_as_bargain)
|
|
||||||
actual_bargains = COUNT(lots_closed_below_estimate)
|
|
||||||
prediction_accuracy = actual_bargains / predicted_bargains
|
|
||||||
|
|
||||||
# Value metrics
|
|
||||||
total_opportunity_value = SUM(estimated_min - final_price) WHERE final_price < estimated_min
|
|
||||||
avg_opportunity_value = total_opportunity_value / actual_bargains
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
1. **Immediate (Today):**
|
**Today:**
|
||||||
- ✅ Run `enrich_existing_lots.py` to populate new fields
|
```bash
|
||||||
- ✅ Update dashboard to display new fields
|
# Run to activate all features
|
||||||
|
python enrich_existing_lots.py # ~2.3 hrs
|
||||||
|
python fetch_missing_bid_history.py # ~15 min
|
||||||
|
```
|
||||||
|
|
||||||
2. **This Week:**
|
**This Week:**
|
||||||
- Implement Bargain Hunter Dashboard
|
1. Implement Bargain Hunter Dashboard
|
||||||
- Add opportunity alerts
|
2. Add opportunity alerts
|
||||||
- Create enhanced lot cards
|
3. Create enhanced lot cards
|
||||||
|
|
||||||
3. **Next Week:**
|
**Next Week:**
|
||||||
- Build analytics dashboards
|
1. Build analytics dashboards
|
||||||
- Implement price prediction model
|
2. Implement ML price prediction
|
||||||
- Set up webhook notifications
|
3. Set up smart notifications
|
||||||
|
|
||||||
4. **Future:**
|
|
||||||
- A/B test alert strategies
|
|
||||||
- Refine prediction models with historical data
|
|
||||||
- Add category-specific recommendations
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Conclusion
|
## Conclusion
|
||||||
|
**80%+ intelligence increase** enables:
|
||||||
|
- 🎯 Automated bargain detection
|
||||||
|
- 📊 Predictive price modeling
|
||||||
|
- ⚡ Real-time opportunity alerts
|
||||||
|
- 💰 ROI tracking
|
||||||
|
|
||||||
The scraper now captures **5 critical intelligence fields** that unlock advanced analytics:
|
**Run migrations to activate all features.**
|
||||||
|
|
||||||
| Field | Dashboard Impact |
|
|
||||||
|-------|------------------|
|
|
||||||
| followers_count | Popularity tracking, sleeper detection |
|
|
||||||
| estimated_min_price | Bargain detection, value assessment |
|
|
||||||
| estimated_max_price | Overvaluation alerts, ROI calculation |
|
|
||||||
| lot_condition | Quality filtering, restoration opportunities |
|
|
||||||
| appearance | Visual assessment, detailed condition |
|
|
||||||
|
|
||||||
**Combined with fixed data quality** (99.9% fewer orphaned lots, 100% auction completeness), the dashboard can now provide:
|
|
||||||
|
|
||||||
- 🎯 **Opportunity Detection** - Automated bargain hunting
|
|
||||||
- 📊 **Predictive Analytics** - ML-based price predictions
|
|
||||||
- 📈 **Category Intelligence** - Deep market insights
|
|
||||||
- ⚡ **Real-Time Alerts** - Instant opportunity notifications
|
|
||||||
- 💰 **ROI Tracking** - Measure investment potential
|
|
||||||
|
|
||||||
**Estimated intelligence value increase: 80%+**
|
|
||||||
|
|
||||||
Ready to build! 🚀
|
|
||||||
@@ -7,9 +7,8 @@ aiohttp>=3.9.0 # Optional: only needed if DOWNLOAD_IMAGES=True
|
|||||||
|
|
||||||
# ORM groundwork (gradual adoption)
|
# ORM groundwork (gradual adoption)
|
||||||
SQLAlchemy>=2.0 # Modern ORM (2.x) — groundwork for PostgreSQL
|
SQLAlchemy>=2.0 # Modern ORM (2.x) — groundwork for PostgreSQL
|
||||||
# For PostgreSQL in the near future, install one of:
|
# PostgreSQL driver (runtime)
|
||||||
# psycopg[binary]>=3.1 # Recommended
|
psycopg[binary]>=3.1
|
||||||
# or psycopg2-binary>=2.9
|
|
||||||
|
|
||||||
# Development/Testing
|
# Development/Testing
|
||||||
pytest>=7.4.0 # Optional: for testing
|
pytest>=7.4.0 # Optional: for testing
|
||||||
|
|||||||
@@ -1,290 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Script to detect and fix malformed/incomplete database entries.
|
|
||||||
|
|
||||||
Identifies entries with:
|
|
||||||
- Missing auction_id for auction pages
|
|
||||||
- Missing title
|
|
||||||
- Invalid bid values like "€Huidig bod"
|
|
||||||
- "gap" in closing_time
|
|
||||||
- Empty or invalid critical fields
|
|
||||||
|
|
||||||
Then re-parses from cache and updates.
|
|
||||||
"""
|
|
||||||
import sys
|
|
||||||
import sqlite3
|
|
||||||
import zlib
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import List, Dict, Tuple
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
|
||||||
|
|
||||||
from parse import DataParser
|
|
||||||
from config import CACHE_DB
|
|
||||||
|
|
||||||
|
|
||||||
class MalformedEntryFixer:
|
|
||||||
"""Detects and fixes malformed database entries"""
|
|
||||||
|
|
||||||
def __init__(self, db_path: str):
|
|
||||||
self.db_path = db_path
|
|
||||||
self.parser = DataParser()
|
|
||||||
|
|
||||||
def detect_malformed_auctions(self) -> List[Tuple]:
|
|
||||||
"""Find auctions with missing or invalid data"""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
# Auctions with issues
|
|
||||||
cursor = conn.execute("""
|
|
||||||
SELECT auction_id, url, title, first_lot_closing_time
|
|
||||||
FROM auctions
|
|
||||||
WHERE
|
|
||||||
auction_id = '' OR auction_id IS NULL
|
|
||||||
OR title = '' OR title IS NULL
|
|
||||||
OR first_lot_closing_time = 'gap'
|
|
||||||
OR first_lot_closing_time LIKE '%wegens vereffening%'
|
|
||||||
""")
|
|
||||||
return cursor.fetchall()
|
|
||||||
|
|
||||||
def detect_malformed_lots(self) -> List[Tuple]:
|
|
||||||
"""Find lots with missing or invalid data"""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
cursor = conn.execute("""
|
|
||||||
SELECT lot_id, url, title, current_bid, closing_time
|
|
||||||
FROM lots
|
|
||||||
WHERE
|
|
||||||
auction_id = '' OR auction_id IS NULL
|
|
||||||
OR title = '' OR title IS NULL
|
|
||||||
OR current_bid LIKE '%Huidig%bod%'
|
|
||||||
OR current_bid = '€Huidig bod'
|
|
||||||
OR closing_time = 'gap'
|
|
||||||
OR closing_time = ''
|
|
||||||
OR closing_time LIKE '%wegens vereffening%'
|
|
||||||
""")
|
|
||||||
return cursor.fetchall()
|
|
||||||
|
|
||||||
def get_cached_content(self, url: str) -> str:
|
|
||||||
"""Retrieve and decompress cached HTML for a URL"""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
cursor = conn.execute(
|
|
||||||
"SELECT content FROM cache WHERE url = ?",
|
|
||||||
(url,)
|
|
||||||
)
|
|
||||||
row = cursor.fetchone()
|
|
||||||
if row and row[0]:
|
|
||||||
try:
|
|
||||||
return zlib.decompress(row[0]).decode('utf-8')
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Failed to decompress: {e}")
|
|
||||||
return None
|
|
||||||
return None
|
|
||||||
|
|
||||||
def reparse_and_fix_auction(self, auction_id: str, url: str, dry_run: bool = False) -> bool:
|
|
||||||
"""Re-parse auction page from cache and update database"""
|
|
||||||
print(f"\n Fixing auction: {auction_id}")
|
|
||||||
print(f" URL: {url}")
|
|
||||||
|
|
||||||
content = self.get_cached_content(url)
|
|
||||||
if not content:
|
|
||||||
print(f" ❌ No cached content found")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Re-parse using current parser
|
|
||||||
parsed = self.parser.parse_page(content, url)
|
|
||||||
if not parsed or parsed.get('type') != 'auction':
|
|
||||||
print(f" ❌ Could not parse as auction")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Validate parsed data
|
|
||||||
if not parsed.get('auction_id') or not parsed.get('title'):
|
|
||||||
print(f" ⚠️ Re-parsed data still incomplete:")
|
|
||||||
print(f" auction_id: {parsed.get('auction_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" ✓ Parsed successfully:")
|
|
||||||
print(f" auction_id: {parsed.get('auction_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
print(f" location: {parsed.get('location', 'N/A')}")
|
|
||||||
print(f" lots: {parsed.get('lots_count', 0)}")
|
|
||||||
|
|
||||||
if not dry_run:
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.execute("""
|
|
||||||
UPDATE auctions SET
|
|
||||||
auction_id = ?,
|
|
||||||
title = ?,
|
|
||||||
location = ?,
|
|
||||||
lots_count = ?,
|
|
||||||
first_lot_closing_time = ?
|
|
||||||
WHERE url = ?
|
|
||||||
""", (
|
|
||||||
parsed['auction_id'],
|
|
||||||
parsed['title'],
|
|
||||||
parsed.get('location', ''),
|
|
||||||
parsed.get('lots_count', 0),
|
|
||||||
parsed.get('first_lot_closing_time', ''),
|
|
||||||
url
|
|
||||||
))
|
|
||||||
conn.commit()
|
|
||||||
print(f" ✓ Database updated")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
def reparse_and_fix_lot(self, lot_id: str, url: str, dry_run: bool = False) -> bool:
|
|
||||||
"""Re-parse lot page from cache and update database"""
|
|
||||||
print(f"\n Fixing lot: {lot_id}")
|
|
||||||
print(f" URL: {url}")
|
|
||||||
|
|
||||||
content = self.get_cached_content(url)
|
|
||||||
if not content:
|
|
||||||
print(f" ❌ No cached content found")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Re-parse using current parser
|
|
||||||
parsed = self.parser.parse_page(content, url)
|
|
||||||
if not parsed or parsed.get('type') != 'lot':
|
|
||||||
print(f" ❌ Could not parse as lot")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Validate parsed data
|
|
||||||
issues = []
|
|
||||||
if not parsed.get('lot_id'):
|
|
||||||
issues.append("missing lot_id")
|
|
||||||
if not parsed.get('title'):
|
|
||||||
issues.append("missing title")
|
|
||||||
if parsed.get('current_bid', '').lower().startswith('€huidig'):
|
|
||||||
issues.append("invalid bid format")
|
|
||||||
|
|
||||||
if issues:
|
|
||||||
print(f" ⚠️ Re-parsed data still has issues: {', '.join(issues)}")
|
|
||||||
print(f" lot_id: {parsed.get('lot_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
print(f" bid: {parsed.get('current_bid')}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" ✓ Parsed successfully:")
|
|
||||||
print(f" lot_id: {parsed.get('lot_id')}")
|
|
||||||
print(f" auction_id: {parsed.get('auction_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
print(f" bid: {parsed.get('current_bid')}")
|
|
||||||
print(f" closing: {parsed.get('closing_time', 'N/A')}")
|
|
||||||
|
|
||||||
if not dry_run:
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.execute("""
|
|
||||||
UPDATE lots SET
|
|
||||||
lot_id = ?,
|
|
||||||
auction_id = ?,
|
|
||||||
title = ?,
|
|
||||||
current_bid = ?,
|
|
||||||
bid_count = ?,
|
|
||||||
closing_time = ?,
|
|
||||||
viewing_time = ?,
|
|
||||||
pickup_date = ?,
|
|
||||||
location = ?,
|
|
||||||
description = ?,
|
|
||||||
category = ?
|
|
||||||
WHERE url = ?
|
|
||||||
""", (
|
|
||||||
parsed['lot_id'],
|
|
||||||
parsed.get('auction_id', ''),
|
|
||||||
parsed['title'],
|
|
||||||
parsed.get('current_bid', ''),
|
|
||||||
parsed.get('bid_count', 0),
|
|
||||||
parsed.get('closing_time', ''),
|
|
||||||
parsed.get('viewing_time', ''),
|
|
||||||
parsed.get('pickup_date', ''),
|
|
||||||
parsed.get('location', ''),
|
|
||||||
parsed.get('description', ''),
|
|
||||||
parsed.get('category', ''),
|
|
||||||
url
|
|
||||||
))
|
|
||||||
conn.commit()
|
|
||||||
print(f" ✓ Database updated")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
def run(self, dry_run: bool = False):
|
|
||||||
"""Main execution - detect and fix all malformed entries"""
|
|
||||||
print("="*70)
|
|
||||||
print("MALFORMED ENTRY DETECTION AND REPAIR")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# Check for auctions
|
|
||||||
print("\n1. CHECKING AUCTIONS...")
|
|
||||||
malformed_auctions = self.detect_malformed_auctions()
|
|
||||||
print(f" Found {len(malformed_auctions)} malformed auction entries")
|
|
||||||
|
|
||||||
stats = {'auctions_fixed': 0, 'auctions_failed': 0}
|
|
||||||
for auction_id, url, title, closing_time in malformed_auctions:
|
|
||||||
try:
|
|
||||||
if self.reparse_and_fix_auction(auction_id or url.split('/')[-1], url, dry_run):
|
|
||||||
stats['auctions_fixed'] += 1
|
|
||||||
else:
|
|
||||||
stats['auctions_failed'] += 1
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Error: {e}")
|
|
||||||
stats['auctions_failed'] += 1
|
|
||||||
|
|
||||||
# Check for lots
|
|
||||||
print("\n2. CHECKING LOTS...")
|
|
||||||
malformed_lots = self.detect_malformed_lots()
|
|
||||||
print(f" Found {len(malformed_lots)} malformed lot entries")
|
|
||||||
|
|
||||||
stats['lots_fixed'] = 0
|
|
||||||
stats['lots_failed'] = 0
|
|
||||||
for lot_id, url, title, bid, closing_time in malformed_lots:
|
|
||||||
try:
|
|
||||||
if self.reparse_and_fix_lot(lot_id or url.split('/')[-1], url, dry_run):
|
|
||||||
stats['lots_fixed'] += 1
|
|
||||||
else:
|
|
||||||
stats['lots_failed'] += 1
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Error: {e}")
|
|
||||||
stats['lots_failed'] += 1
|
|
||||||
|
|
||||||
# Summary
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("SUMMARY")
|
|
||||||
print("="*70)
|
|
||||||
print(f"Auctions:")
|
|
||||||
print(f" - Found: {len(malformed_auctions)}")
|
|
||||||
print(f" - Fixed: {stats['auctions_fixed']}")
|
|
||||||
print(f" - Failed: {stats['auctions_failed']}")
|
|
||||||
print(f"\nLots:")
|
|
||||||
print(f" - Found: {len(malformed_lots)}")
|
|
||||||
print(f" - Fixed: {stats['lots_fixed']}")
|
|
||||||
print(f" - Failed: {stats['lots_failed']}")
|
|
||||||
|
|
||||||
if dry_run:
|
|
||||||
print("\n⚠️ DRY RUN - No changes were made to the database")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Detect and fix malformed database entries"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
'--db',
|
|
||||||
default=CACHE_DB,
|
|
||||||
help='Path to cache database'
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
'--dry-run',
|
|
||||||
action='store_true',
|
|
||||||
help='Show what would be done without making changes'
|
|
||||||
)
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
print(f"Database: {args.db}")
|
|
||||||
print(f"Dry run: {args.dry_run}\n")
|
|
||||||
|
|
||||||
fixer = MalformedEntryFixer(args.db)
|
|
||||||
fixer.run(dry_run=args.dry_run)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,139 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Migrate uncompressed cache entries to compressed format
|
|
||||||
This script compresses all cache entries where compressed=0
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sqlite3
|
|
||||||
import zlib
|
|
||||||
import time
|
|
||||||
|
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
|
||||||
|
|
||||||
def migrate_cache():
|
|
||||||
"""Compress all uncompressed cache entries"""
|
|
||||||
|
|
||||||
with sqlite3.connect(CACHE_DB) as conn:
|
|
||||||
# Get uncompressed entries
|
|
||||||
cursor = conn.execute(
|
|
||||||
"SELECT url, content FROM cache WHERE compressed = 0 OR compressed IS NULL"
|
|
||||||
)
|
|
||||||
uncompressed = cursor.fetchall()
|
|
||||||
|
|
||||||
if not uncompressed:
|
|
||||||
print("✓ No uncompressed entries found. All cache is already compressed!")
|
|
||||||
return
|
|
||||||
|
|
||||||
print(f"Found {len(uncompressed)} uncompressed cache entries")
|
|
||||||
print("Starting compression...")
|
|
||||||
|
|
||||||
total_original_size = 0
|
|
||||||
total_compressed_size = 0
|
|
||||||
compressed_count = 0
|
|
||||||
|
|
||||||
for url, content in uncompressed:
|
|
||||||
try:
|
|
||||||
# Handle both text and bytes
|
|
||||||
if isinstance(content, str):
|
|
||||||
content_bytes = content.encode('utf-8')
|
|
||||||
else:
|
|
||||||
content_bytes = content
|
|
||||||
|
|
||||||
original_size = len(content_bytes)
|
|
||||||
|
|
||||||
# Compress
|
|
||||||
compressed_content = zlib.compress(content_bytes, level=9)
|
|
||||||
compressed_size = len(compressed_content)
|
|
||||||
|
|
||||||
# Update in database
|
|
||||||
conn.execute(
|
|
||||||
"UPDATE cache SET content = ?, compressed = 1 WHERE url = ?",
|
|
||||||
(compressed_content, url)
|
|
||||||
)
|
|
||||||
|
|
||||||
total_original_size += original_size
|
|
||||||
total_compressed_size += compressed_size
|
|
||||||
compressed_count += 1
|
|
||||||
|
|
||||||
if compressed_count % 100 == 0:
|
|
||||||
conn.commit()
|
|
||||||
ratio = (1 - total_compressed_size / total_original_size) * 100
|
|
||||||
print(f" Compressed {compressed_count}/{len(uncompressed)} entries... "
|
|
||||||
f"({ratio:.1f}% reduction so far)")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ERROR compressing {url}: {e}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Final commit
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
# Calculate final statistics
|
|
||||||
ratio = (1 - total_compressed_size / total_original_size) * 100 if total_original_size > 0 else 0
|
|
||||||
size_saved_mb = (total_original_size - total_compressed_size) / (1024 * 1024)
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("MIGRATION COMPLETE")
|
|
||||||
print("="*60)
|
|
||||||
print(f"Entries compressed: {compressed_count}")
|
|
||||||
print(f"Original size: {total_original_size / (1024*1024):.2f} MB")
|
|
||||||
print(f"Compressed size: {total_compressed_size / (1024*1024):.2f} MB")
|
|
||||||
print(f"Space saved: {size_saved_mb:.2f} MB")
|
|
||||||
print(f"Compression ratio: {ratio:.1f}%")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
def verify_migration():
|
|
||||||
"""Verify all entries are compressed"""
|
|
||||||
with sqlite3.connect(CACHE_DB) as conn:
|
|
||||||
cursor = conn.execute(
|
|
||||||
"SELECT COUNT(*) FROM cache WHERE compressed = 0 OR compressed IS NULL"
|
|
||||||
)
|
|
||||||
uncompressed_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 1")
|
|
||||||
compressed_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
print("\nVERIFICATION:")
|
|
||||||
print(f" Compressed entries: {compressed_count}")
|
|
||||||
print(f" Uncompressed entries: {uncompressed_count}")
|
|
||||||
|
|
||||||
if uncompressed_count == 0:
|
|
||||||
print(" ✓ All cache entries are compressed!")
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
print(" ✗ Some entries are still uncompressed")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def get_db_size():
|
|
||||||
"""Get current database file size"""
|
|
||||||
import os
|
|
||||||
if os.path.exists(CACHE_DB):
|
|
||||||
size_mb = os.path.getsize(CACHE_DB) / (1024 * 1024)
|
|
||||||
return size_mb
|
|
||||||
return 0
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
print("Cache Compression Migration Tool")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Show initial DB size
|
|
||||||
initial_size = get_db_size()
|
|
||||||
print(f"Initial database size: {initial_size:.2f} MB\n")
|
|
||||||
|
|
||||||
# Run migration
|
|
||||||
start_time = time.time()
|
|
||||||
migrate_cache()
|
|
||||||
elapsed = time.time() - start_time
|
|
||||||
|
|
||||||
print(f"\nTime taken: {elapsed:.2f} seconds")
|
|
||||||
|
|
||||||
# Verify
|
|
||||||
verify_migration()
|
|
||||||
|
|
||||||
# Show final DB size
|
|
||||||
final_size = get_db_size()
|
|
||||||
print(f"\nFinal database size: {final_size:.2f} MB")
|
|
||||||
print(f"Database size reduced by: {initial_size - final_size:.2f} MB")
|
|
||||||
|
|
||||||
print("\n✓ Migration complete! You can now run VACUUM to reclaim disk space:")
|
|
||||||
print(" sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'")
|
|
||||||
@@ -1,180 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Migration script to re-parse cached HTML pages and update database entries.
|
|
||||||
Fixes issues with incomplete data extraction from earlier scrapes.
|
|
||||||
"""
|
|
||||||
import sys
|
|
||||||
import sqlite3
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add src to path
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
|
||||||
|
|
||||||
from parse import DataParser
|
|
||||||
from config import CACHE_DB
|
|
||||||
|
|
||||||
|
|
||||||
def reparse_and_update_lots(db_path: str = CACHE_DB, dry_run: bool = False):
|
|
||||||
"""
|
|
||||||
Re-parse cached HTML pages and update lot entries in the database.
|
|
||||||
|
|
||||||
This extracts improved data from __NEXT_DATA__ JSON blobs that may have been
|
|
||||||
missed in earlier scraping runs when validation was less strict.
|
|
||||||
"""
|
|
||||||
parser = DataParser()
|
|
||||||
|
|
||||||
with sqlite3.connect(db_path) as conn:
|
|
||||||
# Get all cached lot pages
|
|
||||||
cursor = conn.execute("""
|
|
||||||
SELECT url, content
|
|
||||||
FROM cache
|
|
||||||
WHERE url LIKE '%/l/%'
|
|
||||||
ORDER BY timestamp DESC
|
|
||||||
""")
|
|
||||||
|
|
||||||
cached_pages = cursor.fetchall()
|
|
||||||
print(f"Found {len(cached_pages)} cached lot pages to re-parse")
|
|
||||||
|
|
||||||
stats = {
|
|
||||||
'processed': 0,
|
|
||||||
'updated': 0,
|
|
||||||
'skipped': 0,
|
|
||||||
'errors': 0
|
|
||||||
}
|
|
||||||
|
|
||||||
for url, compressed_content in cached_pages:
|
|
||||||
try:
|
|
||||||
# Decompress content
|
|
||||||
import zlib
|
|
||||||
content = zlib.decompress(compressed_content).decode('utf-8')
|
|
||||||
|
|
||||||
# Re-parse using current parser logic
|
|
||||||
parsed_data = parser.parse_page(content, url)
|
|
||||||
|
|
||||||
if not parsed_data or parsed_data.get('type') != 'lot':
|
|
||||||
stats['skipped'] += 1
|
|
||||||
continue
|
|
||||||
|
|
||||||
lot_id = parsed_data.get('lot_id', '')
|
|
||||||
if not lot_id:
|
|
||||||
print(f" ⚠️ No lot_id for {url}")
|
|
||||||
stats['skipped'] += 1
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Check if lot exists
|
|
||||||
existing = conn.execute(
|
|
||||||
"SELECT lot_id FROM lots WHERE lot_id = ?",
|
|
||||||
(lot_id,)
|
|
||||||
).fetchone()
|
|
||||||
|
|
||||||
if not existing:
|
|
||||||
print(f" → New lot: {lot_id}")
|
|
||||||
# Insert new lot
|
|
||||||
if not dry_run:
|
|
||||||
conn.execute("""
|
|
||||||
INSERT INTO lots
|
|
||||||
(lot_id, auction_id, url, title, current_bid, bid_count,
|
|
||||||
closing_time, viewing_time, pickup_date, location,
|
|
||||||
description, category, scraped_at)
|
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
|
||||||
""", (
|
|
||||||
lot_id,
|
|
||||||
parsed_data.get('auction_id', ''),
|
|
||||||
url,
|
|
||||||
parsed_data.get('title', ''),
|
|
||||||
parsed_data.get('current_bid', ''),
|
|
||||||
parsed_data.get('bid_count', 0),
|
|
||||||
parsed_data.get('closing_time', ''),
|
|
||||||
parsed_data.get('viewing_time', ''),
|
|
||||||
parsed_data.get('pickup_date', ''),
|
|
||||||
parsed_data.get('location', ''),
|
|
||||||
parsed_data.get('description', ''),
|
|
||||||
parsed_data.get('category', ''),
|
|
||||||
parsed_data.get('scraped_at', '')
|
|
||||||
))
|
|
||||||
stats['updated'] += 1
|
|
||||||
else:
|
|
||||||
# Update existing lot with newly parsed data
|
|
||||||
# Only update fields that are now populated but weren't before
|
|
||||||
if not dry_run:
|
|
||||||
conn.execute("""
|
|
||||||
UPDATE lots SET
|
|
||||||
auction_id = COALESCE(NULLIF(?, ''), auction_id),
|
|
||||||
title = COALESCE(NULLIF(?, ''), title),
|
|
||||||
current_bid = COALESCE(NULLIF(?, ''), current_bid),
|
|
||||||
bid_count = CASE WHEN ? > 0 THEN ? ELSE bid_count END,
|
|
||||||
closing_time = COALESCE(NULLIF(?, ''), closing_time),
|
|
||||||
viewing_time = COALESCE(NULLIF(?, ''), viewing_time),
|
|
||||||
pickup_date = COALESCE(NULLIF(?, ''), pickup_date),
|
|
||||||
location = COALESCE(NULLIF(?, ''), location),
|
|
||||||
description = COALESCE(NULLIF(?, ''), description),
|
|
||||||
category = COALESCE(NULLIF(?, ''), category)
|
|
||||||
WHERE lot_id = ?
|
|
||||||
""", (
|
|
||||||
parsed_data.get('auction_id', ''),
|
|
||||||
parsed_data.get('title', ''),
|
|
||||||
parsed_data.get('current_bid', ''),
|
|
||||||
parsed_data.get('bid_count', 0),
|
|
||||||
parsed_data.get('bid_count', 0),
|
|
||||||
parsed_data.get('closing_time', ''),
|
|
||||||
parsed_data.get('viewing_time', ''),
|
|
||||||
parsed_data.get('pickup_date', ''),
|
|
||||||
parsed_data.get('location', ''),
|
|
||||||
parsed_data.get('description', ''),
|
|
||||||
parsed_data.get('category', ''),
|
|
||||||
lot_id
|
|
||||||
))
|
|
||||||
stats['updated'] += 1
|
|
||||||
|
|
||||||
print(f" ✓ Updated: {lot_id[:20]}")
|
|
||||||
|
|
||||||
# Update images if they exist
|
|
||||||
images = parsed_data.get('images', [])
|
|
||||||
if images and not dry_run:
|
|
||||||
for img_url in images:
|
|
||||||
conn.execute("""
|
|
||||||
INSERT OR IGNORE INTO images (lot_id, url)
|
|
||||||
VALUES (?, ?)
|
|
||||||
""", (lot_id, img_url))
|
|
||||||
|
|
||||||
stats['processed'] += 1
|
|
||||||
|
|
||||||
if stats['processed'] % 100 == 0:
|
|
||||||
print(f" Progress: {stats['processed']}/{len(cached_pages)}")
|
|
||||||
if not dry_run:
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Error processing {url}: {e}")
|
|
||||||
stats['errors'] += 1
|
|
||||||
continue
|
|
||||||
|
|
||||||
if not dry_run:
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("MIGRATION COMPLETE")
|
|
||||||
print("="*60)
|
|
||||||
print(f"Processed: {stats['processed']}")
|
|
||||||
print(f"Updated: {stats['updated']}")
|
|
||||||
print(f"Skipped: {stats['skipped']}")
|
|
||||||
print(f"Errors: {stats['errors']}")
|
|
||||||
|
|
||||||
if dry_run:
|
|
||||||
print("\n⚠️ DRY RUN - No changes were made to the database")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(description="Re-parse and update lot entries from cached HTML")
|
|
||||||
parser.add_argument('--db', default=CACHE_DB, help='Path to cache database')
|
|
||||||
parser.add_argument('--dry-run', action='store_true', help='Show what would be done without making changes')
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
print(f"Database: {args.db}")
|
|
||||||
print(f"Dry run: {args.dry_run}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
reparse_and_update_lots(args.db, args.dry_run)
|
|
||||||
823
src/cache.py
823
src/cache.py
@@ -1,26 +1,206 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Cache Manager module for SQLite-based caching and data storage
|
Cache Manager module for database-backed caching and data storage.
|
||||||
|
|
||||||
|
Primary backend: PostgreSQL (psycopg)
|
||||||
|
Fallback (dev/tests only): SQLite
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import sqlite3
|
import sqlite3
|
||||||
|
import psycopg
|
||||||
import time
|
import time
|
||||||
import zlib
|
import zlib
|
||||||
import json
|
import json
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from typing import Dict, List, Optional
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
import config
|
import config
|
||||||
|
|
||||||
class CacheManager:
|
class CacheManager:
|
||||||
"""Manages page caching and data storage using SQLite"""
|
"""Manages page caching and data storage using PostgreSQL (preferred) or SQLite."""
|
||||||
|
|
||||||
def __init__(self, db_path: str = None):
|
def __init__(self, db_path: str = None):
|
||||||
|
# Decide backend
|
||||||
|
self.database_url = (config.DATABASE_URL or '').strip()
|
||||||
|
self.use_postgres = self.database_url.lower().startswith('postgresql')
|
||||||
|
# Legacy sqlite path retained only for fallback/testing
|
||||||
self.db_path = db_path or config.CACHE_DB
|
self.db_path = db_path or config.CACHE_DB
|
||||||
self._init_db()
|
self._init_db()
|
||||||
|
|
||||||
|
# ------------------------
|
||||||
|
# Connection helpers
|
||||||
|
# ------------------------
|
||||||
|
def _pg(self):
|
||||||
|
return psycopg.connect(self.database_url)
|
||||||
|
|
||||||
def _init_db(self):
|
def _init_db(self):
|
||||||
"""Initialize cache and data storage database with consolidated schema"""
|
"""Initialize database schema if missing.
|
||||||
|
|
||||||
|
- For PostgreSQL: create tables with IF NOT EXISTS.
|
||||||
|
- For SQLite: retain legacy schema and migrations.
|
||||||
|
"""
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
# Auctions
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS auctions (
|
||||||
|
auction_id TEXT PRIMARY KEY,
|
||||||
|
url TEXT UNIQUE,
|
||||||
|
title TEXT,
|
||||||
|
location TEXT,
|
||||||
|
lots_count INTEGER,
|
||||||
|
first_lot_closing_time TEXT,
|
||||||
|
scraped_at TEXT,
|
||||||
|
city TEXT,
|
||||||
|
country TEXT,
|
||||||
|
type TEXT,
|
||||||
|
lot_count INTEGER DEFAULT 0,
|
||||||
|
closing_time TEXT,
|
||||||
|
discovered_at BIGINT
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country)")
|
||||||
|
|
||||||
|
# Cache
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS cache (
|
||||||
|
url TEXT PRIMARY KEY,
|
||||||
|
content BYTEA,
|
||||||
|
timestamp DOUBLE PRECISION,
|
||||||
|
status_code INTEGER
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON cache(timestamp)")
|
||||||
|
|
||||||
|
# Lots
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS lots (
|
||||||
|
lot_id TEXT PRIMARY KEY,
|
||||||
|
auction_id TEXT REFERENCES auctions(auction_id),
|
||||||
|
url TEXT UNIQUE,
|
||||||
|
title TEXT,
|
||||||
|
current_bid TEXT,
|
||||||
|
bid_count INTEGER,
|
||||||
|
closing_time TEXT,
|
||||||
|
viewing_time TEXT,
|
||||||
|
pickup_date TEXT,
|
||||||
|
location TEXT,
|
||||||
|
description TEXT,
|
||||||
|
category TEXT,
|
||||||
|
scraped_at TEXT,
|
||||||
|
sale_id INTEGER,
|
||||||
|
manufacturer TEXT,
|
||||||
|
type TEXT,
|
||||||
|
year INTEGER,
|
||||||
|
currency TEXT DEFAULT 'EUR',
|
||||||
|
closing_notified INTEGER DEFAULT 0,
|
||||||
|
starting_bid TEXT,
|
||||||
|
minimum_bid TEXT,
|
||||||
|
status TEXT,
|
||||||
|
brand TEXT,
|
||||||
|
model TEXT,
|
||||||
|
attributes_json TEXT,
|
||||||
|
first_bid_time TEXT,
|
||||||
|
last_bid_time TEXT,
|
||||||
|
bid_velocity DOUBLE PRECISION,
|
||||||
|
bid_increment DOUBLE PRECISION,
|
||||||
|
year_manufactured INTEGER,
|
||||||
|
condition_score DOUBLE PRECISION,
|
||||||
|
condition_description TEXT,
|
||||||
|
serial_number TEXT,
|
||||||
|
damage_description TEXT,
|
||||||
|
followers_count INTEGER DEFAULT 0,
|
||||||
|
estimated_min_price DOUBLE PRECISION,
|
||||||
|
estimated_max_price DOUBLE PRECISION,
|
||||||
|
lot_condition TEXT,
|
||||||
|
appearance TEXT,
|
||||||
|
estimated_min DOUBLE PRECISION,
|
||||||
|
estimated_max DOUBLE PRECISION,
|
||||||
|
next_bid_step_cents INTEGER,
|
||||||
|
condition TEXT,
|
||||||
|
category_path TEXT,
|
||||||
|
city_location TEXT,
|
||||||
|
country_code TEXT,
|
||||||
|
bidding_status TEXT,
|
||||||
|
packaging TEXT,
|
||||||
|
quantity INTEGER,
|
||||||
|
vat DOUBLE PRECISION,
|
||||||
|
buyer_premium_percentage DOUBLE PRECISION,
|
||||||
|
remarks TEXT,
|
||||||
|
reserve_price DOUBLE PRECISION,
|
||||||
|
reserve_met INTEGER,
|
||||||
|
view_count INTEGER,
|
||||||
|
api_data_json TEXT,
|
||||||
|
next_scrape_at BIGINT,
|
||||||
|
scrape_priority INTEGER DEFAULT 0
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_sale_id ON lots(sale_id)")
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_closing_time ON lots(closing_time)")
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_next_scrape ON lots(next_scrape_at)")
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_priority ON lots(scrape_priority DESC)")
|
||||||
|
|
||||||
|
# Images
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS images (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
lot_id TEXT REFERENCES lots(lot_id),
|
||||||
|
url TEXT,
|
||||||
|
local_path TEXT,
|
||||||
|
downloaded INTEGER DEFAULT 0,
|
||||||
|
labels TEXT,
|
||||||
|
processed_at BIGINT
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id)")
|
||||||
|
cur.execute("CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_lot_url ON images(lot_id, url)")
|
||||||
|
|
||||||
|
# Bid history
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS bid_history (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
lot_id TEXT REFERENCES lots(lot_id),
|
||||||
|
bid_amount DOUBLE PRECISION NOT NULL,
|
||||||
|
bid_time TEXT NOT NULL,
|
||||||
|
is_autobid INTEGER DEFAULT 0,
|
||||||
|
bidder_id TEXT,
|
||||||
|
bidder_number INTEGER,
|
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_bid_history_bidder ON bid_history(bidder_id)")
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_bid_history_lot_time ON bid_history(lot_id, bid_time)")
|
||||||
|
# Resource cache
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS resource_cache (
|
||||||
|
url TEXT PRIMARY KEY,
|
||||||
|
content BYTEA,
|
||||||
|
content_type TEXT,
|
||||||
|
status_code INTEGER,
|
||||||
|
headers TEXT,
|
||||||
|
timestamp DOUBLE PRECISION,
|
||||||
|
size_bytes INTEGER,
|
||||||
|
local_path TEXT
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_resource_timestamp ON resource_cache(timestamp)")
|
||||||
|
cur.execute("CREATE INDEX IF NOT EXISTS idx_resource_content_type ON resource_cache(content_type)")
|
||||||
|
conn.commit()
|
||||||
|
return
|
||||||
|
|
||||||
|
# SQLite legacy path
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
# HTML page cache table (existing)
|
# HTML page cache table (existing)
|
||||||
conn.execute("""
|
conn.execute("""
|
||||||
@@ -276,12 +456,17 @@ class CacheManager:
|
|||||||
|
|
||||||
def get(self, url: str, max_age_hours: int = 24) -> Optional[Dict]:
|
def get(self, url: str, max_age_hours: int = 24) -> Optional[Dict]:
|
||||||
"""Get cached page if it exists and is not too old"""
|
"""Get cached page if it exists and is not too old"""
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
if self.use_postgres:
|
||||||
cursor = conn.execute(
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
"SELECT content, timestamp, status_code FROM cache WHERE url = ?",
|
cur.execute("SELECT content, timestamp, status_code FROM cache WHERE url = %s", (url,))
|
||||||
(url,)
|
row = cur.fetchone()
|
||||||
)
|
else:
|
||||||
row = cursor.fetchone()
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cursor = conn.execute(
|
||||||
|
"SELECT content, timestamp, status_code FROM cache WHERE url = ?",
|
||||||
|
(url,)
|
||||||
|
)
|
||||||
|
row = cursor.fetchone()
|
||||||
|
|
||||||
if row:
|
if row:
|
||||||
content, timestamp, status_code = row
|
content, timestamp, status_code = row
|
||||||
@@ -304,27 +489,48 @@ class CacheManager:
|
|||||||
|
|
||||||
def set(self, url: str, content: str, status_code: int = 200):
|
def set(self, url: str, content: str, status_code: int = 200):
|
||||||
"""Cache a page with compression"""
|
"""Cache a page with compression"""
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
compressed_content = zlib.compress(content.encode('utf-8'), level=9)
|
||||||
compressed_content = zlib.compress(content.encode('utf-8'), level=9)
|
original_size = len(content.encode('utf-8'))
|
||||||
original_size = len(content.encode('utf-8'))
|
compressed_size = len(compressed_content)
|
||||||
compressed_size = len(compressed_content)
|
ratio = (1 - compressed_size / original_size) * 100 if original_size > 0 else 0
|
||||||
ratio = (1 - compressed_size / original_size) * 100 if original_size > 0 else 0
|
|
||||||
|
|
||||||
conn.execute(
|
if self.use_postgres:
|
||||||
"INSERT OR REPLACE INTO cache (url, content, timestamp, status_code) VALUES (?, ?, ?, ?)",
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
(url, compressed_content, time.time(), status_code)
|
cur.execute(
|
||||||
)
|
"""
|
||||||
conn.commit()
|
INSERT INTO cache (url, content, timestamp, status_code)
|
||||||
print(f" -> Cached: {url} (compressed {ratio:.1f}%)")
|
VALUES (%s, %s, %s, %s)
|
||||||
|
ON CONFLICT (url)
|
||||||
|
DO UPDATE SET content = EXCLUDED.content,
|
||||||
|
timestamp = EXCLUDED.timestamp,
|
||||||
|
status_code = EXCLUDED.status_code
|
||||||
|
""",
|
||||||
|
(url, compressed_content, time.time(), status_code),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.execute(
|
||||||
|
"INSERT OR REPLACE INTO cache (url, content, timestamp, status_code) VALUES (?, ?, ?, ?)",
|
||||||
|
(url, compressed_content, time.time(), status_code)
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
print(f" -> Cached: {url} (compressed {ratio:.1f}%)")
|
||||||
|
|
||||||
def clear_old(self, max_age_hours: int = 168):
|
def clear_old(self, max_age_hours: int = 168):
|
||||||
"""Clear old cache entries to prevent database bloat"""
|
"""Clear old cache entries to prevent database bloat"""
|
||||||
cutoff_time = time.time() - (max_age_hours * 3600)
|
cutoff_time = time.time() - (max_age_hours * 3600)
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
if self.use_postgres:
|
||||||
deleted = conn.execute("DELETE FROM cache WHERE timestamp < ?", (cutoff_time,)).rowcount
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
conn.commit()
|
cur.execute("DELETE FROM cache WHERE timestamp < %s", (cutoff_time,))
|
||||||
if deleted > 0:
|
deleted = cur.rowcount or 0
|
||||||
print(f" → Cleared {deleted} old cache entries")
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
deleted = conn.execute("DELETE FROM cache WHERE timestamp < ?", (cutoff_time,)).rowcount
|
||||||
|
conn.commit()
|
||||||
|
if (deleted or 0) > 0:
|
||||||
|
print(f" → Cleared {deleted} old cache entries")
|
||||||
|
|
||||||
def save_auction(self, auction_data: Dict):
|
def save_auction(self, auction_data: Dict):
|
||||||
"""Save auction data to database"""
|
"""Save auction data to database"""
|
||||||
@@ -338,118 +544,274 @@ class CacheManager:
|
|||||||
city = parts[0]
|
city = parts[0]
|
||||||
country = parts[-1]
|
country = parts[-1]
|
||||||
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
if self.use_postgres:
|
||||||
conn.execute("""
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
INSERT OR REPLACE INTO auctions
|
cur.execute(
|
||||||
(auction_id, url, title, location, lots_count, first_lot_closing_time, scraped_at,
|
"""
|
||||||
city, country, type, lot_count, closing_time, discovered_at)
|
INSERT INTO auctions
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
(auction_id, url, title, location, lots_count, first_lot_closing_time, scraped_at,
|
||||||
""", (
|
city, country, type, lot_count, closing_time, discovered_at)
|
||||||
auction_data['auction_id'],
|
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
|
||||||
auction_data['url'],
|
ON CONFLICT (auction_id)
|
||||||
auction_data['title'],
|
DO UPDATE SET url = EXCLUDED.url,
|
||||||
location,
|
title = EXCLUDED.title,
|
||||||
auction_data.get('lots_count', 0),
|
location = EXCLUDED.location,
|
||||||
auction_data.get('first_lot_closing_time', ''),
|
lots_count = EXCLUDED.lots_count,
|
||||||
auction_data['scraped_at'],
|
first_lot_closing_time = EXCLUDED.first_lot_closing_time,
|
||||||
city,
|
scraped_at = EXCLUDED.scraped_at,
|
||||||
country,
|
city = EXCLUDED.city,
|
||||||
'online', # Troostwijk is online platform
|
country = EXCLUDED.country,
|
||||||
auction_data.get('lots_count', 0), # Duplicate to lot_count for consistency
|
type = EXCLUDED.type,
|
||||||
auction_data.get('first_lot_closing_time', ''), # Use first_lot_closing_time as closing_time
|
lot_count = EXCLUDED.lot_count,
|
||||||
int(time.time())
|
closing_time = EXCLUDED.closing_time,
|
||||||
))
|
discovered_at = EXCLUDED.discovered_at
|
||||||
conn.commit()
|
""",
|
||||||
|
(
|
||||||
|
auction_data['auction_id'],
|
||||||
|
auction_data['url'],
|
||||||
|
auction_data['title'],
|
||||||
|
location,
|
||||||
|
auction_data.get('lots_count', 0),
|
||||||
|
auction_data.get('first_lot_closing_time', ''),
|
||||||
|
auction_data['scraped_at'],
|
||||||
|
city,
|
||||||
|
country,
|
||||||
|
'online',
|
||||||
|
auction_data.get('lots_count', 0),
|
||||||
|
auction_data.get('first_lot_closing_time', ''),
|
||||||
|
int(time.time()),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT OR REPLACE INTO auctions
|
||||||
|
(auction_id, url, title, location, lots_count, first_lot_closing_time, scraped_at,
|
||||||
|
city, country, type, lot_count, closing_time, discovered_at)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""",
|
||||||
|
(
|
||||||
|
auction_data['auction_id'],
|
||||||
|
auction_data['url'],
|
||||||
|
auction_data['title'],
|
||||||
|
location,
|
||||||
|
auction_data.get('lots_count', 0),
|
||||||
|
auction_data.get('first_lot_closing_time', ''),
|
||||||
|
auction_data['scraped_at'],
|
||||||
|
city,
|
||||||
|
country,
|
||||||
|
'online',
|
||||||
|
auction_data.get('lots_count', 0),
|
||||||
|
auction_data.get('first_lot_closing_time', ''),
|
||||||
|
int(time.time()),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
def save_lot(self, lot_data: Dict):
|
def save_lot(self, lot_data: Dict):
|
||||||
"""Save lot data to database"""
|
"""Save lot data to database"""
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
params = (
|
||||||
conn.execute("""
|
lot_data['lot_id'],
|
||||||
INSERT OR REPLACE INTO lots
|
lot_data.get('auction_id', ''),
|
||||||
(lot_id, auction_id, url, title, current_bid, starting_bid, minimum_bid,
|
lot_data['url'],
|
||||||
bid_count, closing_time, viewing_time, pickup_date, location, description,
|
lot_data['title'],
|
||||||
category, status, brand, model, attributes_json,
|
lot_data.get('current_bid', ''),
|
||||||
first_bid_time, last_bid_time, bid_velocity, bid_increment,
|
lot_data.get('starting_bid', ''),
|
||||||
year_manufactured, condition_score, condition_description,
|
lot_data.get('minimum_bid', ''),
|
||||||
serial_number, manufacturer, damage_description,
|
lot_data.get('bid_count', 0),
|
||||||
followers_count, estimated_min_price, estimated_max_price, lot_condition, appearance,
|
lot_data.get('closing_time', ''),
|
||||||
scraped_at, api_data_json, next_scrape_at, scrape_priority)
|
lot_data.get('viewing_time', ''),
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
lot_data.get('pickup_date', ''),
|
||||||
""", (
|
lot_data.get('location', ''),
|
||||||
lot_data['lot_id'],
|
lot_data.get('description', ''),
|
||||||
lot_data.get('auction_id', ''),
|
lot_data.get('category', ''),
|
||||||
lot_data['url'],
|
lot_data.get('status', ''),
|
||||||
lot_data['title'],
|
lot_data.get('brand', ''),
|
||||||
lot_data.get('current_bid', ''),
|
lot_data.get('model', ''),
|
||||||
lot_data.get('starting_bid', ''),
|
lot_data.get('attributes_json', ''),
|
||||||
lot_data.get('minimum_bid', ''),
|
lot_data.get('first_bid_time'),
|
||||||
lot_data.get('bid_count', 0),
|
lot_data.get('last_bid_time'),
|
||||||
lot_data.get('closing_time', ''),
|
lot_data.get('bid_velocity'),
|
||||||
lot_data.get('viewing_time', ''),
|
lot_data.get('bid_increment'),
|
||||||
lot_data.get('pickup_date', ''),
|
lot_data.get('year_manufactured'),
|
||||||
lot_data.get('location', ''),
|
lot_data.get('condition_score'),
|
||||||
lot_data.get('description', ''),
|
lot_data.get('condition_description', ''),
|
||||||
lot_data.get('category', ''),
|
lot_data.get('serial_number', ''),
|
||||||
lot_data.get('status', ''),
|
lot_data.get('manufacturer', ''),
|
||||||
lot_data.get('brand', ''),
|
lot_data.get('damage_description', ''),
|
||||||
lot_data.get('model', ''),
|
lot_data.get('followers_count', 0),
|
||||||
lot_data.get('attributes_json', ''),
|
lot_data.get('estimated_min_price'),
|
||||||
lot_data.get('first_bid_time'),
|
lot_data.get('estimated_max_price'),
|
||||||
lot_data.get('last_bid_time'),
|
lot_data.get('lot_condition', ''),
|
||||||
lot_data.get('bid_velocity'),
|
lot_data.get('appearance', ''),
|
||||||
lot_data.get('bid_increment'),
|
lot_data['scraped_at'],
|
||||||
lot_data.get('year_manufactured'),
|
lot_data.get('api_data_json'),
|
||||||
lot_data.get('condition_score'),
|
lot_data.get('next_scrape_at'),
|
||||||
lot_data.get('condition_description', ''),
|
lot_data.get('scrape_priority', 0),
|
||||||
lot_data.get('serial_number', ''),
|
)
|
||||||
lot_data.get('manufacturer', ''),
|
if self.use_postgres:
|
||||||
lot_data.get('damage_description', ''),
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
lot_data.get('followers_count', 0),
|
cur.execute(
|
||||||
lot_data.get('estimated_min_price'),
|
"""
|
||||||
lot_data.get('estimated_max_price'),
|
INSERT INTO lots
|
||||||
lot_data.get('lot_condition', ''),
|
(lot_id, auction_id, url, title, current_bid, starting_bid, minimum_bid,
|
||||||
lot_data.get('appearance', ''),
|
bid_count, closing_time, viewing_time, pickup_date, location, description,
|
||||||
lot_data['scraped_at'],
|
category, status, brand, model, attributes_json,
|
||||||
lot_data.get('api_data_json'),
|
first_bid_time, last_bid_time, bid_velocity, bid_increment,
|
||||||
lot_data.get('next_scrape_at'),
|
year_manufactured, condition_score, condition_description,
|
||||||
lot_data.get('scrape_priority', 0)
|
serial_number, manufacturer, damage_description,
|
||||||
))
|
followers_count, estimated_min_price, estimated_max_price, lot_condition, appearance,
|
||||||
conn.commit()
|
scraped_at, api_data_json, next_scrape_at, scrape_priority)
|
||||||
|
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
|
||||||
|
ON CONFLICT (lot_id)
|
||||||
|
DO UPDATE SET auction_id = EXCLUDED.auction_id,
|
||||||
|
url = EXCLUDED.url,
|
||||||
|
title = EXCLUDED.title,
|
||||||
|
current_bid = EXCLUDED.current_bid,
|
||||||
|
starting_bid = EXCLUDED.starting_bid,
|
||||||
|
minimum_bid = EXCLUDED.minimum_bid,
|
||||||
|
bid_count = EXCLUDED.bid_count,
|
||||||
|
closing_time = EXCLUDED.closing_time,
|
||||||
|
viewing_time = EXCLUDED.viewing_time,
|
||||||
|
pickup_date = EXCLUDED.pickup_date,
|
||||||
|
location = EXCLUDED.location,
|
||||||
|
description = EXCLUDED.description,
|
||||||
|
category = EXCLUDED.category,
|
||||||
|
status = EXCLUDED.status,
|
||||||
|
brand = EXCLUDED.brand,
|
||||||
|
model = EXCLUDED.model,
|
||||||
|
attributes_json = EXCLUDED.attributes_json,
|
||||||
|
first_bid_time = EXCLUDED.first_bid_time,
|
||||||
|
last_bid_time = EXCLUDED.last_bid_time,
|
||||||
|
bid_velocity = EXCLUDED.bid_velocity,
|
||||||
|
bid_increment = EXCLUDED.bid_increment,
|
||||||
|
year_manufactured = EXCLUDED.year_manufactured,
|
||||||
|
condition_score = EXCLUDED.condition_score,
|
||||||
|
condition_description = EXCLUDED.condition_description,
|
||||||
|
serial_number = EXCLUDED.serial_number,
|
||||||
|
manufacturer = EXCLUDED.manufacturer,
|
||||||
|
damage_description = EXCLUDED.damage_description,
|
||||||
|
followers_count = EXCLUDED.followers_count,
|
||||||
|
estimated_min_price = EXCLUDED.estimated_min_price,
|
||||||
|
estimated_max_price = EXCLUDED.estimated_max_price,
|
||||||
|
lot_condition = EXCLUDED.lot_condition,
|
||||||
|
appearance = EXCLUDED.appearance,
|
||||||
|
scraped_at = EXCLUDED.scraped_at,
|
||||||
|
api_data_json = EXCLUDED.api_data_json,
|
||||||
|
next_scrape_at = EXCLUDED.next_scrape_at,
|
||||||
|
scrape_priority = EXCLUDED.scrape_priority
|
||||||
|
""",
|
||||||
|
params,
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT OR REPLACE INTO lots
|
||||||
|
(lot_id, auction_id, url, title, current_bid, starting_bid, minimum_bid,
|
||||||
|
bid_count, closing_time, viewing_time, pickup_date, location, description,
|
||||||
|
category, status, brand, model, attributes_json,
|
||||||
|
first_bid_time, last_bid_time, bid_velocity, bid_increment,
|
||||||
|
year_manufactured, condition_score, condition_description,
|
||||||
|
serial_number, manufacturer, damage_description,
|
||||||
|
followers_count, estimated_min_price, estimated_max_price, lot_condition, appearance,
|
||||||
|
scraped_at, api_data_json, next_scrape_at, scrape_priority)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""",
|
||||||
|
params,
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
def save_bid_history(self, lot_id: str, bid_records: List[Dict]):
|
def save_bid_history(self, lot_id: str, bid_records: List[Dict]):
|
||||||
"""Save bid history records to database"""
|
"""Save bid history records to database"""
|
||||||
if not bid_records:
|
if not bid_records:
|
||||||
return
|
return
|
||||||
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
if self.use_postgres:
|
||||||
# Clear existing bid history for this lot
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
conn.execute("DELETE FROM bid_history WHERE lot_id = ?", (lot_id,))
|
cur.execute("DELETE FROM bid_history WHERE lot_id = %s", (lot_id,))
|
||||||
|
for record in bid_records:
|
||||||
# Insert new records
|
cur.execute(
|
||||||
for record in bid_records:
|
"""
|
||||||
conn.execute("""
|
INSERT INTO bid_history
|
||||||
INSERT INTO bid_history
|
(lot_id, bid_amount, bid_time, is_autobid, bidder_id, bidder_number)
|
||||||
(lot_id, bid_amount, bid_time, is_autobid, bidder_id, bidder_number)
|
VALUES (%s, %s, %s, %s, %s, %s)
|
||||||
VALUES (?, ?, ?, ?, ?, ?)
|
""",
|
||||||
""", (
|
(
|
||||||
record['lot_id'],
|
record['lot_id'],
|
||||||
record['bid_amount'],
|
record['bid_amount'],
|
||||||
record['bid_time'],
|
record['bid_time'],
|
||||||
1 if record['is_autobid'] else 0,
|
1 if record['is_autobid'] else 0,
|
||||||
record['bidder_id'],
|
record['bidder_id'],
|
||||||
record['bidder_number']
|
record['bidder_number'],
|
||||||
))
|
),
|
||||||
conn.commit()
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.execute("DELETE FROM bid_history WHERE lot_id = ?", (lot_id,))
|
||||||
|
for record in bid_records:
|
||||||
|
conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT INTO bid_history
|
||||||
|
(lot_id, bid_amount, bid_time, is_autobid, bidder_id, bidder_number)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?)
|
||||||
|
""",
|
||||||
|
(
|
||||||
|
record['lot_id'],
|
||||||
|
record['bid_amount'],
|
||||||
|
record['bid_time'],
|
||||||
|
1 if record['is_autobid'] else 0,
|
||||||
|
record['bidder_id'],
|
||||||
|
record['bidder_number'],
|
||||||
|
),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
def save_images(self, lot_id: str, image_urls: List[str]):
|
def save_images(self, lot_id: str, image_urls: List[str]):
|
||||||
"""Save image URLs for a lot (prevents duplicates via unique constraint)"""
|
"""Save image URLs for a lot (prevents duplicates via unique constraint)"""
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
if self.use_postgres:
|
||||||
for url in image_urls:
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
conn.execute("""
|
for url in image_urls:
|
||||||
INSERT OR IGNORE INTO images (lot_id, url, downloaded)
|
cur.execute(
|
||||||
VALUES (?, ?, 0)
|
"""
|
||||||
""", (lot_id, url))
|
INSERT INTO images (lot_id, url, downloaded)
|
||||||
conn.commit()
|
VALUES (%s, %s, 0)
|
||||||
|
ON CONFLICT (lot_id, url) DO NOTHING
|
||||||
|
""",
|
||||||
|
(lot_id, url),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
for url in image_urls:
|
||||||
|
conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT OR IGNORE INTO images (lot_id, url, downloaded)
|
||||||
|
VALUES (?, ?, 0)
|
||||||
|
""",
|
||||||
|
(lot_id, url),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
|
def update_image_local_path(self, lot_id: str, url: str, local_path: str):
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(
|
||||||
|
"UPDATE images SET local_path = %s, downloaded = 1 WHERE lot_id = %s AND url = %s",
|
||||||
|
(local_path, lot_id, url),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.execute(
|
||||||
|
"UPDATE images SET local_path = ?, downloaded = 1 WHERE lot_id = ? AND url = ?",
|
||||||
|
(local_path, lot_id, url),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
def save_resource(self, url: str, content: bytes, content_type: str, status_code: int = 200,
|
def save_resource(self, url: str, content: bytes, content_type: str, status_code: int = 200,
|
||||||
headers: Optional[Dict] = None, local_path: Optional[str] = None, cache_key: Optional[str] = None):
|
headers: Optional[Dict] = None, local_path: Optional[str] = None, cache_key: Optional[str] = None):
|
||||||
@@ -458,19 +820,40 @@ class CacheManager:
|
|||||||
Args:
|
Args:
|
||||||
cache_key: Optional composite key (url + body hash for POST requests)
|
cache_key: Optional composite key (url + body hash for POST requests)
|
||||||
"""
|
"""
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
headers_json = json.dumps(headers) if headers else None
|
||||||
headers_json = json.dumps(headers) if headers else None
|
size_bytes = len(content) if content else 0
|
||||||
size_bytes = len(content) if content else 0
|
key = cache_key if cache_key else url
|
||||||
|
|
||||||
# Use cache_key if provided, otherwise use url
|
if self.use_postgres:
|
||||||
key = cache_key if cache_key else url
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(
|
||||||
conn.execute("""
|
"""
|
||||||
INSERT OR REPLACE INTO resource_cache
|
INSERT INTO resource_cache
|
||||||
(url, content, content_type, status_code, headers, timestamp, size_bytes, local_path)
|
(url, content, content_type, status_code, headers, timestamp, size_bytes, local_path)
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
|
||||||
""", (key, content, content_type, status_code, headers_json, time.time(), size_bytes, local_path))
|
ON CONFLICT (url)
|
||||||
conn.commit()
|
DO UPDATE SET content = EXCLUDED.content,
|
||||||
|
content_type = EXCLUDED.content_type,
|
||||||
|
status_code = EXCLUDED.status_code,
|
||||||
|
headers = EXCLUDED.headers,
|
||||||
|
timestamp = EXCLUDED.timestamp,
|
||||||
|
size_bytes = EXCLUDED.size_bytes,
|
||||||
|
local_path = EXCLUDED.local_path
|
||||||
|
""",
|
||||||
|
(key, content, content_type, status_code, headers_json, time.time(), size_bytes, local_path),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT OR REPLACE INTO resource_cache
|
||||||
|
(url, content, content_type, status_code, headers, timestamp, size_bytes, local_path)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""",
|
||||||
|
(key, content, content_type, status_code, headers_json, time.time(), size_bytes, local_path),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
def get_resource(self, url: str, cache_key: Optional[str] = None) -> Optional[Dict]:
|
def get_resource(self, url: str, cache_key: Optional[str] = None) -> Optional[Dict]:
|
||||||
"""Get a cached resource
|
"""Get a cached resource
|
||||||
@@ -478,13 +861,24 @@ class CacheManager:
|
|||||||
Args:
|
Args:
|
||||||
cache_key: Optional composite key to lookup
|
cache_key: Optional composite key to lookup
|
||||||
"""
|
"""
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
key = cache_key if cache_key else url
|
||||||
key = cache_key if cache_key else url
|
if self.use_postgres:
|
||||||
cursor = conn.execute("""
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
SELECT content, content_type, status_code, headers, timestamp, size_bytes, local_path
|
cur.execute(
|
||||||
FROM resource_cache WHERE url = ?
|
"SELECT content, content_type, status_code, headers, timestamp, size_bytes, local_path FROM resource_cache WHERE url = %s",
|
||||||
""", (key,))
|
(key,),
|
||||||
row = cursor.fetchone()
|
)
|
||||||
|
row = cur.fetchone()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cursor = conn.execute(
|
||||||
|
"""
|
||||||
|
SELECT content, content_type, status_code, headers, timestamp, size_bytes, local_path
|
||||||
|
FROM resource_cache WHERE url = ?
|
||||||
|
""",
|
||||||
|
(key,),
|
||||||
|
)
|
||||||
|
row = cursor.fetchone()
|
||||||
|
|
||||||
if row:
|
if row:
|
||||||
return {
|
return {
|
||||||
@@ -497,4 +891,141 @@ class CacheManager:
|
|||||||
'local_path': row[6],
|
'local_path': row[6],
|
||||||
'cached': True
|
'cached': True
|
||||||
}
|
}
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
# ------------------------
|
||||||
|
# Query helpers for scraper/monitor/export
|
||||||
|
# ------------------------
|
||||||
|
def get_counts(self) -> Dict[str, int]:
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute("SELECT COUNT(*) FROM auctions")
|
||||||
|
auctions = cur.fetchone()[0]
|
||||||
|
cur.execute("SELECT COUNT(*) FROM lots")
|
||||||
|
lots = cur.fetchone()[0]
|
||||||
|
return {"auctions": auctions, "lots": lots}
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("SELECT COUNT(*) FROM auctions")
|
||||||
|
auctions = cur.fetchone()[0]
|
||||||
|
cur.execute("SELECT COUNT(*) FROM lots")
|
||||||
|
lots = cur.fetchone()[0]
|
||||||
|
return {"auctions": auctions, "lots": lots}
|
||||||
|
|
||||||
|
def get_lot_api_fields(self, lot_id: str) -> Optional[Tuple]:
|
||||||
|
sql = (
|
||||||
|
"SELECT followers_count, estimated_min_price, current_bid, bid_count, closing_time, status "
|
||||||
|
"FROM lots WHERE lot_id = %s" if self.use_postgres else
|
||||||
|
"SELECT followers_count, estimated_min_price, current_bid, bid_count, closing_time, status FROM lots WHERE lot_id = ?"
|
||||||
|
)
|
||||||
|
params = (lot_id,)
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(sql, params)
|
||||||
|
return cur.fetchone()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(sql, params)
|
||||||
|
return cur.fetchone()
|
||||||
|
|
||||||
|
def get_page_record_by_url(self, url: str) -> Optional[Dict]:
|
||||||
|
# Try lot record first by URL
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute("SELECT * FROM lots WHERE url = %s", (url,))
|
||||||
|
lot_row = cur.fetchone()
|
||||||
|
if lot_row:
|
||||||
|
col_names = [desc.name for desc in cur.description]
|
||||||
|
lot_dict = dict(zip(col_names, lot_row))
|
||||||
|
return {"type": "lot", **lot_dict}
|
||||||
|
cur.execute("SELECT * FROM auctions WHERE url = %s", (url,))
|
||||||
|
auc_row = cur.fetchone()
|
||||||
|
if auc_row:
|
||||||
|
col_names = [desc.name for desc in cur.description]
|
||||||
|
auc_dict = dict(zip(col_names, auc_row))
|
||||||
|
return {"type": "auction", **auc_dict}
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("SELECT * FROM lots WHERE url = ?", (url,))
|
||||||
|
lot_row = cur.fetchone()
|
||||||
|
if lot_row:
|
||||||
|
col_names = [d[0] for d in cur.description]
|
||||||
|
lot_dict = dict(zip(col_names, lot_row))
|
||||||
|
return {"type": "lot", **lot_dict}
|
||||||
|
cur.execute("SELECT * FROM auctions WHERE url = ?", (url,))
|
||||||
|
auc_row = cur.fetchone()
|
||||||
|
if auc_row:
|
||||||
|
col_names = [d[0] for d in cur.description]
|
||||||
|
auc_dict = dict(zip(col_names, auc_row))
|
||||||
|
return {"type": "auction", **auc_dict}
|
||||||
|
return None
|
||||||
|
|
||||||
|
def fetch_all(self, table: str) -> List[Dict]:
|
||||||
|
assert table in {"auctions", "lots"}
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(f"SELECT * FROM {table}")
|
||||||
|
rows = cur.fetchall()
|
||||||
|
col_names = [desc.name for desc in cur.description]
|
||||||
|
return [dict(zip(col_names, r)) for r in rows]
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
conn.row_factory = sqlite3.Row
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(f"SELECT * FROM {table}")
|
||||||
|
return [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
def get_lot_times(self, lot_id: str) -> Tuple[Optional[str], Optional[str]]:
|
||||||
|
sql = (
|
||||||
|
"SELECT viewing_time, pickup_date FROM lots WHERE lot_id = %s" if self.use_postgres else
|
||||||
|
"SELECT viewing_time, pickup_date FROM lots WHERE lot_id = ?"
|
||||||
|
)
|
||||||
|
params = (lot_id,)
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(sql, params)
|
||||||
|
row = cur.fetchone()
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(sql, params)
|
||||||
|
row = cur.fetchone()
|
||||||
|
if not row:
|
||||||
|
return None, None
|
||||||
|
return row[0], row[1]
|
||||||
|
|
||||||
|
def has_bid_history(self, lot_id: str) -> bool:
|
||||||
|
sql = (
|
||||||
|
"SELECT COUNT(*) FROM bid_history WHERE lot_id = %s" if self.use_postgres else
|
||||||
|
"SELECT COUNT(*) FROM bid_history WHERE lot_id = ?"
|
||||||
|
)
|
||||||
|
params = (lot_id,)
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(sql, params)
|
||||||
|
cnt = cur.fetchone()[0]
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(sql, params)
|
||||||
|
cnt = cur.fetchone()[0]
|
||||||
|
return cnt > 0
|
||||||
|
|
||||||
|
def get_downloaded_image_urls(self, lot_id: str) -> List[str]:
|
||||||
|
sql = (
|
||||||
|
"SELECT url FROM images WHERE lot_id = %s AND downloaded = 1" if self.use_postgres else
|
||||||
|
"SELECT url FROM images WHERE lot_id = ? AND downloaded = 1"
|
||||||
|
)
|
||||||
|
params = (lot_id,)
|
||||||
|
if self.use_postgres:
|
||||||
|
with self._pg() as conn, conn.cursor() as cur:
|
||||||
|
cur.execute(sql, params)
|
||||||
|
return [r[0] for r in cur.fetchall()]
|
||||||
|
else:
|
||||||
|
with sqlite3.connect(self.db_path) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(sql, params)
|
||||||
|
return [r[0] for r in cur.fetchall()]
|
||||||
@@ -15,6 +15,17 @@ if sys.version_info < (3, 10):
|
|||||||
|
|
||||||
# ==================== CONFIGURATION ====================
|
# ==================== CONFIGURATION ====================
|
||||||
BASE_URL = "https://www.troostwijkauctions.com"
|
BASE_URL = "https://www.troostwijkauctions.com"
|
||||||
|
|
||||||
|
# Primary database: PostgreSQL
|
||||||
|
# You can override via environment variable DATABASE_URL
|
||||||
|
# Example: postgresql://user:pass@host:5432/dbname
|
||||||
|
DATABASE_URL = os.getenv(
|
||||||
|
"DATABASE_URL",
|
||||||
|
# Default provided by ops
|
||||||
|
"postgresql://action:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb",
|
||||||
|
).strip()
|
||||||
|
|
||||||
|
# Deprecated: legacy SQLite cache path (only used as fallback in dev/tests)
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
||||||
OUTPUT_DIR = "/mnt/okcomputer/output"
|
OUTPUT_DIR = "/mnt/okcomputer/output"
|
||||||
IMAGES_DIR = "/mnt/okcomputer/output/images"
|
IMAGES_DIR = "/mnt/okcomputer/output/images"
|
||||||
|
|||||||
@@ -7,7 +7,6 @@ Runs indefinitely to keep database current with latest Troostwijk data
|
|||||||
import asyncio
|
import asyncio
|
||||||
import time
|
import time
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
import sqlite3
|
|
||||||
import config
|
import config
|
||||||
from cache import CacheManager
|
from cache import CacheManager
|
||||||
from scraper import TroostwijkScraper
|
from scraper import TroostwijkScraper
|
||||||
@@ -82,21 +81,7 @@ class AuctionMonitor:
|
|||||||
|
|
||||||
def _get_stats(self) -> dict:
|
def _get_stats(self) -> dict:
|
||||||
"""Get current database statistics"""
|
"""Get current database statistics"""
|
||||||
conn = sqlite3.connect(self.scraper.cache.db_path)
|
return self.scraper.cache.get_counts()
|
||||||
cursor = conn.cursor()
|
|
||||||
|
|
||||||
cursor.execute("SELECT COUNT(*) FROM auctions")
|
|
||||||
auction_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
cursor.execute("SELECT COUNT(*) FROM lots")
|
|
||||||
lot_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
return {
|
|
||||||
'auctions': auction_count,
|
|
||||||
'lots': lot_count
|
|
||||||
}
|
|
||||||
|
|
||||||
async def start(self):
|
async def start(self):
|
||||||
"""Start continuous monitoring loop"""
|
"""Start continuous monitoring loop"""
|
||||||
@@ -106,7 +91,7 @@ class AuctionMonitor:
|
|||||||
if config.OFFLINE:
|
if config.OFFLINE:
|
||||||
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
||||||
print(f"Poll interval: {self.poll_interval / 60:.0f} minutes")
|
print(f"Poll interval: {self.poll_interval / 60:.0f} minutes")
|
||||||
print(f"Cache database: {config.CACHE_DB}")
|
print(f"Database URL: {self._mask_db_url(config.DATABASE_URL)}")
|
||||||
print(f"Rate limit: {config.RATE_LIMIT_SECONDS}s between requests")
|
print(f"Rate limit: {config.RATE_LIMIT_SECONDS}s between requests")
|
||||||
print("="*60)
|
print("="*60)
|
||||||
print("\nPress Ctrl+C to stop\n")
|
print("\nPress Ctrl+C to stop\n")
|
||||||
@@ -135,6 +120,21 @@ class AuctionMonitor:
|
|||||||
print(f"Last scan: {self.last_run.strftime('%Y-%m-%d %H:%M:%S')}")
|
print(f"Last scan: {self.last_run.strftime('%Y-%m-%d %H:%M:%S')}")
|
||||||
print("\nDatabase remains intact with all collected data")
|
print("\nDatabase remains intact with all collected data")
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _mask_db_url(url: str) -> str:
|
||||||
|
try:
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
parsed = urlparse(url)
|
||||||
|
if parsed.username:
|
||||||
|
user = parsed.username
|
||||||
|
host = parsed.hostname or ''
|
||||||
|
port = f":{parsed.port}" if parsed.port else ''
|
||||||
|
db = parsed.path or ''
|
||||||
|
return f"{parsed.scheme}://{user}:***@{host}{port}{db}"
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return url
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
"""Main entry point for monitor"""
|
"""Main entry point for monitor"""
|
||||||
import sys
|
import sys
|
||||||
|
|||||||
176
src/scraper.py
176
src/scraper.py
@@ -3,7 +3,6 @@
|
|||||||
Core scaev module for Scaev Auctions
|
Core scaev module for Scaev Auctions
|
||||||
"""
|
"""
|
||||||
import os
|
import os
|
||||||
import sqlite3
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import time
|
import time
|
||||||
import random
|
import random
|
||||||
@@ -66,13 +65,8 @@ class TroostwijkScraper:
|
|||||||
content = await response.read()
|
content = await response.read()
|
||||||
with open(filepath, 'wb') as f:
|
with open(filepath, 'wb') as f:
|
||||||
f.write(content)
|
f.write(content)
|
||||||
|
# Record download in DB
|
||||||
with sqlite3.connect(self.cache.db_path) as conn:
|
self.cache.update_image_local_path(lot_id, url, str(filepath))
|
||||||
conn.execute("UPDATE images\n"
|
|
||||||
"SET local_path = ?, downloaded = 1\n"
|
|
||||||
"WHERE lot_id = ? AND url = ?\n"
|
|
||||||
"", (str(filepath), lot_id, url))
|
|
||||||
conn.commit()
|
|
||||||
return str(filepath)
|
return str(filepath)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -237,71 +231,54 @@ class TroostwijkScraper:
|
|||||||
if not result:
|
if not result:
|
||||||
# OFFLINE fallback: try to construct page data directly from DB
|
# OFFLINE fallback: try to construct page data directly from DB
|
||||||
if self.offline:
|
if self.offline:
|
||||||
import sqlite3
|
rec = self.cache.get_page_record_by_url(url)
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
if rec:
|
||||||
cur = conn.cursor()
|
if rec.get('type') == 'lot':
|
||||||
# Try lot first
|
page_data = {
|
||||||
cur.execute("SELECT * FROM lots WHERE url = ?", (url,))
|
'type': 'lot',
|
||||||
lot_row = cur.fetchone()
|
'lot_id': rec.get('lot_id'),
|
||||||
if lot_row:
|
'auction_id': rec.get('auction_id'),
|
||||||
# Build a dict using column names
|
'url': rec.get('url') or url,
|
||||||
col_names = [d[0] for d in cur.description]
|
'title': rec.get('title') or '',
|
||||||
lot_dict = dict(zip(col_names, lot_row))
|
'current_bid': rec.get('current_bid') or '',
|
||||||
conn.close()
|
'bid_count': rec.get('bid_count') or 0,
|
||||||
page_data = {
|
'closing_time': rec.get('closing_time') or '',
|
||||||
'type': 'lot',
|
'viewing_time': rec.get('viewing_time') or '',
|
||||||
'lot_id': lot_dict.get('lot_id'),
|
'pickup_date': rec.get('pickup_date') or '',
|
||||||
'auction_id': lot_dict.get('auction_id'),
|
'location': rec.get('location') or '',
|
||||||
'url': lot_dict.get('url') or url,
|
'description': rec.get('description') or '',
|
||||||
'title': lot_dict.get('title') or '',
|
'category': rec.get('category') or '',
|
||||||
'current_bid': lot_dict.get('current_bid') or '',
|
'status': rec.get('status') or '',
|
||||||
'bid_count': lot_dict.get('bid_count') or 0,
|
'brand': rec.get('brand') or '',
|
||||||
'closing_time': lot_dict.get('closing_time') or '',
|
'model': rec.get('model') or '',
|
||||||
'viewing_time': lot_dict.get('viewing_time') or '',
|
'attributes_json': rec.get('attributes_json') or '',
|
||||||
'pickup_date': lot_dict.get('pickup_date') or '',
|
'first_bid_time': rec.get('first_bid_time'),
|
||||||
'location': lot_dict.get('location') or '',
|
'last_bid_time': rec.get('last_bid_time'),
|
||||||
'description': lot_dict.get('description') or '',
|
'bid_velocity': rec.get('bid_velocity'),
|
||||||
'category': lot_dict.get('category') or '',
|
'followers_count': rec.get('followers_count') or 0,
|
||||||
'status': lot_dict.get('status') or '',
|
'estimated_min_price': rec.get('estimated_min_price'),
|
||||||
'brand': lot_dict.get('brand') or '',
|
'estimated_max_price': rec.get('estimated_max_price'),
|
||||||
'model': lot_dict.get('model') or '',
|
'lot_condition': rec.get('lot_condition') or '',
|
||||||
'attributes_json': lot_dict.get('attributes_json') or '',
|
'appearance': rec.get('appearance') or '',
|
||||||
'first_bid_time': lot_dict.get('first_bid_time'),
|
'scraped_at': rec.get('scraped_at') or '',
|
||||||
'last_bid_time': lot_dict.get('last_bid_time'),
|
}
|
||||||
'bid_velocity': lot_dict.get('bid_velocity'),
|
print(" OFFLINE: using DB record for lot")
|
||||||
'followers_count': lot_dict.get('followers_count') or 0,
|
self.visited_lots.add(url)
|
||||||
'estimated_min_price': lot_dict.get('estimated_min_price'),
|
return page_data
|
||||||
'estimated_max_price': lot_dict.get('estimated_max_price'),
|
else:
|
||||||
'lot_condition': lot_dict.get('lot_condition') or '',
|
page_data = {
|
||||||
'appearance': lot_dict.get('appearance') or '',
|
'type': 'auction',
|
||||||
'scraped_at': lot_dict.get('scraped_at') or '',
|
'auction_id': rec.get('auction_id'),
|
||||||
}
|
'url': rec.get('url') or url,
|
||||||
print(" OFFLINE: using DB record for lot")
|
'title': rec.get('title') or '',
|
||||||
self.visited_lots.add(url)
|
'location': rec.get('location') or '',
|
||||||
return page_data
|
'lots_count': rec.get('lots_count') or 0,
|
||||||
|
'first_lot_closing_time': rec.get('first_lot_closing_time') or '',
|
||||||
# Try auction by URL
|
'scraped_at': rec.get('scraped_at') or '',
|
||||||
cur.execute("SELECT * FROM auctions WHERE url = ?", (url,))
|
}
|
||||||
auc_row = cur.fetchone()
|
print(" OFFLINE: using DB record for auction")
|
||||||
if auc_row:
|
self.visited_lots.add(url)
|
||||||
col_names = [d[0] for d in cur.description]
|
return page_data
|
||||||
auc_dict = dict(zip(col_names, auc_row))
|
|
||||||
conn.close()
|
|
||||||
page_data = {
|
|
||||||
'type': 'auction',
|
|
||||||
'auction_id': auc_dict.get('auction_id'),
|
|
||||||
'url': auc_dict.get('url') or url,
|
|
||||||
'title': auc_dict.get('title') or '',
|
|
||||||
'location': auc_dict.get('location') or '',
|
|
||||||
'lots_count': auc_dict.get('lots_count') or 0,
|
|
||||||
'first_lot_closing_time': auc_dict.get('first_lot_closing_time') or '',
|
|
||||||
'scraped_at': auc_dict.get('scraped_at') or '',
|
|
||||||
}
|
|
||||||
print(" OFFLINE: using DB record for auction")
|
|
||||||
self.visited_lots.add(url)
|
|
||||||
return page_data
|
|
||||||
|
|
||||||
conn.close()
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
content = result['content']
|
content = result['content']
|
||||||
@@ -363,7 +340,6 @@ class TroostwijkScraper:
|
|||||||
# Fetch all API data concurrently (or use intercepted/cached data)
|
# Fetch all API data concurrently (or use intercepted/cached data)
|
||||||
lot_id = page_data.get('lot_id')
|
lot_id = page_data.get('lot_id')
|
||||||
auction_id = page_data.get('auction_id')
|
auction_id = page_data.get('auction_id')
|
||||||
import sqlite3
|
|
||||||
|
|
||||||
# Step 1: Check if we intercepted API data during page load
|
# Step 1: Check if we intercepted API data during page load
|
||||||
intercepted_data = None
|
intercepted_data = None
|
||||||
@@ -396,14 +372,7 @@ class TroostwijkScraper:
|
|||||||
pass
|
pass
|
||||||
elif from_cache:
|
elif from_cache:
|
||||||
# Check if we have cached API data in database
|
# Check if we have cached API data in database
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
existing = self.cache.get_lot_api_fields(lot_id)
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT followers_count, estimated_min_price, current_bid, bid_count, closing_time, status
|
|
||||||
FROM lots WHERE lot_id = ?
|
|
||||||
""", (lot_id,))
|
|
||||||
existing = cursor.fetchone()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
# Data quality check: Must have followers_count AND closing_time to be considered "complete"
|
# Data quality check: Must have followers_count AND closing_time to be considered "complete"
|
||||||
# This prevents using stale records like old "0 bids" entries
|
# This prevents using stale records like old "0 bids" entries
|
||||||
@@ -469,14 +438,8 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
# Add auction data fetch if we need viewing/pickup times
|
# Add auction data fetch if we need viewing/pickup times
|
||||||
if auction_id:
|
if auction_id:
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
vt, pd = self.cache.get_lot_times(lot_id)
|
||||||
cursor = conn.cursor()
|
has_times = vt or pd
|
||||||
cursor.execute("""
|
|
||||||
SELECT viewing_time, pickup_date FROM lots WHERE lot_id = ?
|
|
||||||
""", (lot_id,))
|
|
||||||
times = cursor.fetchone()
|
|
||||||
conn.close()
|
|
||||||
has_times = times and (times[0] or times[1])
|
|
||||||
|
|
||||||
if not has_times:
|
if not has_times:
|
||||||
task_map['auction'] = len(api_tasks)
|
task_map['auction'] = len(api_tasks)
|
||||||
@@ -671,14 +634,7 @@ class TroostwijkScraper:
|
|||||||
self.cache.save_bid_history(lot_id, bid_data['bid_records'])
|
self.cache.save_bid_history(lot_id, bid_data['bid_records'])
|
||||||
elif from_cache and page_data.get('bid_count', 0) > 0:
|
elif from_cache and page_data.get('bid_count', 0) > 0:
|
||||||
# Check if cached bid history exists
|
# Check if cached bid history exists
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
if self.cache.has_bid_history(lot_id):
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT COUNT(*) FROM bid_history WHERE lot_id = ?
|
|
||||||
""", (lot_id,))
|
|
||||||
has_history = cursor.fetchone()[0] > 0
|
|
||||||
conn.close()
|
|
||||||
if has_history:
|
|
||||||
print(f" Bid history cached")
|
print(f" Bid history cached")
|
||||||
else:
|
else:
|
||||||
print(f" Bid: {page_data.get('current_bid', 'N/A')} (from HTML)")
|
print(f" Bid: {page_data.get('current_bid', 'N/A')} (from HTML)")
|
||||||
@@ -704,15 +660,7 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
if self.download_images:
|
if self.download_images:
|
||||||
# Check which images are already downloaded
|
# Check which images are already downloaded
|
||||||
import sqlite3
|
already_downloaded = set(self.cache.get_downloaded_image_urls(page_data['lot_id']))
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT url FROM images
|
|
||||||
WHERE lot_id = ? AND downloaded = 1
|
|
||||||
""", (page_data['lot_id'],))
|
|
||||||
already_downloaded = {row[0] for row in cursor.fetchall()}
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
# Only download missing images
|
# Only download missing images
|
||||||
images_to_download = [
|
images_to_download = [
|
||||||
@@ -1072,23 +1020,17 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
def export_to_files(self) -> Dict[str, str]:
|
def export_to_files(self) -> Dict[str, str]:
|
||||||
"""Export database to CSV/JSON files"""
|
"""Export database to CSV/JSON files"""
|
||||||
import sqlite3
|
|
||||||
import json
|
import json
|
||||||
import csv
|
import csv
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||||
output_dir = os.path.dirname(self.cache.db_path)
|
from config import OUTPUT_DIR as output_dir
|
||||||
|
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.cursor()
|
|
||||||
|
|
||||||
files = {}
|
files = {}
|
||||||
|
|
||||||
# Export auctions
|
# Export auctions
|
||||||
cursor.execute("SELECT * FROM auctions")
|
auctions = self.cache.fetch_all('auctions')
|
||||||
auctions = [dict(row) for row in cursor.fetchall()]
|
|
||||||
|
|
||||||
auctions_csv = os.path.join(output_dir, f'auctions_{timestamp}.csv')
|
auctions_csv = os.path.join(output_dir, f'auctions_{timestamp}.csv')
|
||||||
auctions_json = os.path.join(output_dir, f'auctions_{timestamp}.json')
|
auctions_json = os.path.join(output_dir, f'auctions_{timestamp}.json')
|
||||||
@@ -1107,8 +1049,7 @@ class TroostwijkScraper:
|
|||||||
print(f" Exported {len(auctions)} auctions")
|
print(f" Exported {len(auctions)} auctions")
|
||||||
|
|
||||||
# Export lots
|
# Export lots
|
||||||
cursor.execute("SELECT * FROM lots")
|
lots = self.cache.fetch_all('lots')
|
||||||
lots = [dict(row) for row in cursor.fetchall()]
|
|
||||||
|
|
||||||
lots_csv = os.path.join(output_dir, f'lots_{timestamp}.csv')
|
lots_csv = os.path.join(output_dir, f'lots_{timestamp}.csv')
|
||||||
lots_json = os.path.join(output_dir, f'lots_{timestamp}.json')
|
lots_json = os.path.join(output_dir, f'lots_{timestamp}.json')
|
||||||
@@ -1126,5 +1067,4 @@ class TroostwijkScraper:
|
|||||||
files['lots_json'] = lots_json
|
files['lots_json'] = lots_json
|
||||||
print(f" Exported {len(lots)} lots")
|
print(f" Exported {len(lots)} lots")
|
||||||
|
|
||||||
conn.close()
|
|
||||||
return files
|
return files
|
||||||
Reference in New Issue
Block a user