Compare commits
9 Commits
e69563d4d6
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d7860adbaa | ||
|
|
a71b3f36ec | ||
|
|
5b0d2f78d6 | ||
|
|
3f5b93abdd | ||
|
|
2dda1aff00 | ||
|
|
62d664c580 | ||
|
|
5ea2342dbc | ||
|
|
570fd3870e | ||
|
|
5a755a2125 |
13
.aiignore
13
.aiignore
@@ -10,3 +10,16 @@
|
|||||||
dist/
|
dist/
|
||||||
build/
|
build/
|
||||||
out/
|
out/
|
||||||
|
# An .aiignore file follows the same syntax as a .gitignore file.
|
||||||
|
# .gitignore documentation: https://git-scm.com/docs/gitignore
|
||||||
|
|
||||||
|
# you can ignore files
|
||||||
|
# or folders
|
||||||
|
.idea
|
||||||
|
node_modules/
|
||||||
|
.vscode/
|
||||||
|
.git
|
||||||
|
.github
|
||||||
|
scripts
|
||||||
|
.pytest_cache/
|
||||||
|
__pycache__
|
||||||
200
README.md
200
README.md
@@ -1,85 +1,177 @@
|
|||||||
# Setup & IDE Configuration
|
# Python Setup & IDE Guide
|
||||||
|
|
||||||
## Python Version Requirement
|
Short, clear, Python‑focused.
|
||||||
|
|
||||||
This project **requires Python 3.10 or higher**.
|
---
|
||||||
|
|
||||||
The code uses Python 3.10+ features including:
|
## Requirements
|
||||||
- Structural pattern matching
|
|
||||||
- Union type syntax (`X | Y`)
|
|
||||||
- Improved type hints
|
|
||||||
- Modern async/await patterns
|
|
||||||
|
|
||||||
## IDE Configuration
|
- **Python 3.10+**
|
||||||
|
Uses pattern matching, modern type hints, async improvements.
|
||||||
|
|
||||||
### PyCharm / IntelliJ IDEA
|
```bash
|
||||||
|
python --version
|
||||||
If your IDE shows "Python 2.7 syntax" warnings, configure it for Python 3.10+:
|
|
||||||
|
|
||||||
1. **File → Project Structure → Project Settings → Project**
|
|
||||||
- Set Python SDK to 3.10 or higher
|
|
||||||
|
|
||||||
2. **File → Settings → Project → Python Interpreter**
|
|
||||||
- Select Python 3.10+ interpreter
|
|
||||||
- Click gear icon → Add → System Interpreter → Browse to your Python 3.10 installation
|
|
||||||
|
|
||||||
3. **File → Settings → Editor → Inspections → Python**
|
|
||||||
- Ensure "Python version" is set to 3.10+
|
|
||||||
- Check "Code compatibility inspection" → Set minimum version to 3.10
|
|
||||||
|
|
||||||
### VS Code
|
|
||||||
|
|
||||||
Add to `.vscode/settings.json`:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"python.pythonPath": "path/to/python3.10",
|
|
||||||
"python.analysis.typeCheckingMode": "basic",
|
|
||||||
"python.languageServer": "Pylance"
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## IDE Setup (PyCharm / IntelliJ)
|
||||||
|
|
||||||
|
1. **Set interpreter:**
|
||||||
|
*File → Settings → Project → Python Interpreter → Select Python 3.10+*
|
||||||
|
|
||||||
|
2. **Fix syntax warnings:**
|
||||||
|
*Editor → Inspections → Python → Set language level to 3.10+*
|
||||||
|
|
||||||
|
3. **Ensure correct SDK:**
|
||||||
|
*Project Structure → Project SDK → Python 3.10+*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Check Python version
|
# Activate venv
|
||||||
python --version # Should be 3.10+
|
~\venvs\scaev\Scripts\Activate.ps1
|
||||||
|
|
||||||
# Install dependencies
|
# Install deps
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
|
|
||||||
# Install Playwright browsers
|
# Playwright browsers
|
||||||
playwright install chromium
|
playwright install chromium
|
||||||
```
|
```
|
||||||
|
|
||||||
## Verifying Setup
|
---
|
||||||
|
|
||||||
|
## Database Configuration (PostgreSQL)
|
||||||
|
|
||||||
|
The scraper now uses PostgreSQL (no more SQLite files). Configure via `DATABASE_URL`:
|
||||||
|
|
||||||
|
- Default (baked in):
|
||||||
|
`postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb`
|
||||||
|
- Override for your environment:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Should print version 3.10.x or higher
|
# Windows PowerShell
|
||||||
python -c "import sys; print(sys.version)"
|
$env:DATABASE_URL = "postgresql://user:pass@host:5432/dbname"
|
||||||
|
|
||||||
# Should run without errors
|
# Linux/macOS
|
||||||
|
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
|
||||||
|
```
|
||||||
|
|
||||||
|
Packages used:
|
||||||
|
- Driver: `psycopg[binary]`
|
||||||
|
|
||||||
|
Nothing is written to local `.db` files anymore.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import sys; print(sys.version)"
|
||||||
python main.py --help
|
python main.py --help
|
||||||
```
|
```
|
||||||
|
|
||||||
## Common Issues
|
Common fixes:
|
||||||
|
|
||||||
### "ModuleNotFoundError: No module named 'playwright'"
|
|
||||||
```bash
|
```bash
|
||||||
pip install playwright
|
pip install playwright
|
||||||
playwright install chromium
|
playwright install chromium
|
||||||
```
|
```
|
||||||
|
|
||||||
### "Python 2.7 does not support..." warnings in IDE
|
---
|
||||||
- Your IDE is configured for Python 2.7
|
|
||||||
- Follow IDE configuration steps above
|
|
||||||
- The code WILL work with Python 3.10+ despite warnings
|
|
||||||
|
|
||||||
### Script exits with "requires Python 3.10 or higher"
|
# Auto‑Start (Monitor)
|
||||||
- You're running Python 3.9 or older
|
|
||||||
- Upgrade to Python 3.10+: https://www.python.org/downloads/
|
|
||||||
|
|
||||||
## Version Files
|
## Linux (systemd) — Recommended
|
||||||
|
|
||||||
- `.python-version` - Used by pyenv and similar tools
|
```bash
|
||||||
- `requirements.txt` - Package dependencies
|
cd ~/scaev
|
||||||
- Runtime checks in scripts ensure Python 3.10+
|
chmod +x install_service.sh
|
||||||
|
./install_service.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Service features:
|
||||||
|
- Auto‑start
|
||||||
|
- Auto‑restart
|
||||||
|
- Logs: `~/scaev/logs/monitor.log`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl status scaev-monitor
|
||||||
|
journalctl -u scaev-monitor -f
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Windows (Task Scheduler)
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
cd C:\vibe\scaev
|
||||||
|
.\setup_windows_task.ps1
|
||||||
|
```
|
||||||
|
|
||||||
|
Manage:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Start-ScheduledTask "ScaevAuctionMonitor"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cron Alternative (Linux)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
crontab -e
|
||||||
|
@reboot cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
|
||||||
|
0 * * * * pgrep -f monitor.py || (cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Status Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps aux | grep monitor.py
|
||||||
|
tasklist | findstr python
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Troubleshooting
|
||||||
|
|
||||||
|
- Wrong interpreter → Set Python 3.10+
|
||||||
|
- Multiple monitors running → kill extra processes
|
||||||
|
- PostgreSQL connectivity → verify `DATABASE_URL`, network/firewall, and credentials
|
||||||
|
- Service fails → check `journalctl -u scaev-monitor`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Java Extractor (Short Version)
|
||||||
|
|
||||||
|
Prereqs: **Java 21**, **Maven**
|
||||||
|
|
||||||
|
Install:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mvn clean install
|
||||||
|
mvn exec:java -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mvn exec:java -Dexec.args="--max-visits 3"
|
||||||
|
```
|
||||||
|
|
||||||
|
Enable native access (IntelliJ → VM Options):
|
||||||
|
|
||||||
|
```
|
||||||
|
--enable-native-access=ALL-UNNAMED
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
This file keeps everything compact, Python‑focused, and ready for onboarding.
|
||||||
|
|||||||
139
db/migration/V1__initial_schema.sql
Normal file
139
db/migration/V1__initial_schema.sql
Normal file
@@ -0,0 +1,139 @@
|
|||||||
|
-- Auctions
|
||||||
|
CREATE TABLE auctions (
|
||||||
|
auction_id TEXT PRIMARY KEY,
|
||||||
|
url TEXT UNIQUE,
|
||||||
|
title TEXT,
|
||||||
|
location TEXT,
|
||||||
|
lots_count INTEGER,
|
||||||
|
first_lot_closing_time TEXT,
|
||||||
|
scraped_at TEXT,
|
||||||
|
city TEXT,
|
||||||
|
country TEXT,
|
||||||
|
type TEXT,
|
||||||
|
lot_count INTEGER DEFAULT 0,
|
||||||
|
closing_time TEXT,
|
||||||
|
discovered_at BIGINT
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_auctions_country ON auctions(country);
|
||||||
|
|
||||||
|
-- Cache
|
||||||
|
CREATE TABLE cache (
|
||||||
|
url TEXT PRIMARY KEY,
|
||||||
|
content BYTEA,
|
||||||
|
timestamp DOUBLE PRECISION,
|
||||||
|
status_code INTEGER
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_timestamp ON cache(timestamp);
|
||||||
|
|
||||||
|
-- Lots
|
||||||
|
CREATE TABLE lots (
|
||||||
|
lot_id TEXT PRIMARY KEY,
|
||||||
|
auction_id TEXT REFERENCES auctions(auction_id),
|
||||||
|
url TEXT UNIQUE,
|
||||||
|
title TEXT,
|
||||||
|
current_bid TEXT,
|
||||||
|
bid_count INTEGER,
|
||||||
|
closing_time TEXT,
|
||||||
|
viewing_time TEXT,
|
||||||
|
pickup_date TEXT,
|
||||||
|
location TEXT,
|
||||||
|
description TEXT,
|
||||||
|
category TEXT,
|
||||||
|
scraped_at TEXT,
|
||||||
|
sale_id INTEGER,
|
||||||
|
manufacturer TEXT,
|
||||||
|
type TEXT,
|
||||||
|
year INTEGER,
|
||||||
|
currency TEXT DEFAULT 'EUR',
|
||||||
|
closing_notified INTEGER DEFAULT 0,
|
||||||
|
starting_bid TEXT,
|
||||||
|
minimum_bid TEXT,
|
||||||
|
status TEXT,
|
||||||
|
brand TEXT,
|
||||||
|
model TEXT,
|
||||||
|
attributes_json TEXT,
|
||||||
|
first_bid_time TEXT,
|
||||||
|
last_bid_time TEXT,
|
||||||
|
bid_velocity DOUBLE PRECISION,
|
||||||
|
bid_increment DOUBLE PRECISION,
|
||||||
|
year_manufactured INTEGER,
|
||||||
|
condition_score DOUBLE PRECISION,
|
||||||
|
condition_description TEXT,
|
||||||
|
serial_number TEXT,
|
||||||
|
damage_description TEXT,
|
||||||
|
followers_count INTEGER DEFAULT 0,
|
||||||
|
estimated_min_price DOUBLE PRECISION,
|
||||||
|
estimated_max_price DOUBLE PRECISION,
|
||||||
|
lot_condition TEXT,
|
||||||
|
appearance TEXT,
|
||||||
|
estimated_min DOUBLE PRECISION,
|
||||||
|
estimated_max DOUBLE PRECISION,
|
||||||
|
next_bid_step_cents INTEGER,
|
||||||
|
condition TEXT,
|
||||||
|
category_path TEXT,
|
||||||
|
city_location TEXT,
|
||||||
|
country_code TEXT,
|
||||||
|
bidding_status TEXT,
|
||||||
|
packaging TEXT,
|
||||||
|
quantity INTEGER,
|
||||||
|
vat DOUBLE PRECISION,
|
||||||
|
buyer_premium_percentage DOUBLE PRECISION,
|
||||||
|
remarks TEXT,
|
||||||
|
reserve_price DOUBLE PRECISION,
|
||||||
|
reserve_met INTEGER,
|
||||||
|
view_count INTEGER,
|
||||||
|
api_data_json TEXT,
|
||||||
|
next_scrape_at BIGINT,
|
||||||
|
scrape_priority INTEGER DEFAULT 0
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_lots_closing_time ON lots(closing_time);
|
||||||
|
CREATE INDEX idx_lots_next_scrape ON lots(next_scrape_at);
|
||||||
|
CREATE INDEX idx_lots_priority ON lots(scrape_priority DESC);
|
||||||
|
CREATE INDEX idx_lots_sale_id ON lots(sale_id);
|
||||||
|
|
||||||
|
-- Bid history
|
||||||
|
CREATE TABLE bid_history (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
lot_id TEXT REFERENCES lots(lot_id),
|
||||||
|
bid_amount DOUBLE PRECISION NOT NULL,
|
||||||
|
bid_time TEXT NOT NULL,
|
||||||
|
is_autobid INTEGER DEFAULT 0,
|
||||||
|
bidder_id TEXT,
|
||||||
|
bidder_number INTEGER,
|
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_bid_history_bidder ON bid_history(bidder_id);
|
||||||
|
CREATE INDEX idx_bid_history_lot_time ON bid_history(lot_id, bid_time);
|
||||||
|
|
||||||
|
-- Images
|
||||||
|
CREATE TABLE images (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
lot_id TEXT REFERENCES lots(lot_id),
|
||||||
|
url TEXT,
|
||||||
|
local_path TEXT,
|
||||||
|
downloaded INTEGER DEFAULT 0,
|
||||||
|
labels TEXT,
|
||||||
|
processed_at BIGINT
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_images_lot_id ON images(lot_id);
|
||||||
|
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
|
||||||
|
|
||||||
|
-- Resource cache
|
||||||
|
CREATE TABLE resource_cache (
|
||||||
|
url TEXT PRIMARY KEY,
|
||||||
|
content BYTEA,
|
||||||
|
content_type TEXT,
|
||||||
|
status_code INTEGER,
|
||||||
|
headers TEXT,
|
||||||
|
timestamp DOUBLE PRECISION,
|
||||||
|
size_bytes INTEGER,
|
||||||
|
local_path TEXT
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_resource_timestamp ON resource_cache(timestamp);
|
||||||
|
CREATE INDEX idx_resource_content_type ON resource_cache(content_type);
|
||||||
@@ -5,16 +5,29 @@ services:
|
|||||||
dockerfile: Dockerfile
|
dockerfile: Dockerfile
|
||||||
container_name: scaev
|
container_name: scaev
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
|
|
||||||
|
# Voeg het PostgreSQL-netwerk toe
|
||||||
networks:
|
networks:
|
||||||
scaev_mobile_net:
|
scaev_mobile_net:
|
||||||
ipv4_address: 172.30.0.10
|
ipv4_address: 172.30.0.10
|
||||||
traefik_net:
|
traefik_net:
|
||||||
|
db_net:
|
||||||
|
|
||||||
environment:
|
environment:
|
||||||
|
SCAEV_OFFLINE: 0
|
||||||
RATE_LIMIT_SECONDS: "0.5"
|
RATE_LIMIT_SECONDS: "0.5"
|
||||||
MAX_PAGES: "500"
|
MAX_PAGES: "500"
|
||||||
DOWNLOAD_IMAGES: "True"
|
DOWNLOAD_IMAGES: "True"
|
||||||
|
|
||||||
|
# Nieuw: verbind intern via service-naam, niet via LAN IP
|
||||||
|
POSTGRES_HOST: postgres
|
||||||
|
POSTGRES_DB: auctiondb
|
||||||
|
POSTGRES_USER: auction
|
||||||
|
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
- shared-auction-data:/mnt/okcomputer/output
|
- shared-auction-data:/mnt/okcomputer/output
|
||||||
|
|
||||||
labels:
|
labels:
|
||||||
- "traefik.enable=true"
|
- "traefik.enable=true"
|
||||||
- "traefik.http.routers.scaev.rule=Host(`scaev.appmodel.nl`)"
|
- "traefik.http.routers.scaev.rule=Host(`scaev.appmodel.nl`)"
|
||||||
@@ -23,7 +36,6 @@ services:
|
|||||||
- "traefik.http.routers.scaev.tls.certresolver=letsencrypt"
|
- "traefik.http.routers.scaev.tls.certresolver=letsencrypt"
|
||||||
- "traefik.http.services.scaev.loadbalancer.server.port=8000"
|
- "traefik.http.services.scaev.loadbalancer.server.port=8000"
|
||||||
|
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
scaev_mobile_net:
|
scaev_mobile_net:
|
||||||
driver: bridge
|
driver: bridge
|
||||||
@@ -33,10 +45,16 @@ networks:
|
|||||||
config:
|
config:
|
||||||
- subnet: 172.30.0.0/24
|
- subnet: 172.30.0.0/24
|
||||||
gateway: 172.30.0.1
|
gateway: 172.30.0.1
|
||||||
|
|
||||||
traefik_net:
|
traefik_net:
|
||||||
external: true
|
external: true
|
||||||
name: traefik_net
|
name: traefik_net
|
||||||
|
|
||||||
|
# Nieuw: gedeeld netwerk voor scaev en postgres
|
||||||
|
db_net:
|
||||||
|
external: true
|
||||||
|
name: db_net
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
shared-auction-data:
|
shared-auction-data:
|
||||||
external: true
|
external: true
|
||||||
@@ -1,240 +0,0 @@
|
|||||||
# API Intelligence Findings
|
|
||||||
|
|
||||||
## GraphQL API - Available Fields for Intelligence
|
|
||||||
|
|
||||||
### Key Discovery: Additional Fields Available
|
|
||||||
|
|
||||||
From GraphQL schema introspection on `Lot` type:
|
|
||||||
|
|
||||||
#### **Already Captured ✓**
|
|
||||||
- `currentBidAmount` (Money) - Current bid
|
|
||||||
- `initialAmount` (Money) - Starting bid
|
|
||||||
- `nextMinimalBid` (Money) - Minimum bid
|
|
||||||
- `bidsCount` (Int) - Bid count
|
|
||||||
- `startDate` / `endDate` (TbaDate) - Timing
|
|
||||||
- `minimumBidAmountMet` (MinimumBidAmountMet) - Status
|
|
||||||
- `attributes` - Brand/model extraction
|
|
||||||
- `title`, `description`, `images`
|
|
||||||
|
|
||||||
#### **NEW - Available but NOT Captured:**
|
|
||||||
|
|
||||||
1. **followersCount** (Int) - **CRITICAL for intelligence!**
|
|
||||||
- This is the "watch count" we thought was missing
|
|
||||||
- Indicates bidder interest level
|
|
||||||
- **ACTION: Add to schema and extraction**
|
|
||||||
|
|
||||||
2. **biddingStatus** (BiddingStatus) - Lot bidding state
|
|
||||||
- More detailed than minimumBidAmountMet
|
|
||||||
- **ACTION: Investigate enum values**
|
|
||||||
|
|
||||||
3. **estimatedFullPrice** (EstimatedFullPrice) - **Found it!**
|
|
||||||
- Available via `LotDetails.estimatedFullPrice`
|
|
||||||
- May contain estimated min/max values
|
|
||||||
- **ACTION: Test extraction**
|
|
||||||
|
|
||||||
4. **nextBidStepInCents** (Long) - Exact bid increment
|
|
||||||
- More precise than our calculated bid_increment
|
|
||||||
- **ACTION: Replace calculated field**
|
|
||||||
|
|
||||||
5. **condition** (String) - Direct condition field
|
|
||||||
- Cleaner than attribute extraction
|
|
||||||
- **ACTION: Use as primary source**
|
|
||||||
|
|
||||||
6. **categoryInformation** (LotCategoryInformation) - Category data
|
|
||||||
- Structured category info
|
|
||||||
- **ACTION: Extract category path**
|
|
||||||
|
|
||||||
7. **location** (LotLocation) - Lot location details
|
|
||||||
- City, country, possibly address
|
|
||||||
- **ACTION: Add to schema**
|
|
||||||
|
|
||||||
8. **remarks** (String) - Additional notes
|
|
||||||
- May contain pickup/viewing text
|
|
||||||
- **ACTION: Check for viewing/pickup extraction**
|
|
||||||
|
|
||||||
9. **appearance** (String) - Condition appearance
|
|
||||||
- Visual condition notes
|
|
||||||
- **ACTION: Combine with condition_description**
|
|
||||||
|
|
||||||
10. **packaging** (String) - Packaging details
|
|
||||||
- Relevant for shipping intelligence
|
|
||||||
|
|
||||||
11. **quantity** (Long) - Lot quantity
|
|
||||||
- Important for bulk lots
|
|
||||||
|
|
||||||
12. **vat** (BigDecimal) - VAT percentage
|
|
||||||
- For total cost calculations
|
|
||||||
|
|
||||||
13. **buyerPremiumPercentage** (BigDecimal) - Buyer premium
|
|
||||||
- For total cost calculations
|
|
||||||
|
|
||||||
14. **videos** - Video URLs (if available)
|
|
||||||
- **ACTION: Add video support**
|
|
||||||
|
|
||||||
15. **documents** - Document URLs (if available)
|
|
||||||
- May contain specs/manuals
|
|
||||||
|
|
||||||
## Bid History API - Fields
|
|
||||||
|
|
||||||
### Currently Captured ✓
|
|
||||||
- `buyerId` (UUID) - Anonymized bidder
|
|
||||||
- `buyerNumber` (Int) - Bidder number
|
|
||||||
- `currentBid.cents` / `currency` - Bid amount
|
|
||||||
- `autoBid` (Boolean) - Autobid flag
|
|
||||||
- `createdAt` (Timestamp) - Bid time
|
|
||||||
|
|
||||||
### Additional Available:
|
|
||||||
- `negotiated` (Boolean) - Was bid negotiated
|
|
||||||
- **ACTION: Add to bid_history table**
|
|
||||||
|
|
||||||
## Auction API - Not Available
|
|
||||||
- Attempted `auctionDetails` query - **does not exist**
|
|
||||||
- Auction data must be scraped from listing pages
|
|
||||||
|
|
||||||
## Priority Actions for Intelligence
|
|
||||||
|
|
||||||
### HIGH PRIORITY (Immediate):
|
|
||||||
1. ✅ Add `followersCount` field (watch count)
|
|
||||||
2. ✅ Add `estimatedFullPrice` extraction
|
|
||||||
3. ✅ Use `nextBidStepInCents` instead of calculated increment
|
|
||||||
4. ✅ Add `condition` as primary condition source
|
|
||||||
5. ✅ Add `categoryInformation` extraction
|
|
||||||
6. ✅ Add `location` details
|
|
||||||
7. ✅ Add `negotiated` to bid_history table
|
|
||||||
|
|
||||||
### MEDIUM PRIORITY:
|
|
||||||
8. Extract `remarks` for viewing/pickup text
|
|
||||||
9. Add `appearance` and `packaging` fields
|
|
||||||
10. Add `quantity` field
|
|
||||||
11. Add `vat` and `buyerPremiumPercentage` for cost calculations
|
|
||||||
12. Add `biddingStatus` enum extraction
|
|
||||||
|
|
||||||
### LOW PRIORITY:
|
|
||||||
13. Add video URL support
|
|
||||||
14. Add document URL support
|
|
||||||
|
|
||||||
## Updated Schema Requirements
|
|
||||||
|
|
||||||
### lots table - NEW columns:
|
|
||||||
```sql
|
|
||||||
ALTER TABLE lots ADD COLUMN followers_count INTEGER DEFAULT 0;
|
|
||||||
ALTER TABLE lots ADD COLUMN estimated_min_price REAL;
|
|
||||||
ALTER TABLE lots ADD COLUMN estimated_max_price REAL;
|
|
||||||
ALTER TABLE lots ADD COLUMN location_city TEXT;
|
|
||||||
ALTER TABLE lots ADD COLUMN location_country TEXT;
|
|
||||||
ALTER TABLE lots ADD COLUMN lot_condition TEXT; -- Direct from API
|
|
||||||
ALTER TABLE lots ADD COLUMN appearance TEXT;
|
|
||||||
ALTER TABLE lots ADD COLUMN packaging TEXT;
|
|
||||||
ALTER TABLE lots ADD COLUMN quantity INTEGER DEFAULT 1;
|
|
||||||
ALTER TABLE lots ADD COLUMN vat_percentage REAL;
|
|
||||||
ALTER TABLE lots ADD COLUMN buyer_premium_percentage REAL;
|
|
||||||
ALTER TABLE lots ADD COLUMN remarks TEXT;
|
|
||||||
ALTER TABLE lots ADD COLUMN bidding_status TEXT;
|
|
||||||
ALTER TABLE lots ADD COLUMN videos_json TEXT; -- Store as JSON array
|
|
||||||
ALTER TABLE lots ADD COLUMN documents_json TEXT; -- Store as JSON array
|
|
||||||
```
|
|
||||||
|
|
||||||
### bid_history table - NEW column:
|
|
||||||
```sql
|
|
||||||
ALTER TABLE bid_history ADD COLUMN negotiated INTEGER DEFAULT 0;
|
|
||||||
```
|
|
||||||
|
|
||||||
## Intelligence Use Cases
|
|
||||||
|
|
||||||
### With followers_count:
|
|
||||||
- Predict lot popularity and final price
|
|
||||||
- Identify hot items early
|
|
||||||
- Calculate interest-to-bid conversion rate
|
|
||||||
|
|
||||||
### With estimated prices:
|
|
||||||
- Compare final price to estimate
|
|
||||||
- Identify bargains (final < estimate)
|
|
||||||
- Calculate auction house accuracy
|
|
||||||
|
|
||||||
### With nextBidStepInCents:
|
|
||||||
- Show exact next bid amount
|
|
||||||
- Calculate optimal bidding strategy
|
|
||||||
|
|
||||||
### With location:
|
|
||||||
- Filter by proximity
|
|
||||||
- Calculate pickup logistics
|
|
||||||
|
|
||||||
### With vat/buyer_premium:
|
|
||||||
- Calculate true total cost
|
|
||||||
- Compare all-in prices
|
|
||||||
|
|
||||||
### With condition/appearance:
|
|
||||||
- Better condition scoring
|
|
||||||
- Identify restoration projects
|
|
||||||
|
|
||||||
## Updated GraphQL Query
|
|
||||||
|
|
||||||
```graphql
|
|
||||||
query EnhancedLotQuery($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
|
|
||||||
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
|
|
||||||
estimatedFullPrice {
|
|
||||||
min { cents currency }
|
|
||||||
max { cents currency }
|
|
||||||
}
|
|
||||||
lot {
|
|
||||||
id
|
|
||||||
displayId
|
|
||||||
title
|
|
||||||
description { text }
|
|
||||||
currentBidAmount { cents currency }
|
|
||||||
initialAmount { cents currency }
|
|
||||||
nextMinimalBid { cents currency }
|
|
||||||
nextBidStepInCents
|
|
||||||
bidsCount
|
|
||||||
followersCount
|
|
||||||
startDate
|
|
||||||
endDate
|
|
||||||
minimumBidAmountMet
|
|
||||||
biddingStatus
|
|
||||||
condition
|
|
||||||
appearance
|
|
||||||
packaging
|
|
||||||
quantity
|
|
||||||
vat
|
|
||||||
buyerPremiumPercentage
|
|
||||||
remarks
|
|
||||||
auctionId
|
|
||||||
location {
|
|
||||||
city
|
|
||||||
countryCode
|
|
||||||
addressLine1
|
|
||||||
addressLine2
|
|
||||||
}
|
|
||||||
categoryInformation {
|
|
||||||
id
|
|
||||||
name
|
|
||||||
path
|
|
||||||
}
|
|
||||||
images {
|
|
||||||
url
|
|
||||||
thumbnailUrl
|
|
||||||
}
|
|
||||||
videos {
|
|
||||||
url
|
|
||||||
thumbnailUrl
|
|
||||||
}
|
|
||||||
documents {
|
|
||||||
url
|
|
||||||
name
|
|
||||||
}
|
|
||||||
attributes {
|
|
||||||
name
|
|
||||||
value
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Summary
|
|
||||||
|
|
||||||
**NEW fields found:** 15+ additional intelligence fields available
|
|
||||||
**Most critical:** `followersCount` (watch count), `estimatedFullPrice`, `nextBidStepInCents`
|
|
||||||
**Data quality impact:** Estimated 80%+ increase in intelligence value
|
|
||||||
|
|
||||||
These fields will significantly enhance prediction and analysis capabilities.
|
|
||||||
@@ -8,7 +8,7 @@ The scraper follows a **3-phase hierarchical crawling pattern** to extract aucti
|
|||||||
|
|
||||||
```mariadb
|
```mariadb
|
||||||
┌─────────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
│ TROOSTWIJK SCRAPER │
|
│ SCAEV SCRAPER │
|
||||||
└─────────────────────────────────────────────────────────────────┘
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
|
||||||
┌─────────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
@@ -321,19 +321,18 @@ Lot Page Parsed
|
|||||||
|
|
||||||
## Key Configuration
|
## Key Configuration
|
||||||
|
|
||||||
| Setting | Value | Purpose |
|
| Setting | Value | Purpose |
|
||||||
|----------------------|-----------------------------------|----------------------------------|
|
|----------------------|--------------------------------------------------------------------------|----------------------------------|
|
||||||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
| `DATABASE_URL` | `postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb` | PostgreSQL connection string |
|
||||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||||
|
|
||||||
## Output Files
|
## Output Files
|
||||||
|
|
||||||
```
|
```
|
||||||
/mnt/okcomputer/output/
|
/mnt/okcomputer/output/
|
||||||
├── cache.db # SQLite database (compressed HTML + data)
|
|
||||||
├── auctions_{timestamp}.json # Exported auctions
|
├── auctions_{timestamp}.json # Exported auctions
|
||||||
├── auctions_{timestamp}.csv # Exported auctions
|
├── auctions_{timestamp}.csv # Exported auctions
|
||||||
├── lots_{timestamp}.json # Exported lots
|
├── lots_{timestamp}.json # Exported lots
|
||||||
@@ -346,6 +345,48 @@ Lot Page Parsed
|
|||||||
└── 001.jpg
|
└── 001.jpg
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Terminal Progress per Lot (TTY)
|
||||||
|
|
||||||
|
During lot analysis, Scaev now shows a per‑lot TTY progress animation with a final summary of all inputs used:
|
||||||
|
|
||||||
|
- Spinner runs while enrichment is in progress.
|
||||||
|
- Summary lists every page/API used to analyze the lot with:
|
||||||
|
- URL/label
|
||||||
|
- Size in bytes
|
||||||
|
- Source state: cache | realtime | offline | db | intercepted
|
||||||
|
- Duration in ms
|
||||||
|
|
||||||
|
Example output snippet:
|
||||||
|
|
||||||
|
```
|
||||||
|
[LOT A1-28505-5] ✓ Done in 812 ms — pages/APIs used:
|
||||||
|
• [html] https://www.troostwijkauctions.com/l/... | 142331 B | cache | 4 ms
|
||||||
|
• [graphql] GraphQL lotDetails | 5321 B | realtime | 142 ms
|
||||||
|
• [rest] REST bid history | 18234 B | realtime | 236 ms
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- In non‑TTY environments the spinner is replaced by simple log lines.
|
||||||
|
- Intercepted GraphQL responses (captured during page load) are labeled as `intercepted` with near‑zero duration.
|
||||||
|
|
||||||
|
## Data Flow “Tunnel” (Simplified)
|
||||||
|
|
||||||
|
For each lot, the data “tunnels through” the following stages:
|
||||||
|
|
||||||
|
1. HTML page → parse `__NEXT_DATA__` for core lot fields and lot UUID.
|
||||||
|
2. GraphQL `lotDetails` → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
|
||||||
|
3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
|
||||||
|
4. Persist to DB (PostgreSQL) and export; image URLs are captured and optionally downloaded concurrently per lot.
|
||||||
|
|
||||||
|
Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
|
||||||
|
|
||||||
|
## Migrations and ORM Roadmap
|
||||||
|
|
||||||
|
- Migrations follow a Flyway‑style convention in `db/migration` (e.g., `V1__initial_schema.sql`).
|
||||||
|
- Current baseline is V1; there are no new migrations required at this time.
|
||||||
|
- Raw SQL usage remains in place (SQLite) while we prepare a gradual move to SQLAlchemy 2.x targeting PostgreSQL.
|
||||||
|
- See `docs/MIGRATIONS.md` for details on naming, workflow, and the future switch to PostgreSQL.
|
||||||
|
|
||||||
## Extension Points for Integration
|
## Extension Points for Integration
|
||||||
|
|
||||||
### 1. **Downstream Processing Pipeline**
|
### 1. **Downstream Processing Pipeline**
|
||||||
@@ -461,13 +502,6 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
|
|||||||
- ✅ Closing time and status
|
- ✅ Closing time and status
|
||||||
- ✅ Brand, model, manufacturer (from attributes)
|
- ✅ Brand, model, manufacturer (from attributes)
|
||||||
|
|
||||||
**Available but Not Yet Captured:**
|
|
||||||
- ⚠️ `followersCount` - Watch count for popularity analysis
|
|
||||||
- ⚠️ `estimatedFullPrice` - Min/max estimated values
|
|
||||||
- ⚠️ `biddingStatus` - More detailed status enum
|
|
||||||
- ⚠️ `condition` - Direct condition field
|
|
||||||
- ⚠️ `location` - City, country details
|
|
||||||
- ⚠️ `categoryInformation` - Structured category
|
|
||||||
|
|
||||||
### REST API - Bid History
|
### REST API - Bid History
|
||||||
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
|
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
|
||||||
@@ -511,11 +545,6 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
|
|||||||
|
|
||||||
### API Integration Points
|
### API Integration Points
|
||||||
|
|
||||||
**Files:**
|
|
||||||
- `src/graphql_client.py` - GraphQL queries and parsing
|
|
||||||
- `src/bid_history_client.py` - REST API pagination and parsing
|
|
||||||
- `src/scraper.py` - Integration during lot scraping
|
|
||||||
|
|
||||||
**Flow:**
|
**Flow:**
|
||||||
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
|
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
|
||||||
2. Call GraphQL API → Get bidding data
|
2. Call GraphQL API → Get bidding data
|
||||||
@@ -528,4 +557,3 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
|
|||||||
- Overall 0.5s rate limit applies to page requests
|
- Overall 0.5s rate limit applies to page requests
|
||||||
- API calls are part of lot processing (not separately limited)
|
- API calls are part of lot processing (not separately limited)
|
||||||
|
|
||||||
See `API_INTELLIGENCE_FINDINGS.md` for detailed field analysis and roadmap.
|
|
||||||
|
|||||||
@@ -1,120 +0,0 @@
|
|||||||
# Auto-Start Setup Guide
|
|
||||||
|
|
||||||
The monitor doesn't run automatically yet. Choose your setup based on your server OS:
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Linux Server (Systemd Service) ⭐ RECOMMENDED
|
|
||||||
|
|
||||||
**Install:**
|
|
||||||
```bash
|
|
||||||
cd /home/tour/scaev
|
|
||||||
chmod +x install_service.sh
|
|
||||||
./install_service.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
**The service will:**
|
|
||||||
- ✅ Start automatically on server boot
|
|
||||||
- ✅ Restart automatically if it crashes
|
|
||||||
- ✅ Log to `~/scaev/logs/monitor.log`
|
|
||||||
- ✅ Poll every 30 minutes
|
|
||||||
|
|
||||||
**Management commands:**
|
|
||||||
```bash
|
|
||||||
sudo systemctl status scaev-monitor # Check if running
|
|
||||||
sudo systemctl stop scaev-monitor # Stop
|
|
||||||
sudo systemctl start scaev-monitor # Start
|
|
||||||
sudo systemctl restart scaev-monitor # Restart
|
|
||||||
journalctl -u scaev-monitor -f # Live logs
|
|
||||||
tail -f ~/scaev/logs/monitor.log # Monitor log file
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Windows (Task Scheduler)
|
|
||||||
|
|
||||||
**Install (Run as Administrator):**
|
|
||||||
```powershell
|
|
||||||
cd C:\vibe\scaev
|
|
||||||
.\setup_windows_task.ps1
|
|
||||||
```
|
|
||||||
|
|
||||||
**The task will:**
|
|
||||||
- ✅ Start automatically on Windows boot
|
|
||||||
- ✅ Restart automatically if it crashes (up to 3 times)
|
|
||||||
- ✅ Run as SYSTEM user
|
|
||||||
- ✅ Poll every 30 minutes
|
|
||||||
|
|
||||||
**Management:**
|
|
||||||
1. Open Task Scheduler (`taskschd.msc`)
|
|
||||||
2. Find `ScaevAuctionMonitor` in Task Scheduler Library
|
|
||||||
3. Right-click to Run/Stop/Disable
|
|
||||||
|
|
||||||
**Or via PowerShell:**
|
|
||||||
```powershell
|
|
||||||
Start-ScheduledTask -TaskName "ScaevAuctionMonitor"
|
|
||||||
Stop-ScheduledTask -TaskName "ScaevAuctionMonitor"
|
|
||||||
Get-ScheduledTask -TaskName "ScaevAuctionMonitor" | Get-ScheduledTaskInfo
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Alternative: Cron Job (Linux)
|
|
||||||
|
|
||||||
**For simpler setup without systemd:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Edit crontab
|
|
||||||
crontab -e
|
|
||||||
|
|
||||||
# Add this line (runs on boot and restarts every hour if not running)
|
|
||||||
@reboot cd /home/tour/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
|
|
||||||
0 * * * * pgrep -f "monitor.py" || (cd /home/tour/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Verify It's Working
|
|
||||||
|
|
||||||
**Check process is running:**
|
|
||||||
```bash
|
|
||||||
# Linux
|
|
||||||
ps aux | grep monitor.py
|
|
||||||
|
|
||||||
# Windows
|
|
||||||
tasklist | findstr python
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check logs:**
|
|
||||||
```bash
|
|
||||||
# Linux
|
|
||||||
tail -f ~/scaev/logs/monitor.log
|
|
||||||
|
|
||||||
# Windows
|
|
||||||
# Check Task Scheduler history
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check database is updating:**
|
|
||||||
```bash
|
|
||||||
# Last modified time should update every 30 minutes
|
|
||||||
ls -lh C:/mnt/okcomputer/output/cache.db
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
**Service won't start:**
|
|
||||||
1. Check Python path is correct in service file
|
|
||||||
2. Check working directory exists
|
|
||||||
3. Check user permissions
|
|
||||||
4. View error logs: `journalctl -u scaev-monitor -n 50`
|
|
||||||
|
|
||||||
**Monitor stops after a while:**
|
|
||||||
- Check disk space for logs
|
|
||||||
- Check rate limiting isn't blocking requests
|
|
||||||
- Increase RestartSec in service file
|
|
||||||
|
|
||||||
**Database locked errors:**
|
|
||||||
- Ensure only one monitor instance is running
|
|
||||||
- Add timeout to SQLite connections in config
|
|
||||||
@@ -1,23 +0,0 @@
|
|||||||
✅ Routing service configured - scaev-mobile-routing.service active and working
|
|
||||||
✅ Scaev deployed - Container running with dual networks:
|
|
||||||
scaev_mobile_net (172.30.0.10) - for outbound internet via mobile
|
|
||||||
traefik_net (172.20.0.8) - for LAN access
|
|
||||||
✅ Mobile routing verified:
|
|
||||||
Host IP: 5.132.33.195 (LAN gateway)
|
|
||||||
Mobile IP: 77.63.26.140 (mobile provider)
|
|
||||||
Scaev IP: 77.63.26.140 ✅ Using mobile connection!
|
|
||||||
✅ Scraper functional - Successfully accessing troostwijkauctions.com through mobile network
|
|
||||||
Architecture:```
|
|
||||||
┌─────────────────────────────────────────┐
|
|
||||||
│ Tour Machine (192.168.1.159) │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────┐ │
|
|
||||||
│ │ Scaev Container │ │
|
|
||||||
│ │ • scaev_mobile_net: 172.30.0.10 ────┼──> Mobile Gateway (10.133.133.26)
|
|
||||||
│ │ • traefik_net: 172.20.0.8 │ │ └─> Internet (77.63.26.140)
|
|
||||||
│ │ • SQLite: shared-auction-data│ │
|
|
||||||
│ │ • Images: shared-auction-data│ │
|
|
||||||
│ └──────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
└─────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
@@ -1,122 +0,0 @@
|
|||||||
# Deployment
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
- Python 3.8+ installed
|
|
||||||
- Access to a server (Linux/Windows)
|
|
||||||
- Playwright and dependencies installed
|
|
||||||
|
|
||||||
## Production Setup
|
|
||||||
|
|
||||||
### 1. Install on Server
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone repository
|
|
||||||
git clone git@git.appmodel.nl:Tour/troost-scraper.git
|
|
||||||
cd troost-scraper
|
|
||||||
|
|
||||||
# Create virtual environment
|
|
||||||
python -m venv .venv
|
|
||||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
|
||||||
|
|
||||||
# Install dependencies
|
|
||||||
pip install -r requirements.txt
|
|
||||||
playwright install chromium
|
|
||||||
playwright install-deps # Install system dependencies
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Configuration
|
|
||||||
|
|
||||||
Create a configuration file or set environment variables:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# main.py configuration
|
|
||||||
BASE_URL = "https://www.troostwijkauctions.com"
|
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
|
||||||
OUTPUT_DIR = "/mnt/okcomputer/output"
|
|
||||||
RATE_LIMIT_SECONDS = 0.5
|
|
||||||
MAX_PAGES = 50
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Create Output Directories
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo mkdir -p /var/troost-scraper/output
|
|
||||||
sudo chown $USER:$USER /var/troost-scraper
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Run as Cron Job
|
|
||||||
|
|
||||||
Add to crontab (`crontab -e`):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run scraper daily at 2 AM
|
|
||||||
0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
|
|
||||||
```
|
|
||||||
|
|
||||||
## Docker Deployment (Optional)
|
|
||||||
|
|
||||||
Create `Dockerfile`:
|
|
||||||
|
|
||||||
```dockerfile
|
|
||||||
FROM python:3.10-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
# Install system dependencies for Playwright
|
|
||||||
RUN apt-get update && apt-get install -y \
|
|
||||||
wget \
|
|
||||||
gnupg \
|
|
||||||
&& rm -rf /var/lib/apt/lists/*
|
|
||||||
|
|
||||||
COPY requirements.txt .
|
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
|
||||||
RUN playwright install chromium
|
|
||||||
RUN playwright install-deps
|
|
||||||
|
|
||||||
COPY main.py .
|
|
||||||
|
|
||||||
CMD ["python", "main.py"]
|
|
||||||
```
|
|
||||||
|
|
||||||
Build and run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker build -t troost-scraper .
|
|
||||||
docker run -v /path/to/output:/output troost-scraper
|
|
||||||
```
|
|
||||||
|
|
||||||
## Monitoring
|
|
||||||
|
|
||||||
### Check Logs
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tail -f /var/log/troost-scraper.log
|
|
||||||
```
|
|
||||||
|
|
||||||
### Monitor Output
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ls -lh /var/troost-scraper/output/
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Playwright Browser Issues
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Reinstall browsers
|
|
||||||
playwright install --force chromium
|
|
||||||
```
|
|
||||||
|
|
||||||
### Permission Issues
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Fix permissions
|
|
||||||
sudo chown -R $USER:$USER /var/troost-scraper
|
|
||||||
```
|
|
||||||
|
|
||||||
### Memory Issues
|
|
||||||
|
|
||||||
- Reduce `MAX_PAGES` in configuration
|
|
||||||
- Run on machine with more RAM (Playwright needs ~1GB)
|
|
||||||
@@ -1,377 +0,0 @@
|
|||||||
# Data Quality Fixes - Complete Summary
|
|
||||||
|
|
||||||
## Executive Summary
|
|
||||||
|
|
||||||
Successfully completed all 5 high-priority data quality and intelligence tasks:
|
|
||||||
|
|
||||||
1. ✅ **Fixed orphaned lots** (16,807 → 13 orphaned lots)
|
|
||||||
2. ✅ **Fixed bid history fetching** (script created, ready to run)
|
|
||||||
3. ✅ **Added followersCount extraction** (watch count)
|
|
||||||
4. ✅ **Added estimatedFullPrice extraction** (min/max values)
|
|
||||||
5. ✅ **Added direct condition field** from API
|
|
||||||
|
|
||||||
**Impact:** Database now captures 80%+ more intelligence data for future scrapes.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Task 1: Fix Orphaned Lots ✅ COMPLETE
|
|
||||||
|
|
||||||
### Problem:
|
|
||||||
- **16,807 lots** had no matching auction (100% orphaned)
|
|
||||||
- Root cause: auction_id mismatch
|
|
||||||
- Lots table used UUID auction_id (e.g., `72928a1a-12bf-4d5d-93ac-292f057aab6e`)
|
|
||||||
- Auctions table used numeric IDs (legacy incorrect data)
|
|
||||||
- Auction pages use `displayId` (e.g., `A1-34731`)
|
|
||||||
|
|
||||||
### Solution:
|
|
||||||
1. **Updated parse.py** - Modified `_parse_lot_json()` to extract auction displayId from page_props
|
|
||||||
- Lot pages include full auction data
|
|
||||||
- Now extracts `auction.displayId` instead of using UUID `lot.auctionId`
|
|
||||||
|
|
||||||
2. **Created fix_orphaned_lots.py** - Migrated existing 16,793 lots
|
|
||||||
- Read cached lot pages
|
|
||||||
- Extracted auction displayId from embedded auction data
|
|
||||||
- Updated lots.auction_id from UUID to displayId
|
|
||||||
|
|
||||||
3. **Created fix_auctions_table.py** - Rebuilt auctions table
|
|
||||||
- Cleared incorrect auction data
|
|
||||||
- Re-extracted from 517 cached auction pages
|
|
||||||
- Inserted 509 auctions with correct displayId
|
|
||||||
|
|
||||||
### Results:
|
|
||||||
- **Orphaned lots:** 16,807 → **13** (99.9% fixed)
|
|
||||||
- **Auctions completeness:**
|
|
||||||
- lots_count: 0% → **100%**
|
|
||||||
- first_lot_closing_time: 0% → **100%**
|
|
||||||
- **All lots now properly linked to auctions**
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/parse.py` - Updated `_extract_nextjs_data()` and `_parse_lot_json()`
|
|
||||||
|
|
||||||
### Scripts Created:
|
|
||||||
- `fix_orphaned_lots.py` - Migrates existing lots
|
|
||||||
- `fix_auctions_table.py` - Rebuilds auctions table
|
|
||||||
- `check_lot_auction_link.py` - Diagnostic script
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Task 2: Fix Bid History Fetching ✅ COMPLETE
|
|
||||||
|
|
||||||
### Problem:
|
|
||||||
- **1,590 lots** with bids but no bid history (0.1% coverage)
|
|
||||||
- Bid history fetching only ran during scraping, not for existing lots
|
|
||||||
|
|
||||||
### Solution:
|
|
||||||
1. **Verified scraper logic** - src/scraper.py bid history fetching is correct
|
|
||||||
- Extracts lot UUID from __NEXT_DATA__
|
|
||||||
- Calls REST API: `https://shared-api.tbauctions.com/bidmanagement/lots/{uuid}/bidding-history`
|
|
||||||
- Calculates bid velocity, first/last bid time
|
|
||||||
- Saves to bid_history table
|
|
||||||
|
|
||||||
2. **Created fetch_missing_bid_history.py**
|
|
||||||
- Builds lot_id → UUID mapping from cached pages
|
|
||||||
- Fetches bid history from REST API for all lots with bids
|
|
||||||
- Updates lots table with bid intelligence
|
|
||||||
- Saves complete bid history records
|
|
||||||
|
|
||||||
### Results:
|
|
||||||
- Script created and tested
|
|
||||||
- **Limitation:** Takes ~13 minutes to process 1,590 lots (0.5s rate limit)
|
|
||||||
- **Future scrapes:** Bid history will be captured automatically
|
|
||||||
|
|
||||||
### Files Created:
|
|
||||||
- `fetch_missing_bid_history.py` - Migration script for existing lots
|
|
||||||
|
|
||||||
### Note:
|
|
||||||
- Script is ready to run but requires ~13-15 minutes
|
|
||||||
- Future scrapes will automatically capture bid history
|
|
||||||
- No code changes needed - existing scraper logic is correct
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Task 3: Add followersCount Field ✅ COMPLETE
|
|
||||||
|
|
||||||
### Problem:
|
|
||||||
- Watch count thought to be unavailable
|
|
||||||
- **Discovery:** `followersCount` field exists in GraphQL API!
|
|
||||||
|
|
||||||
### Solution:
|
|
||||||
1. **Updated database schema** (src/cache.py)
|
|
||||||
- Added `followers_count INTEGER DEFAULT 0` column
|
|
||||||
- Auto-migration on scraper startup
|
|
||||||
|
|
||||||
2. **Updated GraphQL query** (src/graphql_client.py)
|
|
||||||
- Added `followersCount` to LOT_BIDDING_QUERY
|
|
||||||
|
|
||||||
3. **Updated format_bid_data()** (src/graphql_client.py)
|
|
||||||
- Extracts and returns `followers_count`
|
|
||||||
|
|
||||||
4. **Updated save_lot()** (src/cache.py)
|
|
||||||
- Saves followers_count to database
|
|
||||||
|
|
||||||
5. **Created enrich_existing_lots.py**
|
|
||||||
- Fetches followers_count for existing 16,807 lots
|
|
||||||
- Uses GraphQL API with 0.5s rate limiting
|
|
||||||
- Takes ~2.3 hours to complete
|
|
||||||
|
|
||||||
### Intelligence Value:
|
|
||||||
- **Predict lot popularity** before bidding wars
|
|
||||||
- Calculate interest-to-bid conversion rate
|
|
||||||
- Identify "sleeper" lots (high followers, low bids)
|
|
||||||
- Alert on lots gaining sudden interest
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/cache.py` - Schema + save_lot()
|
|
||||||
- `src/graphql_client.py` - Query + format_bid_data()
|
|
||||||
|
|
||||||
### Files Created:
|
|
||||||
- `enrich_existing_lots.py` - Migration for existing lots
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Task 4: Add estimatedFullPrice Extraction ✅ COMPLETE
|
|
||||||
|
|
||||||
### Problem:
|
|
||||||
- Estimated min/max values thought to be unavailable
|
|
||||||
- **Discovery:** `estimatedFullPrice` object with min/max exists in GraphQL API!
|
|
||||||
|
|
||||||
### Solution:
|
|
||||||
1. **Updated database schema** (src/cache.py)
|
|
||||||
- Added `estimated_min_price REAL` column
|
|
||||||
- Added `estimated_max_price REAL` column
|
|
||||||
|
|
||||||
2. **Updated GraphQL query** (src/graphql_client.py)
|
|
||||||
- Added `estimatedFullPrice { min { cents currency } max { cents currency } }`
|
|
||||||
|
|
||||||
3. **Updated format_bid_data()** (src/graphql_client.py)
|
|
||||||
- Extracts estimated_min_obj and estimated_max_obj
|
|
||||||
- Converts cents to EUR
|
|
||||||
- Returns estimated_min_price and estimated_max_price
|
|
||||||
|
|
||||||
4. **Updated save_lot()** (src/cache.py)
|
|
||||||
- Saves both estimated price fields
|
|
||||||
|
|
||||||
5. **Migration** (enrich_existing_lots.py)
|
|
||||||
- Fetches estimated prices for existing lots
|
|
||||||
|
|
||||||
### Intelligence Value:
|
|
||||||
- Compare final price vs estimate (accuracy analysis)
|
|
||||||
- Identify bargains: `final_price < estimated_min`
|
|
||||||
- Identify overvalued: `final_price > estimated_max`
|
|
||||||
- Build pricing models per category
|
|
||||||
- Investment opportunity detection
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/cache.py` - Schema + save_lot()
|
|
||||||
- `src/graphql_client.py` - Query + format_bid_data()
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Task 5: Use Direct Condition Field ✅ COMPLETE
|
|
||||||
|
|
||||||
### Problem:
|
|
||||||
- Condition extracted from attributes (complex, unreliable)
|
|
||||||
- 0% condition_score success rate
|
|
||||||
- **Discovery:** Direct `condition` and `appearance` fields in GraphQL API!
|
|
||||||
|
|
||||||
### Solution:
|
|
||||||
1. **Updated database schema** (src/cache.py)
|
|
||||||
- Added `lot_condition TEXT` column (direct from API)
|
|
||||||
- Added `appearance TEXT` column (visual condition notes)
|
|
||||||
|
|
||||||
2. **Updated GraphQL query** (src/graphql_client.py)
|
|
||||||
- Added `condition` field
|
|
||||||
- Added `appearance` field
|
|
||||||
|
|
||||||
3. **Updated format_bid_data()** (src/graphql_client.py)
|
|
||||||
- Extracts and returns `lot_condition`
|
|
||||||
- Extracts and returns `appearance`
|
|
||||||
|
|
||||||
4. **Updated save_lot()** (src/cache.py)
|
|
||||||
- Saves both condition fields
|
|
||||||
|
|
||||||
5. **Migration** (enrich_existing_lots.py)
|
|
||||||
- Fetches condition data for existing lots
|
|
||||||
|
|
||||||
### Intelligence Value:
|
|
||||||
- **Cleaner, more reliable** condition data
|
|
||||||
- Better condition scoring potential
|
|
||||||
- Identify restoration projects
|
|
||||||
- Filter by condition category
|
|
||||||
- Combined with appearance for detailed assessment
|
|
||||||
|
|
||||||
### Files Modified:
|
|
||||||
- `src/cache.py` - Schema + save_lot()
|
|
||||||
- `src/graphql_client.py` - Query + format_bid_data()
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary of Code Changes
|
|
||||||
|
|
||||||
### Core Files Modified:
|
|
||||||
|
|
||||||
#### 1. `src/parse.py`
|
|
||||||
**Changes:**
|
|
||||||
- `_extract_nextjs_data()`: Pass auction data to lot parser
|
|
||||||
- `_parse_lot_json()`: Accept auction_data parameter, extract auction displayId
|
|
||||||
|
|
||||||
**Impact:** Fixes orphaned lots issue going forward
|
|
||||||
|
|
||||||
#### 2. `src/cache.py`
|
|
||||||
**Changes:**
|
|
||||||
- Added 5 new columns to lots table schema
|
|
||||||
- Updated `save_lot()` INSERT statement to include new fields
|
|
||||||
- Auto-migration logic for new columns
|
|
||||||
|
|
||||||
**New Columns:**
|
|
||||||
- `followers_count INTEGER DEFAULT 0`
|
|
||||||
- `estimated_min_price REAL`
|
|
||||||
- `estimated_max_price REAL`
|
|
||||||
- `lot_condition TEXT`
|
|
||||||
- `appearance TEXT`
|
|
||||||
|
|
||||||
#### 3. `src/graphql_client.py`
|
|
||||||
**Changes:**
|
|
||||||
- Updated `LOT_BIDDING_QUERY` to include new fields
|
|
||||||
- Updated `format_bid_data()` to extract and format new fields
|
|
||||||
|
|
||||||
**New Fields Extracted:**
|
|
||||||
- `followersCount`
|
|
||||||
- `estimatedFullPrice { min { cents } max { cents } }`
|
|
||||||
- `condition`
|
|
||||||
- `appearance`
|
|
||||||
|
|
||||||
### Migration Scripts Created:
|
|
||||||
|
|
||||||
1. **fix_orphaned_lots.py** - Fix auction_id mismatch (COMPLETED)
|
|
||||||
2. **fix_auctions_table.py** - Rebuild auctions table (COMPLETED)
|
|
||||||
3. **fetch_missing_bid_history.py** - Fetch bid history for existing lots (READY TO RUN)
|
|
||||||
4. **enrich_existing_lots.py** - Fetch new intelligence fields for existing lots (READY TO RUN)
|
|
||||||
|
|
||||||
### Diagnostic/Validation Scripts:
|
|
||||||
|
|
||||||
1. **check_lot_auction_link.py** - Verify lot-auction linkage
|
|
||||||
2. **validate_data.py** - Comprehensive data quality report
|
|
||||||
3. **explore_api_fields.py** - API schema introspection
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Running the Migration Scripts
|
|
||||||
|
|
||||||
### Immediate (Already Complete):
|
|
||||||
```bash
|
|
||||||
python fix_orphaned_lots.py # ✅ DONE - Fixed 16,793 lots
|
|
||||||
python fix_auctions_table.py # ✅ DONE - Rebuilt 509 auctions
|
|
||||||
```
|
|
||||||
|
|
||||||
### Optional (Time-Intensive):
|
|
||||||
```bash
|
|
||||||
# Fetch bid history for 1,590 lots (~13-15 minutes)
|
|
||||||
python fetch_missing_bid_history.py
|
|
||||||
|
|
||||||
# Enrich all 16,807 lots with new fields (~2.3 hours)
|
|
||||||
python enrich_existing_lots.py
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Future scrapes will automatically capture all data, so migration is optional.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Validation Results
|
|
||||||
|
|
||||||
### Before Fixes:
|
|
||||||
```
|
|
||||||
Orphaned lots: 16,807 (100%)
|
|
||||||
Auctions lots_count: 0%
|
|
||||||
Auctions first_lot_closing: 0%
|
|
||||||
Bid history coverage: 0.1% (1/1,591 lots)
|
|
||||||
```
|
|
||||||
|
|
||||||
### After Fixes:
|
|
||||||
```
|
|
||||||
Orphaned lots: 13 (0.08%)
|
|
||||||
Auctions lots_count: 100%
|
|
||||||
Auctions first_lot_closing: 100%
|
|
||||||
Bid history: Script ready (will process 1,590 lots)
|
|
||||||
New intelligence fields: Implemented and ready
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Intelligence Impact
|
|
||||||
|
|
||||||
### Data Completeness Improvements:
|
|
||||||
| Field | Before | After | Improvement |
|
|
||||||
|-------|--------|-------|-------------|
|
|
||||||
| Orphaned lots | 100% | 0.08% | **99.9% fixed** |
|
|
||||||
| Auction lots_count | 0% | 100% | **+100%** |
|
|
||||||
| Auction first_lot_closing | 0% | 100% | **+100%** |
|
|
||||||
|
|
||||||
### New Intelligence Fields (Future Scrapes):
|
|
||||||
| Field | Status | Intelligence Value |
|
|
||||||
|-------|--------|-------------------|
|
|
||||||
| followers_count | ✅ Implemented | High - Popularity predictor |
|
|
||||||
| estimated_min_price | ✅ Implemented | High - Bargain detection |
|
|
||||||
| estimated_max_price | ✅ Implemented | High - Value assessment |
|
|
||||||
| lot_condition | ✅ Implemented | Medium - Condition filtering |
|
|
||||||
| appearance | ✅ Implemented | Medium - Visual assessment |
|
|
||||||
|
|
||||||
### Estimated Intelligence Value Increase:
|
|
||||||
**80%+** - Based on addition of 5 critical fields that enable:
|
|
||||||
- Popularity prediction
|
|
||||||
- Value assessment
|
|
||||||
- Bargain detection
|
|
||||||
- Better condition scoring
|
|
||||||
- Investment opportunity identification
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Documentation Updated
|
|
||||||
|
|
||||||
### Created:
|
|
||||||
- `VALIDATION_SUMMARY.md` - Complete validation findings
|
|
||||||
- `API_INTELLIGENCE_FINDINGS.md` - API field analysis
|
|
||||||
- `FIXES_COMPLETE.md` - This document
|
|
||||||
|
|
||||||
### Updated:
|
|
||||||
- `_wiki/ARCHITECTURE.md` - Complete system documentation
|
|
||||||
- Updated Phase 3 diagram with API enrichment
|
|
||||||
- Expanded lots table schema documentation
|
|
||||||
- Added bid_history table
|
|
||||||
- Added API Integration Architecture section
|
|
||||||
- Updated rate limiting and image download flows
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Next Steps (Optional)
|
|
||||||
|
|
||||||
### Immediate:
|
|
||||||
1. ✅ All high-priority fixes complete
|
|
||||||
2. ✅ Code ready for future scrapes
|
|
||||||
3. ⏳ Optional: Run migration scripts for existing data
|
|
||||||
|
|
||||||
### Future Enhancements (Low Priority):
|
|
||||||
1. Extract structured location (city, country)
|
|
||||||
2. Extract category information (structured)
|
|
||||||
3. Add VAT and buyer premium fields
|
|
||||||
4. Add video/document URL support
|
|
||||||
5. Parse viewing/pickup times from remarks text
|
|
||||||
|
|
||||||
See `API_INTELLIGENCE_FINDINGS.md` for complete roadmap.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
All tasks completed successfully:
|
|
||||||
|
|
||||||
- [x] **Orphaned lots fixed** - 99.9% reduction (16,807 → 13)
|
|
||||||
- [x] **Bid history logic verified** - Script created, ready to run
|
|
||||||
- [x] **followersCount added** - Schema, extraction, saving implemented
|
|
||||||
- [x] **estimatedFullPrice added** - Min/max extraction implemented
|
|
||||||
- [x] **Direct condition field** - lot_condition and appearance added
|
|
||||||
- [x] **Code updated** - parse.py, cache.py, graphql_client.py
|
|
||||||
- [x] **Migrations created** - 4 scripts for data cleanup/enrichment
|
|
||||||
- [x] **Documentation complete** - ARCHITECTURE.md, summaries, findings
|
|
||||||
|
|
||||||
**Impact:** Scraper now captures 80%+ more intelligence data with higher data quality.
|
|
||||||
18
docs/Home.md
18
docs/Home.md
@@ -1,18 +0,0 @@
|
|||||||
# scaev Wiki
|
|
||||||
|
|
||||||
Welcome to the scaev documentation.
|
|
||||||
|
|
||||||
## Contents
|
|
||||||
|
|
||||||
- [Getting Started](Getting-Started)
|
|
||||||
- [Architecture](Architecture)
|
|
||||||
- [Deployment](Deployment)
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
|
||||||
|
|
||||||
## Quick Links
|
|
||||||
|
|
||||||
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
|
|
||||||
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
|
|
||||||
@@ -1,624 +0,0 @@
|
|||||||
# Intelligence Dashboard Upgrade Plan
|
|
||||||
|
|
||||||
## Executive Summary
|
|
||||||
|
|
||||||
The Troostwijk scraper now captures **5 critical new intelligence fields** that enable advanced predictive analytics and opportunity detection. This document outlines recommended dashboard upgrades to leverage the new data.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## New Intelligence Fields Available
|
|
||||||
|
|
||||||
### 1. **followers_count** (Watch Count)
|
|
||||||
**Type:** INTEGER
|
|
||||||
**Coverage:** Will be 100% for new scrapes, 0% for existing (requires migration)
|
|
||||||
**Intelligence Value:** ⭐⭐⭐⭐⭐ CRITICAL
|
|
||||||
|
|
||||||
**What it tells us:**
|
|
||||||
- How many users are watching/following each lot
|
|
||||||
- Real-time popularity indicator
|
|
||||||
- Early warning of bidding competition
|
|
||||||
|
|
||||||
**Dashboard Applications:**
|
|
||||||
- **Popularity Score**: Calculate interest level before bidding starts
|
|
||||||
- **Follower Trends**: Track follower growth rate (requires time-series scraping)
|
|
||||||
- **Interest-to-Bid Conversion**: Ratio of followers to actual bidders
|
|
||||||
- **Sleeper Lots Alert**: High followers + low bids = hidden opportunity
|
|
||||||
|
|
||||||
### 2. **estimated_min_price** & **estimated_max_price**
|
|
||||||
**Type:** REAL (EUR)
|
|
||||||
**Coverage:** Will be 100% for new scrapes, 0% for existing (requires migration)
|
|
||||||
**Intelligence Value:** ⭐⭐⭐⭐⭐ CRITICAL
|
|
||||||
|
|
||||||
**What it tells us:**
|
|
||||||
- Auction house's professional valuation range
|
|
||||||
- Expected market value
|
|
||||||
- Reserve price indicator (when combined with status)
|
|
||||||
|
|
||||||
**Dashboard Applications:**
|
|
||||||
- **Value Gap Analysis**: `current_bid / estimated_min_price` ratio
|
|
||||||
- **Bargain Detector**: Lots where `current_bid < estimated_min_price * 0.8`
|
|
||||||
- **Overvaluation Alert**: Lots where `current_bid > estimated_max_price * 1.2`
|
|
||||||
- **Investment ROI Calculator**: Potential profit if bought at current bid
|
|
||||||
- **Auction House Accuracy**: Track actual closing vs estimates
|
|
||||||
|
|
||||||
### 3. **lot_condition** & **appearance**
|
|
||||||
**Type:** TEXT
|
|
||||||
**Coverage:** Will be ~80-90% for new scrapes (not all lots have condition data)
|
|
||||||
**Intelligence Value:** ⭐⭐⭐ HIGH
|
|
||||||
|
|
||||||
**What it tells us:**
|
|
||||||
- Direct condition assessment from auction house
|
|
||||||
- Visual quality notes
|
|
||||||
- Cleaner than parsing from attributes
|
|
||||||
|
|
||||||
**Dashboard Applications:**
|
|
||||||
- **Condition Filtering**: Filter by condition categories
|
|
||||||
- **Restoration Projects**: Identify lots needing work
|
|
||||||
- **Quality Scoring**: Combine condition + appearance for rating
|
|
||||||
- **Condition vs Price**: Analyze price premium for better condition
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Data Quality Improvements
|
|
||||||
|
|
||||||
### Orphaned Lots Issue - FIXED ✅
|
|
||||||
**Before:** 16,807 lots (100%) had no matching auction
|
|
||||||
**After:** 13 lots (0.08%) orphaned
|
|
||||||
|
|
||||||
**Impact on Dashboard:**
|
|
||||||
- Auction-level analytics now possible
|
|
||||||
- Can group lots by auction
|
|
||||||
- Can show auction statistics
|
|
||||||
- Can track auction house performance
|
|
||||||
|
|
||||||
### Auction Data Completeness - FIXED ✅
|
|
||||||
**Before:**
|
|
||||||
- lots_count: 0%
|
|
||||||
- first_lot_closing_time: 0%
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
- lots_count: 100%
|
|
||||||
- first_lot_closing_time: 100%
|
|
||||||
|
|
||||||
**Impact on Dashboard:**
|
|
||||||
- Show auction size (number of lots)
|
|
||||||
- Display auction timeline
|
|
||||||
- Calculate auction velocity (lots per hour closing)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommended Dashboard Upgrades
|
|
||||||
|
|
||||||
### Priority 1: Opportunity Detection (High ROI)
|
|
||||||
|
|
||||||
#### 1.1 **Bargain Hunter Dashboard**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ BARGAIN OPPORTUNITIES ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Lot: A1-34731-107 - Ford Generator ║
|
|
||||||
║ Current Bid: €500 ║
|
|
||||||
║ Estimated Range: €1,200 - €1,800 ║
|
|
||||||
║ Bargain Score: 🔥🔥🔥🔥🔥 (58% below estimate) ║
|
|
||||||
║ Followers: 12 (High interest, low bids) ║
|
|
||||||
║ Time Left: 2h 15m ║
|
|
||||||
║ → POTENTIAL PROFIT: €700 - €1,300 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
**Calculations:**
|
|
||||||
```python
|
|
||||||
value_gap = estimated_min_price - current_bid
|
|
||||||
bargain_score = value_gap / estimated_min_price * 100
|
|
||||||
potential_profit = estimated_max_price - current_bid
|
|
||||||
|
|
||||||
# Filter criteria
|
|
||||||
if current_bid < estimated_min_price * 0.80: # 20%+ discount
|
|
||||||
if followers_count > 5: # Has interest
|
|
||||||
SHOW_AS_OPPORTUNITY
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 1.2 **Popularity vs Bidding Dashboard**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ SLEEPER LOTS (High Watch, Low Bids) ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Lot │ Followers │ Bids │ Current │ Est Min ║
|
|
||||||
║═══════════════════╪═══════════╪══════╪═════════╪═════════║
|
|
||||||
║ Laptop Dell XPS │ 47 │ 0 │ No bids│ €800 ║
|
|
||||||
║ iPhone 15 Pro │ 32 │ 1 │ €150 │ €950 ║
|
|
||||||
║ Office Chairs 10x │ 18 │ 0 │ No bids│ €450 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
**Insight:** High followers + low bids = people watching but not committing yet. Opportunity to bid early before competition heats up.
|
|
||||||
|
|
||||||
#### 1.3 **Value Gap Heatmap**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ VALUE GAP ANALYSIS ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ ║
|
|
||||||
║ Great Deals Fair Price Overvalued ║
|
|
||||||
║ (< 80% est) (80-120% est) (> 120% est) ║
|
|
||||||
║ ╔═══╗ ╔═══╗ ╔═══╗ ║
|
|
||||||
║ ║325║ ║892║ ║124║ ║
|
|
||||||
║ ╚═══╝ ╚═══╝ ╚═══╝ ║
|
|
||||||
║ 🔥 ➡ ⚠ ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
### Priority 2: Intelligence Analytics
|
|
||||||
|
|
||||||
#### 2.1 **Lot Intelligence Card**
|
|
||||||
Enhanced lot detail view with all new fields:
|
|
||||||
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ A1-34731-107 - Ford FGT9250E Generator ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ BIDDING ║
|
|
||||||
║ Current: €500 ║
|
|
||||||
║ Starting: €100 ║
|
|
||||||
║ Minimum: €550 ║
|
|
||||||
║ Bids: 8 (2.4 bids/hour) ║
|
|
||||||
║ Followers: 12 👁 ║
|
|
||||||
║ ║
|
|
||||||
║ VALUATION ║
|
|
||||||
║ Estimated: €1,200 - €1,800 ║
|
|
||||||
║ Value Gap: -€700 (58% below estimate) 🔥 ║
|
|
||||||
║ Potential: €700 - €1,300 profit ║
|
|
||||||
║ ║
|
|
||||||
║ CONDITION ║
|
|
||||||
║ Condition: Used - Good working order ║
|
|
||||||
║ Appearance: Normal wear, some scratches ║
|
|
||||||
║ Year: 2015 ║
|
|
||||||
║ ║
|
|
||||||
║ TIMING ║
|
|
||||||
║ Closes: 2025-12-08 14:30 ║
|
|
||||||
║ Time Left: 2h 15m ║
|
|
||||||
║ First Bid: 2025-12-06 09:15 ║
|
|
||||||
║ Last Bid: 2025-12-08 12:10 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2.2 **Auction House Accuracy Tracker**
|
|
||||||
Track how accurate estimates are compared to final prices:
|
|
||||||
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ AUCTION HOUSE ESTIMATION ACCURACY ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Category │ Avg Accuracy │ Tend to Over/Under ║
|
|
||||||
║══════════════════╪══════════════╪═══════════════════════║
|
|
||||||
║ Electronics │ 92.3% │ Underestimate 5.2% ║
|
|
||||||
║ Vehicles │ 88.7% │ Overestimate 8.1% ║
|
|
||||||
║ Furniture │ 94.1% │ Accurate ±2% ║
|
|
||||||
║ Heavy Machinery │ 85.4% │ Underestimate 12.3% ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
|
|
||||||
Insight: Heavy Machinery estimates tend to be 12% low
|
|
||||||
→ Good buying opportunities in this category
|
|
||||||
```
|
|
||||||
|
|
||||||
**Calculation:**
|
|
||||||
```python
|
|
||||||
# After lot closes
|
|
||||||
actual_price = final_bid
|
|
||||||
estimated_mid = (estimated_min_price + estimated_max_price) / 2
|
|
||||||
accuracy = abs(actual_price - estimated_mid) / estimated_mid * 100
|
|
||||||
|
|
||||||
if actual_price < estimated_mid:
|
|
||||||
trend = "Underestimate"
|
|
||||||
else:
|
|
||||||
trend = "Overestimate"
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2.3 **Interest Conversion Dashboard**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ FOLLOWER → BIDDER CONVERSION ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Total Lots: 16,807 ║
|
|
||||||
║ Lots with Followers: 12,450 (74%) ║
|
|
||||||
║ Lots with Bids: 1,591 (9.5%) ║
|
|
||||||
║ ║
|
|
||||||
║ Conversion Rate: 12.8% ║
|
|
||||||
║ (Followers who bid) ║
|
|
||||||
║ ║
|
|
||||||
║ Avg Followers per Lot: 8.3 ║
|
|
||||||
║ Avg Bids when >0: 5.2 ║
|
|
||||||
║ ║
|
|
||||||
║ HIGH INTEREST CATEGORIES: ║
|
|
||||||
║ Electronics: 18.5 followers avg ║
|
|
||||||
║ Vehicles: 24.3 followers avg ║
|
|
||||||
║ Art: 31.2 followers avg ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
### Priority 3: Real-Time Alerts
|
|
||||||
|
|
||||||
#### 3.1 **Opportunity Alerts**
|
|
||||||
```python
|
|
||||||
# Alert conditions using new fields
|
|
||||||
|
|
||||||
# BARGAIN ALERT
|
|
||||||
if (current_bid < estimated_min_price * 0.80 and
|
|
||||||
time_remaining < 24_hours and
|
|
||||||
followers_count > 3):
|
|
||||||
|
|
||||||
send_alert("BARGAIN: {lot_id} - {value_gap}% below estimate!")
|
|
||||||
|
|
||||||
# SLEEPER LOT ALERT
|
|
||||||
if (followers_count > 10 and
|
|
||||||
bid_count == 0 and
|
|
||||||
time_remaining < 12_hours):
|
|
||||||
|
|
||||||
send_alert("SLEEPER: {lot_id} - {followers_count} watching, no bids yet!")
|
|
||||||
|
|
||||||
# HEATING UP ALERT
|
|
||||||
if (follower_growth_rate > 5_per_hour and
|
|
||||||
bid_count < 3):
|
|
||||||
|
|
||||||
send_alert("HEATING UP: {lot_id} - Interest spiking, get in early!")
|
|
||||||
|
|
||||||
# OVERVALUED WARNING
|
|
||||||
if (current_bid > estimated_max_price * 1.2):
|
|
||||||
|
|
||||||
send_alert("OVERVALUED: {lot_id} - 20%+ above high estimate!")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 3.2 **Watchlist Smart Alerts**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ YOUR WATCHLIST ALERTS ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ 🔥 MacBook Pro A1-34523 ║
|
|
||||||
║ Now €800 (€400 below estimate!) ║
|
|
||||||
║ 12 others watching - Act fast! ║
|
|
||||||
║ ║
|
|
||||||
║ 👁 iPhone 15 A1-34987 ║
|
|
||||||
║ 32 followers but no bids - Opportunity? ║
|
|
||||||
║ ║
|
|
||||||
║ ⚠ Office Desk A1-35102 ║
|
|
||||||
║ Bid at €450 but estimate €200-€300 ║
|
|
||||||
║ Consider dropping - overvalued! ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
### Priority 4: Advanced Analytics
|
|
||||||
|
|
||||||
#### 4.1 **Price Prediction Model**
|
|
||||||
Using new fields for ML-based price prediction:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Features for price prediction model
|
|
||||||
features = [
|
|
||||||
'followers_count', # NEW - Strong predictor
|
|
||||||
'estimated_min_price', # NEW - Baseline value
|
|
||||||
'estimated_max_price', # NEW - Upper bound
|
|
||||||
'lot_condition', # NEW - Quality indicator
|
|
||||||
'appearance', # NEW - Visual quality
|
|
||||||
'bid_velocity', # Existing
|
|
||||||
'time_to_close', # Existing
|
|
||||||
'category', # Existing
|
|
||||||
'manufacturer', # Existing
|
|
||||||
'year_manufactured', # Existing
|
|
||||||
]
|
|
||||||
|
|
||||||
predicted_final_price = model.predict(features)
|
|
||||||
confidence_interval = (predicted_low, predicted_high)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Dashboard Display:**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ PRICE PREDICTION (AI) ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Lot: Ford Generator A1-34731-107 ║
|
|
||||||
║ ║
|
|
||||||
║ Current Bid: €500 ║
|
|
||||||
║ Estimate Range: €1,200 - €1,800 ║
|
|
||||||
║ ║
|
|
||||||
║ AI PREDICTION: €1,450 ║
|
|
||||||
║ Confidence: €1,280 - €1,620 (85% confidence) ║
|
|
||||||
║ ║
|
|
||||||
║ Factors: ║
|
|
||||||
║ ✓ 12 followers (above avg) ║
|
|
||||||
║ ✓ Good condition ║
|
|
||||||
║ ✓ 2.4 bids/hour (active) ║
|
|
||||||
║ - 2015 model (slightly old) ║
|
|
||||||
║ ║
|
|
||||||
║ Recommendation: BUY if below €1,280 ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 4.2 **Category Intelligence**
|
|
||||||
```
|
|
||||||
╔══════════════════════════════════════════════════════════╗
|
|
||||||
║ ELECTRONICS CATEGORY INTELLIGENCE ║
|
|
||||||
╠══════════════════════════════════════════════════════════╣
|
|
||||||
║ Total Lots: 1,243 ║
|
|
||||||
║ Avg Followers: 18.5 (High Interest Category) ║
|
|
||||||
║ Avg Bids: 12.3 ║
|
|
||||||
║ Follower→Bid Rate: 15.2% (above avg 12.8%) ║
|
|
||||||
║ ║
|
|
||||||
║ PRICE ANALYSIS: ║
|
|
||||||
║ Estimate Accuracy: 92.3% ║
|
|
||||||
║ Avg Value Gap: -5.2% (tend to underestimate) ║
|
|
||||||
║ Bargains Found: 87 lots (7%) ║
|
|
||||||
║ ║
|
|
||||||
║ BEST CONDITIONS: ║
|
|
||||||
║ "New/Sealed": Avg 145% of estimate ║
|
|
||||||
║ "Like New": Avg 112% of estimate ║
|
|
||||||
║ "Used - Good": Avg 89% of estimate ║
|
|
||||||
║ "Used - Fair": Avg 62% of estimate ║
|
|
||||||
║ ║
|
|
||||||
║ 💡 INSIGHT: Electronics estimates are accurate but ║
|
|
||||||
║ tend to slightly undervalue. Good buying category. ║
|
|
||||||
╚══════════════════════════════════════════════════════════╝
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Implementation Priority
|
|
||||||
|
|
||||||
### Phase 1: Quick Wins (1-2 days)
|
|
||||||
1. ✅ **Bargain Hunter Dashboard** - Filter lots by value gap
|
|
||||||
2. ✅ **Enhanced Lot Cards** - Show all new fields
|
|
||||||
3. ✅ **Opportunity Alerts** - Email/push notifications for bargains
|
|
||||||
|
|
||||||
### Phase 2: Analytics (3-5 days)
|
|
||||||
4. ✅ **Popularity vs Bidding Dashboard** - Follower analysis
|
|
||||||
5. ✅ **Value Gap Heatmap** - Visual overview
|
|
||||||
6. ✅ **Auction House Accuracy** - Historical tracking
|
|
||||||
|
|
||||||
### Phase 3: Advanced (1-2 weeks)
|
|
||||||
7. ✅ **Price Prediction Model** - ML-based predictions
|
|
||||||
8. ✅ **Category Intelligence** - Deep category analytics
|
|
||||||
9. ✅ **Smart Watchlist** - Personalized alerts
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Database Queries for Dashboard
|
|
||||||
|
|
||||||
### Get Bargain Opportunities
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
lot_id,
|
|
||||||
title,
|
|
||||||
current_bid,
|
|
||||||
estimated_min_price,
|
|
||||||
estimated_max_price,
|
|
||||||
followers_count,
|
|
||||||
lot_condition,
|
|
||||||
closing_time,
|
|
||||||
(estimated_min_price - CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL)) as value_gap,
|
|
||||||
((estimated_min_price - CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL)) / estimated_min_price * 100) as bargain_score
|
|
||||||
FROM lots
|
|
||||||
WHERE estimated_min_price IS NOT NULL
|
|
||||||
AND current_bid NOT LIKE '%No bids%'
|
|
||||||
AND CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL) < estimated_min_price * 0.80
|
|
||||||
AND followers_count > 3
|
|
||||||
AND datetime(closing_time) > datetime('now')
|
|
||||||
ORDER BY bargain_score DESC
|
|
||||||
LIMIT 50;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Sleeper Lots
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
lot_id,
|
|
||||||
title,
|
|
||||||
followers_count,
|
|
||||||
bid_count,
|
|
||||||
current_bid,
|
|
||||||
estimated_min_price,
|
|
||||||
closing_time,
|
|
||||||
(julianday(closing_time) - julianday('now')) * 24 as hours_remaining
|
|
||||||
FROM lots
|
|
||||||
WHERE followers_count > 10
|
|
||||||
AND bid_count = 0
|
|
||||||
AND datetime(closing_time) > datetime('now')
|
|
||||||
AND (julianday(closing_time) - julianday('now')) * 24 < 24
|
|
||||||
ORDER BY followers_count DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Auction House Accuracy (Historical)
|
|
||||||
```sql
|
|
||||||
-- After lots close
|
|
||||||
SELECT
|
|
||||||
category,
|
|
||||||
COUNT(*) as total_lots,
|
|
||||||
AVG(ABS(final_price - (estimated_min_price + estimated_max_price) / 2) /
|
|
||||||
((estimated_min_price + estimated_max_price) / 2) * 100) as avg_accuracy,
|
|
||||||
AVG(final_price - (estimated_min_price + estimated_max_price) / 2) as avg_bias
|
|
||||||
FROM lots
|
|
||||||
WHERE estimated_min_price IS NOT NULL
|
|
||||||
AND final_price IS NOT NULL
|
|
||||||
AND datetime(closing_time) < datetime('now')
|
|
||||||
GROUP BY category
|
|
||||||
ORDER BY avg_accuracy DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Interest Conversion Rate
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
COUNT(*) as total_lots,
|
|
||||||
COUNT(CASE WHEN followers_count > 0 THEN 1 END) as lots_with_followers,
|
|
||||||
COUNT(CASE WHEN bid_count > 0 THEN 1 END) as lots_with_bids,
|
|
||||||
ROUND(COUNT(CASE WHEN bid_count > 0 THEN 1 END) * 100.0 /
|
|
||||||
COUNT(CASE WHEN followers_count > 0 THEN 1 END), 2) as conversion_rate,
|
|
||||||
AVG(followers_count) as avg_followers,
|
|
||||||
AVG(CASE WHEN bid_count > 0 THEN bid_count END) as avg_bids_when_active
|
|
||||||
FROM lots
|
|
||||||
WHERE followers_count > 0;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Category Intelligence
|
|
||||||
```sql
|
|
||||||
SELECT
|
|
||||||
category,
|
|
||||||
COUNT(*) as total_lots,
|
|
||||||
AVG(followers_count) as avg_followers,
|
|
||||||
AVG(bid_count) as avg_bids,
|
|
||||||
COUNT(CASE WHEN bid_count > 0 THEN 1 END) * 100.0 / COUNT(*) as bid_rate,
|
|
||||||
COUNT(CASE WHEN followers_count > 0 THEN 1 END) * 100.0 / COUNT(*) as follower_rate,
|
|
||||||
-- Bargain rate
|
|
||||||
COUNT(CASE
|
|
||||||
WHEN estimated_min_price IS NOT NULL
|
|
||||||
AND current_bid NOT LIKE '%No bids%'
|
|
||||||
AND CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '€', '') AS REAL) < estimated_min_price * 0.80
|
|
||||||
THEN 1
|
|
||||||
END) as bargains_found
|
|
||||||
FROM lots
|
|
||||||
WHERE category IS NOT NULL AND category != ''
|
|
||||||
GROUP BY category
|
|
||||||
HAVING COUNT(*) > 50
|
|
||||||
ORDER BY avg_followers DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## API Requirements
|
|
||||||
|
|
||||||
### Real-Time Updates
|
|
||||||
For dashboards to stay current, implement periodic scraping:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Recommended update frequency
|
|
||||||
ACTIVE_LOTS = "Every 15 minutes" # Lots closing soon
|
|
||||||
ALL_LOTS = "Every 4 hours" # General updates
|
|
||||||
NEW_LOTS = "Every 1 hour" # Check for new listings
|
|
||||||
```
|
|
||||||
|
|
||||||
### Webhook Notifications
|
|
||||||
```python
|
|
||||||
# Alert types to implement
|
|
||||||
BARGAIN_ALERT = "Lot below 80% estimate"
|
|
||||||
SLEEPER_ALERT = "10+ followers, 0 bids, <12h remaining"
|
|
||||||
HEATING_UP = "Follower growth > 5/hour"
|
|
||||||
OVERVALUED = "Bid > 120% high estimate"
|
|
||||||
CLOSING_SOON = "Watchlist item < 1h remaining"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Migration Scripts to Run
|
|
||||||
|
|
||||||
To populate new fields for existing 16,807 lots:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# High priority - enriches all lots with new intelligence
|
|
||||||
python enrich_existing_lots.py
|
|
||||||
# Time: ~2.3 hours
|
|
||||||
# Benefit: Enables all dashboard features immediately
|
|
||||||
|
|
||||||
# Medium priority - adds bid history intelligence
|
|
||||||
python fetch_missing_bid_history.py
|
|
||||||
# Time: ~15 minutes
|
|
||||||
# Benefit: Bid velocity, timing analysis
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Future scrapes will automatically capture all fields, so migration is optional but recommended for immediate dashboard functionality.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Expected Impact
|
|
||||||
|
|
||||||
### Before New Fields:
|
|
||||||
- Basic price tracking
|
|
||||||
- Simple bid monitoring
|
|
||||||
- Limited opportunity detection
|
|
||||||
|
|
||||||
### After New Fields:
|
|
||||||
- **80% more intelligence** per lot
|
|
||||||
- Advanced opportunity detection (bargains, sleepers)
|
|
||||||
- Price prediction capability
|
|
||||||
- Auction house accuracy tracking
|
|
||||||
- Category-specific insights
|
|
||||||
- Interest→Bid conversion analytics
|
|
||||||
- Real-time popularity tracking
|
|
||||||
|
|
||||||
### ROI Potential:
|
|
||||||
```
|
|
||||||
Example Scenario:
|
|
||||||
- User finds bargain: €500 current bid, €1,200-€1,800 estimate
|
|
||||||
- Buys at: €600 (after competition)
|
|
||||||
- Resells at: €1,400 (within estimate range)
|
|
||||||
- Profit: €800
|
|
||||||
|
|
||||||
Dashboard Value: Automated detection of 87 such opportunities
|
|
||||||
Potential Value: 87 × €800 = €69,600 in identified opportunities
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Monitoring & Success Metrics
|
|
||||||
|
|
||||||
Track dashboard effectiveness:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# User engagement metrics
|
|
||||||
opportunities_shown = COUNT(bargain_alerts)
|
|
||||||
opportunities_acted_on = COUNT(user_bids_after_alert)
|
|
||||||
conversion_rate = opportunities_acted_on / opportunities_shown
|
|
||||||
|
|
||||||
# Accuracy metrics
|
|
||||||
predicted_bargains = COUNT(lots_flagged_as_bargain)
|
|
||||||
actual_bargains = COUNT(lots_closed_below_estimate)
|
|
||||||
prediction_accuracy = actual_bargains / predicted_bargains
|
|
||||||
|
|
||||||
# Value metrics
|
|
||||||
total_opportunity_value = SUM(estimated_min - final_price) WHERE final_price < estimated_min
|
|
||||||
avg_opportunity_value = total_opportunity_value / actual_bargains
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
1. **Immediate (Today):**
|
|
||||||
- ✅ Run `enrich_existing_lots.py` to populate new fields
|
|
||||||
- ✅ Update dashboard to display new fields
|
|
||||||
|
|
||||||
2. **This Week:**
|
|
||||||
- Implement Bargain Hunter Dashboard
|
|
||||||
- Add opportunity alerts
|
|
||||||
- Create enhanced lot cards
|
|
||||||
|
|
||||||
3. **Next Week:**
|
|
||||||
- Build analytics dashboards
|
|
||||||
- Implement price prediction model
|
|
||||||
- Set up webhook notifications
|
|
||||||
|
|
||||||
4. **Future:**
|
|
||||||
- A/B test alert strategies
|
|
||||||
- Refine prediction models with historical data
|
|
||||||
- Add category-specific recommendations
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Conclusion
|
|
||||||
|
|
||||||
The scraper now captures **5 critical intelligence fields** that unlock advanced analytics:
|
|
||||||
|
|
||||||
| Field | Dashboard Impact |
|
|
||||||
|-------|------------------|
|
|
||||||
| followers_count | Popularity tracking, sleeper detection |
|
|
||||||
| estimated_min_price | Bargain detection, value assessment |
|
|
||||||
| estimated_max_price | Overvaluation alerts, ROI calculation |
|
|
||||||
| lot_condition | Quality filtering, restoration opportunities |
|
|
||||||
| appearance | Visual assessment, detailed condition |
|
|
||||||
|
|
||||||
**Combined with fixed data quality** (99.9% fewer orphaned lots, 100% auction completeness), the dashboard can now provide:
|
|
||||||
|
|
||||||
- 🎯 **Opportunity Detection** - Automated bargain hunting
|
|
||||||
- 📊 **Predictive Analytics** - ML-based price predictions
|
|
||||||
- 📈 **Category Intelligence** - Deep market insights
|
|
||||||
- ⚡ **Real-Time Alerts** - Instant opportunity notifications
|
|
||||||
- 💰 **ROI Tracking** - Measure investment potential
|
|
||||||
|
|
||||||
**Estimated intelligence value increase: 80%+**
|
|
||||||
|
|
||||||
Ready to build! 🚀
|
|
||||||
@@ -1,164 +0,0 @@
|
|||||||
# Troostwijk Auction Extractor - Run Instructions
|
|
||||||
|
|
||||||
## Fixed Warnings
|
|
||||||
|
|
||||||
All warnings have been resolved:
|
|
||||||
- ✅ SLF4J logging configured (slf4j-simple)
|
|
||||||
- ✅ Native access enabled for SQLite JDBC
|
|
||||||
- ✅ Logging output controlled via simplelogger.properties
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
1. **Java 21** installed
|
|
||||||
2. **Maven** installed
|
|
||||||
3. **IntelliJ IDEA** (recommended) or command line
|
|
||||||
|
|
||||||
## Setup (First Time Only)
|
|
||||||
|
|
||||||
### 1. Install Dependencies
|
|
||||||
|
|
||||||
In IntelliJ Terminal or PowerShell:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Reload Maven dependencies
|
|
||||||
mvn clean install
|
|
||||||
|
|
||||||
# Install Playwright browser binaries (first time only)
|
|
||||||
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Running the Application
|
|
||||||
|
|
||||||
### Option A: Using IntelliJ IDEA (Easiest)
|
|
||||||
|
|
||||||
1. **Add VM Options for native access:**
|
|
||||||
- Run → Edit Configurations
|
|
||||||
- Select or create configuration for `TroostwijkAuctionExtractor`
|
|
||||||
- In "VM options" field, add:
|
|
||||||
```
|
|
||||||
--enable-native-access=ALL-UNNAMED
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Add Program Arguments (optional):**
|
|
||||||
- In "Program arguments" field, add:
|
|
||||||
```
|
|
||||||
--max-visits 3
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Run the application:**
|
|
||||||
- Click the green Run button
|
|
||||||
|
|
||||||
### Option B: Using Maven (Command Line)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run with 3 page limit
|
|
||||||
mvn exec:java
|
|
||||||
|
|
||||||
# Run with custom arguments (override pom.xml defaults)
|
|
||||||
mvn exec:java -Dexec.args="--max-visits 5"
|
|
||||||
|
|
||||||
# Run without cache
|
|
||||||
mvn exec:java -Dexec.args="--no-cache --max-visits 2"
|
|
||||||
|
|
||||||
# Run with unlimited visits
|
|
||||||
mvn exec:java -Dexec.args=""
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option C: Using Java Directly
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Compile first
|
|
||||||
mvn clean compile
|
|
||||||
|
|
||||||
# Run with native access enabled
|
|
||||||
java --enable-native-access=ALL-UNNAMED \
|
|
||||||
-cp target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
|
|
||||||
com.auction.TroostwijkAuctionExtractor --max-visits 3
|
|
||||||
```
|
|
||||||
|
|
||||||
## Command Line Arguments
|
|
||||||
|
|
||||||
```
|
|
||||||
--max-visits <n> Limit actual page fetches to n (0 = unlimited, default)
|
|
||||||
--no-cache Disable page caching
|
|
||||||
--help Show help message
|
|
||||||
```
|
|
||||||
|
|
||||||
## Examples
|
|
||||||
|
|
||||||
### Test with 3 page visits (cached pages don't count):
|
|
||||||
```bash
|
|
||||||
mvn exec:java -Dexec.args="--max-visits 3"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Fresh extraction without cache:
|
|
||||||
```bash
|
|
||||||
mvn exec:java -Dexec.args="--no-cache --max-visits 5"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Full extraction (all pages, unlimited):
|
|
||||||
```bash
|
|
||||||
mvn exec:java -Dexec.args=""
|
|
||||||
```
|
|
||||||
|
|
||||||
## Expected Output (No Warnings)
|
|
||||||
|
|
||||||
```
|
|
||||||
=== Troostwijk Auction Extractor ===
|
|
||||||
Max page visits set to: 3
|
|
||||||
|
|
||||||
Initializing Playwright browser...
|
|
||||||
✓ Browser ready
|
|
||||||
✓ Cache database initialized
|
|
||||||
|
|
||||||
Starting auction extraction from https://www.troostwijkauctions.com/auctions
|
|
||||||
|
|
||||||
[Page 1] Fetching auctions...
|
|
||||||
✓ Fetched from website (visit 1/3)
|
|
||||||
✓ Found 20 auctions
|
|
||||||
|
|
||||||
[Page 2] Fetching auctions...
|
|
||||||
✓ Loaded from cache
|
|
||||||
✓ Found 20 auctions
|
|
||||||
|
|
||||||
[Page 3] Fetching auctions...
|
|
||||||
✓ Fetched from website (visit 2/3)
|
|
||||||
✓ Found 20 auctions
|
|
||||||
|
|
||||||
✓ Total auctions extracted: 60
|
|
||||||
|
|
||||||
=== Results ===
|
|
||||||
Total auctions found: 60
|
|
||||||
Dutch auctions (NL): 45
|
|
||||||
Actual page visits: 2
|
|
||||||
|
|
||||||
✓ Browser and cache closed
|
|
||||||
```
|
|
||||||
|
|
||||||
## Cache Management
|
|
||||||
|
|
||||||
- Cache is stored in: `cache/page_cache.db`
|
|
||||||
- Cache expires after: 24 hours (configurable in code)
|
|
||||||
- To clear cache: Delete `cache/page_cache.db` file
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### If you still see warnings:
|
|
||||||
|
|
||||||
1. **Reload Maven project in IntelliJ:**
|
|
||||||
- Right-click `pom.xml` → Maven → Reload project
|
|
||||||
|
|
||||||
2. **Verify VM options:**
|
|
||||||
- Ensure `--enable-native-access=ALL-UNNAMED` is in VM options
|
|
||||||
|
|
||||||
3. **Clean and rebuild:**
|
|
||||||
```bash
|
|
||||||
mvn clean install
|
|
||||||
```
|
|
||||||
|
|
||||||
### If Playwright fails:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Reinstall browser binaries
|
|
||||||
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"
|
|
||||||
```
|
|
||||||
@@ -5,6 +5,11 @@
|
|||||||
playwright>=1.40.0
|
playwright>=1.40.0
|
||||||
aiohttp>=3.9.0 # Optional: only needed if DOWNLOAD_IMAGES=True
|
aiohttp>=3.9.0 # Optional: only needed if DOWNLOAD_IMAGES=True
|
||||||
|
|
||||||
|
# ORM groundwork (gradual adoption)
|
||||||
|
SQLAlchemy>=2.0 # Modern ORM (2.x) — groundwork for PostgreSQL
|
||||||
|
# PostgreSQL driver (runtime)
|
||||||
|
psycopg[binary]>=3.1
|
||||||
|
|
||||||
# Development/Testing
|
# Development/Testing
|
||||||
pytest>=7.4.0 # Optional: for testing
|
pytest>=7.4.0 # Optional: for testing
|
||||||
pytest-asyncio>=0.21.0 # Optional: for async tests
|
pytest-asyncio>=0.21.0 # Optional: for async tests
|
||||||
|
|||||||
@@ -1,290 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Script to detect and fix malformed/incomplete database entries.
|
|
||||||
|
|
||||||
Identifies entries with:
|
|
||||||
- Missing auction_id for auction pages
|
|
||||||
- Missing title
|
|
||||||
- Invalid bid values like "€Huidig bod"
|
|
||||||
- "gap" in closing_time
|
|
||||||
- Empty or invalid critical fields
|
|
||||||
|
|
||||||
Then re-parses from cache and updates.
|
|
||||||
"""
|
|
||||||
import sys
|
|
||||||
import sqlite3
|
|
||||||
import zlib
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import List, Dict, Tuple
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
|
||||||
|
|
||||||
from parse import DataParser
|
|
||||||
from config import CACHE_DB
|
|
||||||
|
|
||||||
|
|
||||||
class MalformedEntryFixer:
|
|
||||||
"""Detects and fixes malformed database entries"""
|
|
||||||
|
|
||||||
def __init__(self, db_path: str):
|
|
||||||
self.db_path = db_path
|
|
||||||
self.parser = DataParser()
|
|
||||||
|
|
||||||
def detect_malformed_auctions(self) -> List[Tuple]:
|
|
||||||
"""Find auctions with missing or invalid data"""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
# Auctions with issues
|
|
||||||
cursor = conn.execute("""
|
|
||||||
SELECT auction_id, url, title, first_lot_closing_time
|
|
||||||
FROM auctions
|
|
||||||
WHERE
|
|
||||||
auction_id = '' OR auction_id IS NULL
|
|
||||||
OR title = '' OR title IS NULL
|
|
||||||
OR first_lot_closing_time = 'gap'
|
|
||||||
OR first_lot_closing_time LIKE '%wegens vereffening%'
|
|
||||||
""")
|
|
||||||
return cursor.fetchall()
|
|
||||||
|
|
||||||
def detect_malformed_lots(self) -> List[Tuple]:
|
|
||||||
"""Find lots with missing or invalid data"""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
cursor = conn.execute("""
|
|
||||||
SELECT lot_id, url, title, current_bid, closing_time
|
|
||||||
FROM lots
|
|
||||||
WHERE
|
|
||||||
auction_id = '' OR auction_id IS NULL
|
|
||||||
OR title = '' OR title IS NULL
|
|
||||||
OR current_bid LIKE '%Huidig%bod%'
|
|
||||||
OR current_bid = '€Huidig bod'
|
|
||||||
OR closing_time = 'gap'
|
|
||||||
OR closing_time = ''
|
|
||||||
OR closing_time LIKE '%wegens vereffening%'
|
|
||||||
""")
|
|
||||||
return cursor.fetchall()
|
|
||||||
|
|
||||||
def get_cached_content(self, url: str) -> str:
|
|
||||||
"""Retrieve and decompress cached HTML for a URL"""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
cursor = conn.execute(
|
|
||||||
"SELECT content FROM cache WHERE url = ?",
|
|
||||||
(url,)
|
|
||||||
)
|
|
||||||
row = cursor.fetchone()
|
|
||||||
if row and row[0]:
|
|
||||||
try:
|
|
||||||
return zlib.decompress(row[0]).decode('utf-8')
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Failed to decompress: {e}")
|
|
||||||
return None
|
|
||||||
return None
|
|
||||||
|
|
||||||
def reparse_and_fix_auction(self, auction_id: str, url: str, dry_run: bool = False) -> bool:
|
|
||||||
"""Re-parse auction page from cache and update database"""
|
|
||||||
print(f"\n Fixing auction: {auction_id}")
|
|
||||||
print(f" URL: {url}")
|
|
||||||
|
|
||||||
content = self.get_cached_content(url)
|
|
||||||
if not content:
|
|
||||||
print(f" ❌ No cached content found")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Re-parse using current parser
|
|
||||||
parsed = self.parser.parse_page(content, url)
|
|
||||||
if not parsed or parsed.get('type') != 'auction':
|
|
||||||
print(f" ❌ Could not parse as auction")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Validate parsed data
|
|
||||||
if not parsed.get('auction_id') or not parsed.get('title'):
|
|
||||||
print(f" ⚠️ Re-parsed data still incomplete:")
|
|
||||||
print(f" auction_id: {parsed.get('auction_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" ✓ Parsed successfully:")
|
|
||||||
print(f" auction_id: {parsed.get('auction_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
print(f" location: {parsed.get('location', 'N/A')}")
|
|
||||||
print(f" lots: {parsed.get('lots_count', 0)}")
|
|
||||||
|
|
||||||
if not dry_run:
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.execute("""
|
|
||||||
UPDATE auctions SET
|
|
||||||
auction_id = ?,
|
|
||||||
title = ?,
|
|
||||||
location = ?,
|
|
||||||
lots_count = ?,
|
|
||||||
first_lot_closing_time = ?
|
|
||||||
WHERE url = ?
|
|
||||||
""", (
|
|
||||||
parsed['auction_id'],
|
|
||||||
parsed['title'],
|
|
||||||
parsed.get('location', ''),
|
|
||||||
parsed.get('lots_count', 0),
|
|
||||||
parsed.get('first_lot_closing_time', ''),
|
|
||||||
url
|
|
||||||
))
|
|
||||||
conn.commit()
|
|
||||||
print(f" ✓ Database updated")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
def reparse_and_fix_lot(self, lot_id: str, url: str, dry_run: bool = False) -> bool:
|
|
||||||
"""Re-parse lot page from cache and update database"""
|
|
||||||
print(f"\n Fixing lot: {lot_id}")
|
|
||||||
print(f" URL: {url}")
|
|
||||||
|
|
||||||
content = self.get_cached_content(url)
|
|
||||||
if not content:
|
|
||||||
print(f" ❌ No cached content found")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Re-parse using current parser
|
|
||||||
parsed = self.parser.parse_page(content, url)
|
|
||||||
if not parsed or parsed.get('type') != 'lot':
|
|
||||||
print(f" ❌ Could not parse as lot")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Validate parsed data
|
|
||||||
issues = []
|
|
||||||
if not parsed.get('lot_id'):
|
|
||||||
issues.append("missing lot_id")
|
|
||||||
if not parsed.get('title'):
|
|
||||||
issues.append("missing title")
|
|
||||||
if parsed.get('current_bid', '').lower().startswith('€huidig'):
|
|
||||||
issues.append("invalid bid format")
|
|
||||||
|
|
||||||
if issues:
|
|
||||||
print(f" ⚠️ Re-parsed data still has issues: {', '.join(issues)}")
|
|
||||||
print(f" lot_id: {parsed.get('lot_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
print(f" bid: {parsed.get('current_bid')}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" ✓ Parsed successfully:")
|
|
||||||
print(f" lot_id: {parsed.get('lot_id')}")
|
|
||||||
print(f" auction_id: {parsed.get('auction_id')}")
|
|
||||||
print(f" title: {parsed.get('title', '')[:50]}")
|
|
||||||
print(f" bid: {parsed.get('current_bid')}")
|
|
||||||
print(f" closing: {parsed.get('closing_time', 'N/A')}")
|
|
||||||
|
|
||||||
if not dry_run:
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.execute("""
|
|
||||||
UPDATE lots SET
|
|
||||||
lot_id = ?,
|
|
||||||
auction_id = ?,
|
|
||||||
title = ?,
|
|
||||||
current_bid = ?,
|
|
||||||
bid_count = ?,
|
|
||||||
closing_time = ?,
|
|
||||||
viewing_time = ?,
|
|
||||||
pickup_date = ?,
|
|
||||||
location = ?,
|
|
||||||
description = ?,
|
|
||||||
category = ?
|
|
||||||
WHERE url = ?
|
|
||||||
""", (
|
|
||||||
parsed['lot_id'],
|
|
||||||
parsed.get('auction_id', ''),
|
|
||||||
parsed['title'],
|
|
||||||
parsed.get('current_bid', ''),
|
|
||||||
parsed.get('bid_count', 0),
|
|
||||||
parsed.get('closing_time', ''),
|
|
||||||
parsed.get('viewing_time', ''),
|
|
||||||
parsed.get('pickup_date', ''),
|
|
||||||
parsed.get('location', ''),
|
|
||||||
parsed.get('description', ''),
|
|
||||||
parsed.get('category', ''),
|
|
||||||
url
|
|
||||||
))
|
|
||||||
conn.commit()
|
|
||||||
print(f" ✓ Database updated")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
def run(self, dry_run: bool = False):
|
|
||||||
"""Main execution - detect and fix all malformed entries"""
|
|
||||||
print("="*70)
|
|
||||||
print("MALFORMED ENTRY DETECTION AND REPAIR")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# Check for auctions
|
|
||||||
print("\n1. CHECKING AUCTIONS...")
|
|
||||||
malformed_auctions = self.detect_malformed_auctions()
|
|
||||||
print(f" Found {len(malformed_auctions)} malformed auction entries")
|
|
||||||
|
|
||||||
stats = {'auctions_fixed': 0, 'auctions_failed': 0}
|
|
||||||
for auction_id, url, title, closing_time in malformed_auctions:
|
|
||||||
try:
|
|
||||||
if self.reparse_and_fix_auction(auction_id or url.split('/')[-1], url, dry_run):
|
|
||||||
stats['auctions_fixed'] += 1
|
|
||||||
else:
|
|
||||||
stats['auctions_failed'] += 1
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Error: {e}")
|
|
||||||
stats['auctions_failed'] += 1
|
|
||||||
|
|
||||||
# Check for lots
|
|
||||||
print("\n2. CHECKING LOTS...")
|
|
||||||
malformed_lots = self.detect_malformed_lots()
|
|
||||||
print(f" Found {len(malformed_lots)} malformed lot entries")
|
|
||||||
|
|
||||||
stats['lots_fixed'] = 0
|
|
||||||
stats['lots_failed'] = 0
|
|
||||||
for lot_id, url, title, bid, closing_time in malformed_lots:
|
|
||||||
try:
|
|
||||||
if self.reparse_and_fix_lot(lot_id or url.split('/')[-1], url, dry_run):
|
|
||||||
stats['lots_fixed'] += 1
|
|
||||||
else:
|
|
||||||
stats['lots_failed'] += 1
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Error: {e}")
|
|
||||||
stats['lots_failed'] += 1
|
|
||||||
|
|
||||||
# Summary
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("SUMMARY")
|
|
||||||
print("="*70)
|
|
||||||
print(f"Auctions:")
|
|
||||||
print(f" - Found: {len(malformed_auctions)}")
|
|
||||||
print(f" - Fixed: {stats['auctions_fixed']}")
|
|
||||||
print(f" - Failed: {stats['auctions_failed']}")
|
|
||||||
print(f"\nLots:")
|
|
||||||
print(f" - Found: {len(malformed_lots)}")
|
|
||||||
print(f" - Fixed: {stats['lots_fixed']}")
|
|
||||||
print(f" - Failed: {stats['lots_failed']}")
|
|
||||||
|
|
||||||
if dry_run:
|
|
||||||
print("\n⚠️ DRY RUN - No changes were made to the database")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Detect and fix malformed database entries"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
'--db',
|
|
||||||
default=CACHE_DB,
|
|
||||||
help='Path to cache database'
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
'--dry-run',
|
|
||||||
action='store_true',
|
|
||||||
help='Show what would be done without making changes'
|
|
||||||
)
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
print(f"Database: {args.db}")
|
|
||||||
print(f"Dry run: {args.dry_run}\n")
|
|
||||||
|
|
||||||
fixer = MalformedEntryFixer(args.db)
|
|
||||||
fixer.run(dry_run=args.dry_run)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,139 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Migrate uncompressed cache entries to compressed format
|
|
||||||
This script compresses all cache entries where compressed=0
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sqlite3
|
|
||||||
import zlib
|
|
||||||
import time
|
|
||||||
|
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
|
||||||
|
|
||||||
def migrate_cache():
|
|
||||||
"""Compress all uncompressed cache entries"""
|
|
||||||
|
|
||||||
with sqlite3.connect(CACHE_DB) as conn:
|
|
||||||
# Get uncompressed entries
|
|
||||||
cursor = conn.execute(
|
|
||||||
"SELECT url, content FROM cache WHERE compressed = 0 OR compressed IS NULL"
|
|
||||||
)
|
|
||||||
uncompressed = cursor.fetchall()
|
|
||||||
|
|
||||||
if not uncompressed:
|
|
||||||
print("✓ No uncompressed entries found. All cache is already compressed!")
|
|
||||||
return
|
|
||||||
|
|
||||||
print(f"Found {len(uncompressed)} uncompressed cache entries")
|
|
||||||
print("Starting compression...")
|
|
||||||
|
|
||||||
total_original_size = 0
|
|
||||||
total_compressed_size = 0
|
|
||||||
compressed_count = 0
|
|
||||||
|
|
||||||
for url, content in uncompressed:
|
|
||||||
try:
|
|
||||||
# Handle both text and bytes
|
|
||||||
if isinstance(content, str):
|
|
||||||
content_bytes = content.encode('utf-8')
|
|
||||||
else:
|
|
||||||
content_bytes = content
|
|
||||||
|
|
||||||
original_size = len(content_bytes)
|
|
||||||
|
|
||||||
# Compress
|
|
||||||
compressed_content = zlib.compress(content_bytes, level=9)
|
|
||||||
compressed_size = len(compressed_content)
|
|
||||||
|
|
||||||
# Update in database
|
|
||||||
conn.execute(
|
|
||||||
"UPDATE cache SET content = ?, compressed = 1 WHERE url = ?",
|
|
||||||
(compressed_content, url)
|
|
||||||
)
|
|
||||||
|
|
||||||
total_original_size += original_size
|
|
||||||
total_compressed_size += compressed_size
|
|
||||||
compressed_count += 1
|
|
||||||
|
|
||||||
if compressed_count % 100 == 0:
|
|
||||||
conn.commit()
|
|
||||||
ratio = (1 - total_compressed_size / total_original_size) * 100
|
|
||||||
print(f" Compressed {compressed_count}/{len(uncompressed)} entries... "
|
|
||||||
f"({ratio:.1f}% reduction so far)")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ERROR compressing {url}: {e}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Final commit
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
# Calculate final statistics
|
|
||||||
ratio = (1 - total_compressed_size / total_original_size) * 100 if total_original_size > 0 else 0
|
|
||||||
size_saved_mb = (total_original_size - total_compressed_size) / (1024 * 1024)
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("MIGRATION COMPLETE")
|
|
||||||
print("="*60)
|
|
||||||
print(f"Entries compressed: {compressed_count}")
|
|
||||||
print(f"Original size: {total_original_size / (1024*1024):.2f} MB")
|
|
||||||
print(f"Compressed size: {total_compressed_size / (1024*1024):.2f} MB")
|
|
||||||
print(f"Space saved: {size_saved_mb:.2f} MB")
|
|
||||||
print(f"Compression ratio: {ratio:.1f}%")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
def verify_migration():
|
|
||||||
"""Verify all entries are compressed"""
|
|
||||||
with sqlite3.connect(CACHE_DB) as conn:
|
|
||||||
cursor = conn.execute(
|
|
||||||
"SELECT COUNT(*) FROM cache WHERE compressed = 0 OR compressed IS NULL"
|
|
||||||
)
|
|
||||||
uncompressed_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 1")
|
|
||||||
compressed_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
print("\nVERIFICATION:")
|
|
||||||
print(f" Compressed entries: {compressed_count}")
|
|
||||||
print(f" Uncompressed entries: {uncompressed_count}")
|
|
||||||
|
|
||||||
if uncompressed_count == 0:
|
|
||||||
print(" ✓ All cache entries are compressed!")
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
print(" ✗ Some entries are still uncompressed")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def get_db_size():
|
|
||||||
"""Get current database file size"""
|
|
||||||
import os
|
|
||||||
if os.path.exists(CACHE_DB):
|
|
||||||
size_mb = os.path.getsize(CACHE_DB) / (1024 * 1024)
|
|
||||||
return size_mb
|
|
||||||
return 0
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
print("Cache Compression Migration Tool")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Show initial DB size
|
|
||||||
initial_size = get_db_size()
|
|
||||||
print(f"Initial database size: {initial_size:.2f} MB\n")
|
|
||||||
|
|
||||||
# Run migration
|
|
||||||
start_time = time.time()
|
|
||||||
migrate_cache()
|
|
||||||
elapsed = time.time() - start_time
|
|
||||||
|
|
||||||
print(f"\nTime taken: {elapsed:.2f} seconds")
|
|
||||||
|
|
||||||
# Verify
|
|
||||||
verify_migration()
|
|
||||||
|
|
||||||
# Show final DB size
|
|
||||||
final_size = get_db_size()
|
|
||||||
print(f"\nFinal database size: {final_size:.2f} MB")
|
|
||||||
print(f"Database size reduced by: {initial_size - final_size:.2f} MB")
|
|
||||||
|
|
||||||
print("\n✓ Migration complete! You can now run VACUUM to reclaim disk space:")
|
|
||||||
print(" sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'")
|
|
||||||
@@ -1,180 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Migration script to re-parse cached HTML pages and update database entries.
|
|
||||||
Fixes issues with incomplete data extraction from earlier scrapes.
|
|
||||||
"""
|
|
||||||
import sys
|
|
||||||
import sqlite3
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add src to path
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
|
||||||
|
|
||||||
from parse import DataParser
|
|
||||||
from config import CACHE_DB
|
|
||||||
|
|
||||||
|
|
||||||
def reparse_and_update_lots(db_path: str = CACHE_DB, dry_run: bool = False):
|
|
||||||
"""
|
|
||||||
Re-parse cached HTML pages and update lot entries in the database.
|
|
||||||
|
|
||||||
This extracts improved data from __NEXT_DATA__ JSON blobs that may have been
|
|
||||||
missed in earlier scraping runs when validation was less strict.
|
|
||||||
"""
|
|
||||||
parser = DataParser()
|
|
||||||
|
|
||||||
with sqlite3.connect(db_path) as conn:
|
|
||||||
# Get all cached lot pages
|
|
||||||
cursor = conn.execute("""
|
|
||||||
SELECT url, content
|
|
||||||
FROM cache
|
|
||||||
WHERE url LIKE '%/l/%'
|
|
||||||
ORDER BY timestamp DESC
|
|
||||||
""")
|
|
||||||
|
|
||||||
cached_pages = cursor.fetchall()
|
|
||||||
print(f"Found {len(cached_pages)} cached lot pages to re-parse")
|
|
||||||
|
|
||||||
stats = {
|
|
||||||
'processed': 0,
|
|
||||||
'updated': 0,
|
|
||||||
'skipped': 0,
|
|
||||||
'errors': 0
|
|
||||||
}
|
|
||||||
|
|
||||||
for url, compressed_content in cached_pages:
|
|
||||||
try:
|
|
||||||
# Decompress content
|
|
||||||
import zlib
|
|
||||||
content = zlib.decompress(compressed_content).decode('utf-8')
|
|
||||||
|
|
||||||
# Re-parse using current parser logic
|
|
||||||
parsed_data = parser.parse_page(content, url)
|
|
||||||
|
|
||||||
if not parsed_data or parsed_data.get('type') != 'lot':
|
|
||||||
stats['skipped'] += 1
|
|
||||||
continue
|
|
||||||
|
|
||||||
lot_id = parsed_data.get('lot_id', '')
|
|
||||||
if not lot_id:
|
|
||||||
print(f" ⚠️ No lot_id for {url}")
|
|
||||||
stats['skipped'] += 1
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Check if lot exists
|
|
||||||
existing = conn.execute(
|
|
||||||
"SELECT lot_id FROM lots WHERE lot_id = ?",
|
|
||||||
(lot_id,)
|
|
||||||
).fetchone()
|
|
||||||
|
|
||||||
if not existing:
|
|
||||||
print(f" → New lot: {lot_id}")
|
|
||||||
# Insert new lot
|
|
||||||
if not dry_run:
|
|
||||||
conn.execute("""
|
|
||||||
INSERT INTO lots
|
|
||||||
(lot_id, auction_id, url, title, current_bid, bid_count,
|
|
||||||
closing_time, viewing_time, pickup_date, location,
|
|
||||||
description, category, scraped_at)
|
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
|
||||||
""", (
|
|
||||||
lot_id,
|
|
||||||
parsed_data.get('auction_id', ''),
|
|
||||||
url,
|
|
||||||
parsed_data.get('title', ''),
|
|
||||||
parsed_data.get('current_bid', ''),
|
|
||||||
parsed_data.get('bid_count', 0),
|
|
||||||
parsed_data.get('closing_time', ''),
|
|
||||||
parsed_data.get('viewing_time', ''),
|
|
||||||
parsed_data.get('pickup_date', ''),
|
|
||||||
parsed_data.get('location', ''),
|
|
||||||
parsed_data.get('description', ''),
|
|
||||||
parsed_data.get('category', ''),
|
|
||||||
parsed_data.get('scraped_at', '')
|
|
||||||
))
|
|
||||||
stats['updated'] += 1
|
|
||||||
else:
|
|
||||||
# Update existing lot with newly parsed data
|
|
||||||
# Only update fields that are now populated but weren't before
|
|
||||||
if not dry_run:
|
|
||||||
conn.execute("""
|
|
||||||
UPDATE lots SET
|
|
||||||
auction_id = COALESCE(NULLIF(?, ''), auction_id),
|
|
||||||
title = COALESCE(NULLIF(?, ''), title),
|
|
||||||
current_bid = COALESCE(NULLIF(?, ''), current_bid),
|
|
||||||
bid_count = CASE WHEN ? > 0 THEN ? ELSE bid_count END,
|
|
||||||
closing_time = COALESCE(NULLIF(?, ''), closing_time),
|
|
||||||
viewing_time = COALESCE(NULLIF(?, ''), viewing_time),
|
|
||||||
pickup_date = COALESCE(NULLIF(?, ''), pickup_date),
|
|
||||||
location = COALESCE(NULLIF(?, ''), location),
|
|
||||||
description = COALESCE(NULLIF(?, ''), description),
|
|
||||||
category = COALESCE(NULLIF(?, ''), category)
|
|
||||||
WHERE lot_id = ?
|
|
||||||
""", (
|
|
||||||
parsed_data.get('auction_id', ''),
|
|
||||||
parsed_data.get('title', ''),
|
|
||||||
parsed_data.get('current_bid', ''),
|
|
||||||
parsed_data.get('bid_count', 0),
|
|
||||||
parsed_data.get('bid_count', 0),
|
|
||||||
parsed_data.get('closing_time', ''),
|
|
||||||
parsed_data.get('viewing_time', ''),
|
|
||||||
parsed_data.get('pickup_date', ''),
|
|
||||||
parsed_data.get('location', ''),
|
|
||||||
parsed_data.get('description', ''),
|
|
||||||
parsed_data.get('category', ''),
|
|
||||||
lot_id
|
|
||||||
))
|
|
||||||
stats['updated'] += 1
|
|
||||||
|
|
||||||
print(f" ✓ Updated: {lot_id[:20]}")
|
|
||||||
|
|
||||||
# Update images if they exist
|
|
||||||
images = parsed_data.get('images', [])
|
|
||||||
if images and not dry_run:
|
|
||||||
for img_url in images:
|
|
||||||
conn.execute("""
|
|
||||||
INSERT OR IGNORE INTO images (lot_id, url)
|
|
||||||
VALUES (?, ?)
|
|
||||||
""", (lot_id, img_url))
|
|
||||||
|
|
||||||
stats['processed'] += 1
|
|
||||||
|
|
||||||
if stats['processed'] % 100 == 0:
|
|
||||||
print(f" Progress: {stats['processed']}/{len(cached_pages)}")
|
|
||||||
if not dry_run:
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ❌ Error processing {url}: {e}")
|
|
||||||
stats['errors'] += 1
|
|
||||||
continue
|
|
||||||
|
|
||||||
if not dry_run:
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("MIGRATION COMPLETE")
|
|
||||||
print("="*60)
|
|
||||||
print(f"Processed: {stats['processed']}")
|
|
||||||
print(f"Updated: {stats['updated']}")
|
|
||||||
print(f"Skipped: {stats['skipped']}")
|
|
||||||
print(f"Errors: {stats['errors']}")
|
|
||||||
|
|
||||||
if dry_run:
|
|
||||||
print("\n⚠️ DRY RUN - No changes were made to the database")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(description="Re-parse and update lot entries from cached HTML")
|
|
||||||
parser.add_argument('--db', default=CACHE_DB, help='Path to cache database')
|
|
||||||
parser.add_argument('--dry-run', action='store_true', help='Show what would be done without making changes')
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
print(f"Database: {args.db}")
|
|
||||||
print(f"Dry run: {args.dry_run}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
reparse_and_update_lots(args.db, args.dry_run)
|
|
||||||
931
src/cache.py
931
src/cache.py
File diff suppressed because it is too large
Load Diff
@@ -15,7 +15,36 @@ if sys.version_info < (3, 10):
|
|||||||
|
|
||||||
# ==================== CONFIGURATION ====================
|
# ==================== CONFIGURATION ====================
|
||||||
BASE_URL = "https://www.troostwijkauctions.com"
|
BASE_URL = "https://www.troostwijkauctions.com"
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
POSTGRES_HOST = os.getenv("POSTGRES_HOST", "postgres")
|
||||||
|
POSTGRES_DB = os.getenv("POSTGRES_DB", "auctiondb")
|
||||||
|
POSTGRES_USER = os.getenv("POSTGRES_USER", "auction")
|
||||||
|
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD", "heel-goed-wachtwoord")
|
||||||
|
# Full DSN
|
||||||
|
DATABASE_URL = os.getenv(
|
||||||
|
"DATABASE_URL",
|
||||||
|
f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:5432/{POSTGRES_DB}"
|
||||||
|
).strip()
|
||||||
|
|
||||||
|
# Primary database: PostgreSQL only
|
||||||
|
# Override via environment variable DATABASE_URL
|
||||||
|
# Example: postgresql://user:pass@host:5432/dbname
|
||||||
|
# DATABASE_URL = os.getenv(
|
||||||
|
# "DATABASE_URL",
|
||||||
|
# # Default provided by ops
|
||||||
|
# "postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb",
|
||||||
|
# ).strip()
|
||||||
|
|
||||||
|
# Database connection pool controls (to avoid creating too many short-lived TCP connections)
|
||||||
|
# Environment overrides: SCAEV_DB_POOL_MIN, SCAEV_DB_POOL_MAX, SCAEV_DB_POOL_TIMEOUT
|
||||||
|
def _int_env(name: str, default: int) -> int:
|
||||||
|
try:
|
||||||
|
return int(os.getenv(name, str(default)))
|
||||||
|
except Exception:
|
||||||
|
return default
|
||||||
|
|
||||||
|
DB_POOL_MIN = _int_env("SCAEV_DB_POOL_MIN", 1)
|
||||||
|
DB_POOL_MAX = _int_env("SCAEV_DB_POOL_MAX", 6)
|
||||||
|
DB_POOL_TIMEOUT = _int_env("SCAEV_DB_POOL_TIMEOUT", 30) # seconds to wait for a pooled connection
|
||||||
OUTPUT_DIR = "/mnt/okcomputer/output"
|
OUTPUT_DIR = "/mnt/okcomputer/output"
|
||||||
IMAGES_DIR = "/mnt/okcomputer/output/images"
|
IMAGES_DIR = "/mnt/okcomputer/output/images"
|
||||||
RATE_LIMIT_SECONDS = 0.5 # EXACTLY 0.5 seconds between requests
|
RATE_LIMIT_SECONDS = 0.5 # EXACTLY 0.5 seconds between requests
|
||||||
|
|||||||
54
src/db.py
Normal file
54
src/db.py
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Database scaffolding for future SQLAlchemy 2.x usage.
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- The application now uses PostgreSQL exclusively via `config.DATABASE_URL`.
|
||||||
|
- This module prepares an engine/session bound to `DATABASE_URL`.
|
||||||
|
- Example URL: `postgresql+psycopg://user:pass@host:5432/scaev`
|
||||||
|
|
||||||
|
No runtime dependency from the scraper currently imports or uses this module.
|
||||||
|
It is present to bootstrap a possible future move to SQLAlchemy 2.x.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
def get_database_url() -> str:
|
||||||
|
url = os.getenv("DATABASE_URL")
|
||||||
|
if not url or not url.strip():
|
||||||
|
raise RuntimeError("DATABASE_URL must be set for PostgreSQL connection")
|
||||||
|
return url.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def create_engine_and_session(database_url: str):
|
||||||
|
try:
|
||||||
|
from sqlalchemy import create_engine
|
||||||
|
from sqlalchemy.orm import sessionmaker
|
||||||
|
except Exception as e:
|
||||||
|
raise RuntimeError(
|
||||||
|
"SQLAlchemy is not installed. Add it to requirements.txt to use this module."
|
||||||
|
) from e
|
||||||
|
|
||||||
|
# Engine tuned for simple use; callers may override
|
||||||
|
engine = create_engine(database_url, pool_pre_ping=True, future=True)
|
||||||
|
SessionLocal = sessionmaker(bind=engine, autoflush=False, autocommit=False, future=True)
|
||||||
|
return engine, SessionLocal
|
||||||
|
|
||||||
|
|
||||||
|
def get_sa(session_cached: dict):
|
||||||
|
"""Helper to lazily create and cache SQLAlchemy engine/session factory.
|
||||||
|
|
||||||
|
session_cached: dict — a mutable dict, e.g., module-level {}, to store engine and factory
|
||||||
|
"""
|
||||||
|
if 'engine' in session_cached and 'SessionLocal' in session_cached:
|
||||||
|
return session_cached['engine'], session_cached['SessionLocal']
|
||||||
|
|
||||||
|
url = get_database_url()
|
||||||
|
engine, SessionLocal = create_engine_and_session(url)
|
||||||
|
session_cached['engine'] = engine
|
||||||
|
session_cached['SessionLocal'] = SessionLocal
|
||||||
|
return engine, SessionLocal
|
||||||
14
src/main.py
14
src/main.py
@@ -8,7 +8,6 @@ import sys
|
|||||||
import asyncio
|
import asyncio
|
||||||
import json
|
import json
|
||||||
import csv
|
import csv
|
||||||
import sqlite3
|
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
@@ -16,6 +15,17 @@ import config
|
|||||||
from cache import CacheManager
|
from cache import CacheManager
|
||||||
from scraper import TroostwijkScraper
|
from scraper import TroostwijkScraper
|
||||||
|
|
||||||
|
def mask_db_url(url: str) -> str:
|
||||||
|
try:
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
p = urlparse(url)
|
||||||
|
user = p.username or ''
|
||||||
|
host = p.hostname or ''
|
||||||
|
port = f":{p.port}" if p.port else ''
|
||||||
|
return f"{p.scheme}://{user}:***@{host}{port}{p.path or ''}"
|
||||||
|
except Exception:
|
||||||
|
return url
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
"""Main execution"""
|
"""Main execution"""
|
||||||
# Check for test mode
|
# Check for test mode
|
||||||
@@ -34,7 +44,7 @@ def main():
|
|||||||
if config.OFFLINE:
|
if config.OFFLINE:
|
||||||
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
||||||
print(f"Rate limit: {config.RATE_LIMIT_SECONDS} seconds BETWEEN EVERY REQUEST")
|
print(f"Rate limit: {config.RATE_LIMIT_SECONDS} seconds BETWEEN EVERY REQUEST")
|
||||||
print(f"Cache database: {config.CACHE_DB}")
|
print(f"Database URL: {mask_db_url(config.DATABASE_URL)}")
|
||||||
print(f"Output directory: {config.OUTPUT_DIR}")
|
print(f"Output directory: {config.OUTPUT_DIR}")
|
||||||
print(f"Max listing pages: {config.MAX_PAGES}")
|
print(f"Max listing pages: {config.MAX_PAGES}")
|
||||||
print("=" * 60)
|
print("=" * 60)
|
||||||
|
|||||||
@@ -7,7 +7,6 @@ Runs indefinitely to keep database current with latest Troostwijk data
|
|||||||
import asyncio
|
import asyncio
|
||||||
import time
|
import time
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
import sqlite3
|
|
||||||
import config
|
import config
|
||||||
from cache import CacheManager
|
from cache import CacheManager
|
||||||
from scraper import TroostwijkScraper
|
from scraper import TroostwijkScraper
|
||||||
@@ -82,21 +81,7 @@ class AuctionMonitor:
|
|||||||
|
|
||||||
def _get_stats(self) -> dict:
|
def _get_stats(self) -> dict:
|
||||||
"""Get current database statistics"""
|
"""Get current database statistics"""
|
||||||
conn = sqlite3.connect(self.scraper.cache.db_path)
|
return self.scraper.cache.get_counts()
|
||||||
cursor = conn.cursor()
|
|
||||||
|
|
||||||
cursor.execute("SELECT COUNT(*) FROM auctions")
|
|
||||||
auction_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
cursor.execute("SELECT COUNT(*) FROM lots")
|
|
||||||
lot_count = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
return {
|
|
||||||
'auctions': auction_count,
|
|
||||||
'lots': lot_count
|
|
||||||
}
|
|
||||||
|
|
||||||
async def start(self):
|
async def start(self):
|
||||||
"""Start continuous monitoring loop"""
|
"""Start continuous monitoring loop"""
|
||||||
@@ -106,7 +91,7 @@ class AuctionMonitor:
|
|||||||
if config.OFFLINE:
|
if config.OFFLINE:
|
||||||
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
||||||
print(f"Poll interval: {self.poll_interval / 60:.0f} minutes")
|
print(f"Poll interval: {self.poll_interval / 60:.0f} minutes")
|
||||||
print(f"Cache database: {config.CACHE_DB}")
|
print(f"Database URL: {self._mask_db_url(config.DATABASE_URL)}")
|
||||||
print(f"Rate limit: {config.RATE_LIMIT_SECONDS}s between requests")
|
print(f"Rate limit: {config.RATE_LIMIT_SECONDS}s between requests")
|
||||||
print("="*60)
|
print("="*60)
|
||||||
print("\nPress Ctrl+C to stop\n")
|
print("\nPress Ctrl+C to stop\n")
|
||||||
@@ -135,6 +120,21 @@ class AuctionMonitor:
|
|||||||
print(f"Last scan: {self.last_run.strftime('%Y-%m-%d %H:%M:%S')}")
|
print(f"Last scan: {self.last_run.strftime('%Y-%m-%d %H:%M:%S')}")
|
||||||
print("\nDatabase remains intact with all collected data")
|
print("\nDatabase remains intact with all collected data")
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _mask_db_url(url: str) -> str:
|
||||||
|
try:
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
parsed = urlparse(url)
|
||||||
|
if parsed.username:
|
||||||
|
user = parsed.username
|
||||||
|
host = parsed.hostname or ''
|
||||||
|
port = f":{parsed.port}" if parsed.port else ''
|
||||||
|
db = parsed.path or ''
|
||||||
|
return f"{parsed.scheme}://{user}:***@{host}{port}{db}"
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return url
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
"""Main entry point for monitor"""
|
"""Main entry point for monitor"""
|
||||||
import sys
|
import sys
|
||||||
|
|||||||
105
src/progress.py
Normal file
105
src/progress.py
Normal file
@@ -0,0 +1,105 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Lightweight TTY progress reporter for per-lot scraping.
|
||||||
|
|
||||||
|
It shows a spinner while work is in progress and records all page/API
|
||||||
|
fetches that contributed to the lot analysis, including:
|
||||||
|
- URL or source label
|
||||||
|
- size in bytes (when available)
|
||||||
|
- cache status (cached/real-time/offline/db/intercepted)
|
||||||
|
- duration in milliseconds
|
||||||
|
|
||||||
|
Intentionally dependency-free and safe to use in async code.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import threading
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import List, Optional
|
||||||
|
|
||||||
|
|
||||||
|
SPINNER_FRAMES = ["⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ProgressEvent:
|
||||||
|
kind: str # html | graphql | rest | image | cache | db | intercepted | other
|
||||||
|
label: str # url or description
|
||||||
|
size_bytes: Optional[int]
|
||||||
|
cached: str # "cache", "realtime", "offline", "db", "intercepted"
|
||||||
|
duration_ms: Optional[int]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ProgressReporter:
|
||||||
|
lot_id: str
|
||||||
|
title: str = ""
|
||||||
|
_events: List[ProgressEvent] = field(default_factory=list)
|
||||||
|
_start_ts: float = field(default_factory=time.time)
|
||||||
|
_stop_ts: Optional[float] = None
|
||||||
|
_spinner_thread: Optional[threading.Thread] = None
|
||||||
|
_stop_flag: bool = False
|
||||||
|
_is_tty: bool = field(default_factory=lambda: sys.stdout.isatty())
|
||||||
|
|
||||||
|
def start(self) -> None:
|
||||||
|
if not self._is_tty:
|
||||||
|
print(f"[LOT {self.lot_id}] ⏳ Analyzing… {self.title[:60]}")
|
||||||
|
return
|
||||||
|
|
||||||
|
def run_spinner():
|
||||||
|
idx = 0
|
||||||
|
while not self._stop_flag:
|
||||||
|
frame = SPINNER_FRAMES[idx % len(SPINNER_FRAMES)]
|
||||||
|
idx += 1
|
||||||
|
summary = f"{len(self._events)} events"
|
||||||
|
line = f"[LOT {self.lot_id}] {frame} {self.title[:60]} · {summary}"
|
||||||
|
# CR without newline to animate
|
||||||
|
sys.stdout.write("\r" + line)
|
||||||
|
sys.stdout.flush()
|
||||||
|
time.sleep(0.09)
|
||||||
|
# Clear the spinner line
|
||||||
|
sys.stdout.write("\r" + " " * 120 + "\r")
|
||||||
|
sys.stdout.flush()
|
||||||
|
|
||||||
|
self._spinner_thread = threading.Thread(target=run_spinner, daemon=True)
|
||||||
|
self._spinner_thread.start()
|
||||||
|
|
||||||
|
def add_event(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
kind: str,
|
||||||
|
label: str,
|
||||||
|
size_bytes: Optional[int] = None,
|
||||||
|
cached: str = "realtime",
|
||||||
|
duration_ms: Optional[float] = None,
|
||||||
|
) -> None:
|
||||||
|
self._events.append(
|
||||||
|
ProgressEvent(
|
||||||
|
kind=kind,
|
||||||
|
label=label,
|
||||||
|
size_bytes=int(size_bytes) if size_bytes is not None else None,
|
||||||
|
cached=cached,
|
||||||
|
duration_ms=int(duration_ms) if duration_ms is not None else None,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def stop(self) -> None:
|
||||||
|
self._stop_ts = time.time()
|
||||||
|
self._stop_flag = True
|
||||||
|
if self._spinner_thread and self._spinner_thread.is_alive():
|
||||||
|
self._spinner_thread.join(timeout=1.0)
|
||||||
|
|
||||||
|
total_ms = int((self._stop_ts - self._start_ts) * 1000)
|
||||||
|
print(f"[LOT {self.lot_id}] ✓ Done in {total_ms} ms — pages/APIs used:")
|
||||||
|
if not self._events:
|
||||||
|
print(" • (none)")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Print events as a compact list
|
||||||
|
for ev in self._events:
|
||||||
|
size = f"{ev.size_bytes} B" if ev.size_bytes is not None else "?"
|
||||||
|
dur = f"{ev.duration_ms} ms" if ev.duration_ms is not None else "?"
|
||||||
|
print(f" • [{ev.kind}] {ev.label} | {size} | {ev.cached} | {dur}")
|
||||||
358
src/scraper.py
358
src/scraper.py
@@ -3,7 +3,6 @@
|
|||||||
Core scaev module for Scaev Auctions
|
Core scaev module for Scaev Auctions
|
||||||
"""
|
"""
|
||||||
import os
|
import os
|
||||||
import sqlite3
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import time
|
import time
|
||||||
import random
|
import random
|
||||||
@@ -29,6 +28,7 @@ from graphql_client import (
|
|||||||
)
|
)
|
||||||
from bid_history_client import fetch_bid_history, parse_bid_history
|
from bid_history_client import fetch_bid_history, parse_bid_history
|
||||||
from priority import calculate_priority, parse_closing_time
|
from priority import calculate_priority, parse_closing_time
|
||||||
|
from progress import ProgressReporter
|
||||||
|
|
||||||
class TroostwijkScraper:
|
class TroostwijkScraper:
|
||||||
"""Main scraper class for Troostwijk Auctions"""
|
"""Main scraper class for Troostwijk Auctions"""
|
||||||
@@ -65,13 +65,8 @@ class TroostwijkScraper:
|
|||||||
content = await response.read()
|
content = await response.read()
|
||||||
with open(filepath, 'wb') as f:
|
with open(filepath, 'wb') as f:
|
||||||
f.write(content)
|
f.write(content)
|
||||||
|
# Record download in DB
|
||||||
with sqlite3.connect(self.cache.db_path) as conn:
|
self.cache.update_image_local_path(lot_id, url, str(filepath))
|
||||||
conn.execute("UPDATE images\n"
|
|
||||||
"SET local_path = ?, downloaded = 1\n"
|
|
||||||
"WHERE lot_id = ? AND url = ?\n"
|
|
||||||
"", (str(filepath), lot_id, url))
|
|
||||||
conn.commit()
|
|
||||||
return str(filepath)
|
return str(filepath)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -96,7 +91,7 @@ class TroostwijkScraper:
|
|||||||
(useful for auction listing pages where we just need HTML structure)
|
(useful for auction listing pages where we just need HTML structure)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Dict with 'content' and 'from_cache' keys
|
Dict with: 'content', 'from_cache', 'duration_ms', 'bytes', 'url'
|
||||||
"""
|
"""
|
||||||
if use_cache:
|
if use_cache:
|
||||||
cache_start = time.time()
|
cache_start = time.time()
|
||||||
@@ -104,7 +99,17 @@ class TroostwijkScraper:
|
|||||||
if cached:
|
if cached:
|
||||||
cache_time = (time.time() - cache_start) * 1000
|
cache_time = (time.time() - cache_start) * 1000
|
||||||
print(f" CACHE HIT: {url} ({cache_time:.0f}ms)")
|
print(f" CACHE HIT: {url} ({cache_time:.0f}ms)")
|
||||||
return {'content': cached['content'], 'from_cache': True}
|
try:
|
||||||
|
byte_len = len(cached['content'].encode('utf-8'))
|
||||||
|
except Exception:
|
||||||
|
byte_len = None
|
||||||
|
return {
|
||||||
|
'content': cached['content'],
|
||||||
|
'from_cache': True,
|
||||||
|
'duration_ms': int(cache_time),
|
||||||
|
'bytes': byte_len,
|
||||||
|
'url': url
|
||||||
|
}
|
||||||
|
|
||||||
# In OFFLINE mode we never fetch from network
|
# In OFFLINE mode we never fetch from network
|
||||||
if self.offline:
|
if self.offline:
|
||||||
@@ -130,7 +135,17 @@ class TroostwijkScraper:
|
|||||||
total_time = time.time() - fetch_start
|
total_time = time.time() - fetch_start
|
||||||
self.cache.set(url, content, 200)
|
self.cache.set(url, content, 200)
|
||||||
print(f" [Timing: goto={goto_time:.2f}s, total={total_time:.2f}s, mode={wait_strategy}]")
|
print(f" [Timing: goto={goto_time:.2f}s, total={total_time:.2f}s, mode={wait_strategy}]")
|
||||||
return {'content': content, 'from_cache': False}
|
try:
|
||||||
|
byte_len = len(content.encode('utf-8'))
|
||||||
|
except Exception:
|
||||||
|
byte_len = None
|
||||||
|
return {
|
||||||
|
'content': content,
|
||||||
|
'from_cache': False,
|
||||||
|
'duration_ms': int(total_time * 1000),
|
||||||
|
'bytes': byte_len,
|
||||||
|
'url': url
|
||||||
|
}
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f" ERROR: {e}")
|
print(f" ERROR: {e}")
|
||||||
@@ -216,71 +231,54 @@ class TroostwijkScraper:
|
|||||||
if not result:
|
if not result:
|
||||||
# OFFLINE fallback: try to construct page data directly from DB
|
# OFFLINE fallback: try to construct page data directly from DB
|
||||||
if self.offline:
|
if self.offline:
|
||||||
import sqlite3
|
rec = self.cache.get_page_record_by_url(url)
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
if rec:
|
||||||
cur = conn.cursor()
|
if rec.get('type') == 'lot':
|
||||||
# Try lot first
|
page_data = {
|
||||||
cur.execute("SELECT * FROM lots WHERE url = ?", (url,))
|
'type': 'lot',
|
||||||
lot_row = cur.fetchone()
|
'lot_id': rec.get('lot_id'),
|
||||||
if lot_row:
|
'auction_id': rec.get('auction_id'),
|
||||||
# Build a dict using column names
|
'url': rec.get('url') or url,
|
||||||
col_names = [d[0] for d in cur.description]
|
'title': rec.get('title') or '',
|
||||||
lot_dict = dict(zip(col_names, lot_row))
|
'current_bid': rec.get('current_bid') or '',
|
||||||
conn.close()
|
'bid_count': rec.get('bid_count') or 0,
|
||||||
page_data = {
|
'closing_time': rec.get('closing_time') or '',
|
||||||
'type': 'lot',
|
'viewing_time': rec.get('viewing_time') or '',
|
||||||
'lot_id': lot_dict.get('lot_id'),
|
'pickup_date': rec.get('pickup_date') or '',
|
||||||
'auction_id': lot_dict.get('auction_id'),
|
'location': rec.get('location') or '',
|
||||||
'url': lot_dict.get('url') or url,
|
'description': rec.get('description') or '',
|
||||||
'title': lot_dict.get('title') or '',
|
'category': rec.get('category') or '',
|
||||||
'current_bid': lot_dict.get('current_bid') or '',
|
'status': rec.get('status') or '',
|
||||||
'bid_count': lot_dict.get('bid_count') or 0,
|
'brand': rec.get('brand') or '',
|
||||||
'closing_time': lot_dict.get('closing_time') or '',
|
'model': rec.get('model') or '',
|
||||||
'viewing_time': lot_dict.get('viewing_time') or '',
|
'attributes_json': rec.get('attributes_json') or '',
|
||||||
'pickup_date': lot_dict.get('pickup_date') or '',
|
'first_bid_time': rec.get('first_bid_time'),
|
||||||
'location': lot_dict.get('location') or '',
|
'last_bid_time': rec.get('last_bid_time'),
|
||||||
'description': lot_dict.get('description') or '',
|
'bid_velocity': rec.get('bid_velocity'),
|
||||||
'category': lot_dict.get('category') or '',
|
'followers_count': rec.get('followers_count') or 0,
|
||||||
'status': lot_dict.get('status') or '',
|
'estimated_min_price': rec.get('estimated_min_price'),
|
||||||
'brand': lot_dict.get('brand') or '',
|
'estimated_max_price': rec.get('estimated_max_price'),
|
||||||
'model': lot_dict.get('model') or '',
|
'lot_condition': rec.get('lot_condition') or '',
|
||||||
'attributes_json': lot_dict.get('attributes_json') or '',
|
'appearance': rec.get('appearance') or '',
|
||||||
'first_bid_time': lot_dict.get('first_bid_time'),
|
'scraped_at': rec.get('scraped_at') or '',
|
||||||
'last_bid_time': lot_dict.get('last_bid_time'),
|
}
|
||||||
'bid_velocity': lot_dict.get('bid_velocity'),
|
print(" OFFLINE: using DB record for lot")
|
||||||
'followers_count': lot_dict.get('followers_count') or 0,
|
self.visited_lots.add(url)
|
||||||
'estimated_min_price': lot_dict.get('estimated_min_price'),
|
return page_data
|
||||||
'estimated_max_price': lot_dict.get('estimated_max_price'),
|
else:
|
||||||
'lot_condition': lot_dict.get('lot_condition') or '',
|
page_data = {
|
||||||
'appearance': lot_dict.get('appearance') or '',
|
'type': 'auction',
|
||||||
'scraped_at': lot_dict.get('scraped_at') or '',
|
'auction_id': rec.get('auction_id'),
|
||||||
}
|
'url': rec.get('url') or url,
|
||||||
print(" OFFLINE: using DB record for lot")
|
'title': rec.get('title') or '',
|
||||||
self.visited_lots.add(url)
|
'location': rec.get('location') or '',
|
||||||
return page_data
|
'lots_count': rec.get('lots_count') or 0,
|
||||||
|
'first_lot_closing_time': rec.get('first_lot_closing_time') or '',
|
||||||
# Try auction by URL
|
'scraped_at': rec.get('scraped_at') or '',
|
||||||
cur.execute("SELECT * FROM auctions WHERE url = ?", (url,))
|
}
|
||||||
auc_row = cur.fetchone()
|
print(" OFFLINE: using DB record for auction")
|
||||||
if auc_row:
|
self.visited_lots.add(url)
|
||||||
col_names = [d[0] for d in cur.description]
|
return page_data
|
||||||
auc_dict = dict(zip(col_names, auc_row))
|
|
||||||
conn.close()
|
|
||||||
page_data = {
|
|
||||||
'type': 'auction',
|
|
||||||
'auction_id': auc_dict.get('auction_id'),
|
|
||||||
'url': auc_dict.get('url') or url,
|
|
||||||
'title': auc_dict.get('title') or '',
|
|
||||||
'location': auc_dict.get('location') or '',
|
|
||||||
'lots_count': auc_dict.get('lots_count') or 0,
|
|
||||||
'first_lot_closing_time': auc_dict.get('first_lot_closing_time') or '',
|
|
||||||
'scraped_at': auc_dict.get('scraped_at') or '',
|
|
||||||
}
|
|
||||||
print(" OFFLINE: using DB record for auction")
|
|
||||||
self.visited_lots.add(url)
|
|
||||||
return page_data
|
|
||||||
|
|
||||||
conn.close()
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
content = result['content']
|
content = result['content']
|
||||||
@@ -302,6 +300,18 @@ class TroostwijkScraper:
|
|||||||
print(f" Type: LOT")
|
print(f" Type: LOT")
|
||||||
print(f" Title: {page_data.get('title', 'N/A')[:60]}...")
|
print(f" Title: {page_data.get('title', 'N/A')[:60]}...")
|
||||||
|
|
||||||
|
# TTY progress reporter per lot
|
||||||
|
lot_progress = ProgressReporter(lot_id=page_data.get('lot_id', ''), title=page_data.get('title', ''))
|
||||||
|
lot_progress.start()
|
||||||
|
# Record HTML page fetch
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='html',
|
||||||
|
label=result.get('url', url),
|
||||||
|
size_bytes=result.get('bytes'),
|
||||||
|
cached='cache' if from_cache else 'realtime',
|
||||||
|
duration_ms=result.get('duration_ms')
|
||||||
|
)
|
||||||
|
|
||||||
# Extract ALL data from __NEXT_DATA__ lot object
|
# Extract ALL data from __NEXT_DATA__ lot object
|
||||||
import json
|
import json
|
||||||
import re
|
import re
|
||||||
@@ -330,7 +340,6 @@ class TroostwijkScraper:
|
|||||||
# Fetch all API data concurrently (or use intercepted/cached data)
|
# Fetch all API data concurrently (or use intercepted/cached data)
|
||||||
lot_id = page_data.get('lot_id')
|
lot_id = page_data.get('lot_id')
|
||||||
auction_id = page_data.get('auction_id')
|
auction_id = page_data.get('auction_id')
|
||||||
import sqlite3
|
|
||||||
|
|
||||||
# Step 1: Check if we intercepted API data during page load
|
# Step 1: Check if we intercepted API data during page load
|
||||||
intercepted_data = None
|
intercepted_data = None
|
||||||
@@ -339,6 +348,13 @@ class TroostwijkScraper:
|
|||||||
try:
|
try:
|
||||||
intercepted_json = self.intercepted_api_data[lot_id]
|
intercepted_json = self.intercepted_api_data[lot_id]
|
||||||
intercepted_data = json.loads(intercepted_json)
|
intercepted_data = json.loads(intercepted_json)
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='intercepted',
|
||||||
|
label='GraphQL (intercepted)',
|
||||||
|
size_bytes=len(intercepted_json.encode('utf-8')),
|
||||||
|
cached='intercepted',
|
||||||
|
duration_ms=0
|
||||||
|
)
|
||||||
# Store the raw JSON for future offline use
|
# Store the raw JSON for future offline use
|
||||||
page_data['api_data_json'] = intercepted_json
|
page_data['api_data_json'] = intercepted_json
|
||||||
# Extract lot data from intercepted response
|
# Extract lot data from intercepted response
|
||||||
@@ -356,14 +372,7 @@ class TroostwijkScraper:
|
|||||||
pass
|
pass
|
||||||
elif from_cache:
|
elif from_cache:
|
||||||
# Check if we have cached API data in database
|
# Check if we have cached API data in database
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
existing = self.cache.get_lot_api_fields(lot_id)
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT followers_count, estimated_min_price, current_bid, bid_count, closing_time, status
|
|
||||||
FROM lots WHERE lot_id = ?
|
|
||||||
""", (lot_id,))
|
|
||||||
existing = cursor.fetchone()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
# Data quality check: Must have followers_count AND closing_time to be considered "complete"
|
# Data quality check: Must have followers_count AND closing_time to be considered "complete"
|
||||||
# This prevents using stale records like old "0 bids" entries
|
# This prevents using stale records like old "0 bids" entries
|
||||||
@@ -374,6 +383,13 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
if is_complete:
|
if is_complete:
|
||||||
print(f" Using cached API data")
|
print(f" Using cached API data")
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='db',
|
||||||
|
label='lots table (cached api fields)',
|
||||||
|
size_bytes=None,
|
||||||
|
cached='db',
|
||||||
|
duration_ms=0
|
||||||
|
)
|
||||||
page_data['followers_count'] = existing[0]
|
page_data['followers_count'] = existing[0]
|
||||||
page_data['estimated_min_price'] = existing[1]
|
page_data['estimated_min_price'] = existing[1]
|
||||||
page_data['current_bid'] = existing[2] or page_data.get('current_bid', 'No bids')
|
page_data['current_bid'] = existing[2] or page_data.get('current_bid', 'No bids')
|
||||||
@@ -385,9 +401,31 @@ class TroostwijkScraper:
|
|||||||
else:
|
else:
|
||||||
print(f" Fetching lot data from API (concurrent)...")
|
print(f" Fetching lot data from API (concurrent)...")
|
||||||
# Make concurrent API calls
|
# Make concurrent API calls
|
||||||
api_tasks = [fetch_lot_bidding_data(lot_id)]
|
api_tasks = []
|
||||||
|
# Wrap each API call to capture duration and size
|
||||||
|
async def _timed_fetch(name, coro_func, *args, **kwargs):
|
||||||
|
t0 = time.time()
|
||||||
|
data = await coro_func(*args, **kwargs)
|
||||||
|
dt = int((time.time() - t0) * 1000)
|
||||||
|
size_b = None
|
||||||
|
try:
|
||||||
|
if data is not None:
|
||||||
|
import json as _json
|
||||||
|
size_b = len(_json.dumps(data).encode('utf-8'))
|
||||||
|
except Exception:
|
||||||
|
size_b = None
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='graphql',
|
||||||
|
label=name,
|
||||||
|
size_bytes=size_b,
|
||||||
|
cached='realtime',
|
||||||
|
duration_ms=dt
|
||||||
|
)
|
||||||
|
return data
|
||||||
|
|
||||||
|
api_tasks.append(_timed_fetch('GraphQL lotDetails', fetch_lot_bidding_data, lot_id))
|
||||||
if auction_id:
|
if auction_id:
|
||||||
api_tasks.append(fetch_auction_data(auction_id))
|
api_tasks.append(_timed_fetch('GraphQL auction', fetch_auction_data, auction_id))
|
||||||
results = await asyncio.gather(*api_tasks, return_exceptions=True)
|
results = await asyncio.gather(*api_tasks, return_exceptions=True)
|
||||||
bidding_data = results[0] if results and not isinstance(results[0], Exception) else None
|
bidding_data = results[0] if results and not isinstance(results[0], Exception) else None
|
||||||
bid_history_data = None # Will fetch after we have lot_uuid
|
bid_history_data = None # Will fetch after we have lot_uuid
|
||||||
@@ -395,32 +433,90 @@ class TroostwijkScraper:
|
|||||||
# Fresh page fetch - make concurrent API calls for all data
|
# Fresh page fetch - make concurrent API calls for all data
|
||||||
if not self.offline:
|
if not self.offline:
|
||||||
print(f" Fetching lot data from API (concurrent)...")
|
print(f" Fetching lot data from API (concurrent)...")
|
||||||
api_tasks = [fetch_lot_bidding_data(lot_id)]
|
api_tasks = []
|
||||||
task_map = {'bidding': 0} # Track which index corresponds to which task
|
task_map = {'bidding': 0} # Track which index corresponds to which task
|
||||||
|
|
||||||
# Add auction data fetch if we need viewing/pickup times
|
# Add auction data fetch if we need viewing/pickup times
|
||||||
if auction_id:
|
if auction_id:
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
vt, pd = self.cache.get_lot_times(lot_id)
|
||||||
cursor = conn.cursor()
|
has_times = vt or pd
|
||||||
cursor.execute("""
|
|
||||||
SELECT viewing_time, pickup_date FROM lots WHERE lot_id = ?
|
|
||||||
""", (lot_id,))
|
|
||||||
times = cursor.fetchone()
|
|
||||||
conn.close()
|
|
||||||
has_times = times and (times[0] or times[1])
|
|
||||||
|
|
||||||
if not has_times:
|
if not has_times:
|
||||||
task_map['auction'] = len(api_tasks)
|
task_map['auction'] = len(api_tasks)
|
||||||
api_tasks.append(fetch_auction_data(auction_id))
|
async def fetch_auction_wrapped():
|
||||||
|
t0 = time.time()
|
||||||
|
data = await fetch_auction_data(auction_id)
|
||||||
|
dt = int((time.time() - t0) * 1000)
|
||||||
|
size_b = None
|
||||||
|
try:
|
||||||
|
if data is not None:
|
||||||
|
import json as _json
|
||||||
|
size_b = len(_json.dumps(data).encode('utf-8'))
|
||||||
|
except Exception:
|
||||||
|
size_b = None
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='graphql',
|
||||||
|
label='GraphQL auction',
|
||||||
|
size_bytes=size_b,
|
||||||
|
cached='realtime',
|
||||||
|
duration_ms=dt
|
||||||
|
)
|
||||||
|
return data
|
||||||
|
api_tasks.append(fetch_auction_wrapped())
|
||||||
|
|
||||||
# Add bid history fetch if we have lot_uuid and expect bids
|
# Add bid history fetch if we have lot_uuid and expect bids
|
||||||
if lot_uuid:
|
if lot_uuid:
|
||||||
task_map['bid_history'] = len(api_tasks)
|
task_map['bid_history'] = len(api_tasks)
|
||||||
api_tasks.append(fetch_bid_history(lot_uuid))
|
async def fetch_bid_history_wrapped():
|
||||||
|
t0 = time.time()
|
||||||
|
data = await fetch_bid_history(lot_uuid)
|
||||||
|
dt = int((time.time() - t0) * 1000)
|
||||||
|
size_b = None
|
||||||
|
try:
|
||||||
|
if data is not None:
|
||||||
|
import json as _json
|
||||||
|
size_b = len(_json.dumps(data).encode('utf-8'))
|
||||||
|
except Exception:
|
||||||
|
size_b = None
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='rest',
|
||||||
|
label='REST bid history',
|
||||||
|
size_bytes=size_b,
|
||||||
|
cached='realtime',
|
||||||
|
duration_ms=dt
|
||||||
|
)
|
||||||
|
return data
|
||||||
|
api_tasks.append(fetch_bid_history_wrapped())
|
||||||
|
|
||||||
# Execute all API calls concurrently
|
# Execute all API calls concurrently
|
||||||
|
# Always include the bidding data as first task
|
||||||
|
async def fetch_bidding_wrapped():
|
||||||
|
t0 = time.time()
|
||||||
|
data = await fetch_lot_bidding_data(lot_id)
|
||||||
|
dt = int((time.time() - t0) * 1000)
|
||||||
|
size_b = None
|
||||||
|
try:
|
||||||
|
if data is not None:
|
||||||
|
import json as _json
|
||||||
|
size_b = len(_json.dumps(data).encode('utf-8'))
|
||||||
|
except Exception:
|
||||||
|
size_b = None
|
||||||
|
lot_progress.add_event(
|
||||||
|
kind='graphql',
|
||||||
|
label='GraphQL lotDetails',
|
||||||
|
size_bytes=size_b,
|
||||||
|
cached='realtime',
|
||||||
|
duration_ms=dt
|
||||||
|
)
|
||||||
|
return data
|
||||||
|
|
||||||
|
api_tasks.insert(0, fetch_bidding_wrapped())
|
||||||
|
# Adjust task_map indexes
|
||||||
|
for k in list(task_map.keys()):
|
||||||
|
task_map[k] += 1 if k != 'bidding' else 0
|
||||||
|
|
||||||
results = await asyncio.gather(*api_tasks, return_exceptions=True)
|
results = await asyncio.gather(*api_tasks, return_exceptions=True)
|
||||||
bidding_data = results[task_map['bidding']] if results and not isinstance(results[task_map['bidding']], Exception) else None
|
bidding_data = results[0] if results and not isinstance(results[0], Exception) else None
|
||||||
|
|
||||||
# Store raw API JSON for offline replay
|
# Store raw API JSON for offline replay
|
||||||
if bidding_data:
|
if bidding_data:
|
||||||
@@ -538,14 +634,7 @@ class TroostwijkScraper:
|
|||||||
self.cache.save_bid_history(lot_id, bid_data['bid_records'])
|
self.cache.save_bid_history(lot_id, bid_data['bid_records'])
|
||||||
elif from_cache and page_data.get('bid_count', 0) > 0:
|
elif from_cache and page_data.get('bid_count', 0) > 0:
|
||||||
# Check if cached bid history exists
|
# Check if cached bid history exists
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
if self.cache.has_bid_history(lot_id):
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT COUNT(*) FROM bid_history WHERE lot_id = ?
|
|
||||||
""", (lot_id,))
|
|
||||||
has_history = cursor.fetchone()[0] > 0
|
|
||||||
conn.close()
|
|
||||||
if has_history:
|
|
||||||
print(f" Bid history cached")
|
print(f" Bid history cached")
|
||||||
else:
|
else:
|
||||||
print(f" Bid: {page_data.get('current_bid', 'N/A')} (from HTML)")
|
print(f" Bid: {page_data.get('current_bid', 'N/A')} (from HTML)")
|
||||||
@@ -571,15 +660,7 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
if self.download_images:
|
if self.download_images:
|
||||||
# Check which images are already downloaded
|
# Check which images are already downloaded
|
||||||
import sqlite3
|
already_downloaded = set(self.cache.get_downloaded_image_urls(page_data['lot_id']))
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT url FROM images
|
|
||||||
WHERE lot_id = ? AND downloaded = 1
|
|
||||||
""", (page_data['lot_id'],))
|
|
||||||
already_downloaded = {row[0] for row in cursor.fetchall()}
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
# Only download missing images
|
# Only download missing images
|
||||||
images_to_download = [
|
images_to_download = [
|
||||||
@@ -628,6 +709,12 @@ class TroostwijkScraper:
|
|||||||
else:
|
else:
|
||||||
print(f" All {len(images)} images already cached")
|
print(f" All {len(images)} images already cached")
|
||||||
|
|
||||||
|
# Stop and print progress summary for the lot
|
||||||
|
try:
|
||||||
|
lot_progress.stop()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
return page_data
|
return page_data
|
||||||
|
|
||||||
def _prioritize_lots(self, lot_urls: List[str]) -> List[Tuple[int, str, str]]:
|
def _prioritize_lots(self, lot_urls: List[str]) -> List[Tuple[int, str, str]]:
|
||||||
@@ -636,25 +723,15 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
Returns list of (priority, url, description) tuples sorted by priority (highest first)
|
Returns list of (priority, url, description) tuples sorted by priority (highest first)
|
||||||
"""
|
"""
|
||||||
import sqlite3
|
|
||||||
|
|
||||||
prioritized = []
|
prioritized = []
|
||||||
current_time = int(time.time())
|
current_time = int(time.time())
|
||||||
|
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
|
||||||
cursor = conn.cursor()
|
|
||||||
|
|
||||||
for url in lot_urls:
|
for url in lot_urls:
|
||||||
# Extract lot_id from URL
|
# Extract lot_id from URL
|
||||||
lot_id = self.parser.extract_lot_id(url)
|
lot_id = self.parser.extract_lot_id(url)
|
||||||
|
|
||||||
# Try to get existing data from database
|
# Try to get existing data from database
|
||||||
cursor.execute("""
|
row = self.cache.get_lot_priority_info(lot_id, url)
|
||||||
SELECT closing_time, scraped_at, scrape_priority, next_scrape_at
|
|
||||||
FROM lots WHERE lot_id = ? OR url = ?
|
|
||||||
""", (lot_id, url))
|
|
||||||
|
|
||||||
row = cursor.fetchone()
|
|
||||||
|
|
||||||
if row:
|
if row:
|
||||||
closing_time, scraped_at, existing_priority, next_scrape_at = row
|
closing_time, scraped_at, existing_priority, next_scrape_at = row
|
||||||
@@ -694,8 +771,6 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
prioritized.append((priority, url, desc))
|
prioritized.append((priority, url, desc))
|
||||||
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
# Sort by priority (highest first)
|
# Sort by priority (highest first)
|
||||||
prioritized.sort(key=lambda x: x[0], reverse=True)
|
prioritized.sort(key=lambda x: x[0], reverse=True)
|
||||||
|
|
||||||
@@ -706,14 +781,9 @@ class TroostwijkScraper:
|
|||||||
if self.offline:
|
if self.offline:
|
||||||
print("Launching OFFLINE crawl (no network requests)")
|
print("Launching OFFLINE crawl (no network requests)")
|
||||||
# Gather URLs from database
|
# Gather URLs from database
|
||||||
import sqlite3
|
urls = self.cache.get_distinct_urls()
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
auction_urls = urls['auctions']
|
||||||
cur = conn.cursor()
|
lot_urls = urls['lots']
|
||||||
cur.execute("SELECT DISTINCT url FROM auctions")
|
|
||||||
auction_urls = [r[0] for r in cur.fetchall() if r and r[0]]
|
|
||||||
cur.execute("SELECT DISTINCT url FROM lots")
|
|
||||||
lot_urls = [r[0] for r in cur.fetchall() if r and r[0]]
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
print(f" OFFLINE: {len(auction_urls)} auctions and {len(lot_urls)} lots in DB")
|
print(f" OFFLINE: {len(auction_urls)} auctions and {len(lot_urls)} lots in DB")
|
||||||
|
|
||||||
@@ -933,23 +1003,17 @@ class TroostwijkScraper:
|
|||||||
|
|
||||||
def export_to_files(self) -> Dict[str, str]:
|
def export_to_files(self) -> Dict[str, str]:
|
||||||
"""Export database to CSV/JSON files"""
|
"""Export database to CSV/JSON files"""
|
||||||
import sqlite3
|
|
||||||
import json
|
import json
|
||||||
import csv
|
import csv
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||||
output_dir = os.path.dirname(self.cache.db_path)
|
from config import OUTPUT_DIR as output_dir
|
||||||
|
|
||||||
conn = sqlite3.connect(self.cache.db_path)
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.cursor()
|
|
||||||
|
|
||||||
files = {}
|
files = {}
|
||||||
|
|
||||||
# Export auctions
|
# Export auctions
|
||||||
cursor.execute("SELECT * FROM auctions")
|
auctions = self.cache.fetch_all('auctions')
|
||||||
auctions = [dict(row) for row in cursor.fetchall()]
|
|
||||||
|
|
||||||
auctions_csv = os.path.join(output_dir, f'auctions_{timestamp}.csv')
|
auctions_csv = os.path.join(output_dir, f'auctions_{timestamp}.csv')
|
||||||
auctions_json = os.path.join(output_dir, f'auctions_{timestamp}.json')
|
auctions_json = os.path.join(output_dir, f'auctions_{timestamp}.json')
|
||||||
@@ -968,8 +1032,7 @@ class TroostwijkScraper:
|
|||||||
print(f" Exported {len(auctions)} auctions")
|
print(f" Exported {len(auctions)} auctions")
|
||||||
|
|
||||||
# Export lots
|
# Export lots
|
||||||
cursor.execute("SELECT * FROM lots")
|
lots = self.cache.fetch_all('lots')
|
||||||
lots = [dict(row) for row in cursor.fetchall()]
|
|
||||||
|
|
||||||
lots_csv = os.path.join(output_dir, f'lots_{timestamp}.csv')
|
lots_csv = os.path.join(output_dir, f'lots_{timestamp}.csv')
|
||||||
lots_json = os.path.join(output_dir, f'lots_{timestamp}.json')
|
lots_json = os.path.join(output_dir, f'lots_{timestamp}.json')
|
||||||
@@ -987,5 +1050,4 @@ class TroostwijkScraper:
|
|||||||
files['lots_json'] = lots_json
|
files['lots_json'] = lots_json
|
||||||
print(f" Exported {len(lots)} lots")
|
print(f" Exported {len(lots)} lots")
|
||||||
|
|
||||||
conn.close()
|
|
||||||
return files
|
return files
|
||||||
10
src/test.py
10
src/test.py
@@ -4,7 +4,6 @@ Test module for debugging extraction patterns
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import sys
|
import sys
|
||||||
import sqlite3
|
|
||||||
import time
|
import time
|
||||||
import re
|
import re
|
||||||
import json
|
import json
|
||||||
@@ -27,10 +26,11 @@ def test_extraction(
|
|||||||
if not cached:
|
if not cached:
|
||||||
print(f"ERROR: URL not found in cache: {test_url}")
|
print(f"ERROR: URL not found in cache: {test_url}")
|
||||||
print(f"\nAvailable cached URLs:")
|
print(f"\nAvailable cached URLs:")
|
||||||
with sqlite3.connect(config.CACHE_DB) as conn:
|
try:
|
||||||
cursor = conn.execute("SELECT url FROM cache ORDER BY timestamp DESC LIMIT 10")
|
for url in scraper.cache.get_recent_cached_urls(limit=10):
|
||||||
for row in cursor.fetchall():
|
print(f" - {url}")
|
||||||
print(f" - {row[0]}")
|
except Exception as e:
|
||||||
|
print(f" (failed to list recent cached URLs: {e})")
|
||||||
return
|
return
|
||||||
|
|
||||||
content = cached['content']
|
content = cached['content']
|
||||||
|
|||||||
@@ -1,303 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test cache behavior - verify page is only fetched once and data persists offline
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
import asyncio
|
|
||||||
import sqlite3
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add src to path
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
|
||||||
|
|
||||||
from cache import CacheManager
|
|
||||||
from scraper import TroostwijkScraper
|
|
||||||
import config
|
|
||||||
|
|
||||||
|
|
||||||
class TestCacheBehavior:
|
|
||||||
"""Test suite for cache and offline functionality"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.test_db = "test_cache.db"
|
|
||||||
self.original_db = config.CACHE_DB
|
|
||||||
self.cache = None
|
|
||||||
self.scraper = None
|
|
||||||
|
|
||||||
def setup(self):
|
|
||||||
"""Setup test environment"""
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("TEST SETUP")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Use test database
|
|
||||||
config.CACHE_DB = self.test_db
|
|
||||||
|
|
||||||
# Ensure offline mode is disabled for tests
|
|
||||||
config.OFFLINE = False
|
|
||||||
|
|
||||||
# Clean up old test database
|
|
||||||
if os.path.exists(self.test_db):
|
|
||||||
os.remove(self.test_db)
|
|
||||||
print(f" * Removed old test database")
|
|
||||||
|
|
||||||
# Initialize cache and scraper
|
|
||||||
self.cache = CacheManager()
|
|
||||||
self.scraper = TroostwijkScraper()
|
|
||||||
self.scraper.offline = False # Explicitly disable offline mode
|
|
||||||
|
|
||||||
print(f" * Created test database: {self.test_db}")
|
|
||||||
print(f" * Initialized cache and scraper")
|
|
||||||
print(f" * Offline mode: DISABLED")
|
|
||||||
|
|
||||||
def teardown(self):
|
|
||||||
"""Cleanup test environment"""
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("TEST TEARDOWN")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Restore original database path
|
|
||||||
config.CACHE_DB = self.original_db
|
|
||||||
|
|
||||||
# Keep test database for inspection
|
|
||||||
print(f" * Test database preserved: {self.test_db}")
|
|
||||||
print(f" * Restored original database path")
|
|
||||||
|
|
||||||
async def test_page_fetched_once(self):
|
|
||||||
"""Test that a page is only fetched from network once"""
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("TEST 1: Page Fetched Only Once")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Pick a real lot URL to test with
|
|
||||||
test_url = "https://www.troostwijkauctions.com/l/bmw-x5-xdrive40d-high-executive-m-sport-a8-286pk-2019-A1-26955-7"
|
|
||||||
|
|
||||||
print(f"\nTest URL: {test_url}")
|
|
||||||
|
|
||||||
# First visit - should fetch from network
|
|
||||||
print("\n--- FIRST VISIT (should fetch from network) ---")
|
|
||||||
start_time = time.time()
|
|
||||||
|
|
||||||
async with asyncio.timeout(60): # 60 second timeout
|
|
||||||
page_data_1 = await self._scrape_single_page(test_url)
|
|
||||||
|
|
||||||
first_visit_time = time.time() - start_time
|
|
||||||
|
|
||||||
if not page_data_1:
|
|
||||||
print(" [FAIL] First visit returned no data")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" [OK] First visit completed in {first_visit_time:.2f}s")
|
|
||||||
print(f" [OK] Got lot data: {page_data_1.get('title', 'N/A')[:60]}...")
|
|
||||||
|
|
||||||
# Check closing time was captured
|
|
||||||
closing_time_1 = page_data_1.get('closing_time')
|
|
||||||
print(f" [OK] Closing time: {closing_time_1}")
|
|
||||||
|
|
||||||
# Second visit - should use cache
|
|
||||||
print("\n--- SECOND VISIT (should use cache) ---")
|
|
||||||
start_time = time.time()
|
|
||||||
|
|
||||||
async with asyncio.timeout(30): # Should be much faster
|
|
||||||
page_data_2 = await self._scrape_single_page(test_url)
|
|
||||||
|
|
||||||
second_visit_time = time.time() - start_time
|
|
||||||
|
|
||||||
if not page_data_2:
|
|
||||||
print(" [FAIL] Second visit returned no data")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" [OK] Second visit completed in {second_visit_time:.2f}s")
|
|
||||||
|
|
||||||
# Verify data matches
|
|
||||||
if page_data_1.get('lot_id') != page_data_2.get('lot_id'):
|
|
||||||
print(f" [FAIL] Lot IDs don't match")
|
|
||||||
return False
|
|
||||||
|
|
||||||
closing_time_2 = page_data_2.get('closing_time')
|
|
||||||
print(f" [OK] Closing time: {closing_time_2}")
|
|
||||||
|
|
||||||
if closing_time_1 != closing_time_2:
|
|
||||||
print(f" [FAIL] Closing times don't match!")
|
|
||||||
print(f" First: {closing_time_1}")
|
|
||||||
print(f" Second: {closing_time_2}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Verify second visit was significantly faster (used cache)
|
|
||||||
if second_visit_time >= first_visit_time * 0.5:
|
|
||||||
print(f" [WARN] Second visit not significantly faster")
|
|
||||||
print(f" First: {first_visit_time:.2f}s")
|
|
||||||
print(f" Second: {second_visit_time:.2f}s")
|
|
||||||
else:
|
|
||||||
print(f" [OK] Second visit was {(first_visit_time / second_visit_time):.1f}x faster (cache working!)")
|
|
||||||
|
|
||||||
# Verify resource cache has entries
|
|
||||||
conn = sqlite3.connect(self.test_db)
|
|
||||||
cursor = conn.execute("SELECT COUNT(*) FROM resource_cache")
|
|
||||||
resource_count = cursor.fetchone()[0]
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
print(f" [OK] Cached {resource_count} resources")
|
|
||||||
|
|
||||||
print("\n[PASS] TEST 1 PASSED: Page fetched only once, data persists")
|
|
||||||
return True
|
|
||||||
|
|
||||||
async def test_offline_mode(self):
|
|
||||||
"""Test that offline mode works with cached data"""
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("TEST 2: Offline Mode with Cached Data")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Use the same URL from test 1 (should be cached)
|
|
||||||
test_url = "https://www.troostwijkauctions.com/l/bmw-x5-xdrive40d-high-executive-m-sport-a8-286pk-2019-A1-26955-7"
|
|
||||||
|
|
||||||
# Enable offline mode
|
|
||||||
original_offline = config.OFFLINE
|
|
||||||
config.OFFLINE = True
|
|
||||||
self.scraper.offline = True
|
|
||||||
|
|
||||||
print(f"\nTest URL: {test_url}")
|
|
||||||
print(" * Offline mode: ENABLED")
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Try to scrape in offline mode
|
|
||||||
print("\n--- OFFLINE SCRAPE (should use DB/cache only) ---")
|
|
||||||
start_time = time.time()
|
|
||||||
|
|
||||||
async with asyncio.timeout(30):
|
|
||||||
page_data = await self._scrape_single_page(test_url)
|
|
||||||
|
|
||||||
offline_time = time.time() - start_time
|
|
||||||
|
|
||||||
if not page_data:
|
|
||||||
print(" [FAIL] Offline mode returned no data")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" [OK] Offline scrape completed in {offline_time:.2f}s")
|
|
||||||
print(f" [OK] Got lot data: {page_data.get('title', 'N/A')[:60]}...")
|
|
||||||
|
|
||||||
# Check closing time is available
|
|
||||||
closing_time = page_data.get('closing_time')
|
|
||||||
if not closing_time:
|
|
||||||
print(f" [FAIL] No closing time in offline mode")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" [OK] Closing time preserved: {closing_time}")
|
|
||||||
|
|
||||||
# Verify essential fields are present
|
|
||||||
essential_fields = ['lot_id', 'title', 'url', 'location']
|
|
||||||
missing_fields = [f for f in essential_fields if not page_data.get(f)]
|
|
||||||
|
|
||||||
if missing_fields:
|
|
||||||
print(f" [FAIL] Missing essential fields: {missing_fields}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" [OK] All essential fields present")
|
|
||||||
|
|
||||||
# Check database has the lot
|
|
||||||
conn = sqlite3.connect(self.test_db)
|
|
||||||
cursor = conn.execute("SELECT closing_time FROM lots WHERE url = ?", (test_url,))
|
|
||||||
row = cursor.fetchone()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
if not row:
|
|
||||||
print(f" [FAIL] Lot not found in database")
|
|
||||||
return False
|
|
||||||
|
|
||||||
db_closing_time = row[0]
|
|
||||||
print(f" [OK] Database has closing time: {db_closing_time}")
|
|
||||||
|
|
||||||
if db_closing_time != closing_time:
|
|
||||||
print(f" [FAIL] Closing time mismatch")
|
|
||||||
print(f" Scraped: {closing_time}")
|
|
||||||
print(f" Database: {db_closing_time}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print("\n[PASS] TEST 2 PASSED: Offline mode works, closing time preserved")
|
|
||||||
return True
|
|
||||||
|
|
||||||
finally:
|
|
||||||
# Restore offline mode
|
|
||||||
config.OFFLINE = original_offline
|
|
||||||
self.scraper.offline = original_offline
|
|
||||||
|
|
||||||
async def _scrape_single_page(self, url):
|
|
||||||
"""Helper to scrape a single page"""
|
|
||||||
from playwright.async_api import async_playwright
|
|
||||||
|
|
||||||
if config.OFFLINE or self.scraper.offline:
|
|
||||||
# Offline mode - use crawl_page directly
|
|
||||||
return await self.scraper.crawl_page(page=None, url=url)
|
|
||||||
|
|
||||||
# Online mode - need browser
|
|
||||||
async with async_playwright() as p:
|
|
||||||
browser = await p.chromium.launch(headless=True)
|
|
||||||
page = await browser.new_page()
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = await self.scraper.crawl_page(page, url)
|
|
||||||
return result
|
|
||||||
finally:
|
|
||||||
await browser.close()
|
|
||||||
|
|
||||||
async def run_all_tests(self):
|
|
||||||
"""Run all tests"""
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("CACHE BEHAVIOR TEST SUITE")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
self.setup()
|
|
||||||
|
|
||||||
results = []
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Test 1: Page fetched once
|
|
||||||
result1 = await self.test_page_fetched_once()
|
|
||||||
results.append(("Page Fetched Once", result1))
|
|
||||||
|
|
||||||
# Test 2: Offline mode
|
|
||||||
result2 = await self.test_offline_mode()
|
|
||||||
results.append(("Offline Mode", result2))
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"\n[ERROR] TEST SUITE ERROR: {e}")
|
|
||||||
import traceback
|
|
||||||
traceback.print_exc()
|
|
||||||
|
|
||||||
finally:
|
|
||||||
self.teardown()
|
|
||||||
|
|
||||||
# Print summary
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("TEST SUMMARY")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
all_passed = True
|
|
||||||
for test_name, passed in results:
|
|
||||||
status = "[PASS]" if passed else "[FAIL]"
|
|
||||||
print(f" {status}: {test_name}")
|
|
||||||
if not passed:
|
|
||||||
all_passed = False
|
|
||||||
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
if all_passed:
|
|
||||||
print("\n*** ALL TESTS PASSED! ***")
|
|
||||||
return 0
|
|
||||||
else:
|
|
||||||
print("\n*** SOME TESTS FAILED ***")
|
|
||||||
return 1
|
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
"""Run tests"""
|
|
||||||
tester = TestCacheBehavior()
|
|
||||||
exit_code = await tester.run_all_tests()
|
|
||||||
sys.exit(exit_code)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
@@ -1,51 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
|
|
||||||
sys.path.insert(0, parent_dir)
|
|
||||||
sys.path.insert(0, os.path.join(parent_dir, 'src'))
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
from scraper import TroostwijkScraper
|
|
||||||
import config
|
|
||||||
import os
|
|
||||||
|
|
||||||
async def test():
|
|
||||||
# Force online mode
|
|
||||||
os.environ['SCAEV_OFFLINE'] = '0'
|
|
||||||
config.OFFLINE = False
|
|
||||||
|
|
||||||
scraper = TroostwijkScraper()
|
|
||||||
scraper.offline = False
|
|
||||||
|
|
||||||
from playwright.async_api import async_playwright
|
|
||||||
async with async_playwright() as p:
|
|
||||||
browser = await p.chromium.launch(headless=True)
|
|
||||||
context = await browser.new_context()
|
|
||||||
page = await context.new_page()
|
|
||||||
|
|
||||||
url = "https://www.troostwijkauctions.com/l/used-dometic-seastar-tfxchx8641p-top-mount-engine-control-liver-A1-39684-12"
|
|
||||||
|
|
||||||
# Add debug logging to parser
|
|
||||||
original_parse = scraper.parser.parse_page
|
|
||||||
def debug_parse(content, url):
|
|
||||||
result = original_parse(content, url)
|
|
||||||
if result:
|
|
||||||
print(f"PARSER OUTPUT:")
|
|
||||||
print(f" description: {result.get('description', 'NONE')[:100] if result.get('description') else 'EMPTY'}")
|
|
||||||
print(f" closing_time: {result.get('closing_time', 'NONE')}")
|
|
||||||
print(f" bid_count: {result.get('bid_count', 'NONE')}")
|
|
||||||
return result
|
|
||||||
scraper.parser.parse_page = debug_parse
|
|
||||||
|
|
||||||
page_data = await scraper.crawl_page(page, url)
|
|
||||||
|
|
||||||
await browser.close()
|
|
||||||
|
|
||||||
print(f"\nFINAL page_data:")
|
|
||||||
print(f" description: {page_data.get('description', 'NONE')[:100] if page_data and page_data.get('description') else 'EMPTY'}")
|
|
||||||
print(f" closing_time: {page_data.get('closing_time', 'NONE') if page_data else 'NONE'}")
|
|
||||||
print(f" bid_count: {page_data.get('bid_count', 'NONE') if page_data else 'NONE'}")
|
|
||||||
print(f" status: {page_data.get('status', 'NONE') if page_data else 'NONE'}")
|
|
||||||
|
|
||||||
asyncio.run(test())
|
|
||||||
@@ -1,85 +0,0 @@
|
|||||||
import asyncio
|
|
||||||
import types
|
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
|
||||||
async def test_fetch_lot_bidding_data_403(monkeypatch):
|
|
||||||
"""
|
|
||||||
Simulate a 403 from the GraphQL endpoint and verify:
|
|
||||||
- Function returns None (graceful handling)
|
|
||||||
- It attempts a retry and logs a clear 403 message
|
|
||||||
"""
|
|
||||||
# Load modules directly from src using importlib to avoid path issues
|
|
||||||
project_root = Path(__file__).resolve().parents[1]
|
|
||||||
src_path = project_root / 'src'
|
|
||||||
import importlib.util
|
|
||||||
|
|
||||||
def _load_module(name, file_path):
|
|
||||||
spec = importlib.util.spec_from_file_location(name, str(file_path))
|
|
||||||
module = importlib.util.module_from_spec(spec)
|
|
||||||
sys.modules[name] = module
|
|
||||||
spec.loader.exec_module(module) # type: ignore
|
|
||||||
return module
|
|
||||||
|
|
||||||
# Load config first because graphql_client imports it by module name
|
|
||||||
config = _load_module('config', src_path / 'config.py')
|
|
||||||
graphql_client = _load_module('graphql_client', src_path / 'graphql_client.py')
|
|
||||||
monkeypatch.setattr(config, "OFFLINE", False, raising=False)
|
|
||||||
|
|
||||||
log_messages = []
|
|
||||||
|
|
||||||
def fake_print(*args, **kwargs):
|
|
||||||
msg = " ".join(str(a) for a in args)
|
|
||||||
log_messages.append(msg)
|
|
||||||
|
|
||||||
import builtins
|
|
||||||
monkeypatch.setattr(builtins, "print", fake_print)
|
|
||||||
|
|
||||||
class MockResponse:
|
|
||||||
def __init__(self, status=403, text_body="Forbidden"):
|
|
||||||
self.status = status
|
|
||||||
self._text_body = text_body
|
|
||||||
|
|
||||||
async def json(self):
|
|
||||||
return {}
|
|
||||||
|
|
||||||
async def text(self):
|
|
||||||
return self._text_body
|
|
||||||
|
|
||||||
async def __aenter__(self):
|
|
||||||
return self
|
|
||||||
|
|
||||||
async def __aexit__(self, exc_type, exc, tb):
|
|
||||||
return False
|
|
||||||
|
|
||||||
class MockSession:
|
|
||||||
def __init__(self, *args, **kwargs):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def post(self, *args, **kwargs):
|
|
||||||
# Always return 403
|
|
||||||
return MockResponse(403, "Forbidden by WAF")
|
|
||||||
|
|
||||||
async def __aenter__(self):
|
|
||||||
return self
|
|
||||||
|
|
||||||
async def __aexit__(self, exc_type, exc, tb):
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Patch aiohttp.ClientSession to our mock
|
|
||||||
import types as _types
|
|
||||||
dummy_aiohttp = _types.SimpleNamespace()
|
|
||||||
dummy_aiohttp.ClientSession = MockSession
|
|
||||||
# Ensure that an `import aiohttp` inside the function resolves to our dummy
|
|
||||||
monkeypatch.setitem(sys.modules, 'aiohttp', dummy_aiohttp)
|
|
||||||
|
|
||||||
result = await graphql_client.fetch_lot_bidding_data("A1-40179-35")
|
|
||||||
|
|
||||||
# Should gracefully return None
|
|
||||||
assert result is None
|
|
||||||
|
|
||||||
# Should have logged a 403 at least once
|
|
||||||
assert any("GraphQL API error: 403" in m for m in log_messages)
|
|
||||||
@@ -1,208 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test to validate that all expected fields are populated after scraping
|
|
||||||
"""
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
import asyncio
|
|
||||||
import sqlite3
|
|
||||||
|
|
||||||
# Add parent and src directory to path
|
|
||||||
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
|
|
||||||
sys.path.insert(0, parent_dir)
|
|
||||||
sys.path.insert(0, os.path.join(parent_dir, 'src'))
|
|
||||||
|
|
||||||
# Force online mode before importing
|
|
||||||
os.environ['SCAEV_OFFLINE'] = '0'
|
|
||||||
|
|
||||||
from scraper import TroostwijkScraper
|
|
||||||
import config
|
|
||||||
|
|
||||||
|
|
||||||
async def test_lot_has_all_fields():
|
|
||||||
"""Test that a lot page has all expected fields populated"""
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("TEST: Lot has all required fields")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Use the example lot from user
|
|
||||||
test_url = "https://www.troostwijkauctions.com/l/radaway-idea-black-dwj-doucheopstelling-A1-39956-18"
|
|
||||||
|
|
||||||
# Ensure we're not in offline mode
|
|
||||||
config.OFFLINE = False
|
|
||||||
|
|
||||||
scraper = TroostwijkScraper()
|
|
||||||
scraper.offline = False
|
|
||||||
|
|
||||||
print(f"\n[1] Scraping: {test_url}")
|
|
||||||
|
|
||||||
# Start playwright and scrape
|
|
||||||
from playwright.async_api import async_playwright
|
|
||||||
async with async_playwright() as p:
|
|
||||||
browser = await p.chromium.launch(headless=True)
|
|
||||||
context = await browser.new_context()
|
|
||||||
page = await context.new_page()
|
|
||||||
|
|
||||||
page_data = await scraper.crawl_page(page, test_url)
|
|
||||||
|
|
||||||
await browser.close()
|
|
||||||
|
|
||||||
if not page_data:
|
|
||||||
print(" [FAIL] No data returned")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f"\n[2] Validating fields...")
|
|
||||||
|
|
||||||
# Fields that MUST have values (critical for auction functionality)
|
|
||||||
required_fields = {
|
|
||||||
'closing_time': 'Closing time',
|
|
||||||
'current_bid': 'Current bid',
|
|
||||||
'bid_count': 'Bid count',
|
|
||||||
'status': 'Status',
|
|
||||||
}
|
|
||||||
|
|
||||||
# Fields that SHOULD have values but may legitimately be empty
|
|
||||||
optional_fields = {
|
|
||||||
'description': 'Description',
|
|
||||||
}
|
|
||||||
|
|
||||||
missing_fields = []
|
|
||||||
empty_fields = []
|
|
||||||
optional_missing = []
|
|
||||||
|
|
||||||
# Check required fields
|
|
||||||
for field, label in required_fields.items():
|
|
||||||
value = page_data.get(field)
|
|
||||||
|
|
||||||
if value is None:
|
|
||||||
missing_fields.append(label)
|
|
||||||
print(f" [FAIL] {label}: MISSING (None)")
|
|
||||||
elif value == '' or value == 0 or value == 'No bids':
|
|
||||||
# Special case: 'No bids' is only acceptable if bid_count is 0
|
|
||||||
if field == 'current_bid' and page_data.get('bid_count', 0) == 0:
|
|
||||||
print(f" [PASS] {label}: '{value}' (acceptable - no bids)")
|
|
||||||
else:
|
|
||||||
empty_fields.append(label)
|
|
||||||
print(f" [FAIL] {label}: EMPTY ('{value}')")
|
|
||||||
else:
|
|
||||||
print(f" [PASS] {label}: {value}")
|
|
||||||
|
|
||||||
# Check optional fields (warn but don't fail)
|
|
||||||
for field, label in optional_fields.items():
|
|
||||||
value = page_data.get(field)
|
|
||||||
if value is None or value == '':
|
|
||||||
optional_missing.append(label)
|
|
||||||
print(f" [WARN] {label}: EMPTY (may be legitimate)")
|
|
||||||
else:
|
|
||||||
print(f" [PASS] {label}: {value[:50]}...")
|
|
||||||
|
|
||||||
# Check database
|
|
||||||
print(f"\n[3] Checking database entry...")
|
|
||||||
conn = sqlite3.connect(scraper.cache.db_path)
|
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute("""
|
|
||||||
SELECT closing_time, current_bid, bid_count, description, status
|
|
||||||
FROM lots WHERE url = ?
|
|
||||||
""", (test_url,))
|
|
||||||
row = cursor.fetchone()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
if row:
|
|
||||||
db_closing, db_bid, db_count, db_desc, db_status = row
|
|
||||||
print(f" DB closing_time: {db_closing or 'EMPTY'}")
|
|
||||||
print(f" DB current_bid: {db_bid or 'EMPTY'}")
|
|
||||||
print(f" DB bid_count: {db_count}")
|
|
||||||
print(f" DB description: {db_desc[:50] if db_desc else 'EMPTY'}...")
|
|
||||||
print(f" DB status: {db_status or 'EMPTY'}")
|
|
||||||
|
|
||||||
# Verify DB matches page_data
|
|
||||||
if db_closing != page_data.get('closing_time'):
|
|
||||||
print(f" [WARN] DB closing_time doesn't match page_data")
|
|
||||||
if db_count != page_data.get('bid_count'):
|
|
||||||
print(f" [WARN] DB bid_count doesn't match page_data")
|
|
||||||
else:
|
|
||||||
print(f" [WARN] No database entry found")
|
|
||||||
|
|
||||||
print(f"\n" + "="*60)
|
|
||||||
if missing_fields or empty_fields:
|
|
||||||
print(f"[FAIL] Missing fields: {', '.join(missing_fields)}")
|
|
||||||
print(f"[FAIL] Empty fields: {', '.join(empty_fields)}")
|
|
||||||
if optional_missing:
|
|
||||||
print(f"[WARN] Optional missing: {', '.join(optional_missing)}")
|
|
||||||
return False
|
|
||||||
else:
|
|
||||||
print("[PASS] All required fields are populated")
|
|
||||||
if optional_missing:
|
|
||||||
print(f"[WARN] Optional missing: {', '.join(optional_missing)}")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
async def test_lot_with_description():
|
|
||||||
"""Test that a lot with description preserves it"""
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("TEST: Lot with description")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
# Use a lot known to have description
|
|
||||||
test_url = "https://www.troostwijkauctions.com/l/used-dometic-seastar-tfxchx8641p-top-mount-engine-control-liver-A1-39684-12"
|
|
||||||
|
|
||||||
config.OFFLINE = False
|
|
||||||
|
|
||||||
scraper = TroostwijkScraper()
|
|
||||||
scraper.offline = False
|
|
||||||
|
|
||||||
print(f"\n[1] Scraping: {test_url}")
|
|
||||||
|
|
||||||
from playwright.async_api import async_playwright
|
|
||||||
async with async_playwright() as p:
|
|
||||||
browser = await p.chromium.launch(headless=True)
|
|
||||||
context = await browser.new_context()
|
|
||||||
page = await context.new_page()
|
|
||||||
|
|
||||||
page_data = await scraper.crawl_page(page, test_url)
|
|
||||||
|
|
||||||
await browser.close()
|
|
||||||
|
|
||||||
if not page_data:
|
|
||||||
print(" [FAIL] No data returned")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f"\n[2] Checking description...")
|
|
||||||
description = page_data.get('description', '')
|
|
||||||
|
|
||||||
if not description or description == '':
|
|
||||||
print(f" [FAIL] Description is empty")
|
|
||||||
return False
|
|
||||||
else:
|
|
||||||
print(f" [PASS] Description: {description[:100]}...")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
"""Run all tests"""
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("MISSING FIELDS TEST SUITE")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
test1 = await test_lot_has_all_fields()
|
|
||||||
test2 = await test_lot_with_description()
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
if test1 and test2:
|
|
||||||
print("ALL TESTS PASSED")
|
|
||||||
else:
|
|
||||||
print("SOME TESTS FAILED")
|
|
||||||
if not test1:
|
|
||||||
print(" - test_lot_has_all_fields FAILED")
|
|
||||||
if not test2:
|
|
||||||
print(" - test_lot_with_description FAILED")
|
|
||||||
print("="*60 + "\n")
|
|
||||||
|
|
||||||
return 0 if (test1 and test2) else 1
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
exit_code = asyncio.run(main())
|
|
||||||
sys.exit(exit_code)
|
|
||||||
@@ -1,335 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test suite for Troostwijk Scraper
|
|
||||||
Tests both auction and lot parsing with cached data
|
|
||||||
|
|
||||||
Requires Python 3.10+
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
|
|
||||||
# Require Python 3.10+
|
|
||||||
if sys.version_info < (3, 10):
|
|
||||||
print("ERROR: This script requires Python 3.10 or higher")
|
|
||||||
print(f"Current version: {sys.version}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import json
|
|
||||||
import sqlite3
|
|
||||||
from datetime import datetime
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add parent directory to path
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent))
|
|
||||||
|
|
||||||
from main import TroostwijkScraper, CacheManager, CACHE_DB
|
|
||||||
|
|
||||||
# Test URLs - these will use cached data to avoid overloading the server
|
|
||||||
TEST_AUCTIONS = [
|
|
||||||
"https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813",
|
|
||||||
"https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557",
|
|
||||||
"https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675",
|
|
||||||
]
|
|
||||||
|
|
||||||
TEST_LOTS = [
|
|
||||||
"https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5",
|
|
||||||
"https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9",
|
|
||||||
"https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101",
|
|
||||||
]
|
|
||||||
|
|
||||||
class TestResult:
|
|
||||||
def __init__(self, url, success, message, data=None):
|
|
||||||
self.url = url
|
|
||||||
self.success = success
|
|
||||||
self.message = message
|
|
||||||
self.data = data
|
|
||||||
|
|
||||||
class ScraperTester:
|
|
||||||
def __init__(self):
|
|
||||||
self.scraper = TroostwijkScraper()
|
|
||||||
self.results = []
|
|
||||||
|
|
||||||
def check_cache_exists(self, url):
|
|
||||||
"""Check if URL is cached"""
|
|
||||||
cached = self.scraper.cache.get(url, max_age_hours=999999) # Get even old cache
|
|
||||||
return cached is not None
|
|
||||||
|
|
||||||
def test_auction_parsing(self, url):
|
|
||||||
"""Test auction page parsing"""
|
|
||||||
print(f"\n{'='*70}")
|
|
||||||
print(f"Testing Auction: {url}")
|
|
||||||
print('='*70)
|
|
||||||
|
|
||||||
# Check cache
|
|
||||||
if not self.check_cache_exists(url):
|
|
||||||
return TestResult(
|
|
||||||
url,
|
|
||||||
False,
|
|
||||||
"❌ NOT IN CACHE - Please run scraper first to cache this URL",
|
|
||||||
None
|
|
||||||
)
|
|
||||||
|
|
||||||
# Get cached content
|
|
||||||
cached = self.scraper.cache.get(url, max_age_hours=999999)
|
|
||||||
content = cached['content']
|
|
||||||
|
|
||||||
print(f"✓ Cache hit (age: {(datetime.now().timestamp() - cached['timestamp']) / 3600:.1f} hours)")
|
|
||||||
|
|
||||||
# Parse
|
|
||||||
try:
|
|
||||||
data = self.scraper._parse_page(content, url)
|
|
||||||
|
|
||||||
if not data:
|
|
||||||
return TestResult(url, False, "❌ Parsing returned None", None)
|
|
||||||
|
|
||||||
if data.get('type') != 'auction':
|
|
||||||
return TestResult(
|
|
||||||
url,
|
|
||||||
False,
|
|
||||||
f"❌ Expected type='auction', got '{data.get('type')}'",
|
|
||||||
data
|
|
||||||
)
|
|
||||||
|
|
||||||
# Validate required fields
|
|
||||||
issues = []
|
|
||||||
required_fields = {
|
|
||||||
'auction_id': str,
|
|
||||||
'title': str,
|
|
||||||
'location': str,
|
|
||||||
'lots_count': int,
|
|
||||||
'first_lot_closing_time': str,
|
|
||||||
}
|
|
||||||
|
|
||||||
for field, expected_type in required_fields.items():
|
|
||||||
value = data.get(field)
|
|
||||||
if value is None or value == '':
|
|
||||||
issues.append(f" ❌ {field}: MISSING or EMPTY")
|
|
||||||
elif not isinstance(value, expected_type):
|
|
||||||
issues.append(f" ❌ {field}: Wrong type (expected {expected_type.__name__}, got {type(value).__name__})")
|
|
||||||
else:
|
|
||||||
# Pretty print value
|
|
||||||
display_value = str(value)[:60]
|
|
||||||
print(f" ✓ {field}: {display_value}")
|
|
||||||
|
|
||||||
if issues:
|
|
||||||
return TestResult(url, False, "\n".join(issues), data)
|
|
||||||
|
|
||||||
print(f" ✓ lots_count: {data.get('lots_count')}")
|
|
||||||
|
|
||||||
return TestResult(url, True, "✅ All auction fields validated successfully", data)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
return TestResult(url, False, f"❌ Exception during parsing: {e}", None)
|
|
||||||
|
|
||||||
def test_lot_parsing(self, url):
|
|
||||||
"""Test lot page parsing"""
|
|
||||||
print(f"\n{'='*70}")
|
|
||||||
print(f"Testing Lot: {url}")
|
|
||||||
print('='*70)
|
|
||||||
|
|
||||||
# Check cache
|
|
||||||
if not self.check_cache_exists(url):
|
|
||||||
return TestResult(
|
|
||||||
url,
|
|
||||||
False,
|
|
||||||
"❌ NOT IN CACHE - Please run scraper first to cache this URL",
|
|
||||||
None
|
|
||||||
)
|
|
||||||
|
|
||||||
# Get cached content
|
|
||||||
cached = self.scraper.cache.get(url, max_age_hours=999999)
|
|
||||||
content = cached['content']
|
|
||||||
|
|
||||||
print(f"✓ Cache hit (age: {(datetime.now().timestamp() - cached['timestamp']) / 3600:.1f} hours)")
|
|
||||||
|
|
||||||
# Parse
|
|
||||||
try:
|
|
||||||
data = self.scraper._parse_page(content, url)
|
|
||||||
|
|
||||||
if not data:
|
|
||||||
return TestResult(url, False, "❌ Parsing returned None", None)
|
|
||||||
|
|
||||||
if data.get('type') != 'lot':
|
|
||||||
return TestResult(
|
|
||||||
url,
|
|
||||||
False,
|
|
||||||
f"❌ Expected type='lot', got '{data.get('type')}'",
|
|
||||||
data
|
|
||||||
)
|
|
||||||
|
|
||||||
# Validate required fields
|
|
||||||
issues = []
|
|
||||||
required_fields = {
|
|
||||||
'lot_id': (str, lambda x: x and len(x) > 0),
|
|
||||||
'title': (str, lambda x: x and len(x) > 3 and x not in ['...', 'N/A']),
|
|
||||||
'location': (str, lambda x: x and len(x) > 2 and x not in ['Locatie', 'Location']),
|
|
||||||
'current_bid': (str, lambda x: x and x not in ['€Huidig bod', 'Huidig bod']),
|
|
||||||
'closing_time': (str, lambda x: True), # Can be empty
|
|
||||||
'images': (list, lambda x: True), # Can be empty list
|
|
||||||
}
|
|
||||||
|
|
||||||
for field, (expected_type, validator) in required_fields.items():
|
|
||||||
value = data.get(field)
|
|
||||||
|
|
||||||
if value is None:
|
|
||||||
issues.append(f" ❌ {field}: MISSING (None)")
|
|
||||||
elif not isinstance(value, expected_type):
|
|
||||||
issues.append(f" ❌ {field}: Wrong type (expected {expected_type.__name__}, got {type(value).__name__})")
|
|
||||||
elif not validator(value):
|
|
||||||
issues.append(f" ❌ {field}: Invalid value: '{value}'")
|
|
||||||
else:
|
|
||||||
# Pretty print value
|
|
||||||
if field == 'images':
|
|
||||||
print(f" ✓ {field}: {len(value)} images")
|
|
||||||
for i, img in enumerate(value[:3], 1):
|
|
||||||
print(f" {i}. {img[:60]}...")
|
|
||||||
else:
|
|
||||||
display_value = str(value)[:60]
|
|
||||||
print(f" ✓ {field}: {display_value}")
|
|
||||||
|
|
||||||
# Additional checks
|
|
||||||
if data.get('bid_count') is not None:
|
|
||||||
print(f" ✓ bid_count: {data.get('bid_count')}")
|
|
||||||
|
|
||||||
if data.get('viewing_time'):
|
|
||||||
print(f" ✓ viewing_time: {data.get('viewing_time')}")
|
|
||||||
|
|
||||||
if data.get('pickup_date'):
|
|
||||||
print(f" ✓ pickup_date: {data.get('pickup_date')}")
|
|
||||||
|
|
||||||
if issues:
|
|
||||||
return TestResult(url, False, "\n".join(issues), data)
|
|
||||||
|
|
||||||
return TestResult(url, True, "✅ All lot fields validated successfully", data)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
import traceback
|
|
||||||
return TestResult(url, False, f"❌ Exception during parsing: {e}\n{traceback.format_exc()}", None)
|
|
||||||
|
|
||||||
def run_all_tests(self):
|
|
||||||
"""Run all tests"""
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("TROOSTWIJK SCRAPER TEST SUITE")
|
|
||||||
print("="*70)
|
|
||||||
print("\nThis test suite uses CACHED data only - no live requests to server")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# Test auctions
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("TESTING AUCTIONS")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
for url in TEST_AUCTIONS:
|
|
||||||
result = self.test_auction_parsing(url)
|
|
||||||
self.results.append(result)
|
|
||||||
|
|
||||||
# Test lots
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("TESTING LOTS")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
for url in TEST_LOTS:
|
|
||||||
result = self.test_lot_parsing(url)
|
|
||||||
self.results.append(result)
|
|
||||||
|
|
||||||
# Summary
|
|
||||||
self.print_summary()
|
|
||||||
|
|
||||||
def print_summary(self):
|
|
||||||
"""Print test summary"""
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("TEST SUMMARY")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
passed = sum(1 for r in self.results if r.success)
|
|
||||||
failed = sum(1 for r in self.results if not r.success)
|
|
||||||
total = len(self.results)
|
|
||||||
|
|
||||||
print(f"\nTotal tests: {total}")
|
|
||||||
print(f"Passed: {passed} ✓")
|
|
||||||
print(f"Failed: {failed} ✗")
|
|
||||||
print(f"Success rate: {passed/total*100:.1f}%")
|
|
||||||
|
|
||||||
if failed > 0:
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("FAILED TESTS:")
|
|
||||||
print("="*70)
|
|
||||||
for result in self.results:
|
|
||||||
if not result.success:
|
|
||||||
print(f"\n{result.url}")
|
|
||||||
print(result.message)
|
|
||||||
if result.data:
|
|
||||||
print("\nParsed data:")
|
|
||||||
for key, value in result.data.items():
|
|
||||||
if key != 'lots': # Don't print full lots array
|
|
||||||
print(f" {key}: {str(value)[:80]}")
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
|
|
||||||
return failed == 0
|
|
||||||
|
|
||||||
def check_cache_status():
|
|
||||||
"""Check cache compression status"""
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("CACHE STATUS CHECK")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
try:
|
|
||||||
with sqlite3.connect(CACHE_DB) as conn:
|
|
||||||
# Total entries
|
|
||||||
cursor = conn.execute("SELECT COUNT(*) FROM cache")
|
|
||||||
total = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
# Compressed vs uncompressed
|
|
||||||
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 1")
|
|
||||||
compressed = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 0 OR compressed IS NULL")
|
|
||||||
uncompressed = cursor.fetchone()[0]
|
|
||||||
|
|
||||||
print(f"Total cache entries: {total}")
|
|
||||||
print(f"Compressed: {compressed} ({compressed/total*100:.1f}%)")
|
|
||||||
print(f"Uncompressed: {uncompressed} ({uncompressed/total*100:.1f}%)")
|
|
||||||
|
|
||||||
if uncompressed > 0:
|
|
||||||
print(f"\n⚠️ Warning: {uncompressed} entries are still uncompressed")
|
|
||||||
print(" Run: python migrate_compress_cache.py")
|
|
||||||
else:
|
|
||||||
print("\n✓ All cache entries are compressed!")
|
|
||||||
|
|
||||||
# Check test URLs
|
|
||||||
print(f"\n{'='*70}")
|
|
||||||
print("TEST URL CACHE STATUS:")
|
|
||||||
print('='*70)
|
|
||||||
|
|
||||||
all_test_urls = TEST_AUCTIONS + TEST_LOTS
|
|
||||||
cached_count = 0
|
|
||||||
|
|
||||||
for url in all_test_urls:
|
|
||||||
cursor = conn.execute("SELECT url FROM cache WHERE url = ?", (url,))
|
|
||||||
if cursor.fetchone():
|
|
||||||
print(f"✓ {url[:60]}...")
|
|
||||||
cached_count += 1
|
|
||||||
else:
|
|
||||||
print(f"✗ {url[:60]}... (NOT CACHED)")
|
|
||||||
|
|
||||||
print(f"\n{cached_count}/{len(all_test_urls)} test URLs are cached")
|
|
||||||
|
|
||||||
if cached_count < len(all_test_urls):
|
|
||||||
print("\n⚠️ Some test URLs are not cached. Tests for those URLs will fail.")
|
|
||||||
print(" Run the main scraper to cache these URLs first.")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error checking cache status: {e}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
# Check cache status first
|
|
||||||
check_cache_status()
|
|
||||||
|
|
||||||
# Run tests
|
|
||||||
tester = ScraperTester()
|
|
||||||
success = tester.run_all_tests()
|
|
||||||
|
|
||||||
# Exit with appropriate code
|
|
||||||
sys.exit(0 if success else 1)
|
|
||||||
Reference in New Issue
Block a user