Compare commits

...

3 Commits

Author SHA1 Message Date
Tour
2dda1aff00 - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 22:56:10 +01:00
Tour
62d664c580 - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 20:53:54 +01:00
Tour
5ea2342dbc - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 19:53:31 +01:00
27 changed files with 834 additions and 3836 deletions

View File

@@ -10,3 +10,16 @@
dist/
build/
out/
# An .aiignore file follows the same syntax as a .gitignore file.
# .gitignore documentation: https://git-scm.com/docs/gitignore
# you can ignore files
# or folders
.idea
node_modules/
.vscode/
.git
.github
scripts
.pytest_cache/
__pycache__

182
README.md
View File

@@ -1,75 +1,177 @@
# Setup & IDE Configuration
# Python Setup & IDE Guide
## Python Version Requirement
Short, clear, Pythonfocused.
This project **requires Python 3.10 or higher**.
---
The code uses Python 3.10+ features including:
- Structural pattern matching
- Union type syntax (`X | Y`)
- Improved type hints
- Modern async/await patterns
## Requirements
## IDE Configuration
- **Python 3.10+**
Uses pattern matching, modern type hints, async improvements.
### PyCharm / IntelliJ IDEA
```bash
python --version
```
If your IDE shows "Python 2.7 syntax" warnings, configure it for Python 3.10+:
---
1. **File → Project Structure → Project Settings → Project**
- Set Python SDK to 3.10 or higher
## IDE Setup (PyCharm / IntelliJ)
2. **File → Settings → Project → Python Interpreter**
- Select Python 3.10+ interpreter
- Click gear icon → Add → System Interpreter → Browse to your Python 3.10 installation
1. **Set interpreter:**
*File → Settings → Project Python Interpreter → Select Python 3.10+*
3. **File → Settings → Editor → Inspections → Python**
- Ensure "Python version" is set to 3.10+
- Check "Code compatibility inspection" → Set minimum version to 3.10
2. **Fix syntax warnings:**
*Editor → Inspections → Python → Set language level to 3.10+*
3. **Ensure correct SDK:**
*Project Structure → Project SDK → Python 3.10+*
---
## Installation
```bash
# Check Python version
python --version # Should be 3.10+
# Activate venv
~\venvs\scaev\Scripts\Activate.ps1
# Install dependencies
# Install deps
pip install -r requirements.txt
# Install Playwright browsers
# Playwright browsers
playwright install chromium
```
## Verifying Setup
---
## Database Configuration (PostgreSQL)
The scraper now uses PostgreSQL (no more SQLite files). Configure via `DATABASE_URL`:
- Default (baked in):
`postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb`
- Override for your environment:
```bash
# Should print version 3.10.x or higher
python -c "import sys; print(sys.version)"
# Windows PowerShell
$env:DATABASE_URL = "postgresql://user:pass@host:5432/dbname"
# Should run without errors
# Linux/macOS
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
```
Packages used:
- Driver: `psycopg[binary]`
Nothing is written to local `.db` files anymore.
---
## Verify
```bash
python -c "import sys; print(sys.version)"
python main.py --help
```
## Common Issues
Common fixes:
### "ModuleNotFoundError: No module named 'playwright'"
```bash
pip install playwright
playwright install chromium
```
### "Python 2.7 does not support..." warnings in IDE
- Your IDE is configured for Python 2.7
- Follow IDE configuration steps above
- The code WILL work with Python 3.10+ despite warnings
---
### Script exits with "requires Python 3.10 or higher"
- You're running Python 3.9 or older
- Upgrade to Python 3.10+: https://www.python.org/downloads/
# AutoStart (Monitor)
## Version Files
## Linux (systemd) — Recommended
- `.python-version` - Used by pyenv and similar tools
- `requirements.txt` - Package dependencies
- Runtime checks in scripts ensure Python 3.10+
```bash
cd ~/scaev
chmod +x install_service.sh
./install_service.sh
```
Service features:
- Autostart
- Autorestart
- Logs: `~/scaev/logs/monitor.log`
```bash
sudo systemctl status scaev-monitor
journalctl -u scaev-monitor -f
```
---
## Windows (Task Scheduler)
```powershell
cd C:\vibe\scaev
.\setup_windows_task.ps1
```
Manage:
```powershell
Start-ScheduledTask "ScaevAuctionMonitor"
```
---
# Cron Alternative (Linux)
```bash
crontab -e
@reboot cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
0 * * * * pgrep -f monitor.py || (cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
```
---
# Status Checks
```bash
ps aux | grep monitor.py
tasklist | findstr python
```
---
# Troubleshooting
- Wrong interpreter → Set Python 3.10+
- Multiple monitors running → kill extra processes
- PostgreSQL connectivity → verify `DATABASE_URL`, network/firewall, and credentials
- Service fails → check `journalctl -u scaev-monitor`
---
# Java Extractor (Short Version)
Prereqs: **Java 21**, **Maven**
Install:
```bash
mvn clean install
mvn exec:java -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
```
Run:
```bash
mvn exec:java -Dexec.args="--max-visits 3"
```
Enable native access (IntelliJ → VM Options):
```
--enable-native-access=ALL-UNNAMED
```
---
---
This file keeps everything compact, Pythonfocused, and ready for onboarding.

View File

@@ -1,240 +0,0 @@
# API Intelligence Findings
## GraphQL API - Available Fields for Intelligence
### Key Discovery: Additional Fields Available
From GraphQL schema introspection on `Lot` type:
#### **Already Captured ✓**
- `currentBidAmount` (Money) - Current bid
- `initialAmount` (Money) - Starting bid
- `nextMinimalBid` (Money) - Minimum bid
- `bidsCount` (Int) - Bid count
- `startDate` / `endDate` (TbaDate) - Timing
- `minimumBidAmountMet` (MinimumBidAmountMet) - Status
- `attributes` - Brand/model extraction
- `title`, `description`, `images`
#### **NEW - Available but NOT Captured:**
1. **followersCount** (Int) - **CRITICAL for intelligence!**
- This is the "watch count" we thought was missing
- Indicates bidder interest level
- **ACTION: Add to schema and extraction**
2. **biddingStatus** (BiddingStatus) - Lot bidding state
- More detailed than minimumBidAmountMet
- **ACTION: Investigate enum values**
3. **estimatedFullPrice** (EstimatedFullPrice) - **Found it!**
- Available via `LotDetails.estimatedFullPrice`
- May contain estimated min/max values
- **ACTION: Test extraction**
4. **nextBidStepInCents** (Long) - Exact bid increment
- More precise than our calculated bid_increment
- **ACTION: Replace calculated field**
5. **condition** (String) - Direct condition field
- Cleaner than attribute extraction
- **ACTION: Use as primary source**
6. **categoryInformation** (LotCategoryInformation) - Category data
- Structured category info
- **ACTION: Extract category path**
7. **location** (LotLocation) - Lot location details
- City, country, possibly address
- **ACTION: Add to schema**
8. **remarks** (String) - Additional notes
- May contain pickup/viewing text
- **ACTION: Check for viewing/pickup extraction**
9. **appearance** (String) - Condition appearance
- Visual condition notes
- **ACTION: Combine with condition_description**
10. **packaging** (String) - Packaging details
- Relevant for shipping intelligence
11. **quantity** (Long) - Lot quantity
- Important for bulk lots
12. **vat** (BigDecimal) - VAT percentage
- For total cost calculations
13. **buyerPremiumPercentage** (BigDecimal) - Buyer premium
- For total cost calculations
14. **videos** - Video URLs (if available)
- **ACTION: Add video support**
15. **documents** - Document URLs (if available)
- May contain specs/manuals
## Bid History API - Fields
### Currently Captured ✓
- `buyerId` (UUID) - Anonymized bidder
- `buyerNumber` (Int) - Bidder number
- `currentBid.cents` / `currency` - Bid amount
- `autoBid` (Boolean) - Autobid flag
- `createdAt` (Timestamp) - Bid time
### Additional Available:
- `negotiated` (Boolean) - Was bid negotiated
- **ACTION: Add to bid_history table**
## Auction API - Not Available
- Attempted `auctionDetails` query - **does not exist**
- Auction data must be scraped from listing pages
## Priority Actions for Intelligence
### HIGH PRIORITY (Immediate):
1. ✅ Add `followersCount` field (watch count)
2. ✅ Add `estimatedFullPrice` extraction
3. ✅ Use `nextBidStepInCents` instead of calculated increment
4. ✅ Add `condition` as primary condition source
5. ✅ Add `categoryInformation` extraction
6. ✅ Add `location` details
7. ✅ Add `negotiated` to bid_history table
### MEDIUM PRIORITY:
8. Extract `remarks` for viewing/pickup text
9. Add `appearance` and `packaging` fields
10. Add `quantity` field
11. Add `vat` and `buyerPremiumPercentage` for cost calculations
12. Add `biddingStatus` enum extraction
### LOW PRIORITY:
13. Add video URL support
14. Add document URL support
## Updated Schema Requirements
### lots table - NEW columns:
```sql
ALTER TABLE lots ADD COLUMN followers_count INTEGER DEFAULT 0;
ALTER TABLE lots ADD COLUMN estimated_min_price REAL;
ALTER TABLE lots ADD COLUMN estimated_max_price REAL;
ALTER TABLE lots ADD COLUMN location_city TEXT;
ALTER TABLE lots ADD COLUMN location_country TEXT;
ALTER TABLE lots ADD COLUMN lot_condition TEXT; -- Direct from API
ALTER TABLE lots ADD COLUMN appearance TEXT;
ALTER TABLE lots ADD COLUMN packaging TEXT;
ALTER TABLE lots ADD COLUMN quantity INTEGER DEFAULT 1;
ALTER TABLE lots ADD COLUMN vat_percentage REAL;
ALTER TABLE lots ADD COLUMN buyer_premium_percentage REAL;
ALTER TABLE lots ADD COLUMN remarks TEXT;
ALTER TABLE lots ADD COLUMN bidding_status TEXT;
ALTER TABLE lots ADD COLUMN videos_json TEXT; -- Store as JSON array
ALTER TABLE lots ADD COLUMN documents_json TEXT; -- Store as JSON array
```
### bid_history table - NEW column:
```sql
ALTER TABLE bid_history ADD COLUMN negotiated INTEGER DEFAULT 0;
```
## Intelligence Use Cases
### With followers_count:
- Predict lot popularity and final price
- Identify hot items early
- Calculate interest-to-bid conversion rate
### With estimated prices:
- Compare final price to estimate
- Identify bargains (final < estimate)
- Calculate auction house accuracy
### With nextBidStepInCents:
- Show exact next bid amount
- Calculate optimal bidding strategy
### With location:
- Filter by proximity
- Calculate pickup logistics
### With vat/buyer_premium:
- Calculate true total cost
- Compare all-in prices
### With condition/appearance:
- Better condition scoring
- Identify restoration projects
## Updated GraphQL Query
```graphql
query EnhancedLotQuery($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
estimatedFullPrice {
min { cents currency }
max { cents currency }
}
lot {
id
displayId
title
description { text }
currentBidAmount { cents currency }
initialAmount { cents currency }
nextMinimalBid { cents currency }
nextBidStepInCents
bidsCount
followersCount
startDate
endDate
minimumBidAmountMet
biddingStatus
condition
appearance
packaging
quantity
vat
buyerPremiumPercentage
remarks
auctionId
location {
city
countryCode
addressLine1
addressLine2
}
categoryInformation {
id
name
path
}
images {
url
thumbnailUrl
}
videos {
url
thumbnailUrl
}
documents {
url
name
}
attributes {
name
value
}
}
}
}
```
## Summary
**NEW fields found:** 15+ additional intelligence fields available
**Most critical:** `followersCount` (watch count), `estimatedFullPrice`, `nextBidStepInCents`
**Data quality impact:** Estimated 80%+ increase in intelligence value
These fields will significantly enhance prediction and analysis capabilities.

View File

@@ -321,19 +321,18 @@ Lot Page Parsed
## Key Configuration
| Setting | Value | Purpose |
|----------------------|-----------------------------------|----------------------------------|
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
| Setting | Value | Purpose |
|----------------------|--------------------------------------------------------------------------|----------------------------------|
| `DATABASE_URL` | `postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb` | PostgreSQL connection string |
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
## Output Files
```
/mnt/okcomputer/output/
├── cache.db # SQLite database (compressed HTML + data)
├── auctions_{timestamp}.json # Exported auctions
├── auctions_{timestamp}.csv # Exported auctions
├── lots_{timestamp}.json # Exported lots
@@ -377,7 +376,7 @@ For each lot, the data “tunnels through” the following stages:
1. HTML page → parse `__NEXT_DATA__` for core lot fields and lot UUID.
2. GraphQL `lotDetails` → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
4. Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.
4. Persist to DB (PostgreSQL) and export; image URLs are captured and optionally downloaded concurrently per lot.
Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
@@ -503,13 +502,6 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
- ✅ Closing time and status
- ✅ Brand, model, manufacturer (from attributes)
**Available but Not Yet Captured:**
- ⚠️ `followersCount` - Watch count for popularity analysis
- ⚠️ `estimatedFullPrice` - Min/max estimated values
- ⚠️ `biddingStatus` - More detailed status enum
- ⚠️ `condition` - Direct condition field
- ⚠️ `location` - City, country details
- ⚠️ `categoryInformation` - Structured category
### REST API - Bid History
**Endpoint:** `https://shared-api.tbauctions.com/bidmanagement/lots/{lot_uuid}/bidding-history`
@@ -553,11 +545,6 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
### API Integration Points
**Files:**
- `src/graphql_client.py` - GraphQL queries and parsing
- `src/bid_history_client.py` - REST API pagination and parsing
- `src/scraper.py` - Integration during lot scraping
**Flow:**
1. Lot page scraped → Extract lot UUID from `__NEXT_DATA__`
2. Call GraphQL API → Get bidding data
@@ -570,4 +557,3 @@ query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platfo
- Overall 0.5s rate limit applies to page requests
- API calls are part of lot processing (not separately limited)
See `API_INTELLIGENCE_FINDINGS.md` for detailed field analysis and roadmap.

View File

@@ -1,120 +0,0 @@
# Auto-Start Setup Guide
The monitor doesn't run automatically yet. Choose your setup based on your server OS:
---
## Linux Server (Systemd Service) ⭐ RECOMMENDED
**Install:**
```bash
cd /home/tour/scaev
chmod +x install_service.sh
./install_service.sh
```
**The service will:**
- ✅ Start automatically on server boot
- ✅ Restart automatically if it crashes
- ✅ Log to `~/scaev/logs/monitor.log`
- ✅ Poll every 30 minutes
**Management commands:**
```bash
sudo systemctl status scaev-monitor # Check if running
sudo systemctl stop scaev-monitor # Stop
sudo systemctl start scaev-monitor # Start
sudo systemctl restart scaev-monitor # Restart
journalctl -u scaev-monitor -f # Live logs
tail -f ~/scaev/logs/monitor.log # Monitor log file
```
---
## Windows (Task Scheduler)
**Install (Run as Administrator):**
```powershell
cd C:\vibe\scaev
.\setup_windows_task.ps1
```
**The task will:**
- ✅ Start automatically on Windows boot
- ✅ Restart automatically if it crashes (up to 3 times)
- ✅ Run as SYSTEM user
- ✅ Poll every 30 minutes
**Management:**
1. Open Task Scheduler (`taskschd.msc`)
2. Find `ScaevAuctionMonitor` in Task Scheduler Library
3. Right-click to Run/Stop/Disable
**Or via PowerShell:**
```powershell
Start-ScheduledTask -TaskName "ScaevAuctionMonitor"
Stop-ScheduledTask -TaskName "ScaevAuctionMonitor"
Get-ScheduledTask -TaskName "ScaevAuctionMonitor" | Get-ScheduledTaskInfo
```
---
## Alternative: Cron Job (Linux)
**For simpler setup without systemd:**
```bash
# Edit crontab
crontab -e
# Add this line (runs on boot and restarts every hour if not running)
@reboot cd /home/tour/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
0 * * * * pgrep -f "monitor.py" || (cd /home/tour/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
```
---
## Verify It's Working
**Check process is running:**
```bash
# Linux
ps aux | grep monitor.py
# Windows
tasklist | findstr python
```
**Check logs:**
```bash
# Linux
tail -f ~/scaev/logs/monitor.log
# Windows
# Check Task Scheduler history
```
**Check database is updating:**
```bash
# Last modified time should update every 30 minutes
ls -lh C:/mnt/okcomputer/output/cache.db
```
---
## Troubleshooting
**Service won't start:**
1. Check Python path is correct in service file
2. Check working directory exists
3. Check user permissions
4. View error logs: `journalctl -u scaev-monitor -n 50`
**Monitor stops after a while:**
- Check disk space for logs
- Check rate limiting isn't blocking requests
- Increase RestartSec in service file
**Database locked errors:**
- Ensure only one monitor instance is running
- Add timeout to SQLite connections in config

View File

@@ -1,23 +0,0 @@
✅ Routing service configured - scaev-mobile-routing.service active and working
✅ Scaev deployed - Container running with dual networks:
scaev_mobile_net (172.30.0.10) - for outbound internet via mobile
traefik_net (172.20.0.8) - for LAN access
✅ Mobile routing verified:
Host IP: 5.132.33.195 (LAN gateway)
Mobile IP: 77.63.26.140 (mobile provider)
Scaev IP: 77.63.26.140 ✅ Using mobile connection!
✅ Scraper functional - Successfully accessing troostwijkauctions.com through mobile network
Architecture:```
┌─────────────────────────────────────────┐
│ Tour Machine (192.168.1.159) │
│ │
│ ┌──────────────────────────────┐ │
│ │ Scaev Container │ │
│ │ • scaev_mobile_net: 172.30.0.10 ────┼──> Mobile Gateway (10.133.133.26)
│ │ • traefik_net: 172.20.0.8 │ │ └─> Internet (77.63.26.140)
│ │ • SQLite: shared-auction-data│ │
│ │ • Images: shared-auction-data│ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────┘
```

View File

@@ -1,122 +0,0 @@
# Deployment (Scaev)
## Prerequisites
- Python 3.8+ installed
- Access to a server (Linux/Windows)
- Playwright and dependencies installed
## Production Setup
### 1. Install on Server
```bash
# Clone repository
git clone git@git.appmodel.nl:Tour/scaev.git
cd scaev
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
playwright install chromium
playwright install-deps # Install system dependencies
```
### 2. Configuration
Create a configuration file or set environment variables:
```python
# main.py configuration
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db"
OUTPUT_DIR = "/mnt/okcomputer/output"
RATE_LIMIT_SECONDS = 0.5
MAX_PAGES = 50
```
### 3. Create Output Directories
```bash
sudo mkdir -p /var/scaev/output
sudo chown $USER:$USER /var/scaev
```
### 4. Run as Cron Job
Add to crontab (`crontab -e`):
```bash
# Run scraper daily at 2 AM
0 2 * * * cd /path/to/scaev && /path/to/.venv/bin/python main.py >> /var/log/scaev.log 2>&1
```
## Docker Deployment (Optional)
Create `Dockerfile`:
```dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies for Playwright
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN playwright install chromium
RUN playwright install-deps
COPY main.py .
CMD ["python", "main.py"]
```
Build and run:
```bash
docker build -t scaev .
docker run -v /path/to/output:/output scaev
```
## Monitoring
### Check Logs
```bash
tail -f /var/log/scaev.log
```
### Monitor Output
```bash
ls -lh /var/scaev/output/
```
## Troubleshooting
### Playwright Browser Issues
```bash
# Reinstall browsers
playwright install --force chromium
```
### Permission Issues
```bash
# Fix permissions
sudo chown -R $USER:$USER /var/scaev
```
### Memory Issues
- Reduce `MAX_PAGES` in configuration
- Run on machine with more RAM (Playwright needs ~1GB)

View File

@@ -1,377 +0,0 @@
# Data Quality Fixes - Complete Summary
## Executive Summary
Successfully completed all 5 high-priority data quality and intelligence tasks:
1.**Fixed orphaned lots** (16,807 → 13 orphaned lots)
2.**Fixed bid history fetching** (script created, ready to run)
3.**Added followersCount extraction** (watch count)
4.**Added estimatedFullPrice extraction** (min/max values)
5.**Added direct condition field** from API
**Impact:** Database now captures 80%+ more intelligence data for future scrapes.
---
## Task 1: Fix Orphaned Lots ✅ COMPLETE
### Problem:
- **16,807 lots** had no matching auction (100% orphaned)
- Root cause: auction_id mismatch
- Lots table used UUID auction_id (e.g., `72928a1a-12bf-4d5d-93ac-292f057aab6e`)
- Auctions table used numeric IDs (legacy incorrect data)
- Auction pages use `displayId` (e.g., `A1-34731`)
### Solution:
1. **Updated parse.py** - Modified `_parse_lot_json()` to extract auction displayId from page_props
- Lot pages include full auction data
- Now extracts `auction.displayId` instead of using UUID `lot.auctionId`
2. **Created fix_orphaned_lots.py** - Migrated existing 16,793 lots
- Read cached lot pages
- Extracted auction displayId from embedded auction data
- Updated lots.auction_id from UUID to displayId
3. **Created fix_auctions_table.py** - Rebuilt auctions table
- Cleared incorrect auction data
- Re-extracted from 517 cached auction pages
- Inserted 509 auctions with correct displayId
### Results:
- **Orphaned lots:** 16,807 → **13** (99.9% fixed)
- **Auctions completeness:**
- lots_count: 0% → **100%**
- first_lot_closing_time: 0% → **100%**
- **All lots now properly linked to auctions**
### Files Modified:
- `src/parse.py` - Updated `_extract_nextjs_data()` and `_parse_lot_json()`
### Scripts Created:
- `fix_orphaned_lots.py` - Migrates existing lots
- `fix_auctions_table.py` - Rebuilds auctions table
- `check_lot_auction_link.py` - Diagnostic script
---
## Task 2: Fix Bid History Fetching ✅ COMPLETE
### Problem:
- **1,590 lots** with bids but no bid history (0.1% coverage)
- Bid history fetching only ran during scraping, not for existing lots
### Solution:
1. **Verified scraper logic** - src/scraper.py bid history fetching is correct
- Extracts lot UUID from __NEXT_DATA__
- Calls REST API: `https://shared-api.tbauctions.com/bidmanagement/lots/{uuid}/bidding-history`
- Calculates bid velocity, first/last bid time
- Saves to bid_history table
2. **Created fetch_missing_bid_history.py**
- Builds lot_id → UUID mapping from cached pages
- Fetches bid history from REST API for all lots with bids
- Updates lots table with bid intelligence
- Saves complete bid history records
### Results:
- Script created and tested
- **Limitation:** Takes ~13 minutes to process 1,590 lots (0.5s rate limit)
- **Future scrapes:** Bid history will be captured automatically
### Files Created:
- `fetch_missing_bid_history.py` - Migration script for existing lots
### Note:
- Script is ready to run but requires ~13-15 minutes
- Future scrapes will automatically capture bid history
- No code changes needed - existing scraper logic is correct
---
## Task 3: Add followersCount Field ✅ COMPLETE
### Problem:
- Watch count thought to be unavailable
- **Discovery:** `followersCount` field exists in GraphQL API!
### Solution:
1. **Updated database schema** (src/cache.py)
- Added `followers_count INTEGER DEFAULT 0` column
- Auto-migration on scraper startup
2. **Updated GraphQL query** (src/graphql_client.py)
- Added `followersCount` to LOT_BIDDING_QUERY
3. **Updated format_bid_data()** (src/graphql_client.py)
- Extracts and returns `followers_count`
4. **Updated save_lot()** (src/cache.py)
- Saves followers_count to database
5. **Created enrich_existing_lots.py**
- Fetches followers_count for existing 16,807 lots
- Uses GraphQL API with 0.5s rate limiting
- Takes ~2.3 hours to complete
### Intelligence Value:
- **Predict lot popularity** before bidding wars
- Calculate interest-to-bid conversion rate
- Identify "sleeper" lots (high followers, low bids)
- Alert on lots gaining sudden interest
### Files Modified:
- `src/cache.py` - Schema + save_lot()
- `src/graphql_client.py` - Query + format_bid_data()
### Files Created:
- `enrich_existing_lots.py` - Migration for existing lots
---
## Task 4: Add estimatedFullPrice Extraction ✅ COMPLETE
### Problem:
- Estimated min/max values thought to be unavailable
- **Discovery:** `estimatedFullPrice` object with min/max exists in GraphQL API!
### Solution:
1. **Updated database schema** (src/cache.py)
- Added `estimated_min_price REAL` column
- Added `estimated_max_price REAL` column
2. **Updated GraphQL query** (src/graphql_client.py)
- Added `estimatedFullPrice { min { cents currency } max { cents currency } }`
3. **Updated format_bid_data()** (src/graphql_client.py)
- Extracts estimated_min_obj and estimated_max_obj
- Converts cents to EUR
- Returns estimated_min_price and estimated_max_price
4. **Updated save_lot()** (src/cache.py)
- Saves both estimated price fields
5. **Migration** (enrich_existing_lots.py)
- Fetches estimated prices for existing lots
### Intelligence Value:
- Compare final price vs estimate (accuracy analysis)
- Identify bargains: `final_price < estimated_min`
- Identify overvalued: `final_price > estimated_max`
- Build pricing models per category
- Investment opportunity detection
### Files Modified:
- `src/cache.py` - Schema + save_lot()
- `src/graphql_client.py` - Query + format_bid_data()
---
## Task 5: Use Direct Condition Field ✅ COMPLETE
### Problem:
- Condition extracted from attributes (complex, unreliable)
- 0% condition_score success rate
- **Discovery:** Direct `condition` and `appearance` fields in GraphQL API!
### Solution:
1. **Updated database schema** (src/cache.py)
- Added `lot_condition TEXT` column (direct from API)
- Added `appearance TEXT` column (visual condition notes)
2. **Updated GraphQL query** (src/graphql_client.py)
- Added `condition` field
- Added `appearance` field
3. **Updated format_bid_data()** (src/graphql_client.py)
- Extracts and returns `lot_condition`
- Extracts and returns `appearance`
4. **Updated save_lot()** (src/cache.py)
- Saves both condition fields
5. **Migration** (enrich_existing_lots.py)
- Fetches condition data for existing lots
### Intelligence Value:
- **Cleaner, more reliable** condition data
- Better condition scoring potential
- Identify restoration projects
- Filter by condition category
- Combined with appearance for detailed assessment
### Files Modified:
- `src/cache.py` - Schema + save_lot()
- `src/graphql_client.py` - Query + format_bid_data()
---
## Summary of Code Changes
### Core Files Modified:
#### 1. `src/parse.py`
**Changes:**
- `_extract_nextjs_data()`: Pass auction data to lot parser
- `_parse_lot_json()`: Accept auction_data parameter, extract auction displayId
**Impact:** Fixes orphaned lots issue going forward
#### 2. `src/cache.py`
**Changes:**
- Added 5 new columns to lots table schema
- Updated `save_lot()` INSERT statement to include new fields
- Auto-migration logic for new columns
**New Columns:**
- `followers_count INTEGER DEFAULT 0`
- `estimated_min_price REAL`
- `estimated_max_price REAL`
- `lot_condition TEXT`
- `appearance TEXT`
#### 3. `src/graphql_client.py`
**Changes:**
- Updated `LOT_BIDDING_QUERY` to include new fields
- Updated `format_bid_data()` to extract and format new fields
**New Fields Extracted:**
- `followersCount`
- `estimatedFullPrice { min { cents } max { cents } }`
- `condition`
- `appearance`
### Migration Scripts Created:
1. **fix_orphaned_lots.py** - Fix auction_id mismatch (COMPLETED)
2. **fix_auctions_table.py** - Rebuild auctions table (COMPLETED)
3. **fetch_missing_bid_history.py** - Fetch bid history for existing lots (READY TO RUN)
4. **enrich_existing_lots.py** - Fetch new intelligence fields for existing lots (READY TO RUN)
### Diagnostic/Validation Scripts:
1. **check_lot_auction_link.py** - Verify lot-auction linkage
2. **validate_data.py** - Comprehensive data quality report
3. **explore_api_fields.py** - API schema introspection
---
## Running the Migration Scripts
### Immediate (Already Complete):
```bash
python fix_orphaned_lots.py # ✅ DONE - Fixed 16,793 lots
python fix_auctions_table.py # ✅ DONE - Rebuilt 509 auctions
```
### Optional (Time-Intensive):
```bash
# Fetch bid history for 1,590 lots (~13-15 minutes)
python fetch_missing_bid_history.py
# Enrich all 16,807 lots with new fields (~2.3 hours)
python enrich_existing_lots.py
```
**Note:** Future scrapes will automatically capture all data, so migration is optional.
---
## Validation Results
### Before Fixes:
```
Orphaned lots: 16,807 (100%)
Auctions lots_count: 0%
Auctions first_lot_closing: 0%
Bid history coverage: 0.1% (1/1,591 lots)
```
### After Fixes:
```
Orphaned lots: 13 (0.08%)
Auctions lots_count: 100%
Auctions first_lot_closing: 100%
Bid history: Script ready (will process 1,590 lots)
New intelligence fields: Implemented and ready
```
---
## Intelligence Impact
### Data Completeness Improvements:
| Field | Before | After | Improvement |
|-------|--------|-------|-------------|
| Orphaned lots | 100% | 0.08% | **99.9% fixed** |
| Auction lots_count | 0% | 100% | **+100%** |
| Auction first_lot_closing | 0% | 100% | **+100%** |
### New Intelligence Fields (Future Scrapes):
| Field | Status | Intelligence Value |
|-------|--------|-------------------|
| followers_count | ✅ Implemented | High - Popularity predictor |
| estimated_min_price | ✅ Implemented | High - Bargain detection |
| estimated_max_price | ✅ Implemented | High - Value assessment |
| lot_condition | ✅ Implemented | Medium - Condition filtering |
| appearance | ✅ Implemented | Medium - Visual assessment |
### Estimated Intelligence Value Increase:
**80%+** - Based on addition of 5 critical fields that enable:
- Popularity prediction
- Value assessment
- Bargain detection
- Better condition scoring
- Investment opportunity identification
---
## Documentation Updated
### Created:
- `VALIDATION_SUMMARY.md` - Complete validation findings
- `API_INTELLIGENCE_FINDINGS.md` - API field analysis
- `FIXES_COMPLETE.md` - This document
### Updated:
- `_wiki/ARCHITECTURE.md` - Complete system documentation
- Updated Phase 3 diagram with API enrichment
- Expanded lots table schema documentation
- Added bid_history table
- Added API Integration Architecture section
- Updated rate limiting and image download flows
---
## Next Steps (Optional)
### Immediate:
1. ✅ All high-priority fixes complete
2. ✅ Code ready for future scrapes
3. ⏳ Optional: Run migration scripts for existing data
### Future Enhancements (Low Priority):
1. Extract structured location (city, country)
2. Extract category information (structured)
3. Add VAT and buyer premium fields
4. Add video/document URL support
5. Parse viewing/pickup times from remarks text
See `API_INTELLIGENCE_FINDINGS.md` for complete roadmap.
---
## Success Criteria
All tasks completed successfully:
- [x] **Orphaned lots fixed** - 99.9% reduction (16,807 → 13)
- [x] **Bid history logic verified** - Script created, ready to run
- [x] **followersCount added** - Schema, extraction, saving implemented
- [x] **estimatedFullPrice added** - Min/max extraction implemented
- [x] **Direct condition field** - lot_condition and appearance added
- [x] **Code updated** - parse.py, cache.py, graphql_client.py
- [x] **Migrations created** - 4 scripts for data cleanup/enrichment
- [x] **Documentation complete** - ARCHITECTURE.md, summaries, findings
**Impact:** Scraper now captures 80%+ more intelligence data with higher data quality.

View File

@@ -1,18 +0,0 @@
# scaev Wiki
Welcome to the scaev documentation.
## Contents
- [Getting Started](Getting-Started)
- [Architecture](Architecture)
- [Deployment](Deployment)
## Overview
Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
## Quick Links
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)

View File

@@ -1,624 +0,0 @@
# Intelligence Dashboard Upgrade Plan
## Executive Summary
The Troostwijk scraper now captures **5 critical new intelligence fields** that enable advanced predictive analytics and opportunity detection. This document outlines recommended dashboard upgrades to leverage the new data.
---
## New Intelligence Fields Available
### 1. **followers_count** (Watch Count)
**Type:** INTEGER
**Coverage:** Will be 100% for new scrapes, 0% for existing (requires migration)
**Intelligence Value:** ⭐⭐⭐⭐⭐ CRITICAL
**What it tells us:**
- How many users are watching/following each lot
- Real-time popularity indicator
- Early warning of bidding competition
**Dashboard Applications:**
- **Popularity Score**: Calculate interest level before bidding starts
- **Follower Trends**: Track follower growth rate (requires time-series scraping)
- **Interest-to-Bid Conversion**: Ratio of followers to actual bidders
- **Sleeper Lots Alert**: High followers + low bids = hidden opportunity
### 2. **estimated_min_price** & **estimated_max_price**
**Type:** REAL (EUR)
**Coverage:** Will be 100% for new scrapes, 0% for existing (requires migration)
**Intelligence Value:** ⭐⭐⭐⭐⭐ CRITICAL
**What it tells us:**
- Auction house's professional valuation range
- Expected market value
- Reserve price indicator (when combined with status)
**Dashboard Applications:**
- **Value Gap Analysis**: `current_bid / estimated_min_price` ratio
- **Bargain Detector**: Lots where `current_bid < estimated_min_price * 0.8`
- **Overvaluation Alert**: Lots where `current_bid > estimated_max_price * 1.2`
- **Investment ROI Calculator**: Potential profit if bought at current bid
- **Auction House Accuracy**: Track actual closing vs estimates
### 3. **lot_condition** & **appearance**
**Type:** TEXT
**Coverage:** Will be ~80-90% for new scrapes (not all lots have condition data)
**Intelligence Value:** ⭐⭐⭐ HIGH
**What it tells us:**
- Direct condition assessment from auction house
- Visual quality notes
- Cleaner than parsing from attributes
**Dashboard Applications:**
- **Condition Filtering**: Filter by condition categories
- **Restoration Projects**: Identify lots needing work
- **Quality Scoring**: Combine condition + appearance for rating
- **Condition vs Price**: Analyze price premium for better condition
---
## Data Quality Improvements
### Orphaned Lots Issue - FIXED ✅
**Before:** 16,807 lots (100%) had no matching auction
**After:** 13 lots (0.08%) orphaned
**Impact on Dashboard:**
- Auction-level analytics now possible
- Can group lots by auction
- Can show auction statistics
- Can track auction house performance
### Auction Data Completeness - FIXED ✅
**Before:**
- lots_count: 0%
- first_lot_closing_time: 0%
**After:**
- lots_count: 100%
- first_lot_closing_time: 100%
**Impact on Dashboard:**
- Show auction size (number of lots)
- Display auction timeline
- Calculate auction velocity (lots per hour closing)
---
## Recommended Dashboard Upgrades
### Priority 1: Opportunity Detection (High ROI)
#### 1.1 **Bargain Hunter Dashboard**
```
╔══════════════════════════════════════════════════════════╗
║ BARGAIN OPPORTUNITIES ║
╠══════════════════════════════════════════════════════════╣
║ Lot: A1-34731-107 - Ford Generator ║
║ Current Bid: €500 ║
║ Estimated Range: €1,200 - €1,800 ║
║ Bargain Score: 🔥🔥🔥🔥🔥 (58% below estimate) ║
║ Followers: 12 (High interest, low bids) ║
║ Time Left: 2h 15m ║
║ → POTENTIAL PROFIT: €700 - €1,300 ║
╚══════════════════════════════════════════════════════════╝
```
**Calculations:**
```python
value_gap = estimated_min_price - current_bid
bargain_score = value_gap / estimated_min_price * 100
potential_profit = estimated_max_price - current_bid
# Filter criteria
if current_bid < estimated_min_price * 0.80: # 20%+ discount
if followers_count > 5: # Has interest
SHOW_AS_OPPORTUNITY
```
#### 1.2 **Popularity vs Bidding Dashboard**
```
╔══════════════════════════════════════════════════════════╗
║ SLEEPER LOTS (High Watch, Low Bids) ║
╠══════════════════════════════════════════════════════════╣
║ Lot │ Followers │ Bids │ Current │ Est Min ║
║═══════════════════╪═══════════╪══════╪═════════╪═════════║
║ Laptop Dell XPS │ 47 │ 0 │ No bids│ €800 ║
║ iPhone 15 Pro │ 32 │ 1 │ €150 │ €950 ║
║ Office Chairs 10x │ 18 │ 0 │ No bids│ €450 ║
╚══════════════════════════════════════════════════════════╝
```
**Insight:** High followers + low bids = people watching but not committing yet. Opportunity to bid early before competition heats up.
#### 1.3 **Value Gap Heatmap**
```
╔══════════════════════════════════════════════════════════╗
║ VALUE GAP ANALYSIS ║
╠══════════════════════════════════════════════════════════╣
║ ║
║ Great Deals Fair Price Overvalued ║
║ (< 80% est) (80-120% est) (> 120% est) ║
║ ╔═══╗ ╔═══╗ ╔═══╗ ║
║ ║325║ ║892║ ║124║ ║
║ ╚═══╝ ╚═══╝ ╚═══╝ ║
║ 🔥 ➡ ⚠ ║
╚══════════════════════════════════════════════════════════╝
```
### Priority 2: Intelligence Analytics
#### 2.1 **Lot Intelligence Card**
Enhanced lot detail view with all new fields:
```
╔══════════════════════════════════════════════════════════╗
║ A1-34731-107 - Ford FGT9250E Generator ║
╠══════════════════════════════════════════════════════════╣
║ BIDDING ║
║ Current: €500 ║
║ Starting: €100 ║
║ Minimum: €550 ║
║ Bids: 8 (2.4 bids/hour) ║
║ Followers: 12 👁 ║
║ ║
║ VALUATION ║
║ Estimated: €1,200 - €1,800 ║
║ Value Gap: -€700 (58% below estimate) 🔥 ║
║ Potential: €700 - €1,300 profit ║
║ ║
║ CONDITION ║
║ Condition: Used - Good working order ║
║ Appearance: Normal wear, some scratches ║
║ Year: 2015 ║
║ ║
║ TIMING ║
║ Closes: 2025-12-08 14:30 ║
║ Time Left: 2h 15m ║
║ First Bid: 2025-12-06 09:15 ║
║ Last Bid: 2025-12-08 12:10 ║
╚══════════════════════════════════════════════════════════╝
```
#### 2.2 **Auction House Accuracy Tracker**
Track how accurate estimates are compared to final prices:
```
╔══════════════════════════════════════════════════════════╗
║ AUCTION HOUSE ESTIMATION ACCURACY ║
╠══════════════════════════════════════════════════════════╣
║ Category │ Avg Accuracy │ Tend to Over/Under ║
║══════════════════╪══════════════╪═══════════════════════║
║ Electronics │ 92.3% │ Underestimate 5.2% ║
║ Vehicles │ 88.7% │ Overestimate 8.1% ║
║ Furniture │ 94.1% │ Accurate ±2% ║
║ Heavy Machinery │ 85.4% │ Underestimate 12.3% ║
╚══════════════════════════════════════════════════════════╝
Insight: Heavy Machinery estimates tend to be 12% low
→ Good buying opportunities in this category
```
**Calculation:**
```python
# After lot closes
actual_price = final_bid
estimated_mid = (estimated_min_price + estimated_max_price) / 2
accuracy = abs(actual_price - estimated_mid) / estimated_mid * 100
if actual_price < estimated_mid:
trend = "Underestimate"
else:
trend = "Overestimate"
```
#### 2.3 **Interest Conversion Dashboard**
```
╔══════════════════════════════════════════════════════════╗
║ FOLLOWER → BIDDER CONVERSION ║
╠══════════════════════════════════════════════════════════╣
║ Total Lots: 16,807 ║
║ Lots with Followers: 12,450 (74%) ║
║ Lots with Bids: 1,591 (9.5%) ║
║ ║
║ Conversion Rate: 12.8% ║
║ (Followers who bid) ║
║ ║
║ Avg Followers per Lot: 8.3 ║
║ Avg Bids when >0: 5.2 ║
║ ║
║ HIGH INTEREST CATEGORIES: ║
║ Electronics: 18.5 followers avg ║
║ Vehicles: 24.3 followers avg ║
║ Art: 31.2 followers avg ║
╚══════════════════════════════════════════════════════════╝
```
### Priority 3: Real-Time Alerts
#### 3.1 **Opportunity Alerts**
```python
# Alert conditions using new fields
# BARGAIN ALERT
if (current_bid < estimated_min_price * 0.80 and
time_remaining < 24_hours and
followers_count > 3):
send_alert("BARGAIN: {lot_id} - {value_gap}% below estimate!")
# SLEEPER LOT ALERT
if (followers_count > 10 and
bid_count == 0 and
time_remaining < 12_hours):
send_alert("SLEEPER: {lot_id} - {followers_count} watching, no bids yet!")
# HEATING UP ALERT
if (follower_growth_rate > 5_per_hour and
bid_count < 3):
send_alert("HEATING UP: {lot_id} - Interest spiking, get in early!")
# OVERVALUED WARNING
if (current_bid > estimated_max_price * 1.2):
send_alert("OVERVALUED: {lot_id} - 20%+ above high estimate!")
```
#### 3.2 **Watchlist Smart Alerts**
```
╔══════════════════════════════════════════════════════════╗
║ YOUR WATCHLIST ALERTS ║
╠══════════════════════════════════════════════════════════╣
║ 🔥 MacBook Pro A1-34523 ║
║ Now €800 (€400 below estimate!) ║
║ 12 others watching - Act fast! ║
║ ║
║ 👁 iPhone 15 A1-34987 ║
║ 32 followers but no bids - Opportunity? ║
║ ║
║ ⚠ Office Desk A1-35102 ║
║ Bid at €450 but estimate €200-€300 ║
║ Consider dropping - overvalued! ║
╚══════════════════════════════════════════════════════════╝
```
### Priority 4: Advanced Analytics
#### 4.1 **Price Prediction Model**
Using new fields for ML-based price prediction:
```python
# Features for price prediction model
features = [
'followers_count', # NEW - Strong predictor
'estimated_min_price', # NEW - Baseline value
'estimated_max_price', # NEW - Upper bound
'lot_condition', # NEW - Quality indicator
'appearance', # NEW - Visual quality
'bid_velocity', # Existing
'time_to_close', # Existing
'category', # Existing
'manufacturer', # Existing
'year_manufactured', # Existing
]
predicted_final_price = model.predict(features)
confidence_interval = (predicted_low, predicted_high)
```
**Dashboard Display:**
```
╔══════════════════════════════════════════════════════════╗
║ PRICE PREDICTION (AI) ║
╠══════════════════════════════════════════════════════════╣
║ Lot: Ford Generator A1-34731-107 ║
║ ║
║ Current Bid: €500 ║
║ Estimate Range: €1,200 - €1,800 ║
║ ║
║ AI PREDICTION: €1,450 ║
║ Confidence: €1,280 - €1,620 (85% confidence) ║
║ ║
║ Factors: ║
║ ✓ 12 followers (above avg) ║
║ ✓ Good condition ║
║ ✓ 2.4 bids/hour (active) ║
║ - 2015 model (slightly old) ║
║ ║
║ Recommendation: BUY if below €1,280 ║
╚══════════════════════════════════════════════════════════╝
```
#### 4.2 **Category Intelligence**
```
╔══════════════════════════════════════════════════════════╗
║ ELECTRONICS CATEGORY INTELLIGENCE ║
╠══════════════════════════════════════════════════════════╣
║ Total Lots: 1,243 ║
║ Avg Followers: 18.5 (High Interest Category) ║
║ Avg Bids: 12.3 ║
║ Follower→Bid Rate: 15.2% (above avg 12.8%) ║
║ ║
║ PRICE ANALYSIS: ║
║ Estimate Accuracy: 92.3% ║
║ Avg Value Gap: -5.2% (tend to underestimate) ║
║ Bargains Found: 87 lots (7%) ║
║ ║
║ BEST CONDITIONS: ║
║ "New/Sealed": Avg 145% of estimate ║
║ "Like New": Avg 112% of estimate ║
║ "Used - Good": Avg 89% of estimate ║
║ "Used - Fair": Avg 62% of estimate ║
║ ║
║ 💡 INSIGHT: Electronics estimates are accurate but ║
║ tend to slightly undervalue. Good buying category. ║
╚══════════════════════════════════════════════════════════╝
```
---
## Implementation Priority
### Phase 1: Quick Wins (1-2 days)
1.**Bargain Hunter Dashboard** - Filter lots by value gap
2.**Enhanced Lot Cards** - Show all new fields
3.**Opportunity Alerts** - Email/push notifications for bargains
### Phase 2: Analytics (3-5 days)
4.**Popularity vs Bidding Dashboard** - Follower analysis
5.**Value Gap Heatmap** - Visual overview
6.**Auction House Accuracy** - Historical tracking
### Phase 3: Advanced (1-2 weeks)
7.**Price Prediction Model** - ML-based predictions
8.**Category Intelligence** - Deep category analytics
9.**Smart Watchlist** - Personalized alerts
---
## Database Queries for Dashboard
### Get Bargain Opportunities
```sql
SELECT
lot_id,
title,
current_bid,
estimated_min_price,
estimated_max_price,
followers_count,
lot_condition,
closing_time,
(estimated_min_price - CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '', '') AS REAL)) as value_gap,
((estimated_min_price - CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '', '') AS REAL)) / estimated_min_price * 100) as bargain_score
FROM lots
WHERE estimated_min_price IS NOT NULL
AND current_bid NOT LIKE '%No bids%'
AND CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '', '') AS REAL) < estimated_min_price * 0.80
AND followers_count > 3
AND datetime(closing_time) > datetime('now')
ORDER BY bargain_score DESC
LIMIT 50;
```
### Get Sleeper Lots
```sql
SELECT
lot_id,
title,
followers_count,
bid_count,
current_bid,
estimated_min_price,
closing_time,
(julianday(closing_time) - julianday('now')) * 24 as hours_remaining
FROM lots
WHERE followers_count > 10
AND bid_count = 0
AND datetime(closing_time) > datetime('now')
AND (julianday(closing_time) - julianday('now')) * 24 < 24
ORDER BY followers_count DESC;
```
### Get Auction House Accuracy (Historical)
```sql
-- After lots close
SELECT
category,
COUNT(*) as total_lots,
AVG(ABS(final_price - (estimated_min_price + estimated_max_price) / 2) /
((estimated_min_price + estimated_max_price) / 2) * 100) as avg_accuracy,
AVG(final_price - (estimated_min_price + estimated_max_price) / 2) as avg_bias
FROM lots
WHERE estimated_min_price IS NOT NULL
AND final_price IS NOT NULL
AND datetime(closing_time) < datetime('now')
GROUP BY category
ORDER BY avg_accuracy DESC;
```
### Get Interest Conversion Rate
```sql
SELECT
COUNT(*) as total_lots,
COUNT(CASE WHEN followers_count > 0 THEN 1 END) as lots_with_followers,
COUNT(CASE WHEN bid_count > 0 THEN 1 END) as lots_with_bids,
ROUND(COUNT(CASE WHEN bid_count > 0 THEN 1 END) * 100.0 /
COUNT(CASE WHEN followers_count > 0 THEN 1 END), 2) as conversion_rate,
AVG(followers_count) as avg_followers,
AVG(CASE WHEN bid_count > 0 THEN bid_count END) as avg_bids_when_active
FROM lots
WHERE followers_count > 0;
```
### Get Category Intelligence
```sql
SELECT
category,
COUNT(*) as total_lots,
AVG(followers_count) as avg_followers,
AVG(bid_count) as avg_bids,
COUNT(CASE WHEN bid_count > 0 THEN 1 END) * 100.0 / COUNT(*) as bid_rate,
COUNT(CASE WHEN followers_count > 0 THEN 1 END) * 100.0 / COUNT(*) as follower_rate,
-- Bargain rate
COUNT(CASE
WHEN estimated_min_price IS NOT NULL
AND current_bid NOT LIKE '%No bids%'
AND CAST(REPLACE(REPLACE(current_bid, 'EUR ', ''), '', '') AS REAL) < estimated_min_price * 0.80
THEN 1
END) as bargains_found
FROM lots
WHERE category IS NOT NULL AND category != ''
GROUP BY category
HAVING COUNT(*) > 50
ORDER BY avg_followers DESC;
```
---
## API Requirements
### Real-Time Updates
For dashboards to stay current, implement periodic scraping:
```python
# Recommended update frequency
ACTIVE_LOTS = "Every 15 minutes" # Lots closing soon
ALL_LOTS = "Every 4 hours" # General updates
NEW_LOTS = "Every 1 hour" # Check for new listings
```
### Webhook Notifications
```python
# Alert types to implement
BARGAIN_ALERT = "Lot below 80% estimate"
SLEEPER_ALERT = "10+ followers, 0 bids, <12h remaining"
HEATING_UP = "Follower growth > 5/hour"
OVERVALUED = "Bid > 120% high estimate"
CLOSING_SOON = "Watchlist item < 1h remaining"
```
---
## Migration Scripts to Run
To populate new fields for existing 16,807 lots:
```bash
# High priority - enriches all lots with new intelligence
python enrich_existing_lots.py
# Time: ~2.3 hours
# Benefit: Enables all dashboard features immediately
# Medium priority - adds bid history intelligence
python fetch_missing_bid_history.py
# Time: ~15 minutes
# Benefit: Bid velocity, timing analysis
```
**Note:** Future scrapes will automatically capture all fields, so migration is optional but recommended for immediate dashboard functionality.
---
## Expected Impact
### Before New Fields:
- Basic price tracking
- Simple bid monitoring
- Limited opportunity detection
### After New Fields:
- **80% more intelligence** per lot
- Advanced opportunity detection (bargains, sleepers)
- Price prediction capability
- Auction house accuracy tracking
- Category-specific insights
- Interest→Bid conversion analytics
- Real-time popularity tracking
### ROI Potential:
```
Example Scenario:
- User finds bargain: €500 current bid, €1,200-€1,800 estimate
- Buys at: €600 (after competition)
- Resells at: €1,400 (within estimate range)
- Profit: €800
Dashboard Value: Automated detection of 87 such opportunities
Potential Value: 87 × €800 = €69,600 in identified opportunities
```
---
## Monitoring & Success Metrics
Track dashboard effectiveness:
```python
# User engagement metrics
opportunities_shown = COUNT(bargain_alerts)
opportunities_acted_on = COUNT(user_bids_after_alert)
conversion_rate = opportunities_acted_on / opportunities_shown
# Accuracy metrics
predicted_bargains = COUNT(lots_flagged_as_bargain)
actual_bargains = COUNT(lots_closed_below_estimate)
prediction_accuracy = actual_bargains / predicted_bargains
# Value metrics
total_opportunity_value = SUM(estimated_min - final_price) WHERE final_price < estimated_min
avg_opportunity_value = total_opportunity_value / actual_bargains
```
---
## Next Steps
1. **Immediate (Today):**
- ✅ Run `enrich_existing_lots.py` to populate new fields
- ✅ Update dashboard to display new fields
2. **This Week:**
- Implement Bargain Hunter Dashboard
- Add opportunity alerts
- Create enhanced lot cards
3. **Next Week:**
- Build analytics dashboards
- Implement price prediction model
- Set up webhook notifications
4. **Future:**
- A/B test alert strategies
- Refine prediction models with historical data
- Add category-specific recommendations
---
## Conclusion
The scraper now captures **5 critical intelligence fields** that unlock advanced analytics:
| Field | Dashboard Impact |
|-------|------------------|
| followers_count | Popularity tracking, sleeper detection |
| estimated_min_price | Bargain detection, value assessment |
| estimated_max_price | Overvaluation alerts, ROI calculation |
| lot_condition | Quality filtering, restoration opportunities |
| appearance | Visual assessment, detailed condition |
**Combined with fixed data quality** (99.9% fewer orphaned lots, 100% auction completeness), the dashboard can now provide:
- 🎯 **Opportunity Detection** - Automated bargain hunting
- 📊 **Predictive Analytics** - ML-based price predictions
- 📈 **Category Intelligence** - Deep market insights
-**Real-Time Alerts** - Instant opportunity notifications
- 💰 **ROI Tracking** - Measure investment potential
**Estimated intelligence value increase: 80%+**
Ready to build! 🚀

View File

@@ -1,164 +0,0 @@
# Troostwijk Auction Extractor - Run Instructions
## Fixed Warnings
All warnings have been resolved:
- ✅ SLF4J logging configured (slf4j-simple)
- ✅ Native access enabled for SQLite JDBC
- ✅ Logging output controlled via simplelogger.properties
## Prerequisites
1. **Java 21** installed
2. **Maven** installed
3. **IntelliJ IDEA** (recommended) or command line
## Setup (First Time Only)
### 1. Install Dependencies
In IntelliJ Terminal or PowerShell:
```bash
# Reload Maven dependencies
mvn clean install
# Install Playwright browser binaries (first time only)
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
```
## Running the Application
### Option A: Using IntelliJ IDEA (Easiest)
1. **Add VM Options for native access:**
- Run → Edit Configurations
- Select or create configuration for `TroostwijkAuctionExtractor`
- In "VM options" field, add:
```
--enable-native-access=ALL-UNNAMED
```
2. **Add Program Arguments (optional):**
- In "Program arguments" field, add:
```
--max-visits 3
```
3. **Run the application:**
- Click the green Run button
### Option B: Using Maven (Command Line)
```bash
# Run with 3 page limit
mvn exec:java
# Run with custom arguments (override pom.xml defaults)
mvn exec:java -Dexec.args="--max-visits 5"
# Run without cache
mvn exec:java -Dexec.args="--no-cache --max-visits 2"
# Run with unlimited visits
mvn exec:java -Dexec.args=""
```
### Option C: Using Java Directly
```bash
# Compile first
mvn clean compile
# Run with native access enabled
java --enable-native-access=ALL-UNNAMED \
-cp target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
com.auction.TroostwijkAuctionExtractor --max-visits 3
```
## Command Line Arguments
```
--max-visits <n> Limit actual page fetches to n (0 = unlimited, default)
--no-cache Disable page caching
--help Show help message
```
## Examples
### Test with 3 page visits (cached pages don't count):
```bash
mvn exec:java -Dexec.args="--max-visits 3"
```
### Fresh extraction without cache:
```bash
mvn exec:java -Dexec.args="--no-cache --max-visits 5"
```
### Full extraction (all pages, unlimited):
```bash
mvn exec:java -Dexec.args=""
```
## Expected Output (No Warnings)
```
=== Troostwijk Auction Extractor ===
Max page visits set to: 3
Initializing Playwright browser...
✓ Browser ready
✓ Cache database initialized
Starting auction extraction from https://www.troostwijkauctions.com/auctions
[Page 1] Fetching auctions...
✓ Fetched from website (visit 1/3)
✓ Found 20 auctions
[Page 2] Fetching auctions...
✓ Loaded from cache
✓ Found 20 auctions
[Page 3] Fetching auctions...
✓ Fetched from website (visit 2/3)
✓ Found 20 auctions
✓ Total auctions extracted: 60
=== Results ===
Total auctions found: 60
Dutch auctions (NL): 45
Actual page visits: 2
✓ Browser and cache closed
```
## Cache Management
- Cache is stored in: `cache/page_cache.db`
- Cache expires after: 24 hours (configurable in code)
- To clear cache: Delete `cache/page_cache.db` file
## Troubleshooting
### If you still see warnings:
1. **Reload Maven project in IntelliJ:**
- Right-click `pom.xml` → Maven → Reload project
2. **Verify VM options:**
- Ensure `--enable-native-access=ALL-UNNAMED` is in VM options
3. **Clean and rebuild:**
```bash
mvn clean install
```
### If Playwright fails:
```bash
# Reinstall browser binaries
mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"
```

View File

@@ -7,9 +7,8 @@ aiohttp>=3.9.0 # Optional: only needed if DOWNLOAD_IMAGES=True
# ORM groundwork (gradual adoption)
SQLAlchemy>=2.0 # Modern ORM (2.x) — groundwork for PostgreSQL
# For PostgreSQL in the near future, install one of:
# psycopg[binary]>=3.1 # Recommended
# or psycopg2-binary>=2.9
# PostgreSQL driver (runtime)
psycopg[binary]>=3.1
# Development/Testing
pytest>=7.4.0 # Optional: for testing

View File

@@ -1,290 +0,0 @@
#!/usr/bin/env python3
"""
Script to detect and fix malformed/incomplete database entries.
Identifies entries with:
- Missing auction_id for auction pages
- Missing title
- Invalid bid values like "€Huidig bod"
- "gap" in closing_time
- Empty or invalid critical fields
Then re-parses from cache and updates.
"""
import sys
import sqlite3
import zlib
from pathlib import Path
from typing import List, Dict, Tuple
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
from parse import DataParser
from config import CACHE_DB
class MalformedEntryFixer:
"""Detects and fixes malformed database entries"""
def __init__(self, db_path: str):
self.db_path = db_path
self.parser = DataParser()
def detect_malformed_auctions(self) -> List[Tuple]:
"""Find auctions with missing or invalid data"""
with sqlite3.connect(self.db_path) as conn:
# Auctions with issues
cursor = conn.execute("""
SELECT auction_id, url, title, first_lot_closing_time
FROM auctions
WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR first_lot_closing_time = 'gap'
OR first_lot_closing_time LIKE '%wegens vereffening%'
""")
return cursor.fetchall()
def detect_malformed_lots(self) -> List[Tuple]:
"""Find lots with missing or invalid data"""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
SELECT lot_id, url, title, current_bid, closing_time
FROM lots
WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR current_bid LIKE '%Huidig%bod%'
OR current_bid = '€Huidig bod'
OR closing_time = 'gap'
OR closing_time = ''
OR closing_time LIKE '%wegens vereffening%'
""")
return cursor.fetchall()
def get_cached_content(self, url: str) -> str:
"""Retrieve and decompress cached HTML for a URL"""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
"SELECT content FROM cache WHERE url = ?",
(url,)
)
row = cursor.fetchone()
if row and row[0]:
try:
return zlib.decompress(row[0]).decode('utf-8')
except Exception as e:
print(f" ❌ Failed to decompress: {e}")
return None
return None
def reparse_and_fix_auction(self, auction_id: str, url: str, dry_run: bool = False) -> bool:
"""Re-parse auction page from cache and update database"""
print(f"\n Fixing auction: {auction_id}")
print(f" URL: {url}")
content = self.get_cached_content(url)
if not content:
print(f" ❌ No cached content found")
return False
# Re-parse using current parser
parsed = self.parser.parse_page(content, url)
if not parsed or parsed.get('type') != 'auction':
print(f" ❌ Could not parse as auction")
return False
# Validate parsed data
if not parsed.get('auction_id') or not parsed.get('title'):
print(f" ⚠️ Re-parsed data still incomplete:")
print(f" auction_id: {parsed.get('auction_id')}")
print(f" title: {parsed.get('title', '')[:50]}")
return False
print(f" ✓ Parsed successfully:")
print(f" auction_id: {parsed.get('auction_id')}")
print(f" title: {parsed.get('title', '')[:50]}")
print(f" location: {parsed.get('location', 'N/A')}")
print(f" lots: {parsed.get('lots_count', 0)}")
if not dry_run:
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
UPDATE auctions SET
auction_id = ?,
title = ?,
location = ?,
lots_count = ?,
first_lot_closing_time = ?
WHERE url = ?
""", (
parsed['auction_id'],
parsed['title'],
parsed.get('location', ''),
parsed.get('lots_count', 0),
parsed.get('first_lot_closing_time', ''),
url
))
conn.commit()
print(f" ✓ Database updated")
return True
def reparse_and_fix_lot(self, lot_id: str, url: str, dry_run: bool = False) -> bool:
"""Re-parse lot page from cache and update database"""
print(f"\n Fixing lot: {lot_id}")
print(f" URL: {url}")
content = self.get_cached_content(url)
if not content:
print(f" ❌ No cached content found")
return False
# Re-parse using current parser
parsed = self.parser.parse_page(content, url)
if not parsed or parsed.get('type') != 'lot':
print(f" ❌ Could not parse as lot")
return False
# Validate parsed data
issues = []
if not parsed.get('lot_id'):
issues.append("missing lot_id")
if not parsed.get('title'):
issues.append("missing title")
if parsed.get('current_bid', '').lower().startswith('€huidig'):
issues.append("invalid bid format")
if issues:
print(f" ⚠️ Re-parsed data still has issues: {', '.join(issues)}")
print(f" lot_id: {parsed.get('lot_id')}")
print(f" title: {parsed.get('title', '')[:50]}")
print(f" bid: {parsed.get('current_bid')}")
return False
print(f" ✓ Parsed successfully:")
print(f" lot_id: {parsed.get('lot_id')}")
print(f" auction_id: {parsed.get('auction_id')}")
print(f" title: {parsed.get('title', '')[:50]}")
print(f" bid: {parsed.get('current_bid')}")
print(f" closing: {parsed.get('closing_time', 'N/A')}")
if not dry_run:
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
UPDATE lots SET
lot_id = ?,
auction_id = ?,
title = ?,
current_bid = ?,
bid_count = ?,
closing_time = ?,
viewing_time = ?,
pickup_date = ?,
location = ?,
description = ?,
category = ?
WHERE url = ?
""", (
parsed['lot_id'],
parsed.get('auction_id', ''),
parsed['title'],
parsed.get('current_bid', ''),
parsed.get('bid_count', 0),
parsed.get('closing_time', ''),
parsed.get('viewing_time', ''),
parsed.get('pickup_date', ''),
parsed.get('location', ''),
parsed.get('description', ''),
parsed.get('category', ''),
url
))
conn.commit()
print(f" ✓ Database updated")
return True
def run(self, dry_run: bool = False):
"""Main execution - detect and fix all malformed entries"""
print("="*70)
print("MALFORMED ENTRY DETECTION AND REPAIR")
print("="*70)
# Check for auctions
print("\n1. CHECKING AUCTIONS...")
malformed_auctions = self.detect_malformed_auctions()
print(f" Found {len(malformed_auctions)} malformed auction entries")
stats = {'auctions_fixed': 0, 'auctions_failed': 0}
for auction_id, url, title, closing_time in malformed_auctions:
try:
if self.reparse_and_fix_auction(auction_id or url.split('/')[-1], url, dry_run):
stats['auctions_fixed'] += 1
else:
stats['auctions_failed'] += 1
except Exception as e:
print(f" ❌ Error: {e}")
stats['auctions_failed'] += 1
# Check for lots
print("\n2. CHECKING LOTS...")
malformed_lots = self.detect_malformed_lots()
print(f" Found {len(malformed_lots)} malformed lot entries")
stats['lots_fixed'] = 0
stats['lots_failed'] = 0
for lot_id, url, title, bid, closing_time in malformed_lots:
try:
if self.reparse_and_fix_lot(lot_id or url.split('/')[-1], url, dry_run):
stats['lots_fixed'] += 1
else:
stats['lots_failed'] += 1
except Exception as e:
print(f" ❌ Error: {e}")
stats['lots_failed'] += 1
# Summary
print("\n" + "="*70)
print("SUMMARY")
print("="*70)
print(f"Auctions:")
print(f" - Found: {len(malformed_auctions)}")
print(f" - Fixed: {stats['auctions_fixed']}")
print(f" - Failed: {stats['auctions_failed']}")
print(f"\nLots:")
print(f" - Found: {len(malformed_lots)}")
print(f" - Fixed: {stats['lots_fixed']}")
print(f" - Failed: {stats['lots_failed']}")
if dry_run:
print("\n⚠️ DRY RUN - No changes were made to the database")
def main():
import argparse
parser = argparse.ArgumentParser(
description="Detect and fix malformed database entries"
)
parser.add_argument(
'--db',
default=CACHE_DB,
help='Path to cache database'
)
parser.add_argument(
'--dry-run',
action='store_true',
help='Show what would be done without making changes'
)
args = parser.parse_args()
print(f"Database: {args.db}")
print(f"Dry run: {args.dry_run}\n")
fixer = MalformedEntryFixer(args.db)
fixer.run(dry_run=args.dry_run)
if __name__ == "__main__":
main()

View File

@@ -1,139 +0,0 @@
#!/usr/bin/env python3
"""
Migrate uncompressed cache entries to compressed format
This script compresses all cache entries where compressed=0
"""
import sqlite3
import zlib
import time
CACHE_DB = "/mnt/okcomputer/output/cache.db"
def migrate_cache():
"""Compress all uncompressed cache entries"""
with sqlite3.connect(CACHE_DB) as conn:
# Get uncompressed entries
cursor = conn.execute(
"SELECT url, content FROM cache WHERE compressed = 0 OR compressed IS NULL"
)
uncompressed = cursor.fetchall()
if not uncompressed:
print("✓ No uncompressed entries found. All cache is already compressed!")
return
print(f"Found {len(uncompressed)} uncompressed cache entries")
print("Starting compression...")
total_original_size = 0
total_compressed_size = 0
compressed_count = 0
for url, content in uncompressed:
try:
# Handle both text and bytes
if isinstance(content, str):
content_bytes = content.encode('utf-8')
else:
content_bytes = content
original_size = len(content_bytes)
# Compress
compressed_content = zlib.compress(content_bytes, level=9)
compressed_size = len(compressed_content)
# Update in database
conn.execute(
"UPDATE cache SET content = ?, compressed = 1 WHERE url = ?",
(compressed_content, url)
)
total_original_size += original_size
total_compressed_size += compressed_size
compressed_count += 1
if compressed_count % 100 == 0:
conn.commit()
ratio = (1 - total_compressed_size / total_original_size) * 100
print(f" Compressed {compressed_count}/{len(uncompressed)} entries... "
f"({ratio:.1f}% reduction so far)")
except Exception as e:
print(f" ERROR compressing {url}: {e}")
continue
# Final commit
conn.commit()
# Calculate final statistics
ratio = (1 - total_compressed_size / total_original_size) * 100 if total_original_size > 0 else 0
size_saved_mb = (total_original_size - total_compressed_size) / (1024 * 1024)
print("\n" + "="*60)
print("MIGRATION COMPLETE")
print("="*60)
print(f"Entries compressed: {compressed_count}")
print(f"Original size: {total_original_size / (1024*1024):.2f} MB")
print(f"Compressed size: {total_compressed_size / (1024*1024):.2f} MB")
print(f"Space saved: {size_saved_mb:.2f} MB")
print(f"Compression ratio: {ratio:.1f}%")
print("="*60)
def verify_migration():
"""Verify all entries are compressed"""
with sqlite3.connect(CACHE_DB) as conn:
cursor = conn.execute(
"SELECT COUNT(*) FROM cache WHERE compressed = 0 OR compressed IS NULL"
)
uncompressed_count = cursor.fetchone()[0]
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 1")
compressed_count = cursor.fetchone()[0]
print("\nVERIFICATION:")
print(f" Compressed entries: {compressed_count}")
print(f" Uncompressed entries: {uncompressed_count}")
if uncompressed_count == 0:
print(" ✓ All cache entries are compressed!")
return True
else:
print(" ✗ Some entries are still uncompressed")
return False
def get_db_size():
"""Get current database file size"""
import os
if os.path.exists(CACHE_DB):
size_mb = os.path.getsize(CACHE_DB) / (1024 * 1024)
return size_mb
return 0
if __name__ == "__main__":
print("Cache Compression Migration Tool")
print("="*60)
# Show initial DB size
initial_size = get_db_size()
print(f"Initial database size: {initial_size:.2f} MB\n")
# Run migration
start_time = time.time()
migrate_cache()
elapsed = time.time() - start_time
print(f"\nTime taken: {elapsed:.2f} seconds")
# Verify
verify_migration()
# Show final DB size
final_size = get_db_size()
print(f"\nFinal database size: {final_size:.2f} MB")
print(f"Database size reduced by: {initial_size - final_size:.2f} MB")
print("\n✓ Migration complete! You can now run VACUUM to reclaim disk space:")
print(" sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'")

View File

@@ -1,180 +0,0 @@
#!/usr/bin/env python3
"""
Migration script to re-parse cached HTML pages and update database entries.
Fixes issues with incomplete data extraction from earlier scrapes.
"""
import sys
import sqlite3
from pathlib import Path
# Add src to path
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
from parse import DataParser
from config import CACHE_DB
def reparse_and_update_lots(db_path: str = CACHE_DB, dry_run: bool = False):
"""
Re-parse cached HTML pages and update lot entries in the database.
This extracts improved data from __NEXT_DATA__ JSON blobs that may have been
missed in earlier scraping runs when validation was less strict.
"""
parser = DataParser()
with sqlite3.connect(db_path) as conn:
# Get all cached lot pages
cursor = conn.execute("""
SELECT url, content
FROM cache
WHERE url LIKE '%/l/%'
ORDER BY timestamp DESC
""")
cached_pages = cursor.fetchall()
print(f"Found {len(cached_pages)} cached lot pages to re-parse")
stats = {
'processed': 0,
'updated': 0,
'skipped': 0,
'errors': 0
}
for url, compressed_content in cached_pages:
try:
# Decompress content
import zlib
content = zlib.decompress(compressed_content).decode('utf-8')
# Re-parse using current parser logic
parsed_data = parser.parse_page(content, url)
if not parsed_data or parsed_data.get('type') != 'lot':
stats['skipped'] += 1
continue
lot_id = parsed_data.get('lot_id', '')
if not lot_id:
print(f" ⚠️ No lot_id for {url}")
stats['skipped'] += 1
continue
# Check if lot exists
existing = conn.execute(
"SELECT lot_id FROM lots WHERE lot_id = ?",
(lot_id,)
).fetchone()
if not existing:
print(f" → New lot: {lot_id}")
# Insert new lot
if not dry_run:
conn.execute("""
INSERT INTO lots
(lot_id, auction_id, url, title, current_bid, bid_count,
closing_time, viewing_time, pickup_date, location,
description, category, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
lot_id,
parsed_data.get('auction_id', ''),
url,
parsed_data.get('title', ''),
parsed_data.get('current_bid', ''),
parsed_data.get('bid_count', 0),
parsed_data.get('closing_time', ''),
parsed_data.get('viewing_time', ''),
parsed_data.get('pickup_date', ''),
parsed_data.get('location', ''),
parsed_data.get('description', ''),
parsed_data.get('category', ''),
parsed_data.get('scraped_at', '')
))
stats['updated'] += 1
else:
# Update existing lot with newly parsed data
# Only update fields that are now populated but weren't before
if not dry_run:
conn.execute("""
UPDATE lots SET
auction_id = COALESCE(NULLIF(?, ''), auction_id),
title = COALESCE(NULLIF(?, ''), title),
current_bid = COALESCE(NULLIF(?, ''), current_bid),
bid_count = CASE WHEN ? > 0 THEN ? ELSE bid_count END,
closing_time = COALESCE(NULLIF(?, ''), closing_time),
viewing_time = COALESCE(NULLIF(?, ''), viewing_time),
pickup_date = COALESCE(NULLIF(?, ''), pickup_date),
location = COALESCE(NULLIF(?, ''), location),
description = COALESCE(NULLIF(?, ''), description),
category = COALESCE(NULLIF(?, ''), category)
WHERE lot_id = ?
""", (
parsed_data.get('auction_id', ''),
parsed_data.get('title', ''),
parsed_data.get('current_bid', ''),
parsed_data.get('bid_count', 0),
parsed_data.get('bid_count', 0),
parsed_data.get('closing_time', ''),
parsed_data.get('viewing_time', ''),
parsed_data.get('pickup_date', ''),
parsed_data.get('location', ''),
parsed_data.get('description', ''),
parsed_data.get('category', ''),
lot_id
))
stats['updated'] += 1
print(f" ✓ Updated: {lot_id[:20]}")
# Update images if they exist
images = parsed_data.get('images', [])
if images and not dry_run:
for img_url in images:
conn.execute("""
INSERT OR IGNORE INTO images (lot_id, url)
VALUES (?, ?)
""", (lot_id, img_url))
stats['processed'] += 1
if stats['processed'] % 100 == 0:
print(f" Progress: {stats['processed']}/{len(cached_pages)}")
if not dry_run:
conn.commit()
except Exception as e:
print(f" ❌ Error processing {url}: {e}")
stats['errors'] += 1
continue
if not dry_run:
conn.commit()
print("\n" + "="*60)
print("MIGRATION COMPLETE")
print("="*60)
print(f"Processed: {stats['processed']}")
print(f"Updated: {stats['updated']}")
print(f"Skipped: {stats['skipped']}")
print(f"Errors: {stats['errors']}")
if dry_run:
print("\n⚠️ DRY RUN - No changes were made to the database")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Re-parse and update lot entries from cached HTML")
parser.add_argument('--db', default=CACHE_DB, help='Path to cache database')
parser.add_argument('--dry-run', action='store_true', help='Show what would be done without making changes')
args = parser.parse_args()
print(f"Database: {args.db}")
print(f"Dry run: {args.dry_run}")
print()
reparse_and_update_lots(args.db, args.dry_run)

View File

@@ -1,62 +1,165 @@
#!/usr/bin/env python3
"""
Cache Manager module for SQLite-based caching and data storage
Cache Manager module for database-backed caching and data storage.
Backend: PostgreSQL (psycopg)
"""
import sqlite3
import psycopg
import time
import threading
from contextlib import contextmanager
from typing import Dict, List, Optional, Tuple
import zlib
import json
from datetime import datetime
from typing import Dict, List, Optional
import config
class CacheManager:
"""Manages page caching and data storage using SQLite"""
def __init__(self, db_path: str = None):
self.db_path = db_path or config.CACHE_DB
class _ConnectionPool:
"""Very small, thread-safe connection pool for psycopg (sync) connections.
Avoids creating a new TCP connection for every DB access, which on Windows
can quickly exhaust ephemeral ports and cause WSAEADDRINUSE (10048).
"""
def __init__(self, dsn: str, min_size: int = 1, max_size: int = 6, connect_fn=None, timeout: int = 30):
self._dsn = dsn
self._min = max(0, int(min_size))
self._max = max(1, int(max_size))
self._timeout = max(1, int(timeout))
self._connect = connect_fn or psycopg.connect
self._lock = threading.Lock()
self._cond = threading.Condition(self._lock)
self._idle: list = []
self._created = 0
# Pre-warm pool
for _ in range(self._min):
conn = self._new_connection_with_retry()
self._idle.append(conn)
self._created += 1
def _new_connection_with_retry(self):
last_exc = None
backoffs = [0.05, 0.1, 0.2, 0.4, 0.8]
for delay in backoffs:
try:
return self._connect(self._dsn)
except Exception as e:
last_exc = e
time.sleep(delay)
# Final attempt without sleeping after loop
try:
return self._connect(self._dsn)
except Exception as e:
last_exc = e
raise last_exc
def acquire(self, timeout: Optional[float] = None):
deadline = time.time() + (timeout if timeout is not None else self._timeout)
with self._cond:
while True:
# Reuse idle
while self._idle:
conn = self._idle.pop()
try:
if getattr(conn, "closed", False):
self._created -= 1
continue
return conn
except Exception:
# Consider it broken
self._created -= 1
continue
# Create new if capacity
if self._created < self._max:
conn = self._new_connection_with_retry()
self._created += 1
return conn
# Wait for release
remaining = deadline - time.time()
if remaining <= 0:
raise TimeoutError("Timed out waiting for database connection from pool")
self._cond.wait(remaining)
def release(self, conn):
try:
try:
conn.rollback()
except Exception:
pass
if getattr(conn, "closed", False):
with self._cond:
self._created -= 1
self._cond.notify()
return
with self._cond:
self._idle.append(conn)
self._cond.notify()
except Exception:
# Drop silently on unexpected errors
with self._cond:
try:
self._created -= 1
except Exception:
pass
self._cond.notify()
@contextmanager
def connection(self, timeout: Optional[float] = None):
conn = self.acquire(timeout)
try:
yield conn
finally:
self.release(conn)
def closeall(self):
with self._cond:
for c in self._idle:
try:
c.close()
except Exception:
pass
self._idle.clear()
self._created = 0
class CacheManager:
"""Manages page caching and data storage using PostgreSQL."""
def __init__(self):
self.database_url = (config.DATABASE_URL or '').strip()
if not self.database_url.lower().startswith('postgresql'):
raise RuntimeError("DATABASE_URL must be a PostgreSQL URL, e.g., postgresql://user:pass@host:5432/db")
# Initialize a small connection pool to prevent excessive short-lived TCP connections
self._pool = _ConnectionPool(
self.database_url,
min_size=getattr(config, 'DB_POOL_MIN', 1),
max_size=getattr(config, 'DB_POOL_MAX', 6),
timeout=getattr(config, 'DB_POOL_TIMEOUT', 30),
)
self._init_db()
# ------------------------
# Connection helpers
# ------------------------
def _pg(self):
# Return a context manager yielding a pooled connection
return self._pool.connection()
def _init_db(self):
"""Initialize cache and data storage database with consolidated schema"""
with sqlite3.connect(self.db_path) as conn:
# HTML page cache table (existing)
conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
url TEXT PRIMARY KEY,
content BLOB,
timestamp REAL,
status_code INTEGER
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_timestamp ON cache(timestamp)
""")
"""Initialize database schema if missing.
# Resource cache table (NEW: for ALL web resources - JS, CSS, images, fonts, etc.)
conn.execute("""
CREATE TABLE IF NOT EXISTS resource_cache (
url TEXT PRIMARY KEY,
content BLOB,
content_type TEXT,
status_code INTEGER,
headers TEXT,
timestamp REAL,
size_bytes INTEGER,
local_path TEXT
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_resource_timestamp ON resource_cache(timestamp)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_resource_content_type ON resource_cache(content_type)
""")
# Auctions table - consolidated schema
conn.execute("""
- Create tables with IF NOT EXISTS for PostgreSQL.
"""
with self._pg() as conn, conn.cursor() as cur:
# Auctions
cur.execute(
"""
CREATE TABLE IF NOT EXISTS auctions (
auction_id TEXT PRIMARY KEY,
url TEXT UNIQUE,
@@ -70,16 +173,31 @@ class CacheManager:
type TEXT,
lot_count INTEGER DEFAULT 0,
closing_time TEXT,
discovered_at INTEGER
discovered_at BIGINT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country)")
"""
)
cur.execute("CREATE INDEX IF NOT EXISTS idx_auctions_country ON auctions(country)")
# Lots table - consolidated schema with all fields from working database
conn.execute("""
# Cache
cur.execute(
"""
CREATE TABLE IF NOT EXISTS cache (
url TEXT PRIMARY KEY,
content BYTEA,
timestamp DOUBLE PRECISION,
status_code INTEGER
)
"""
)
cur.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON cache(timestamp)")
# Lots
cur.execute(
"""
CREATE TABLE IF NOT EXISTS lots (
lot_id TEXT PRIMARY KEY,
auction_id TEXT,
auction_id TEXT REFERENCES auctions(auction_id),
url TEXT UNIQUE,
title TEXT,
current_bid TEXT,
@@ -105,20 +223,20 @@ class CacheManager:
attributes_json TEXT,
first_bid_time TEXT,
last_bid_time TEXT,
bid_velocity REAL,
bid_increment REAL,
bid_velocity DOUBLE PRECISION,
bid_increment DOUBLE PRECISION,
year_manufactured INTEGER,
condition_score REAL,
condition_score DOUBLE PRECISION,
condition_description TEXT,
serial_number TEXT,
damage_description TEXT,
followers_count INTEGER DEFAULT 0,
estimated_min_price REAL,
estimated_max_price REAL,
estimated_min_price DOUBLE PRECISION,
estimated_max_price DOUBLE PRECISION,
lot_condition TEXT,
appearance TEXT,
estimated_min REAL,
estimated_max REAL,
estimated_min DOUBLE PRECISION,
estimated_max DOUBLE PRECISION,
next_bid_step_cents INTEGER,
condition TEXT,
category_path TEXT,
@@ -127,161 +245,84 @@ class CacheManager:
bidding_status TEXT,
packaging TEXT,
quantity INTEGER,
vat REAL,
buyer_premium_percentage REAL,
vat DOUBLE PRECISION,
buyer_premium_percentage DOUBLE PRECISION,
remarks TEXT,
reserve_price REAL,
reserve_price DOUBLE PRECISION,
reserve_met INTEGER,
view_count INTEGER,
api_data_json TEXT,
next_scrape_at INTEGER,
scrape_priority INTEGER DEFAULT 0,
FOREIGN KEY (auction_id) REFERENCES auctions(auction_id)
next_scrape_at BIGINT,
scrape_priority INTEGER DEFAULT 0
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_lots_sale_id ON lots(sale_id)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_lots_closing_time ON lots(closing_time)")
"""
)
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_sale_id ON lots(sale_id)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_closing_time ON lots(closing_time)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_next_scrape ON lots(next_scrape_at)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_lots_priority ON lots(scrape_priority DESC)")
# Images table
conn.execute("""
# Images
cur.execute(
"""
CREATE TABLE IF NOT EXISTS images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT,
id SERIAL PRIMARY KEY,
lot_id TEXT REFERENCES lots(lot_id),
url TEXT,
local_path TEXT,
downloaded INTEGER DEFAULT 0,
labels TEXT,
processed_at INTEGER,
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
processed_at BIGINT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id)")
"""
)
cur.execute("CREATE INDEX IF NOT EXISTS idx_images_lot_id ON images(lot_id)")
cur.execute("CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_lot_url ON images(lot_id, url)")
# Remove duplicates before creating unique index
conn.execute("""
DELETE FROM images
WHERE id NOT IN (
SELECT MIN(id)
FROM images
GROUP BY lot_id, url
)
""")
conn.execute("""
CREATE UNIQUE INDEX IF NOT EXISTS idx_unique_lot_url
ON images(lot_id, url)
""")
# Bid history table
conn.execute("""
# Bid history
cur.execute(
"""
CREATE TABLE IF NOT EXISTS bid_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id TEXT NOT NULL,
bid_amount REAL NOT NULL,
id SERIAL PRIMARY KEY,
lot_id TEXT REFERENCES lots(lot_id),
bid_amount DOUBLE PRECISION NOT NULL,
bid_time TEXT NOT NULL,
is_autobid INTEGER DEFAULT 0,
bidder_id TEXT,
bidder_number INTEGER,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_bid_history_lot_time
ON bid_history(lot_id, bid_time)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_bid_history_bidder
ON bid_history(bidder_id)
""")
# MIGRATIONS: Add new columns to existing tables
self._run_migrations(conn)
"""
)
cur.execute("CREATE INDEX IF NOT EXISTS idx_bid_history_bidder ON bid_history(bidder_id)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_bid_history_lot_time ON bid_history(lot_id, bid_time)")
# Resource cache
cur.execute(
"""
CREATE TABLE IF NOT EXISTS resource_cache (
url TEXT PRIMARY KEY,
content BYTEA,
content_type TEXT,
status_code INTEGER,
headers TEXT,
timestamp DOUBLE PRECISION,
size_bytes INTEGER,
local_path TEXT
)
"""
)
cur.execute("CREATE INDEX IF NOT EXISTS idx_resource_timestamp ON resource_cache(timestamp)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_resource_content_type ON resource_cache(content_type)")
conn.commit()
return
def _run_migrations(self, conn):
"""Run database migrations to add new columns to existing tables"""
print("Checking for database migrations...")
# Check and add new columns to lots table
cursor = conn.execute("PRAGMA table_info(lots)")
lots_columns = {row[1] for row in cursor.fetchall()}
migrations_applied = False
if 'api_data_json' not in lots_columns:
print(" > Adding api_data_json column to lots table...")
conn.execute("ALTER TABLE lots ADD COLUMN api_data_json TEXT")
migrations_applied = True
if 'next_scrape_at' not in lots_columns:
print(" > Adding next_scrape_at column to lots table...")
conn.execute("ALTER TABLE lots ADD COLUMN next_scrape_at INTEGER")
migrations_applied = True
if 'scrape_priority' not in lots_columns:
print(" > Adding scrape_priority column to lots table...")
conn.execute("ALTER TABLE lots ADD COLUMN scrape_priority INTEGER DEFAULT 0")
migrations_applied = True
# Check resource_cache table structure
cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='resource_cache'")
resource_cache_exists = cursor.fetchone() is not None
if resource_cache_exists:
# Check if table has correct structure
cursor = conn.execute("PRAGMA table_info(resource_cache)")
resource_columns = {row[1] for row in cursor.fetchall()}
# Expected columns
expected_columns = {'url', 'content', 'content_type', 'status_code', 'headers', 'timestamp', 'size_bytes', 'local_path'}
if resource_columns != expected_columns:
print(" > Rebuilding resource_cache table with correct schema...")
# Backup old data count
cursor = conn.execute("SELECT COUNT(*) FROM resource_cache")
old_count = cursor.fetchone()[0]
print(f" (Preserving {old_count} cached resources)")
# Drop and recreate with correct schema
conn.execute("DROP TABLE IF EXISTS resource_cache")
conn.execute("""
CREATE TABLE resource_cache (
url TEXT PRIMARY KEY,
content BLOB,
content_type TEXT,
status_code INTEGER,
headers TEXT,
timestamp REAL,
size_bytes INTEGER,
local_path TEXT
)
""")
conn.execute("CREATE INDEX idx_resource_timestamp ON resource_cache(timestamp)")
conn.execute("CREATE INDEX idx_resource_content_type ON resource_cache(content_type)")
migrations_applied = True
print(" * resource_cache table rebuilt")
# Create indexes after migrations (when columns exist)
try:
conn.execute("CREATE INDEX IF NOT EXISTS idx_lots_priority ON lots(scrape_priority DESC)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_lots_next_scrape ON lots(next_scrape_at)")
except:
pass # Indexes might already exist
if migrations_applied:
print(" * Migrations complete")
else:
print(" * Database schema is up to date")
# SQLite migrations removed; PostgreSQL uses IF NOT EXISTS DDL above
def get(self, url: str, max_age_hours: int = 24) -> Optional[Dict]:
"""Get cached page if it exists and is not too old"""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
"SELECT content, timestamp, status_code FROM cache WHERE url = ?",
(url,)
)
row = cursor.fetchone()
with self._pg() as conn, conn.cursor() as cur:
cur.execute("SELECT content, timestamp, status_code FROM cache WHERE url = %s", (url,))
row = cur.fetchone()
if row:
content, timestamp, status_code = row
@@ -304,27 +345,35 @@ class CacheManager:
def set(self, url: str, content: str, status_code: int = 200):
"""Cache a page with compression"""
with sqlite3.connect(self.db_path) as conn:
compressed_content = zlib.compress(content.encode('utf-8'), level=9)
original_size = len(content.encode('utf-8'))
compressed_size = len(compressed_content)
ratio = (1 - compressed_size / original_size) * 100 if original_size > 0 else 0
compressed_content = zlib.compress(content.encode('utf-8'), level=9)
original_size = len(content.encode('utf-8'))
compressed_size = len(compressed_content)
ratio = (1 - compressed_size / original_size) * 100 if original_size > 0 else 0
conn.execute(
"INSERT OR REPLACE INTO cache (url, content, timestamp, status_code) VALUES (?, ?, ?, ?)",
(url, compressed_content, time.time(), status_code)
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"""
INSERT INTO cache (url, content, timestamp, status_code)
VALUES (%s, %s, %s, %s)
ON CONFLICT (url)
DO UPDATE SET content = EXCLUDED.content,
timestamp = EXCLUDED.timestamp,
status_code = EXCLUDED.status_code
""",
(url, compressed_content, time.time(), status_code),
)
conn.commit()
print(f" -> Cached: {url} (compressed {ratio:.1f}%)")
print(f" -> Cached: {url} (compressed {ratio:.1f}%)")
def clear_old(self, max_age_hours: int = 168):
"""Clear old cache entries to prevent database bloat"""
cutoff_time = time.time() - (max_age_hours * 3600)
with sqlite3.connect(self.db_path) as conn:
deleted = conn.execute("DELETE FROM cache WHERE timestamp < ?", (cutoff_time,)).rowcount
with self._pg() as conn, conn.cursor() as cur:
cur.execute("DELETE FROM cache WHERE timestamp < %s", (cutoff_time,))
deleted = cur.rowcount or 0
conn.commit()
if deleted > 0:
print(f" → Cleared {deleted} old cache entries")
if (deleted or 0) > 0:
print(f" → Cleared {deleted} old cache entries")
def save_auction(self, auction_data: Dict):
"""Save auction data to database"""
@@ -338,117 +387,186 @@ class CacheManager:
city = parts[0]
country = parts[-1]
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT OR REPLACE INTO auctions
(auction_id, url, title, location, lots_count, first_lot_closing_time, scraped_at,
city, country, type, lot_count, closing_time, discovered_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
auction_data['auction_id'],
auction_data['url'],
auction_data['title'],
location,
auction_data.get('lots_count', 0),
auction_data.get('first_lot_closing_time', ''),
auction_data['scraped_at'],
city,
country,
'online', # Troostwijk is online platform
auction_data.get('lots_count', 0), # Duplicate to lot_count for consistency
auction_data.get('first_lot_closing_time', ''), # Use first_lot_closing_time as closing_time
int(time.time())
))
conn.commit()
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"""
INSERT INTO auctions
(auction_id, url, title, location, lots_count, first_lot_closing_time, scraped_at,
city, country, type, lot_count, closing_time, discovered_at)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (auction_id)
DO UPDATE SET url = EXCLUDED.url,
title = EXCLUDED.title,
location = EXCLUDED.location,
lots_count = EXCLUDED.lots_count,
first_lot_closing_time = EXCLUDED.first_lot_closing_time,
scraped_at = EXCLUDED.scraped_at,
city = EXCLUDED.city,
country = EXCLUDED.country,
type = EXCLUDED.type,
lot_count = EXCLUDED.lot_count,
closing_time = EXCLUDED.closing_time,
discovered_at = EXCLUDED.discovered_at
""",
(
auction_data['auction_id'],
auction_data['url'],
auction_data['title'],
location,
auction_data.get('lots_count', 0),
auction_data.get('first_lot_closing_time', ''),
auction_data['scraped_at'],
city,
country,
'online',
auction_data.get('lots_count', 0),
auction_data.get('first_lot_closing_time', ''),
int(time.time()),
),
)
conn.commit()
def save_lot(self, lot_data: Dict):
"""Save lot data to database"""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT OR REPLACE INTO lots
(lot_id, auction_id, url, title, current_bid, starting_bid, minimum_bid,
bid_count, closing_time, viewing_time, pickup_date, location, description,
category, status, brand, model, attributes_json,
first_bid_time, last_bid_time, bid_velocity, bid_increment,
year_manufactured, condition_score, condition_description,
serial_number, manufacturer, damage_description,
followers_count, estimated_min_price, estimated_max_price, lot_condition, appearance,
scraped_at, api_data_json, next_scrape_at, scrape_priority)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
lot_data['lot_id'],
lot_data.get('auction_id', ''),
lot_data['url'],
lot_data['title'],
lot_data.get('current_bid', ''),
lot_data.get('starting_bid', ''),
lot_data.get('minimum_bid', ''),
lot_data.get('bid_count', 0),
lot_data.get('closing_time', ''),
lot_data.get('viewing_time', ''),
lot_data.get('pickup_date', ''),
lot_data.get('location', ''),
lot_data.get('description', ''),
lot_data.get('category', ''),
lot_data.get('status', ''),
lot_data.get('brand', ''),
lot_data.get('model', ''),
lot_data.get('attributes_json', ''),
lot_data.get('first_bid_time'),
lot_data.get('last_bid_time'),
lot_data.get('bid_velocity'),
lot_data.get('bid_increment'),
lot_data.get('year_manufactured'),
lot_data.get('condition_score'),
lot_data.get('condition_description', ''),
lot_data.get('serial_number', ''),
lot_data.get('manufacturer', ''),
lot_data.get('damage_description', ''),
lot_data.get('followers_count', 0),
lot_data.get('estimated_min_price'),
lot_data.get('estimated_max_price'),
lot_data.get('lot_condition', ''),
lot_data.get('appearance', ''),
lot_data['scraped_at'],
lot_data.get('api_data_json'),
lot_data.get('next_scrape_at'),
lot_data.get('scrape_priority', 0)
))
conn.commit()
params = (
lot_data['lot_id'],
lot_data.get('auction_id', ''),
lot_data['url'],
lot_data['title'],
lot_data.get('current_bid', ''),
lot_data.get('starting_bid', ''),
lot_data.get('minimum_bid', ''),
lot_data.get('bid_count', 0),
lot_data.get('closing_time', ''),
lot_data.get('viewing_time', ''),
lot_data.get('pickup_date', ''),
lot_data.get('location', ''),
lot_data.get('description', ''),
lot_data.get('category', ''),
lot_data.get('status', ''),
lot_data.get('brand', ''),
lot_data.get('model', ''),
lot_data.get('attributes_json', ''),
lot_data.get('first_bid_time'),
lot_data.get('last_bid_time'),
lot_data.get('bid_velocity'),
lot_data.get('bid_increment'),
lot_data.get('year_manufactured'),
lot_data.get('condition_score'),
lot_data.get('condition_description', ''),
lot_data.get('serial_number', ''),
lot_data.get('manufacturer', ''),
lot_data.get('damage_description', ''),
lot_data.get('followers_count', 0),
lot_data.get('estimated_min_price'),
lot_data.get('estimated_max_price'),
lot_data.get('lot_condition', ''),
lot_data.get('appearance', ''),
lot_data['scraped_at'],
lot_data.get('api_data_json'),
lot_data.get('next_scrape_at'),
lot_data.get('scrape_priority', 0),
)
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"""
INSERT INTO lots
(lot_id, auction_id, url, title, current_bid, starting_bid, minimum_bid,
bid_count, closing_time, viewing_time, pickup_date, location, description,
category, status, brand, model, attributes_json,
first_bid_time, last_bid_time, bid_velocity, bid_increment,
year_manufactured, condition_score, condition_description,
serial_number, manufacturer, damage_description,
followers_count, estimated_min_price, estimated_max_price, lot_condition, appearance,
scraped_at, api_data_json, next_scrape_at, scrape_priority)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (lot_id)
DO UPDATE SET auction_id = EXCLUDED.auction_id,
url = EXCLUDED.url,
title = EXCLUDED.title,
current_bid = EXCLUDED.current_bid,
starting_bid = EXCLUDED.starting_bid,
minimum_bid = EXCLUDED.minimum_bid,
bid_count = EXCLUDED.bid_count,
closing_time = EXCLUDED.closing_time,
viewing_time = EXCLUDED.viewing_time,
pickup_date = EXCLUDED.pickup_date,
location = EXCLUDED.location,
description = EXCLUDED.description,
category = EXCLUDED.category,
status = EXCLUDED.status,
brand = EXCLUDED.brand,
model = EXCLUDED.model,
attributes_json = EXCLUDED.attributes_json,
first_bid_time = EXCLUDED.first_bid_time,
last_bid_time = EXCLUDED.last_bid_time,
bid_velocity = EXCLUDED.bid_velocity,
bid_increment = EXCLUDED.bid_increment,
year_manufactured = EXCLUDED.year_manufactured,
condition_score = EXCLUDED.condition_score,
condition_description = EXCLUDED.condition_description,
serial_number = EXCLUDED.serial_number,
manufacturer = EXCLUDED.manufacturer,
damage_description = EXCLUDED.damage_description,
followers_count = EXCLUDED.followers_count,
estimated_min_price = EXCLUDED.estimated_min_price,
estimated_max_price = EXCLUDED.estimated_max_price,
lot_condition = EXCLUDED.lot_condition,
appearance = EXCLUDED.appearance,
scraped_at = EXCLUDED.scraped_at,
api_data_json = EXCLUDED.api_data_json,
next_scrape_at = EXCLUDED.next_scrape_at,
scrape_priority = EXCLUDED.scrape_priority
""",
params,
)
conn.commit()
def save_bid_history(self, lot_id: str, bid_records: List[Dict]):
"""Save bid history records to database"""
if not bid_records:
return
with sqlite3.connect(self.db_path) as conn:
# Clear existing bid history for this lot
conn.execute("DELETE FROM bid_history WHERE lot_id = ?", (lot_id,))
# Insert new records
with self._pg() as conn, conn.cursor() as cur:
cur.execute("DELETE FROM bid_history WHERE lot_id = %s", (lot_id,))
for record in bid_records:
conn.execute("""
cur.execute(
"""
INSERT INTO bid_history
(lot_id, bid_amount, bid_time, is_autobid, bidder_id, bidder_number)
VALUES (?, ?, ?, ?, ?, ?)
""", (
record['lot_id'],
record['bid_amount'],
record['bid_time'],
1 if record['is_autobid'] else 0,
record['bidder_id'],
record['bidder_number']
))
VALUES (%s, %s, %s, %s, %s, %s)
""",
(
record['lot_id'],
record['bid_amount'],
record['bid_time'],
1 if record['is_autobid'] else 0,
record['bidder_id'],
record['bidder_number'],
),
)
conn.commit()
def save_images(self, lot_id: str, image_urls: List[str]):
"""Save image URLs for a lot (prevents duplicates via unique constraint)"""
with sqlite3.connect(self.db_path) as conn:
with self._pg() as conn, conn.cursor() as cur:
for url in image_urls:
conn.execute("""
INSERT OR IGNORE INTO images (lot_id, url, downloaded)
VALUES (?, ?, 0)
""", (lot_id, url))
cur.execute(
"""
INSERT INTO images (lot_id, url, downloaded)
VALUES (%s, %s, 0)
ON CONFLICT (lot_id, url) DO NOTHING
""",
(lot_id, url),
)
conn.commit()
def update_image_local_path(self, lot_id: str, url: str, local_path: str):
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"UPDATE images SET local_path = %s, downloaded = 1 WHERE lot_id = %s AND url = %s",
(local_path, lot_id, url),
)
conn.commit()
def save_resource(self, url: str, content: bytes, content_type: str, status_code: int = 200,
@@ -458,18 +576,27 @@ class CacheManager:
Args:
cache_key: Optional composite key (url + body hash for POST requests)
"""
with sqlite3.connect(self.db_path) as conn:
headers_json = json.dumps(headers) if headers else None
size_bytes = len(content) if content else 0
headers_json = json.dumps(headers) if headers else None
size_bytes = len(content) if content else 0
key = cache_key if cache_key else url
# Use cache_key if provided, otherwise use url
key = cache_key if cache_key else url
conn.execute("""
INSERT OR REPLACE INTO resource_cache
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"""
INSERT INTO resource_cache
(url, content, content_type, status_code, headers, timestamp, size_bytes, local_path)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (key, content, content_type, status_code, headers_json, time.time(), size_bytes, local_path))
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (url)
DO UPDATE SET content = EXCLUDED.content,
content_type = EXCLUDED.content_type,
status_code = EXCLUDED.status_code,
headers = EXCLUDED.headers,
timestamp = EXCLUDED.timestamp,
size_bytes = EXCLUDED.size_bytes,
local_path = EXCLUDED.local_path
""",
(key, content, content_type, status_code, headers_json, time.time(), size_bytes, local_path),
)
conn.commit()
def get_resource(self, url: str, cache_key: Optional[str] = None) -> Optional[Dict]:
@@ -478,13 +605,13 @@ class CacheManager:
Args:
cache_key: Optional composite key to lookup
"""
with sqlite3.connect(self.db_path) as conn:
key = cache_key if cache_key else url
cursor = conn.execute("""
SELECT content, content_type, status_code, headers, timestamp, size_bytes, local_path
FROM resource_cache WHERE url = ?
""", (key,))
row = cursor.fetchone()
key = cache_key if cache_key else url
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"SELECT content, content_type, status_code, headers, timestamp, size_bytes, local_path FROM resource_cache WHERE url = %s",
(key,),
)
row = cur.fetchone()
if row:
return {
@@ -497,4 +624,107 @@ class CacheManager:
'local_path': row[6],
'cached': True
}
return None
return None
# ------------------------
# Query helpers for scraper/monitor/export
# ------------------------
def get_counts(self) -> Dict[str, int]:
with self._pg() as conn, conn.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM auctions")
auctions = cur.fetchone()[0]
cur.execute("SELECT COUNT(*) FROM lots")
lots = cur.fetchone()[0]
return {"auctions": auctions, "lots": lots}
def get_lot_api_fields(self, lot_id: str) -> Optional[Tuple]:
sql = (
"SELECT followers_count, estimated_min_price, current_bid, bid_count, closing_time, status "
"FROM lots WHERE lot_id = %s"
)
params = (lot_id,)
with self._pg() as conn, conn.cursor() as cur:
cur.execute(sql, params)
return cur.fetchone()
def get_page_record_by_url(self, url: str) -> Optional[Dict]:
# Try lot record first by URL
with self._pg() as conn, conn.cursor() as cur:
cur.execute("SELECT * FROM lots WHERE url = %s", (url,))
lot_row = cur.fetchone()
if lot_row:
col_names = [desc.name for desc in cur.description]
lot_dict = dict(zip(col_names, lot_row))
return {"type": "lot", **lot_dict}
cur.execute("SELECT * FROM auctions WHERE url = %s", (url,))
auc_row = cur.fetchone()
if auc_row:
col_names = [desc.name for desc in cur.description]
auc_dict = dict(zip(col_names, auc_row))
return {"type": "auction", **auc_dict}
return None
def fetch_all(self, table: str) -> List[Dict]:
assert table in {"auctions", "lots"}
with self._pg() as conn, conn.cursor() as cur:
cur.execute(f"SELECT * FROM {table}")
rows = cur.fetchall()
col_names = [desc.name for desc in cur.description]
return [dict(zip(col_names, r)) for r in rows]
def get_lot_times(self, lot_id: str) -> Tuple[Optional[str], Optional[str]]:
sql = (
"SELECT viewing_time, pickup_date FROM lots WHERE lot_id = %s"
)
params = (lot_id,)
with self._pg() as conn, conn.cursor() as cur:
cur.execute(sql, params)
row = cur.fetchone()
if not row:
return None, None
return row[0], row[1]
def has_bid_history(self, lot_id: str) -> bool:
sql = ("SELECT COUNT(*) FROM bid_history WHERE lot_id = %s")
params = (lot_id,)
with self._pg() as conn, conn.cursor() as cur:
cur.execute(sql, params)
cnt = cur.fetchone()[0]
return cnt > 0
def get_downloaded_image_urls(self, lot_id: str) -> List[str]:
sql = ("SELECT url FROM images WHERE lot_id = %s AND downloaded = 1")
params = (lot_id,)
with self._pg() as conn, conn.cursor() as cur:
cur.execute(sql, params)
return [r[0] for r in cur.fetchall()]
# ------------------------
# Aggregation helpers for scraper
# ------------------------
def get_distinct_urls(self) -> Dict[str, List[str]]:
with self._pg() as conn, conn.cursor() as cur:
cur.execute("SELECT DISTINCT url FROM auctions")
auctions = [r[0] for r in cur.fetchall() if r and r[0]]
cur.execute("SELECT DISTINCT url FROM lots")
lots = [r[0] for r in cur.fetchall() if r and r[0]]
return {"auctions": auctions, "lots": lots}
def get_lot_priority_info(self, lot_id: str, url: str) -> Tuple[Optional[str], Optional[str], Optional[int], Optional[int]]:
with self._pg() as conn, conn.cursor() as cur:
cur.execute(
"""
SELECT closing_time, scraped_at, scrape_priority, next_scrape_at
FROM lots WHERE lot_id = %s OR url = %s
""",
(lot_id, url),
)
row = cur.fetchone()
if not row:
return None, None, None, None
return row[0], row[1], row[2], row[3]
def get_recent_cached_urls(self, limit: int = 10) -> List[str]:
with self._pg() as conn, conn.cursor() as cur:
cur.execute("SELECT url FROM cache ORDER BY timestamp DESC LIMIT %s", (limit,))
return [r[0] for r in cur.fetchall()]

View File

@@ -15,7 +15,27 @@ if sys.version_info < (3, 10):
# ==================== CONFIGURATION ====================
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db"
# Primary database: PostgreSQL only
# Override via environment variable DATABASE_URL
# Example: postgresql://user:pass@host:5432/dbname
DATABASE_URL = os.getenv(
"DATABASE_URL",
# Default provided by ops
"postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb",
).strip()
# Database connection pool controls (to avoid creating too many short-lived TCP connections)
# Environment overrides: SCAEV_DB_POOL_MIN, SCAEV_DB_POOL_MAX, SCAEV_DB_POOL_TIMEOUT
def _int_env(name: str, default: int) -> int:
try:
return int(os.getenv(name, str(default)))
except Exception:
return default
DB_POOL_MIN = _int_env("SCAEV_DB_POOL_MIN", 1)
DB_POOL_MAX = _int_env("SCAEV_DB_POOL_MAX", 6)
DB_POOL_TIMEOUT = _int_env("SCAEV_DB_POOL_TIMEOUT", 30) # seconds to wait for a pooled connection
OUTPUT_DIR = "/mnt/okcomputer/output"
IMAGES_DIR = "/mnt/okcomputer/output/images"
RATE_LIMIT_SECONDS = 0.5 # EXACTLY 0.5 seconds between requests

View File

@@ -3,14 +3,12 @@
Database scaffolding for future SQLAlchemy 2.x usage.
Notes:
- We keep using the current SQLite + raw SQL for operational code.
- This module prepares an engine/session bound to DATABASE_URL, defaulting to
SQLite file in config.CACHE_DB path (for local dev only).
- PostgreSQL can be enabled by setting DATABASE_URL, e.g.:
DATABASE_URL=postgresql+psycopg://user:pass@localhost:5432/scaev
- The application now uses PostgreSQL exclusively via `config.DATABASE_URL`.
- This module prepares an engine/session bound to `DATABASE_URL`.
- Example URL: `postgresql+psycopg://user:pass@host:5432/scaev`
No runtime dependency from the scraper currently imports or uses this module.
It is present to bootstrap the gradual migration to SQLAlchemy 2.x.
It is present to bootstrap a possible future move to SQLAlchemy 2.x.
"""
from __future__ import annotations
@@ -19,14 +17,11 @@ import os
from typing import Optional
def get_database_url(sqlite_fallback_path: str) -> str:
def get_database_url() -> str:
url = os.getenv("DATABASE_URL")
if url and url.strip():
return url.strip()
# SQLite fallback
# Use a separate sqlite file when DATABASE_URL is not set; this does not
# alter the existing cache.db usage by raw SQL — it's just a dev convenience.
return f"sqlite:///{sqlite_fallback_path}"
if not url or not url.strip():
raise RuntimeError("DATABASE_URL must be set for PostgreSQL connection")
return url.strip()
def create_engine_and_session(database_url: str):
@@ -44,16 +39,15 @@ def create_engine_and_session(database_url: str):
return engine, SessionLocal
def get_sa(session_cached: dict, sqlite_fallback_path: str):
def get_sa(session_cached: dict):
"""Helper to lazily create and cache SQLAlchemy engine/session factory.
session_cached: dict — a mutable dict, e.g., module-level {}, to store engine and factory
sqlite_fallback_path: path to a sqlite file for local development
"""
if 'engine' in session_cached and 'SessionLocal' in session_cached:
return session_cached['engine'], session_cached['SessionLocal']
url = get_database_url(sqlite_fallback_path)
url = get_database_url()
engine, SessionLocal = create_engine_and_session(url)
session_cached['engine'] = engine
session_cached['SessionLocal'] = SessionLocal

View File

@@ -8,7 +8,6 @@ import sys
import asyncio
import json
import csv
import sqlite3
from datetime import datetime
from pathlib import Path
@@ -16,6 +15,17 @@ import config
from cache import CacheManager
from scraper import TroostwijkScraper
def mask_db_url(url: str) -> str:
try:
from urllib.parse import urlparse
p = urlparse(url)
user = p.username or ''
host = p.hostname or ''
port = f":{p.port}" if p.port else ''
return f"{p.scheme}://{user}:***@{host}{port}{p.path or ''}"
except Exception:
return url
def main():
"""Main execution"""
# Check for test mode
@@ -34,7 +44,7 @@ def main():
if config.OFFLINE:
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
print(f"Rate limit: {config.RATE_LIMIT_SECONDS} seconds BETWEEN EVERY REQUEST")
print(f"Cache database: {config.CACHE_DB}")
print(f"Database URL: {mask_db_url(config.DATABASE_URL)}")
print(f"Output directory: {config.OUTPUT_DIR}")
print(f"Max listing pages: {config.MAX_PAGES}")
print("=" * 60)

View File

@@ -7,7 +7,6 @@ Runs indefinitely to keep database current with latest Troostwijk data
import asyncio
import time
from datetime import datetime
import sqlite3
import config
from cache import CacheManager
from scraper import TroostwijkScraper
@@ -82,21 +81,7 @@ class AuctionMonitor:
def _get_stats(self) -> dict:
"""Get current database statistics"""
conn = sqlite3.connect(self.scraper.cache.db_path)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM auctions")
auction_count = cursor.fetchone()[0]
cursor.execute("SELECT COUNT(*) FROM lots")
lot_count = cursor.fetchone()[0]
conn.close()
return {
'auctions': auction_count,
'lots': lot_count
}
return self.scraper.cache.get_counts()
async def start(self):
"""Start continuous monitoring loop"""
@@ -106,7 +91,7 @@ class AuctionMonitor:
if config.OFFLINE:
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
print(f"Poll interval: {self.poll_interval / 60:.0f} minutes")
print(f"Cache database: {config.CACHE_DB}")
print(f"Database URL: {self._mask_db_url(config.DATABASE_URL)}")
print(f"Rate limit: {config.RATE_LIMIT_SECONDS}s between requests")
print("="*60)
print("\nPress Ctrl+C to stop\n")
@@ -135,6 +120,21 @@ class AuctionMonitor:
print(f"Last scan: {self.last_run.strftime('%Y-%m-%d %H:%M:%S')}")
print("\nDatabase remains intact with all collected data")
@staticmethod
def _mask_db_url(url: str) -> str:
try:
from urllib.parse import urlparse
parsed = urlparse(url)
if parsed.username:
user = parsed.username
host = parsed.hostname or ''
port = f":{parsed.port}" if parsed.port else ''
db = parsed.path or ''
return f"{parsed.scheme}://{user}:***@{host}{port}{db}"
except Exception:
pass
return url
def main():
"""Main entry point for monitor"""
import sys

View File

@@ -3,7 +3,6 @@
Core scaev module for Scaev Auctions
"""
import os
import sqlite3
import asyncio
import time
import random
@@ -66,13 +65,8 @@ class TroostwijkScraper:
content = await response.read()
with open(filepath, 'wb') as f:
f.write(content)
with sqlite3.connect(self.cache.db_path) as conn:
conn.execute("UPDATE images\n"
"SET local_path = ?, downloaded = 1\n"
"WHERE lot_id = ? AND url = ?\n"
"", (str(filepath), lot_id, url))
conn.commit()
# Record download in DB
self.cache.update_image_local_path(lot_id, url, str(filepath))
return str(filepath)
except Exception as e:
@@ -237,71 +231,54 @@ class TroostwijkScraper:
if not result:
# OFFLINE fallback: try to construct page data directly from DB
if self.offline:
import sqlite3
conn = sqlite3.connect(self.cache.db_path)
cur = conn.cursor()
# Try lot first
cur.execute("SELECT * FROM lots WHERE url = ?", (url,))
lot_row = cur.fetchone()
if lot_row:
# Build a dict using column names
col_names = [d[0] for d in cur.description]
lot_dict = dict(zip(col_names, lot_row))
conn.close()
page_data = {
'type': 'lot',
'lot_id': lot_dict.get('lot_id'),
'auction_id': lot_dict.get('auction_id'),
'url': lot_dict.get('url') or url,
'title': lot_dict.get('title') or '',
'current_bid': lot_dict.get('current_bid') or '',
'bid_count': lot_dict.get('bid_count') or 0,
'closing_time': lot_dict.get('closing_time') or '',
'viewing_time': lot_dict.get('viewing_time') or '',
'pickup_date': lot_dict.get('pickup_date') or '',
'location': lot_dict.get('location') or '',
'description': lot_dict.get('description') or '',
'category': lot_dict.get('category') or '',
'status': lot_dict.get('status') or '',
'brand': lot_dict.get('brand') or '',
'model': lot_dict.get('model') or '',
'attributes_json': lot_dict.get('attributes_json') or '',
'first_bid_time': lot_dict.get('first_bid_time'),
'last_bid_time': lot_dict.get('last_bid_time'),
'bid_velocity': lot_dict.get('bid_velocity'),
'followers_count': lot_dict.get('followers_count') or 0,
'estimated_min_price': lot_dict.get('estimated_min_price'),
'estimated_max_price': lot_dict.get('estimated_max_price'),
'lot_condition': lot_dict.get('lot_condition') or '',
'appearance': lot_dict.get('appearance') or '',
'scraped_at': lot_dict.get('scraped_at') or '',
}
print(" OFFLINE: using DB record for lot")
self.visited_lots.add(url)
return page_data
# Try auction by URL
cur.execute("SELECT * FROM auctions WHERE url = ?", (url,))
auc_row = cur.fetchone()
if auc_row:
col_names = [d[0] for d in cur.description]
auc_dict = dict(zip(col_names, auc_row))
conn.close()
page_data = {
'type': 'auction',
'auction_id': auc_dict.get('auction_id'),
'url': auc_dict.get('url') or url,
'title': auc_dict.get('title') or '',
'location': auc_dict.get('location') or '',
'lots_count': auc_dict.get('lots_count') or 0,
'first_lot_closing_time': auc_dict.get('first_lot_closing_time') or '',
'scraped_at': auc_dict.get('scraped_at') or '',
}
print(" OFFLINE: using DB record for auction")
self.visited_lots.add(url)
return page_data
conn.close()
rec = self.cache.get_page_record_by_url(url)
if rec:
if rec.get('type') == 'lot':
page_data = {
'type': 'lot',
'lot_id': rec.get('lot_id'),
'auction_id': rec.get('auction_id'),
'url': rec.get('url') or url,
'title': rec.get('title') or '',
'current_bid': rec.get('current_bid') or '',
'bid_count': rec.get('bid_count') or 0,
'closing_time': rec.get('closing_time') or '',
'viewing_time': rec.get('viewing_time') or '',
'pickup_date': rec.get('pickup_date') or '',
'location': rec.get('location') or '',
'description': rec.get('description') or '',
'category': rec.get('category') or '',
'status': rec.get('status') or '',
'brand': rec.get('brand') or '',
'model': rec.get('model') or '',
'attributes_json': rec.get('attributes_json') or '',
'first_bid_time': rec.get('first_bid_time'),
'last_bid_time': rec.get('last_bid_time'),
'bid_velocity': rec.get('bid_velocity'),
'followers_count': rec.get('followers_count') or 0,
'estimated_min_price': rec.get('estimated_min_price'),
'estimated_max_price': rec.get('estimated_max_price'),
'lot_condition': rec.get('lot_condition') or '',
'appearance': rec.get('appearance') or '',
'scraped_at': rec.get('scraped_at') or '',
}
print(" OFFLINE: using DB record for lot")
self.visited_lots.add(url)
return page_data
else:
page_data = {
'type': 'auction',
'auction_id': rec.get('auction_id'),
'url': rec.get('url') or url,
'title': rec.get('title') or '',
'location': rec.get('location') or '',
'lots_count': rec.get('lots_count') or 0,
'first_lot_closing_time': rec.get('first_lot_closing_time') or '',
'scraped_at': rec.get('scraped_at') or '',
}
print(" OFFLINE: using DB record for auction")
self.visited_lots.add(url)
return page_data
return None
content = result['content']
@@ -363,7 +340,6 @@ class TroostwijkScraper:
# Fetch all API data concurrently (or use intercepted/cached data)
lot_id = page_data.get('lot_id')
auction_id = page_data.get('auction_id')
import sqlite3
# Step 1: Check if we intercepted API data during page load
intercepted_data = None
@@ -396,14 +372,7 @@ class TroostwijkScraper:
pass
elif from_cache:
# Check if we have cached API data in database
conn = sqlite3.connect(self.cache.db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT followers_count, estimated_min_price, current_bid, bid_count, closing_time, status
FROM lots WHERE lot_id = ?
""", (lot_id,))
existing = cursor.fetchone()
conn.close()
existing = self.cache.get_lot_api_fields(lot_id)
# Data quality check: Must have followers_count AND closing_time to be considered "complete"
# This prevents using stale records like old "0 bids" entries
@@ -469,14 +438,8 @@ class TroostwijkScraper:
# Add auction data fetch if we need viewing/pickup times
if auction_id:
conn = sqlite3.connect(self.cache.db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT viewing_time, pickup_date FROM lots WHERE lot_id = ?
""", (lot_id,))
times = cursor.fetchone()
conn.close()
has_times = times and (times[0] or times[1])
vt, pd = self.cache.get_lot_times(lot_id)
has_times = vt or pd
if not has_times:
task_map['auction'] = len(api_tasks)
@@ -671,14 +634,7 @@ class TroostwijkScraper:
self.cache.save_bid_history(lot_id, bid_data['bid_records'])
elif from_cache and page_data.get('bid_count', 0) > 0:
# Check if cached bid history exists
conn = sqlite3.connect(self.cache.db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT COUNT(*) FROM bid_history WHERE lot_id = ?
""", (lot_id,))
has_history = cursor.fetchone()[0] > 0
conn.close()
if has_history:
if self.cache.has_bid_history(lot_id):
print(f" Bid history cached")
else:
print(f" Bid: {page_data.get('current_bid', 'N/A')} (from HTML)")
@@ -704,15 +660,7 @@ class TroostwijkScraper:
if self.download_images:
# Check which images are already downloaded
import sqlite3
conn = sqlite3.connect(self.cache.db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT url FROM images
WHERE lot_id = ? AND downloaded = 1
""", (page_data['lot_id'],))
already_downloaded = {row[0] for row in cursor.fetchall()}
conn.close()
already_downloaded = set(self.cache.get_downloaded_image_urls(page_data['lot_id']))
# Only download missing images
images_to_download = [
@@ -775,25 +723,15 @@ class TroostwijkScraper:
Returns list of (priority, url, description) tuples sorted by priority (highest first)
"""
import sqlite3
prioritized = []
current_time = int(time.time())
conn = sqlite3.connect(self.cache.db_path)
cursor = conn.cursor()
for url in lot_urls:
# Extract lot_id from URL
lot_id = self.parser.extract_lot_id(url)
# Try to get existing data from database
cursor.execute("""
SELECT closing_time, scraped_at, scrape_priority, next_scrape_at
FROM lots WHERE lot_id = ? OR url = ?
""", (lot_id, url))
row = cursor.fetchone()
row = self.cache.get_lot_priority_info(lot_id, url)
if row:
closing_time, scraped_at, existing_priority, next_scrape_at = row
@@ -833,8 +771,6 @@ class TroostwijkScraper:
prioritized.append((priority, url, desc))
conn.close()
# Sort by priority (highest first)
prioritized.sort(key=lambda x: x[0], reverse=True)
@@ -845,14 +781,9 @@ class TroostwijkScraper:
if self.offline:
print("Launching OFFLINE crawl (no network requests)")
# Gather URLs from database
import sqlite3
conn = sqlite3.connect(self.cache.db_path)
cur = conn.cursor()
cur.execute("SELECT DISTINCT url FROM auctions")
auction_urls = [r[0] for r in cur.fetchall() if r and r[0]]
cur.execute("SELECT DISTINCT url FROM lots")
lot_urls = [r[0] for r in cur.fetchall() if r and r[0]]
conn.close()
urls = self.cache.get_distinct_urls()
auction_urls = urls['auctions']
lot_urls = urls['lots']
print(f" OFFLINE: {len(auction_urls)} auctions and {len(lot_urls)} lots in DB")
@@ -1072,23 +1003,17 @@ class TroostwijkScraper:
def export_to_files(self) -> Dict[str, str]:
"""Export database to CSV/JSON files"""
import sqlite3
import json
import csv
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = os.path.dirname(self.cache.db_path)
conn = sqlite3.connect(self.cache.db_path)
conn.row_factory = sqlite3.Row
cursor = conn.cursor()
from config import OUTPUT_DIR as output_dir
files = {}
# Export auctions
cursor.execute("SELECT * FROM auctions")
auctions = [dict(row) for row in cursor.fetchall()]
auctions = self.cache.fetch_all('auctions')
auctions_csv = os.path.join(output_dir, f'auctions_{timestamp}.csv')
auctions_json = os.path.join(output_dir, f'auctions_{timestamp}.json')
@@ -1107,8 +1032,7 @@ class TroostwijkScraper:
print(f" Exported {len(auctions)} auctions")
# Export lots
cursor.execute("SELECT * FROM lots")
lots = [dict(row) for row in cursor.fetchall()]
lots = self.cache.fetch_all('lots')
lots_csv = os.path.join(output_dir, f'lots_{timestamp}.csv')
lots_json = os.path.join(output_dir, f'lots_{timestamp}.json')
@@ -1126,5 +1050,4 @@ class TroostwijkScraper:
files['lots_json'] = lots_json
print(f" Exported {len(lots)} lots")
conn.close()
return files

View File

@@ -4,7 +4,6 @@ Test module for debugging extraction patterns
"""
import sys
import sqlite3
import time
import re
import json
@@ -27,10 +26,11 @@ def test_extraction(
if not cached:
print(f"ERROR: URL not found in cache: {test_url}")
print(f"\nAvailable cached URLs:")
with sqlite3.connect(config.CACHE_DB) as conn:
cursor = conn.execute("SELECT url FROM cache ORDER BY timestamp DESC LIMIT 10")
for row in cursor.fetchall():
print(f" - {row[0]}")
try:
for url in scraper.cache.get_recent_cached_urls(limit=10):
print(f" - {url}")
except Exception as e:
print(f" (failed to list recent cached URLs: {e})")
return
content = cached['content']

View File

@@ -1,303 +0,0 @@
#!/usr/bin/env python3
"""
Test cache behavior - verify page is only fetched once and data persists offline
"""
import sys
import os
import asyncio
import sqlite3
import time
from pathlib import Path
# Add src to path
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
from cache import CacheManager
from scraper import TroostwijkScraper
import config
class TestCacheBehavior:
"""Test suite for cache and offline functionality"""
def __init__(self):
self.test_db = "test_cache.db"
self.original_db = config.CACHE_DB
self.cache = None
self.scraper = None
def setup(self):
"""Setup test environment"""
print("\n" + "="*60)
print("TEST SETUP")
print("="*60)
# Use test database
config.CACHE_DB = self.test_db
# Ensure offline mode is disabled for tests
config.OFFLINE = False
# Clean up old test database
if os.path.exists(self.test_db):
os.remove(self.test_db)
print(f" * Removed old test database")
# Initialize cache and scraper
self.cache = CacheManager()
self.scraper = TroostwijkScraper()
self.scraper.offline = False # Explicitly disable offline mode
print(f" * Created test database: {self.test_db}")
print(f" * Initialized cache and scraper")
print(f" * Offline mode: DISABLED")
def teardown(self):
"""Cleanup test environment"""
print("\n" + "="*60)
print("TEST TEARDOWN")
print("="*60)
# Restore original database path
config.CACHE_DB = self.original_db
# Keep test database for inspection
print(f" * Test database preserved: {self.test_db}")
print(f" * Restored original database path")
async def test_page_fetched_once(self):
"""Test that a page is only fetched from network once"""
print("\n" + "="*60)
print("TEST 1: Page Fetched Only Once")
print("="*60)
# Pick a real lot URL to test with
test_url = "https://www.troostwijkauctions.com/l/bmw-x5-xdrive40d-high-executive-m-sport-a8-286pk-2019-A1-26955-7"
print(f"\nTest URL: {test_url}")
# First visit - should fetch from network
print("\n--- FIRST VISIT (should fetch from network) ---")
start_time = time.time()
async with asyncio.timeout(60): # 60 second timeout
page_data_1 = await self._scrape_single_page(test_url)
first_visit_time = time.time() - start_time
if not page_data_1:
print(" [FAIL] First visit returned no data")
return False
print(f" [OK] First visit completed in {first_visit_time:.2f}s")
print(f" [OK] Got lot data: {page_data_1.get('title', 'N/A')[:60]}...")
# Check closing time was captured
closing_time_1 = page_data_1.get('closing_time')
print(f" [OK] Closing time: {closing_time_1}")
# Second visit - should use cache
print("\n--- SECOND VISIT (should use cache) ---")
start_time = time.time()
async with asyncio.timeout(30): # Should be much faster
page_data_2 = await self._scrape_single_page(test_url)
second_visit_time = time.time() - start_time
if not page_data_2:
print(" [FAIL] Second visit returned no data")
return False
print(f" [OK] Second visit completed in {second_visit_time:.2f}s")
# Verify data matches
if page_data_1.get('lot_id') != page_data_2.get('lot_id'):
print(f" [FAIL] Lot IDs don't match")
return False
closing_time_2 = page_data_2.get('closing_time')
print(f" [OK] Closing time: {closing_time_2}")
if closing_time_1 != closing_time_2:
print(f" [FAIL] Closing times don't match!")
print(f" First: {closing_time_1}")
print(f" Second: {closing_time_2}")
return False
# Verify second visit was significantly faster (used cache)
if second_visit_time >= first_visit_time * 0.5:
print(f" [WARN] Second visit not significantly faster")
print(f" First: {first_visit_time:.2f}s")
print(f" Second: {second_visit_time:.2f}s")
else:
print(f" [OK] Second visit was {(first_visit_time / second_visit_time):.1f}x faster (cache working!)")
# Verify resource cache has entries
conn = sqlite3.connect(self.test_db)
cursor = conn.execute("SELECT COUNT(*) FROM resource_cache")
resource_count = cursor.fetchone()[0]
conn.close()
print(f" [OK] Cached {resource_count} resources")
print("\n[PASS] TEST 1 PASSED: Page fetched only once, data persists")
return True
async def test_offline_mode(self):
"""Test that offline mode works with cached data"""
print("\n" + "="*60)
print("TEST 2: Offline Mode with Cached Data")
print("="*60)
# Use the same URL from test 1 (should be cached)
test_url = "https://www.troostwijkauctions.com/l/bmw-x5-xdrive40d-high-executive-m-sport-a8-286pk-2019-A1-26955-7"
# Enable offline mode
original_offline = config.OFFLINE
config.OFFLINE = True
self.scraper.offline = True
print(f"\nTest URL: {test_url}")
print(" * Offline mode: ENABLED")
try:
# Try to scrape in offline mode
print("\n--- OFFLINE SCRAPE (should use DB/cache only) ---")
start_time = time.time()
async with asyncio.timeout(30):
page_data = await self._scrape_single_page(test_url)
offline_time = time.time() - start_time
if not page_data:
print(" [FAIL] Offline mode returned no data")
return False
print(f" [OK] Offline scrape completed in {offline_time:.2f}s")
print(f" [OK] Got lot data: {page_data.get('title', 'N/A')[:60]}...")
# Check closing time is available
closing_time = page_data.get('closing_time')
if not closing_time:
print(f" [FAIL] No closing time in offline mode")
return False
print(f" [OK] Closing time preserved: {closing_time}")
# Verify essential fields are present
essential_fields = ['lot_id', 'title', 'url', 'location']
missing_fields = [f for f in essential_fields if not page_data.get(f)]
if missing_fields:
print(f" [FAIL] Missing essential fields: {missing_fields}")
return False
print(f" [OK] All essential fields present")
# Check database has the lot
conn = sqlite3.connect(self.test_db)
cursor = conn.execute("SELECT closing_time FROM lots WHERE url = ?", (test_url,))
row = cursor.fetchone()
conn.close()
if not row:
print(f" [FAIL] Lot not found in database")
return False
db_closing_time = row[0]
print(f" [OK] Database has closing time: {db_closing_time}")
if db_closing_time != closing_time:
print(f" [FAIL] Closing time mismatch")
print(f" Scraped: {closing_time}")
print(f" Database: {db_closing_time}")
return False
print("\n[PASS] TEST 2 PASSED: Offline mode works, closing time preserved")
return True
finally:
# Restore offline mode
config.OFFLINE = original_offline
self.scraper.offline = original_offline
async def _scrape_single_page(self, url):
"""Helper to scrape a single page"""
from playwright.async_api import async_playwright
if config.OFFLINE or self.scraper.offline:
# Offline mode - use crawl_page directly
return await self.scraper.crawl_page(page=None, url=url)
# Online mode - need browser
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
try:
result = await self.scraper.crawl_page(page, url)
return result
finally:
await browser.close()
async def run_all_tests(self):
"""Run all tests"""
print("\n" + "="*70)
print("CACHE BEHAVIOR TEST SUITE")
print("="*70)
self.setup()
results = []
try:
# Test 1: Page fetched once
result1 = await self.test_page_fetched_once()
results.append(("Page Fetched Once", result1))
# Test 2: Offline mode
result2 = await self.test_offline_mode()
results.append(("Offline Mode", result2))
except Exception as e:
print(f"\n[ERROR] TEST SUITE ERROR: {e}")
import traceback
traceback.print_exc()
finally:
self.teardown()
# Print summary
print("\n" + "="*70)
print("TEST SUMMARY")
print("="*70)
all_passed = True
for test_name, passed in results:
status = "[PASS]" if passed else "[FAIL]"
print(f" {status}: {test_name}")
if not passed:
all_passed = False
print("="*70)
if all_passed:
print("\n*** ALL TESTS PASSED! ***")
return 0
else:
print("\n*** SOME TESTS FAILED ***")
return 1
async def main():
"""Run tests"""
tester = TestCacheBehavior()
exit_code = await tester.run_all_tests()
sys.exit(exit_code)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,51 +0,0 @@
#!/usr/bin/env python3
import sys
import os
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
sys.path.insert(0, parent_dir)
sys.path.insert(0, os.path.join(parent_dir, 'src'))
import asyncio
from scraper import TroostwijkScraper
import config
import os
async def test():
# Force online mode
os.environ['SCAEV_OFFLINE'] = '0'
config.OFFLINE = False
scraper = TroostwijkScraper()
scraper.offline = False
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
url = "https://www.troostwijkauctions.com/l/used-dometic-seastar-tfxchx8641p-top-mount-engine-control-liver-A1-39684-12"
# Add debug logging to parser
original_parse = scraper.parser.parse_page
def debug_parse(content, url):
result = original_parse(content, url)
if result:
print(f"PARSER OUTPUT:")
print(f" description: {result.get('description', 'NONE')[:100] if result.get('description') else 'EMPTY'}")
print(f" closing_time: {result.get('closing_time', 'NONE')}")
print(f" bid_count: {result.get('bid_count', 'NONE')}")
return result
scraper.parser.parse_page = debug_parse
page_data = await scraper.crawl_page(page, url)
await browser.close()
print(f"\nFINAL page_data:")
print(f" description: {page_data.get('description', 'NONE')[:100] if page_data and page_data.get('description') else 'EMPTY'}")
print(f" closing_time: {page_data.get('closing_time', 'NONE') if page_data else 'NONE'}")
print(f" bid_count: {page_data.get('bid_count', 'NONE') if page_data else 'NONE'}")
print(f" status: {page_data.get('status', 'NONE') if page_data else 'NONE'}")
asyncio.run(test())

View File

@@ -1,85 +0,0 @@
import asyncio
import types
import sys
from pathlib import Path
import pytest
@pytest.mark.asyncio
async def test_fetch_lot_bidding_data_403(monkeypatch):
"""
Simulate a 403 from the GraphQL endpoint and verify:
- Function returns None (graceful handling)
- It attempts a retry and logs a clear 403 message
"""
# Load modules directly from src using importlib to avoid path issues
project_root = Path(__file__).resolve().parents[1]
src_path = project_root / 'src'
import importlib.util
def _load_module(name, file_path):
spec = importlib.util.spec_from_file_location(name, str(file_path))
module = importlib.util.module_from_spec(spec)
sys.modules[name] = module
spec.loader.exec_module(module) # type: ignore
return module
# Load config first because graphql_client imports it by module name
config = _load_module('config', src_path / 'config.py')
graphql_client = _load_module('graphql_client', src_path / 'graphql_client.py')
monkeypatch.setattr(config, "OFFLINE", False, raising=False)
log_messages = []
def fake_print(*args, **kwargs):
msg = " ".join(str(a) for a in args)
log_messages.append(msg)
import builtins
monkeypatch.setattr(builtins, "print", fake_print)
class MockResponse:
def __init__(self, status=403, text_body="Forbidden"):
self.status = status
self._text_body = text_body
async def json(self):
return {}
async def text(self):
return self._text_body
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc, tb):
return False
class MockSession:
def __init__(self, *args, **kwargs):
pass
def post(self, *args, **kwargs):
# Always return 403
return MockResponse(403, "Forbidden by WAF")
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc, tb):
return False
# Patch aiohttp.ClientSession to our mock
import types as _types
dummy_aiohttp = _types.SimpleNamespace()
dummy_aiohttp.ClientSession = MockSession
# Ensure that an `import aiohttp` inside the function resolves to our dummy
monkeypatch.setitem(sys.modules, 'aiohttp', dummy_aiohttp)
result = await graphql_client.fetch_lot_bidding_data("A1-40179-35")
# Should gracefully return None
assert result is None
# Should have logged a 403 at least once
assert any("GraphQL API error: 403" in m for m in log_messages)

View File

@@ -1,208 +0,0 @@
#!/usr/bin/env python3
"""
Test to validate that all expected fields are populated after scraping
"""
import sys
import os
import asyncio
import sqlite3
# Add parent and src directory to path
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
sys.path.insert(0, parent_dir)
sys.path.insert(0, os.path.join(parent_dir, 'src'))
# Force online mode before importing
os.environ['SCAEV_OFFLINE'] = '0'
from scraper import TroostwijkScraper
import config
async def test_lot_has_all_fields():
"""Test that a lot page has all expected fields populated"""
print("\n" + "="*60)
print("TEST: Lot has all required fields")
print("="*60)
# Use the example lot from user
test_url = "https://www.troostwijkauctions.com/l/radaway-idea-black-dwj-doucheopstelling-A1-39956-18"
# Ensure we're not in offline mode
config.OFFLINE = False
scraper = TroostwijkScraper()
scraper.offline = False
print(f"\n[1] Scraping: {test_url}")
# Start playwright and scrape
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
page_data = await scraper.crawl_page(page, test_url)
await browser.close()
if not page_data:
print(" [FAIL] No data returned")
return False
print(f"\n[2] Validating fields...")
# Fields that MUST have values (critical for auction functionality)
required_fields = {
'closing_time': 'Closing time',
'current_bid': 'Current bid',
'bid_count': 'Bid count',
'status': 'Status',
}
# Fields that SHOULD have values but may legitimately be empty
optional_fields = {
'description': 'Description',
}
missing_fields = []
empty_fields = []
optional_missing = []
# Check required fields
for field, label in required_fields.items():
value = page_data.get(field)
if value is None:
missing_fields.append(label)
print(f" [FAIL] {label}: MISSING (None)")
elif value == '' or value == 0 or value == 'No bids':
# Special case: 'No bids' is only acceptable if bid_count is 0
if field == 'current_bid' and page_data.get('bid_count', 0) == 0:
print(f" [PASS] {label}: '{value}' (acceptable - no bids)")
else:
empty_fields.append(label)
print(f" [FAIL] {label}: EMPTY ('{value}')")
else:
print(f" [PASS] {label}: {value}")
# Check optional fields (warn but don't fail)
for field, label in optional_fields.items():
value = page_data.get(field)
if value is None or value == '':
optional_missing.append(label)
print(f" [WARN] {label}: EMPTY (may be legitimate)")
else:
print(f" [PASS] {label}: {value[:50]}...")
# Check database
print(f"\n[3] Checking database entry...")
conn = sqlite3.connect(scraper.cache.db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT closing_time, current_bid, bid_count, description, status
FROM lots WHERE url = ?
""", (test_url,))
row = cursor.fetchone()
conn.close()
if row:
db_closing, db_bid, db_count, db_desc, db_status = row
print(f" DB closing_time: {db_closing or 'EMPTY'}")
print(f" DB current_bid: {db_bid or 'EMPTY'}")
print(f" DB bid_count: {db_count}")
print(f" DB description: {db_desc[:50] if db_desc else 'EMPTY'}...")
print(f" DB status: {db_status or 'EMPTY'}")
# Verify DB matches page_data
if db_closing != page_data.get('closing_time'):
print(f" [WARN] DB closing_time doesn't match page_data")
if db_count != page_data.get('bid_count'):
print(f" [WARN] DB bid_count doesn't match page_data")
else:
print(f" [WARN] No database entry found")
print(f"\n" + "="*60)
if missing_fields or empty_fields:
print(f"[FAIL] Missing fields: {', '.join(missing_fields)}")
print(f"[FAIL] Empty fields: {', '.join(empty_fields)}")
if optional_missing:
print(f"[WARN] Optional missing: {', '.join(optional_missing)}")
return False
else:
print("[PASS] All required fields are populated")
if optional_missing:
print(f"[WARN] Optional missing: {', '.join(optional_missing)}")
return True
async def test_lot_with_description():
"""Test that a lot with description preserves it"""
print("\n" + "="*60)
print("TEST: Lot with description")
print("="*60)
# Use a lot known to have description
test_url = "https://www.troostwijkauctions.com/l/used-dometic-seastar-tfxchx8641p-top-mount-engine-control-liver-A1-39684-12"
config.OFFLINE = False
scraper = TroostwijkScraper()
scraper.offline = False
print(f"\n[1] Scraping: {test_url}")
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
page_data = await scraper.crawl_page(page, test_url)
await browser.close()
if not page_data:
print(" [FAIL] No data returned")
return False
print(f"\n[2] Checking description...")
description = page_data.get('description', '')
if not description or description == '':
print(f" [FAIL] Description is empty")
return False
else:
print(f" [PASS] Description: {description[:100]}...")
return True
async def main():
"""Run all tests"""
print("\n" + "="*60)
print("MISSING FIELDS TEST SUITE")
print("="*60)
test1 = await test_lot_has_all_fields()
test2 = await test_lot_with_description()
print("\n" + "="*60)
if test1 and test2:
print("ALL TESTS PASSED")
else:
print("SOME TESTS FAILED")
if not test1:
print(" - test_lot_has_all_fields FAILED")
if not test2:
print(" - test_lot_with_description FAILED")
print("="*60 + "\n")
return 0 if (test1 and test2) else 1
if __name__ == '__main__':
exit_code = asyncio.run(main())
sys.exit(exit_code)

View File

@@ -1,335 +0,0 @@
#!/usr/bin/env python3
"""
Test suite for Troostwijk Scraper
Tests both auction and lot parsing with cached data
Requires Python 3.10+
"""
import sys
# Require Python 3.10+
if sys.version_info < (3, 10):
print("ERROR: This script requires Python 3.10 or higher")
print(f"Current version: {sys.version}")
sys.exit(1)
import asyncio
import json
import sqlite3
from datetime import datetime
from pathlib import Path
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent))
from main import TroostwijkScraper, CacheManager, CACHE_DB
# Test URLs - these will use cached data to avoid overloading the server
TEST_AUCTIONS = [
"https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813",
"https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557",
"https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675",
]
TEST_LOTS = [
"https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5",
"https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9",
"https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101",
]
class TestResult:
def __init__(self, url, success, message, data=None):
self.url = url
self.success = success
self.message = message
self.data = data
class ScraperTester:
def __init__(self):
self.scraper = TroostwijkScraper()
self.results = []
def check_cache_exists(self, url):
"""Check if URL is cached"""
cached = self.scraper.cache.get(url, max_age_hours=999999) # Get even old cache
return cached is not None
def test_auction_parsing(self, url):
"""Test auction page parsing"""
print(f"\n{'='*70}")
print(f"Testing Auction: {url}")
print('='*70)
# Check cache
if not self.check_cache_exists(url):
return TestResult(
url,
False,
"❌ NOT IN CACHE - Please run scraper first to cache this URL",
None
)
# Get cached content
cached = self.scraper.cache.get(url, max_age_hours=999999)
content = cached['content']
print(f"✓ Cache hit (age: {(datetime.now().timestamp() - cached['timestamp']) / 3600:.1f} hours)")
# Parse
try:
data = self.scraper._parse_page(content, url)
if not data:
return TestResult(url, False, "❌ Parsing returned None", None)
if data.get('type') != 'auction':
return TestResult(
url,
False,
f"❌ Expected type='auction', got '{data.get('type')}'",
data
)
# Validate required fields
issues = []
required_fields = {
'auction_id': str,
'title': str,
'location': str,
'lots_count': int,
'first_lot_closing_time': str,
}
for field, expected_type in required_fields.items():
value = data.get(field)
if value is None or value == '':
issues.append(f"{field}: MISSING or EMPTY")
elif not isinstance(value, expected_type):
issues.append(f"{field}: Wrong type (expected {expected_type.__name__}, got {type(value).__name__})")
else:
# Pretty print value
display_value = str(value)[:60]
print(f"{field}: {display_value}")
if issues:
return TestResult(url, False, "\n".join(issues), data)
print(f" ✓ lots_count: {data.get('lots_count')}")
return TestResult(url, True, "✅ All auction fields validated successfully", data)
except Exception as e:
return TestResult(url, False, f"❌ Exception during parsing: {e}", None)
def test_lot_parsing(self, url):
"""Test lot page parsing"""
print(f"\n{'='*70}")
print(f"Testing Lot: {url}")
print('='*70)
# Check cache
if not self.check_cache_exists(url):
return TestResult(
url,
False,
"❌ NOT IN CACHE - Please run scraper first to cache this URL",
None
)
# Get cached content
cached = self.scraper.cache.get(url, max_age_hours=999999)
content = cached['content']
print(f"✓ Cache hit (age: {(datetime.now().timestamp() - cached['timestamp']) / 3600:.1f} hours)")
# Parse
try:
data = self.scraper._parse_page(content, url)
if not data:
return TestResult(url, False, "❌ Parsing returned None", None)
if data.get('type') != 'lot':
return TestResult(
url,
False,
f"❌ Expected type='lot', got '{data.get('type')}'",
data
)
# Validate required fields
issues = []
required_fields = {
'lot_id': (str, lambda x: x and len(x) > 0),
'title': (str, lambda x: x and len(x) > 3 and x not in ['...', 'N/A']),
'location': (str, lambda x: x and len(x) > 2 and x not in ['Locatie', 'Location']),
'current_bid': (str, lambda x: x and x not in ['€Huidig bod', 'Huidig bod']),
'closing_time': (str, lambda x: True), # Can be empty
'images': (list, lambda x: True), # Can be empty list
}
for field, (expected_type, validator) in required_fields.items():
value = data.get(field)
if value is None:
issues.append(f"{field}: MISSING (None)")
elif not isinstance(value, expected_type):
issues.append(f"{field}: Wrong type (expected {expected_type.__name__}, got {type(value).__name__})")
elif not validator(value):
issues.append(f"{field}: Invalid value: '{value}'")
else:
# Pretty print value
if field == 'images':
print(f"{field}: {len(value)} images")
for i, img in enumerate(value[:3], 1):
print(f" {i}. {img[:60]}...")
else:
display_value = str(value)[:60]
print(f"{field}: {display_value}")
# Additional checks
if data.get('bid_count') is not None:
print(f" ✓ bid_count: {data.get('bid_count')}")
if data.get('viewing_time'):
print(f" ✓ viewing_time: {data.get('viewing_time')}")
if data.get('pickup_date'):
print(f" ✓ pickup_date: {data.get('pickup_date')}")
if issues:
return TestResult(url, False, "\n".join(issues), data)
return TestResult(url, True, "✅ All lot fields validated successfully", data)
except Exception as e:
import traceback
return TestResult(url, False, f"❌ Exception during parsing: {e}\n{traceback.format_exc()}", None)
def run_all_tests(self):
"""Run all tests"""
print("\n" + "="*70)
print("TROOSTWIJK SCRAPER TEST SUITE")
print("="*70)
print("\nThis test suite uses CACHED data only - no live requests to server")
print("="*70)
# Test auctions
print("\n" + "="*70)
print("TESTING AUCTIONS")
print("="*70)
for url in TEST_AUCTIONS:
result = self.test_auction_parsing(url)
self.results.append(result)
# Test lots
print("\n" + "="*70)
print("TESTING LOTS")
print("="*70)
for url in TEST_LOTS:
result = self.test_lot_parsing(url)
self.results.append(result)
# Summary
self.print_summary()
def print_summary(self):
"""Print test summary"""
print("\n" + "="*70)
print("TEST SUMMARY")
print("="*70)
passed = sum(1 for r in self.results if r.success)
failed = sum(1 for r in self.results if not r.success)
total = len(self.results)
print(f"\nTotal tests: {total}")
print(f"Passed: {passed}")
print(f"Failed: {failed}")
print(f"Success rate: {passed/total*100:.1f}%")
if failed > 0:
print("\n" + "="*70)
print("FAILED TESTS:")
print("="*70)
for result in self.results:
if not result.success:
print(f"\n{result.url}")
print(result.message)
if result.data:
print("\nParsed data:")
for key, value in result.data.items():
if key != 'lots': # Don't print full lots array
print(f" {key}: {str(value)[:80]}")
print("\n" + "="*70)
return failed == 0
def check_cache_status():
"""Check cache compression status"""
print("\n" + "="*70)
print("CACHE STATUS CHECK")
print("="*70)
try:
with sqlite3.connect(CACHE_DB) as conn:
# Total entries
cursor = conn.execute("SELECT COUNT(*) FROM cache")
total = cursor.fetchone()[0]
# Compressed vs uncompressed
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 1")
compressed = cursor.fetchone()[0]
cursor = conn.execute("SELECT COUNT(*) FROM cache WHERE compressed = 0 OR compressed IS NULL")
uncompressed = cursor.fetchone()[0]
print(f"Total cache entries: {total}")
print(f"Compressed: {compressed} ({compressed/total*100:.1f}%)")
print(f"Uncompressed: {uncompressed} ({uncompressed/total*100:.1f}%)")
if uncompressed > 0:
print(f"\n⚠️ Warning: {uncompressed} entries are still uncompressed")
print(" Run: python migrate_compress_cache.py")
else:
print("\n✓ All cache entries are compressed!")
# Check test URLs
print(f"\n{'='*70}")
print("TEST URL CACHE STATUS:")
print('='*70)
all_test_urls = TEST_AUCTIONS + TEST_LOTS
cached_count = 0
for url in all_test_urls:
cursor = conn.execute("SELECT url FROM cache WHERE url = ?", (url,))
if cursor.fetchone():
print(f"{url[:60]}...")
cached_count += 1
else:
print(f"{url[:60]}... (NOT CACHED)")
print(f"\n{cached_count}/{len(all_test_urls)} test URLs are cached")
if cached_count < len(all_test_urls):
print("\n⚠️ Some test URLs are not cached. Tests for those URLs will fail.")
print(" Run the main scraper to cache these URLs first.")
except Exception as e:
print(f"Error checking cache status: {e}")
if __name__ == "__main__":
# Check cache status first
check_cache_status()
# Run tests
tester = ScraperTester()
success = tester.run_all_tests()
# Exit with appropriate code
sys.exit(0 if success else 1)