- Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
@@ -21,3 +21,5 @@ node_modules/
|
||||
.git
|
||||
.github
|
||||
scripts
|
||||
.pytest_cache/
|
||||
__pycache__
|
||||
30
README.md
30
README.md
@@ -43,6 +43,29 @@ playwright install chromium
|
||||
|
||||
---
|
||||
|
||||
## Database Configuration (PostgreSQL)
|
||||
|
||||
The scraper now uses PostgreSQL (no more SQLite files). Configure via `DATABASE_URL`:
|
||||
|
||||
- Default (baked in):
|
||||
`postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb`
|
||||
- Override for your environment:
|
||||
|
||||
```bash
|
||||
# Windows PowerShell
|
||||
$env:DATABASE_URL = "postgresql://user:pass@host:5432/dbname"
|
||||
|
||||
# Linux/macOS
|
||||
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
|
||||
```
|
||||
|
||||
Packages used:
|
||||
- Driver: `psycopg[binary]`
|
||||
|
||||
Nothing is written to local `.db` files anymore.
|
||||
|
||||
---
|
||||
|
||||
## Verify
|
||||
|
||||
```bash
|
||||
@@ -119,7 +142,7 @@ tasklist | findstr python
|
||||
|
||||
- Wrong interpreter → Set Python 3.10+
|
||||
- Multiple monitors running → kill extra processes
|
||||
- SQLite locked → ensure one instance only
|
||||
- PostgreSQL connectivity → verify `DATABASE_URL`, network/firewall, and credentials
|
||||
- Service fails → check `journalctl -u scaev-monitor`
|
||||
|
||||
---
|
||||
@@ -149,11 +172,6 @@ Enable native access (IntelliJ → VM Options):
|
||||
|
||||
---
|
||||
|
||||
## Cache
|
||||
|
||||
- Path: `cache/page_cache.db`
|
||||
- Clear: delete the file
|
||||
|
||||
---
|
||||
|
||||
This file keeps everything compact, Python‑focused, and ready for onboarding.
|
||||
|
||||
@@ -321,13 +321,13 @@ Lot Page Parsed
|
||||
|
||||
## Key Configuration
|
||||
|
||||
| Setting | Value | Purpose |
|
||||
|----------------------|-----------------------------------|----------------------------------|
|
||||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||
| Setting | Value | Purpose |
|
||||
|----------------------|--------------------------------------------------------------------------|----------------------------------|
|
||||
| `DATABASE_URL` | `postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb` | PostgreSQL connection string |
|
||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||
|
||||
## Output Files
|
||||
|
||||
@@ -376,7 +376,7 @@ For each lot, the data “tunnels through” the following stages:
|
||||
1. HTML page → parse `__NEXT_DATA__` for core lot fields and lot UUID.
|
||||
2. GraphQL `lotDetails` → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
|
||||
3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
|
||||
4. Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.
|
||||
4. Persist to DB (PostgreSQL) and export; image URLs are captured and optionally downloaded concurrently per lot.
|
||||
|
||||
Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
|
||||
|
||||
|
||||
1033
src/cache.py
1033
src/cache.py
File diff suppressed because it is too large
Load Diff
@@ -16,8 +16,8 @@ if sys.version_info < (3, 10):
|
||||
# ==================== CONFIGURATION ====================
|
||||
BASE_URL = "https://www.troostwijkauctions.com"
|
||||
|
||||
# Primary database: PostgreSQL
|
||||
# You can override via environment variable DATABASE_URL
|
||||
# Primary database: PostgreSQL only
|
||||
# Override via environment variable DATABASE_URL
|
||||
# Example: postgresql://user:pass@host:5432/dbname
|
||||
DATABASE_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
@@ -25,8 +25,17 @@ DATABASE_URL = os.getenv(
|
||||
"postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb",
|
||||
).strip()
|
||||
|
||||
# Deprecated: legacy SQLite cache path (only used as fallback in dev/tests)
|
||||
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
||||
# Database connection pool controls (to avoid creating too many short-lived TCP connections)
|
||||
# Environment overrides: SCAEV_DB_POOL_MIN, SCAEV_DB_POOL_MAX, SCAEV_DB_POOL_TIMEOUT
|
||||
def _int_env(name: str, default: int) -> int:
|
||||
try:
|
||||
return int(os.getenv(name, str(default)))
|
||||
except Exception:
|
||||
return default
|
||||
|
||||
DB_POOL_MIN = _int_env("SCAEV_DB_POOL_MIN", 1)
|
||||
DB_POOL_MAX = _int_env("SCAEV_DB_POOL_MAX", 6)
|
||||
DB_POOL_TIMEOUT = _int_env("SCAEV_DB_POOL_TIMEOUT", 30) # seconds to wait for a pooled connection
|
||||
OUTPUT_DIR = "/mnt/okcomputer/output"
|
||||
IMAGES_DIR = "/mnt/okcomputer/output/images"
|
||||
RATE_LIMIT_SECONDS = 0.5 # EXACTLY 0.5 seconds between requests
|
||||
|
||||
26
src/db.py
26
src/db.py
@@ -3,14 +3,12 @@
|
||||
Database scaffolding for future SQLAlchemy 2.x usage.
|
||||
|
||||
Notes:
|
||||
- We keep using the current SQLite + raw SQL for operational code.
|
||||
- This module prepares an engine/session bound to DATABASE_URL, defaulting to
|
||||
SQLite file in config.CACHE_DB path (for local dev only).
|
||||
- PostgreSQL can be enabled by setting DATABASE_URL, e.g.:
|
||||
DATABASE_URL=postgresql+psycopg://user:pass@localhost:5432/scaev
|
||||
- The application now uses PostgreSQL exclusively via `config.DATABASE_URL`.
|
||||
- This module prepares an engine/session bound to `DATABASE_URL`.
|
||||
- Example URL: `postgresql+psycopg://user:pass@host:5432/scaev`
|
||||
|
||||
No runtime dependency from the scraper currently imports or uses this module.
|
||||
It is present to bootstrap the gradual migration to SQLAlchemy 2.x.
|
||||
It is present to bootstrap a possible future move to SQLAlchemy 2.x.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -19,14 +17,11 @@ import os
|
||||
from typing import Optional
|
||||
|
||||
|
||||
def get_database_url(sqlite_fallback_path: str) -> str:
|
||||
def get_database_url() -> str:
|
||||
url = os.getenv("DATABASE_URL")
|
||||
if url and url.strip():
|
||||
return url.strip()
|
||||
# SQLite fallback
|
||||
# Use a separate sqlite file when DATABASE_URL is not set; this does not
|
||||
# alter the existing cache.db usage by raw SQL — it's just a dev convenience.
|
||||
return f"sqlite:///{sqlite_fallback_path}"
|
||||
if not url or not url.strip():
|
||||
raise RuntimeError("DATABASE_URL must be set for PostgreSQL connection")
|
||||
return url.strip()
|
||||
|
||||
|
||||
def create_engine_and_session(database_url: str):
|
||||
@@ -44,16 +39,15 @@ def create_engine_and_session(database_url: str):
|
||||
return engine, SessionLocal
|
||||
|
||||
|
||||
def get_sa(session_cached: dict, sqlite_fallback_path: str):
|
||||
def get_sa(session_cached: dict):
|
||||
"""Helper to lazily create and cache SQLAlchemy engine/session factory.
|
||||
|
||||
session_cached: dict — a mutable dict, e.g., module-level {}, to store engine and factory
|
||||
sqlite_fallback_path: path to a sqlite file for local development
|
||||
"""
|
||||
if 'engine' in session_cached and 'SessionLocal' in session_cached:
|
||||
return session_cached['engine'], session_cached['SessionLocal']
|
||||
|
||||
url = get_database_url(sqlite_fallback_path)
|
||||
url = get_database_url()
|
||||
engine, SessionLocal = create_engine_and_session(url)
|
||||
session_cached['engine'] = engine
|
||||
session_cached['SessionLocal'] = SessionLocal
|
||||
|
||||
14
src/main.py
14
src/main.py
@@ -8,7 +8,6 @@ import sys
|
||||
import asyncio
|
||||
import json
|
||||
import csv
|
||||
import sqlite3
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
@@ -16,6 +15,17 @@ import config
|
||||
from cache import CacheManager
|
||||
from scraper import TroostwijkScraper
|
||||
|
||||
def mask_db_url(url: str) -> str:
|
||||
try:
|
||||
from urllib.parse import urlparse
|
||||
p = urlparse(url)
|
||||
user = p.username or ''
|
||||
host = p.hostname or ''
|
||||
port = f":{p.port}" if p.port else ''
|
||||
return f"{p.scheme}://{user}:***@{host}{port}{p.path or ''}"
|
||||
except Exception:
|
||||
return url
|
||||
|
||||
def main():
|
||||
"""Main execution"""
|
||||
# Check for test mode
|
||||
@@ -34,7 +44,7 @@ def main():
|
||||
if config.OFFLINE:
|
||||
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
||||
print(f"Rate limit: {config.RATE_LIMIT_SECONDS} seconds BETWEEN EVERY REQUEST")
|
||||
print(f"Cache database: {config.CACHE_DB}")
|
||||
print(f"Database URL: {mask_db_url(config.DATABASE_URL)}")
|
||||
print(f"Output directory: {config.OUTPUT_DIR}")
|
||||
print(f"Max listing pages: {config.MAX_PAGES}")
|
||||
print("=" * 60)
|
||||
|
||||
@@ -723,25 +723,15 @@ class TroostwijkScraper:
|
||||
|
||||
Returns list of (priority, url, description) tuples sorted by priority (highest first)
|
||||
"""
|
||||
import sqlite3
|
||||
|
||||
prioritized = []
|
||||
current_time = int(time.time())
|
||||
|
||||
conn = sqlite3.connect(self.cache.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
for url in lot_urls:
|
||||
# Extract lot_id from URL
|
||||
lot_id = self.parser.extract_lot_id(url)
|
||||
|
||||
# Try to get existing data from database
|
||||
cursor.execute("""
|
||||
SELECT closing_time, scraped_at, scrape_priority, next_scrape_at
|
||||
FROM lots WHERE lot_id = ? OR url = ?
|
||||
""", (lot_id, url))
|
||||
|
||||
row = cursor.fetchone()
|
||||
row = self.cache.get_lot_priority_info(lot_id, url)
|
||||
|
||||
if row:
|
||||
closing_time, scraped_at, existing_priority, next_scrape_at = row
|
||||
@@ -781,8 +771,6 @@ class TroostwijkScraper:
|
||||
|
||||
prioritized.append((priority, url, desc))
|
||||
|
||||
conn.close()
|
||||
|
||||
# Sort by priority (highest first)
|
||||
prioritized.sort(key=lambda x: x[0], reverse=True)
|
||||
|
||||
@@ -793,14 +781,9 @@ class TroostwijkScraper:
|
||||
if self.offline:
|
||||
print("Launching OFFLINE crawl (no network requests)")
|
||||
# Gather URLs from database
|
||||
import sqlite3
|
||||
conn = sqlite3.connect(self.cache.db_path)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT DISTINCT url FROM auctions")
|
||||
auction_urls = [r[0] for r in cur.fetchall() if r and r[0]]
|
||||
cur.execute("SELECT DISTINCT url FROM lots")
|
||||
lot_urls = [r[0] for r in cur.fetchall() if r and r[0]]
|
||||
conn.close()
|
||||
urls = self.cache.get_distinct_urls()
|
||||
auction_urls = urls['auctions']
|
||||
lot_urls = urls['lots']
|
||||
|
||||
print(f" OFFLINE: {len(auction_urls)} auctions and {len(lot_urls)} lots in DB")
|
||||
|
||||
|
||||
10
src/test.py
10
src/test.py
@@ -4,7 +4,6 @@ Test module for debugging extraction patterns
|
||||
"""
|
||||
|
||||
import sys
|
||||
import sqlite3
|
||||
import time
|
||||
import re
|
||||
import json
|
||||
@@ -27,10 +26,11 @@ def test_extraction(
|
||||
if not cached:
|
||||
print(f"ERROR: URL not found in cache: {test_url}")
|
||||
print(f"\nAvailable cached URLs:")
|
||||
with sqlite3.connect(config.CACHE_DB) as conn:
|
||||
cursor = conn.execute("SELECT url FROM cache ORDER BY timestamp DESC LIMIT 10")
|
||||
for row in cursor.fetchall():
|
||||
print(f" - {row[0]}")
|
||||
try:
|
||||
for url in scraper.cache.get_recent_cached_urls(limit=10):
|
||||
print(f" - {url}")
|
||||
except Exception as e:
|
||||
print(f" (failed to list recent cached URLs: {e})")
|
||||
return
|
||||
|
||||
content = cached['content']
|
||||
|
||||
Reference in New Issue
Block a user