- Added targeted test to reproduce and validate handling of GraphQL 403 errors.

- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear. - Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded. ### Details 1) Test case for 403 and investigation - New test file: `test/test_graphql_403.py`. - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks. - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs. - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged. - Result: `pytest test/test_graphql_403.py -q` passes locally. - Root cause insights (from investigation and log improvements): - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes. - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting. 2) Incremental/in-place logging for downloads - Updated `src/scraper.py` image download section to: - Show in-place progress: `Downloading images: X/N` updated live as each image finishes. - After completion, print: `Downloaded: K/N new images`. - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot. 3) GraphQL client improvements - Updated `src/graphql_client.py`: - Added browser-like headers and contextual Referer. - Added small retry with backoff for 403/429. - Improved error logs to include status, lot id, and a short body snippet. ### How your example logs will look now For a lot where GraphQL returns 403: ``` Fetching lot data from API (concurrent)... GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF ``` For image downloads: ``` Images: 6 Downloading images: 0/6 ... 6/6 Downloaded: 6/6 new images Indexes: 0, 1, 2, 3, 4, 5 ``` (When all cached: `All 6 images already cached`) ### Notes - Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed. - If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 09:15:49 +01:00
parent e69563d4d6
commit 5a755a2125
8 changed files with 512 additions and 31 deletions
--- a/db/migration/V1__initial_schema.sql
+++ b/db/migration/V1__initial_schema.sql
@@ -0,0 +1,139 @@
+-- Auctions
+CREATE TABLE auctions (
+    auction_id TEXT PRIMARY KEY,
+    url TEXT UNIQUE,
+    title TEXT,
+    location TEXT,
+    lots_count INTEGER,
+    first_lot_closing_time TEXT,
+    scraped_at TEXT,
+    city TEXT,
+    country TEXT,
+    type TEXT,
+    lot_count INTEGER DEFAULT 0,
+    closing_time TEXT,
+    discovered_at BIGINT
+);
+
+CREATE INDEX idx_auctions_country ON auctions(country);
+
+-- Cache
+CREATE TABLE cache (
+    url TEXT PRIMARY KEY,
+    content BYTEA,
+    timestamp DOUBLE PRECISION,
+    status_code INTEGER
+);
+
+CREATE INDEX idx_timestamp ON cache(timestamp);
+
+-- Lots
+CREATE TABLE lots (
+    lot_id TEXT PRIMARY KEY,
+    auction_id TEXT REFERENCES auctions(auction_id),
+    url TEXT UNIQUE,
+    title TEXT,
+    current_bid TEXT,
+    bid_count INTEGER,
+    closing_time TEXT,
+    viewing_time TEXT,
+    pickup_date TEXT,
+    location TEXT,
+    description TEXT,
+    category TEXT,
+    scraped_at TEXT,
+    sale_id INTEGER,
+    manufacturer TEXT,
+    type TEXT,
+    year INTEGER,
+    currency TEXT DEFAULT 'EUR',
+    closing_notified INTEGER DEFAULT 0,
+    starting_bid TEXT,
+    minimum_bid TEXT,
+    status TEXT,
+    brand TEXT,
+    model TEXT,
+    attributes_json TEXT,
+    first_bid_time TEXT,
+    last_bid_time TEXT,
+    bid_velocity DOUBLE PRECISION,
+    bid_increment DOUBLE PRECISION,
+    year_manufactured INTEGER,
+    condition_score DOUBLE PRECISION,
+    condition_description TEXT,
+    serial_number TEXT,
+    damage_description TEXT,
+    followers_count INTEGER DEFAULT 0,
+    estimated_min_price DOUBLE PRECISION,
+    estimated_max_price DOUBLE PRECISION,
+    lot_condition TEXT,
+    appearance TEXT,
+    estimated_min DOUBLE PRECISION,
+    estimated_max DOUBLE PRECISION,
+    next_bid_step_cents INTEGER,
+    condition TEXT,
+    category_path TEXT,
+    city_location TEXT,
+    country_code TEXT,
+    bidding_status TEXT,
+    packaging TEXT,
+    quantity INTEGER,
+    vat DOUBLE PRECISION,
+    buyer_premium_percentage DOUBLE PRECISION,
+    remarks TEXT,
+    reserve_price DOUBLE PRECISION,
+    reserve_met INTEGER,
+    view_count INTEGER,
+    api_data_json TEXT,
+    next_scrape_at BIGINT,
+    scrape_priority INTEGER DEFAULT 0
+);
+
+CREATE INDEX idx_lots_closing_time ON lots(closing_time);
+CREATE INDEX idx_lots_next_scrape ON lots(next_scrape_at);
+CREATE INDEX idx_lots_priority ON lots(scrape_priority DESC);
+CREATE INDEX idx_lots_sale_id ON lots(sale_id);
+
+-- Bid history
+CREATE TABLE bid_history (
+    id SERIAL PRIMARY KEY,
+    lot_id TEXT REFERENCES lots(lot_id),
+    bid_amount DOUBLE PRECISION NOT NULL,
+    bid_time TEXT NOT NULL,
+    is_autobid INTEGER DEFAULT 0,
+    bidder_id TEXT,
+    bidder_number INTEGER,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+
+CREATE INDEX idx_bid_history_bidder ON bid_history(bidder_id);
+CREATE INDEX idx_bid_history_lot_time ON bid_history(lot_id, bid_time);
+
+-- Images
+CREATE TABLE images (
+    id SERIAL PRIMARY KEY,
+    lot_id TEXT REFERENCES lots(lot_id),
+    url TEXT,
+    local_path TEXT,
+    downloaded INTEGER DEFAULT 0,
+    labels TEXT,
+    processed_at BIGINT
+);
+
+CREATE INDEX idx_images_lot_id ON images(lot_id);
+CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
+
+-- Resource cache
+CREATE TABLE resource_cache (
+    url TEXT PRIMARY KEY,
+    content BYTEA,
+    content_type TEXT,
+    status_code INTEGER,
+    headers TEXT,
+    timestamp DOUBLE PRECISION,
+    size_bytes INTEGER,
+    local_path TEXT
+);
+
+CREATE INDEX idx_resource_timestamp ON resource_cache(timestamp);
+CREATE INDEX idx_resource_content_type ON resource_cache(content_type);