- Added targeted test to reproduce and validate handling of GraphQL 403 errors.

- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
Tour
2025-12-09 09:15:49 +01:00
parent e69563d4d6
commit 5a755a2125
8 changed files with 512 additions and 31 deletions

View File

@@ -0,0 +1,139 @@
-- Auctions
CREATE TABLE auctions (
auction_id TEXT PRIMARY KEY,
url TEXT UNIQUE,
title TEXT,
location TEXT,
lots_count INTEGER,
first_lot_closing_time TEXT,
scraped_at TEXT,
city TEXT,
country TEXT,
type TEXT,
lot_count INTEGER DEFAULT 0,
closing_time TEXT,
discovered_at BIGINT
);
CREATE INDEX idx_auctions_country ON auctions(country);
-- Cache
CREATE TABLE cache (
url TEXT PRIMARY KEY,
content BYTEA,
timestamp DOUBLE PRECISION,
status_code INTEGER
);
CREATE INDEX idx_timestamp ON cache(timestamp);
-- Lots
CREATE TABLE lots (
lot_id TEXT PRIMARY KEY,
auction_id TEXT REFERENCES auctions(auction_id),
url TEXT UNIQUE,
title TEXT,
current_bid TEXT,
bid_count INTEGER,
closing_time TEXT,
viewing_time TEXT,
pickup_date TEXT,
location TEXT,
description TEXT,
category TEXT,
scraped_at TEXT,
sale_id INTEGER,
manufacturer TEXT,
type TEXT,
year INTEGER,
currency TEXT DEFAULT 'EUR',
closing_notified INTEGER DEFAULT 0,
starting_bid TEXT,
minimum_bid TEXT,
status TEXT,
brand TEXT,
model TEXT,
attributes_json TEXT,
first_bid_time TEXT,
last_bid_time TEXT,
bid_velocity DOUBLE PRECISION,
bid_increment DOUBLE PRECISION,
year_manufactured INTEGER,
condition_score DOUBLE PRECISION,
condition_description TEXT,
serial_number TEXT,
damage_description TEXT,
followers_count INTEGER DEFAULT 0,
estimated_min_price DOUBLE PRECISION,
estimated_max_price DOUBLE PRECISION,
lot_condition TEXT,
appearance TEXT,
estimated_min DOUBLE PRECISION,
estimated_max DOUBLE PRECISION,
next_bid_step_cents INTEGER,
condition TEXT,
category_path TEXT,
city_location TEXT,
country_code TEXT,
bidding_status TEXT,
packaging TEXT,
quantity INTEGER,
vat DOUBLE PRECISION,
buyer_premium_percentage DOUBLE PRECISION,
remarks TEXT,
reserve_price DOUBLE PRECISION,
reserve_met INTEGER,
view_count INTEGER,
api_data_json TEXT,
next_scrape_at BIGINT,
scrape_priority INTEGER DEFAULT 0
);
CREATE INDEX idx_lots_closing_time ON lots(closing_time);
CREATE INDEX idx_lots_next_scrape ON lots(next_scrape_at);
CREATE INDEX idx_lots_priority ON lots(scrape_priority DESC);
CREATE INDEX idx_lots_sale_id ON lots(sale_id);
-- Bid history
CREATE TABLE bid_history (
id SERIAL PRIMARY KEY,
lot_id TEXT REFERENCES lots(lot_id),
bid_amount DOUBLE PRECISION NOT NULL,
bid_time TEXT NOT NULL,
is_autobid INTEGER DEFAULT 0,
bidder_id TEXT,
bidder_number INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_bid_history_bidder ON bid_history(bidder_id);
CREATE INDEX idx_bid_history_lot_time ON bid_history(lot_id, bid_time);
-- Images
CREATE TABLE images (
id SERIAL PRIMARY KEY,
lot_id TEXT REFERENCES lots(lot_id),
url TEXT,
local_path TEXT,
downloaded INTEGER DEFAULT 0,
labels TEXT,
processed_at BIGINT
);
CREATE INDEX idx_images_lot_id ON images(lot_id);
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
-- Resource cache
CREATE TABLE resource_cache (
url TEXT PRIMARY KEY,
content BYTEA,
content_type TEXT,
status_code INTEGER,
headers TEXT,
timestamp DOUBLE PRECISION,
size_bytes INTEGER,
local_path TEXT
);
CREATE INDEX idx_resource_timestamp ON resource_cache(timestamp);
CREATE INDEX idx_resource_content_type ON resource_cache(content_type);