- Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
139
db/migration/V1__initial_schema.sql
Normal file
139
db/migration/V1__initial_schema.sql
Normal file
@@ -0,0 +1,139 @@
|
||||
-- Auctions
|
||||
CREATE TABLE auctions (
|
||||
auction_id TEXT PRIMARY KEY,
|
||||
url TEXT UNIQUE,
|
||||
title TEXT,
|
||||
location TEXT,
|
||||
lots_count INTEGER,
|
||||
first_lot_closing_time TEXT,
|
||||
scraped_at TEXT,
|
||||
city TEXT,
|
||||
country TEXT,
|
||||
type TEXT,
|
||||
lot_count INTEGER DEFAULT 0,
|
||||
closing_time TEXT,
|
||||
discovered_at BIGINT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_auctions_country ON auctions(country);
|
||||
|
||||
-- Cache
|
||||
CREATE TABLE cache (
|
||||
url TEXT PRIMARY KEY,
|
||||
content BYTEA,
|
||||
timestamp DOUBLE PRECISION,
|
||||
status_code INTEGER
|
||||
);
|
||||
|
||||
CREATE INDEX idx_timestamp ON cache(timestamp);
|
||||
|
||||
-- Lots
|
||||
CREATE TABLE lots (
|
||||
lot_id TEXT PRIMARY KEY,
|
||||
auction_id TEXT REFERENCES auctions(auction_id),
|
||||
url TEXT UNIQUE,
|
||||
title TEXT,
|
||||
current_bid TEXT,
|
||||
bid_count INTEGER,
|
||||
closing_time TEXT,
|
||||
viewing_time TEXT,
|
||||
pickup_date TEXT,
|
||||
location TEXT,
|
||||
description TEXT,
|
||||
category TEXT,
|
||||
scraped_at TEXT,
|
||||
sale_id INTEGER,
|
||||
manufacturer TEXT,
|
||||
type TEXT,
|
||||
year INTEGER,
|
||||
currency TEXT DEFAULT 'EUR',
|
||||
closing_notified INTEGER DEFAULT 0,
|
||||
starting_bid TEXT,
|
||||
minimum_bid TEXT,
|
||||
status TEXT,
|
||||
brand TEXT,
|
||||
model TEXT,
|
||||
attributes_json TEXT,
|
||||
first_bid_time TEXT,
|
||||
last_bid_time TEXT,
|
||||
bid_velocity DOUBLE PRECISION,
|
||||
bid_increment DOUBLE PRECISION,
|
||||
year_manufactured INTEGER,
|
||||
condition_score DOUBLE PRECISION,
|
||||
condition_description TEXT,
|
||||
serial_number TEXT,
|
||||
damage_description TEXT,
|
||||
followers_count INTEGER DEFAULT 0,
|
||||
estimated_min_price DOUBLE PRECISION,
|
||||
estimated_max_price DOUBLE PRECISION,
|
||||
lot_condition TEXT,
|
||||
appearance TEXT,
|
||||
estimated_min DOUBLE PRECISION,
|
||||
estimated_max DOUBLE PRECISION,
|
||||
next_bid_step_cents INTEGER,
|
||||
condition TEXT,
|
||||
category_path TEXT,
|
||||
city_location TEXT,
|
||||
country_code TEXT,
|
||||
bidding_status TEXT,
|
||||
packaging TEXT,
|
||||
quantity INTEGER,
|
||||
vat DOUBLE PRECISION,
|
||||
buyer_premium_percentage DOUBLE PRECISION,
|
||||
remarks TEXT,
|
||||
reserve_price DOUBLE PRECISION,
|
||||
reserve_met INTEGER,
|
||||
view_count INTEGER,
|
||||
api_data_json TEXT,
|
||||
next_scrape_at BIGINT,
|
||||
scrape_priority INTEGER DEFAULT 0
|
||||
);
|
||||
|
||||
CREATE INDEX idx_lots_closing_time ON lots(closing_time);
|
||||
CREATE INDEX idx_lots_next_scrape ON lots(next_scrape_at);
|
||||
CREATE INDEX idx_lots_priority ON lots(scrape_priority DESC);
|
||||
CREATE INDEX idx_lots_sale_id ON lots(sale_id);
|
||||
|
||||
-- Bid history
|
||||
CREATE TABLE bid_history (
|
||||
id SERIAL PRIMARY KEY,
|
||||
lot_id TEXT REFERENCES lots(lot_id),
|
||||
bid_amount DOUBLE PRECISION NOT NULL,
|
||||
bid_time TEXT NOT NULL,
|
||||
is_autobid INTEGER DEFAULT 0,
|
||||
bidder_id TEXT,
|
||||
bidder_number INTEGER,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE INDEX idx_bid_history_bidder ON bid_history(bidder_id);
|
||||
CREATE INDEX idx_bid_history_lot_time ON bid_history(lot_id, bid_time);
|
||||
|
||||
-- Images
|
||||
CREATE TABLE images (
|
||||
id SERIAL PRIMARY KEY,
|
||||
lot_id TEXT REFERENCES lots(lot_id),
|
||||
url TEXT,
|
||||
local_path TEXT,
|
||||
downloaded INTEGER DEFAULT 0,
|
||||
labels TEXT,
|
||||
processed_at BIGINT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_images_lot_id ON images(lot_id);
|
||||
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
|
||||
|
||||
-- Resource cache
|
||||
CREATE TABLE resource_cache (
|
||||
url TEXT PRIMARY KEY,
|
||||
content BYTEA,
|
||||
content_type TEXT,
|
||||
status_code INTEGER,
|
||||
headers TEXT,
|
||||
timestamp DOUBLE PRECISION,
|
||||
size_bytes INTEGER,
|
||||
local_path TEXT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_resource_timestamp ON resource_cache(timestamp);
|
||||
CREATE INDEX idx_resource_content_type ON resource_cache(content_type);
|
||||
Reference in New Issue
Block a user