- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
140 lines
3.5 KiB
SQL
140 lines
3.5 KiB
SQL
-- Auctions
|
|
CREATE TABLE auctions (
|
|
auction_id TEXT PRIMARY KEY,
|
|
url TEXT UNIQUE,
|
|
title TEXT,
|
|
location TEXT,
|
|
lots_count INTEGER,
|
|
first_lot_closing_time TEXT,
|
|
scraped_at TEXT,
|
|
city TEXT,
|
|
country TEXT,
|
|
type TEXT,
|
|
lot_count INTEGER DEFAULT 0,
|
|
closing_time TEXT,
|
|
discovered_at BIGINT
|
|
);
|
|
|
|
CREATE INDEX idx_auctions_country ON auctions(country);
|
|
|
|
-- Cache
|
|
CREATE TABLE cache (
|
|
url TEXT PRIMARY KEY,
|
|
content BYTEA,
|
|
timestamp DOUBLE PRECISION,
|
|
status_code INTEGER
|
|
);
|
|
|
|
CREATE INDEX idx_timestamp ON cache(timestamp);
|
|
|
|
-- Lots
|
|
CREATE TABLE lots (
|
|
lot_id TEXT PRIMARY KEY,
|
|
auction_id TEXT REFERENCES auctions(auction_id),
|
|
url TEXT UNIQUE,
|
|
title TEXT,
|
|
current_bid TEXT,
|
|
bid_count INTEGER,
|
|
closing_time TEXT,
|
|
viewing_time TEXT,
|
|
pickup_date TEXT,
|
|
location TEXT,
|
|
description TEXT,
|
|
category TEXT,
|
|
scraped_at TEXT,
|
|
sale_id INTEGER,
|
|
manufacturer TEXT,
|
|
type TEXT,
|
|
year INTEGER,
|
|
currency TEXT DEFAULT 'EUR',
|
|
closing_notified INTEGER DEFAULT 0,
|
|
starting_bid TEXT,
|
|
minimum_bid TEXT,
|
|
status TEXT,
|
|
brand TEXT,
|
|
model TEXT,
|
|
attributes_json TEXT,
|
|
first_bid_time TEXT,
|
|
last_bid_time TEXT,
|
|
bid_velocity DOUBLE PRECISION,
|
|
bid_increment DOUBLE PRECISION,
|
|
year_manufactured INTEGER,
|
|
condition_score DOUBLE PRECISION,
|
|
condition_description TEXT,
|
|
serial_number TEXT,
|
|
damage_description TEXT,
|
|
followers_count INTEGER DEFAULT 0,
|
|
estimated_min_price DOUBLE PRECISION,
|
|
estimated_max_price DOUBLE PRECISION,
|
|
lot_condition TEXT,
|
|
appearance TEXT,
|
|
estimated_min DOUBLE PRECISION,
|
|
estimated_max DOUBLE PRECISION,
|
|
next_bid_step_cents INTEGER,
|
|
condition TEXT,
|
|
category_path TEXT,
|
|
city_location TEXT,
|
|
country_code TEXT,
|
|
bidding_status TEXT,
|
|
packaging TEXT,
|
|
quantity INTEGER,
|
|
vat DOUBLE PRECISION,
|
|
buyer_premium_percentage DOUBLE PRECISION,
|
|
remarks TEXT,
|
|
reserve_price DOUBLE PRECISION,
|
|
reserve_met INTEGER,
|
|
view_count INTEGER,
|
|
api_data_json TEXT,
|
|
next_scrape_at BIGINT,
|
|
scrape_priority INTEGER DEFAULT 0
|
|
);
|
|
|
|
CREATE INDEX idx_lots_closing_time ON lots(closing_time);
|
|
CREATE INDEX idx_lots_next_scrape ON lots(next_scrape_at);
|
|
CREATE INDEX idx_lots_priority ON lots(scrape_priority DESC);
|
|
CREATE INDEX idx_lots_sale_id ON lots(sale_id);
|
|
|
|
-- Bid history
|
|
CREATE TABLE bid_history (
|
|
id SERIAL PRIMARY KEY,
|
|
lot_id TEXT REFERENCES lots(lot_id),
|
|
bid_amount DOUBLE PRECISION NOT NULL,
|
|
bid_time TEXT NOT NULL,
|
|
is_autobid INTEGER DEFAULT 0,
|
|
bidder_id TEXT,
|
|
bidder_number INTEGER,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
CREATE INDEX idx_bid_history_bidder ON bid_history(bidder_id);
|
|
CREATE INDEX idx_bid_history_lot_time ON bid_history(lot_id, bid_time);
|
|
|
|
-- Images
|
|
CREATE TABLE images (
|
|
id SERIAL PRIMARY KEY,
|
|
lot_id TEXT REFERENCES lots(lot_id),
|
|
url TEXT,
|
|
local_path TEXT,
|
|
downloaded INTEGER DEFAULT 0,
|
|
labels TEXT,
|
|
processed_at BIGINT
|
|
);
|
|
|
|
CREATE INDEX idx_images_lot_id ON images(lot_id);
|
|
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
|
|
|
|
-- Resource cache
|
|
CREATE TABLE resource_cache (
|
|
url TEXT PRIMARY KEY,
|
|
content BYTEA,
|
|
content_type TEXT,
|
|
status_code INTEGER,
|
|
headers TEXT,
|
|
timestamp DOUBLE PRECISION,
|
|
size_bytes INTEGER,
|
|
local_path TEXT
|
|
);
|
|
|
|
CREATE INDEX idx_resource_timestamp ON resource_cache(timestamp);
|
|
CREATE INDEX idx_resource_content_type ON resource_cache(content_type);
|