Commit Graph

40 Commits

Author SHA1 Message Date
Tour
d18f08aa36 - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 07:57:22 +01:00
Tour
8a2b005d4a move.venv 2025-12-09 07:11:09 +01:00
Tour
b0ee52b686 enrich data 2025-12-09 02:05:46 +01:00
Tour
06f63732b1 enrich data 2025-12-09 01:39:36 +01:00
Tour
83d0fc1329 enrich data 2025-12-09 01:19:55 +01:00
Tour
999c5609b6 fix-sqlite3 2025-12-08 23:27:52 +01:00
Tour
207916c1fe remove-field-scraped_at_timestamp 2025-12-08 13:07:45 +01:00
Tour
d67cb15748 remove-field-scraped_at_timestamp 2025-12-07 16:57:54 +01:00
Tour
ce2fa60ee9 upgrade-speed-auctions 2025-12-07 16:53:44 +01:00
Tour
b1905164bd enrich data 2025-12-07 16:26:30 +01:00
Tour
fd69faebcc it 2025-12-07 16:00:38 +01:00
Tour
1b336c49ba gogo 2025-12-07 12:32:39 +01:00
Tour
07d58cf59c gogo 2025-12-07 12:26:41 +01:00
Tour
17af27ee99 gogo 2025-12-07 12:20:51 +01:00
Tour
450ec33101 fix-os-error-after-completion 2025-12-07 10:21:47 +01:00
Tour
7e6629641f performance 2025-12-07 08:56:08 +01:00
Tour
c56d63d6fa go 2025-12-07 08:47:24 +01:00
Tour
bf6dc39a95 Add mobile network routing for scaev container 2025-12-07 08:32:56 +01:00
Tour
5bf1b31183 enrich data 2025-12-07 07:09:16 +01:00
Tour
3a77c8b0cd enrich data 2025-12-07 06:40:32 +01:00
Tour
b5ef8029ce enrich data 2025-12-07 06:09:45 +01:00
Tour
765361d582 enrichment 2025-12-07 02:20:14 +01:00
Tour
08bf112c3f enrich data 2025-12-07 01:59:45 +01:00
Tour
d09ee5574f enrich data 2025-12-07 01:26:48 +01:00
Tour
bb7f4bbe9d GraphQL integrate, data correctness 2025-12-07 00:36:57 +01:00
Tour
71567fd965 GraphQL integrate, data correctness 2025-12-07 00:25:25 +01:00
Tour
8c5f6016ec go 2025-12-06 21:27:11 +01:00
Tour
21e97ada0d new-style 2025-12-05 20:11:39 +01:00
Tour
19653290a6 all 2025-12-05 09:33:51 +01:00
Tour
d8a53bacb6 all 2025-12-05 09:02:42 +01:00
Tour
6107b80ad1 all 2025-12-05 08:49:43 +01:00
Tour
aea188699f integrating with monitor app 2025-12-05 06:48:08 +01:00
Tour
72afdf772b a 2025-12-04 15:35:26 +01:00
Tour
ee13f68740 a 2025-12-04 15:30:46 +01:00
Tour
dd398862e3 a 2025-12-04 15:29:07 +01:00
Tour
021a75396e a 2025-12-04 15:26:33 +01:00
Tour
05b5e63762 a 2025-12-04 15:21:24 +01:00
Tour
bb3135e7ad a 2025-12-04 14:58:24 +01:00
Tour
b12f3a5ee2 a 2025-12-04 14:53:55 +01:00
Tour
79e14be37a first 2025-12-04 14:49:58 +01:00