- Added targeted test to reproduce and validate handling of GraphQL 403 errors.

- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
Tour
2025-12-09 07:57:22 +01:00
parent 8a2b005d4a
commit d18f08aa36
17 changed files with 170 additions and 2676 deletions

32
.gitignore vendored
View File

@@ -28,8 +28,6 @@ share/python-wheels/
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
@@ -83,31 +81,6 @@ target/
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
@@ -155,11 +128,6 @@ dmypy.json
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
# Project specific - Scaev