- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
145 lines
1.5 KiB
Plaintext
145 lines
1.5 KiB
Plaintext
### Python template
|
|
# Byte-compiled / optimized / DLL files
|
|
__pycache__/
|
|
*.py[cod]
|
|
*$py.class
|
|
|
|
# C extensions
|
|
*.so
|
|
|
|
# Distribution / packaging
|
|
.Python
|
|
build/
|
|
develop-eggs/
|
|
dist/
|
|
downloads/
|
|
eggs/
|
|
.eggs/
|
|
lib/
|
|
lib64/
|
|
parts/
|
|
sdist/
|
|
var/
|
|
wheels/
|
|
share/python-wheels/
|
|
*.egg-info/
|
|
.installed.cfg
|
|
*.egg
|
|
MANIFEST
|
|
|
|
# PyInstaller
|
|
*.manifest
|
|
*.spec
|
|
|
|
# Installer logs
|
|
pip-log.txt
|
|
pip-delete-this-directory.txt
|
|
|
|
# Unit test / coverage reports
|
|
htmlcov/
|
|
.tox/
|
|
.nox/
|
|
.coverage
|
|
.coverage.*
|
|
.cache
|
|
nosetests.xml
|
|
coverage.xml
|
|
*.cover
|
|
*.py,cover
|
|
.hypothesis/
|
|
.pytest_cache/
|
|
cover/
|
|
|
|
# Translations
|
|
*.mo
|
|
*.pot
|
|
|
|
# Django stuff:
|
|
*.log
|
|
local_settings.py
|
|
db.sqlite3
|
|
db.sqlite3-journal
|
|
|
|
# Flask stuff:
|
|
instance/
|
|
.webassets-cache
|
|
|
|
# Scrapy stuff:
|
|
.scrapy
|
|
|
|
# Sphinx documentation
|
|
docs/_build/
|
|
|
|
# PyBuilder
|
|
.pybuilder/
|
|
target/
|
|
|
|
# Jupyter Notebook
|
|
.ipynb_checkpoints
|
|
|
|
# IPython
|
|
profile_default/
|
|
ipython_config.py
|
|
|
|
.pdm.toml
|
|
.pdm-python
|
|
.pdm-build/
|
|
|
|
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
|
__pypackages__/
|
|
|
|
# Celery stuff
|
|
celerybeat-schedule
|
|
celerybeat.pid
|
|
|
|
# SageMath parsed files
|
|
*.sage.py
|
|
|
|
# Environments
|
|
.env
|
|
.venv
|
|
env/
|
|
venv/
|
|
ENV/
|
|
env.bak/
|
|
venv.bak/
|
|
|
|
# Spyder project settings
|
|
.spyderproject
|
|
.spyproject
|
|
|
|
# Rope project settings
|
|
.ropeproject
|
|
|
|
# mkdocs documentation
|
|
/site
|
|
|
|
# mypy
|
|
.mypy_cache/
|
|
.dmypy.json
|
|
dmypy.json
|
|
|
|
# Pyre type checker
|
|
.pyre/
|
|
|
|
# pytype static type analyzer
|
|
.pytype/
|
|
|
|
# Cython debug symbols
|
|
cython_debug/
|
|
|
|
.idea/
|
|
|
|
# Project specific - Scaev
|
|
output/
|
|
*.db
|
|
*.csv
|
|
*.json
|
|
!requirements.txt
|
|
|
|
# Playwright
|
|
.playwright/
|
|
|
|
# macOS
|
|
.DS_Store
|