- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
86 lines
2.6 KiB
Python
86 lines
2.6 KiB
Python
import asyncio
|
|
import types
|
|
import sys
|
|
from pathlib import Path
|
|
import pytest
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_fetch_lot_bidding_data_403(monkeypatch):
|
|
"""
|
|
Simulate a 403 from the GraphQL endpoint and verify:
|
|
- Function returns None (graceful handling)
|
|
- It attempts a retry and logs a clear 403 message
|
|
"""
|
|
# Load modules directly from src using importlib to avoid path issues
|
|
project_root = Path(__file__).resolve().parents[1]
|
|
src_path = project_root / 'src'
|
|
import importlib.util
|
|
|
|
def _load_module(name, file_path):
|
|
spec = importlib.util.spec_from_file_location(name, str(file_path))
|
|
module = importlib.util.module_from_spec(spec)
|
|
sys.modules[name] = module
|
|
spec.loader.exec_module(module) # type: ignore
|
|
return module
|
|
|
|
# Load config first because graphql_client imports it by module name
|
|
config = _load_module('config', src_path / 'config.py')
|
|
graphql_client = _load_module('graphql_client', src_path / 'graphql_client.py')
|
|
monkeypatch.setattr(config, "OFFLINE", False, raising=False)
|
|
|
|
log_messages = []
|
|
|
|
def fake_print(*args, **kwargs):
|
|
msg = " ".join(str(a) for a in args)
|
|
log_messages.append(msg)
|
|
|
|
import builtins
|
|
monkeypatch.setattr(builtins, "print", fake_print)
|
|
|
|
class MockResponse:
|
|
def __init__(self, status=403, text_body="Forbidden"):
|
|
self.status = status
|
|
self._text_body = text_body
|
|
|
|
async def json(self):
|
|
return {}
|
|
|
|
async def text(self):
|
|
return self._text_body
|
|
|
|
async def __aenter__(self):
|
|
return self
|
|
|
|
async def __aexit__(self, exc_type, exc, tb):
|
|
return False
|
|
|
|
class MockSession:
|
|
def __init__(self, *args, **kwargs):
|
|
pass
|
|
|
|
def post(self, *args, **kwargs):
|
|
# Always return 403
|
|
return MockResponse(403, "Forbidden by WAF")
|
|
|
|
async def __aenter__(self):
|
|
return self
|
|
|
|
async def __aexit__(self, exc_type, exc, tb):
|
|
return False
|
|
|
|
# Patch aiohttp.ClientSession to our mock
|
|
import types as _types
|
|
dummy_aiohttp = _types.SimpleNamespace()
|
|
dummy_aiohttp.ClientSession = MockSession
|
|
# Ensure that an `import aiohttp` inside the function resolves to our dummy
|
|
monkeypatch.setitem(sys.modules, 'aiohttp', dummy_aiohttp)
|
|
|
|
result = await graphql_client.fetch_lot_bidding_data("A1-40179-35")
|
|
|
|
# Should gracefully return None
|
|
assert result is None
|
|
|
|
# Should have logged a 403 at least once
|
|
assert any("GraphQL API error: 403" in m for m in log_messages)
|