- Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
85
test/test_graphql_403.py
Normal file
85
test/test_graphql_403.py
Normal file
@@ -0,0 +1,85 @@
|
||||
import asyncio
|
||||
import types
|
||||
import sys
|
||||
from pathlib import Path
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_fetch_lot_bidding_data_403(monkeypatch):
|
||||
"""
|
||||
Simulate a 403 from the GraphQL endpoint and verify:
|
||||
- Function returns None (graceful handling)
|
||||
- It attempts a retry and logs a clear 403 message
|
||||
"""
|
||||
# Load modules directly from src using importlib to avoid path issues
|
||||
project_root = Path(__file__).resolve().parents[1]
|
||||
src_path = project_root / 'src'
|
||||
import importlib.util
|
||||
|
||||
def _load_module(name, file_path):
|
||||
spec = importlib.util.spec_from_file_location(name, str(file_path))
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
sys.modules[name] = module
|
||||
spec.loader.exec_module(module) # type: ignore
|
||||
return module
|
||||
|
||||
# Load config first because graphql_client imports it by module name
|
||||
config = _load_module('config', src_path / 'config.py')
|
||||
graphql_client = _load_module('graphql_client', src_path / 'graphql_client.py')
|
||||
monkeypatch.setattr(config, "OFFLINE", False, raising=False)
|
||||
|
||||
log_messages = []
|
||||
|
||||
def fake_print(*args, **kwargs):
|
||||
msg = " ".join(str(a) for a in args)
|
||||
log_messages.append(msg)
|
||||
|
||||
import builtins
|
||||
monkeypatch.setattr(builtins, "print", fake_print)
|
||||
|
||||
class MockResponse:
|
||||
def __init__(self, status=403, text_body="Forbidden"):
|
||||
self.status = status
|
||||
self._text_body = text_body
|
||||
|
||||
async def json(self):
|
||||
return {}
|
||||
|
||||
async def text(self):
|
||||
return self._text_body
|
||||
|
||||
async def __aenter__(self):
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc, tb):
|
||||
return False
|
||||
|
||||
class MockSession:
|
||||
def __init__(self, *args, **kwargs):
|
||||
pass
|
||||
|
||||
def post(self, *args, **kwargs):
|
||||
# Always return 403
|
||||
return MockResponse(403, "Forbidden by WAF")
|
||||
|
||||
async def __aenter__(self):
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc, tb):
|
||||
return False
|
||||
|
||||
# Patch aiohttp.ClientSession to our mock
|
||||
import types as _types
|
||||
dummy_aiohttp = _types.SimpleNamespace()
|
||||
dummy_aiohttp.ClientSession = MockSession
|
||||
# Ensure that an `import aiohttp` inside the function resolves to our dummy
|
||||
monkeypatch.setitem(sys.modules, 'aiohttp', dummy_aiohttp)
|
||||
|
||||
result = await graphql_client.fetch_lot_bidding_data("A1-40179-35")
|
||||
|
||||
# Should gracefully return None
|
||||
assert result is None
|
||||
|
||||
# Should have logged a 403 at least once
|
||||
assert any("GraphQL API error: 403" in m for m in log_messages)
|
||||
Reference in New Issue
Block a user