- Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
@@ -8,7 +8,7 @@ The scraper follows a **3-phase hierarchical crawling pattern** to extract aucti
|
||||
|
||||
```mariadb
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TROOSTWIJK SCRAPER │
|
||||
│ SCAEV SCRAPER │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
@@ -346,6 +346,48 @@ Lot Page Parsed
|
||||
└── 001.jpg
|
||||
```
|
||||
|
||||
## Terminal Progress per Lot (TTY)
|
||||
|
||||
During lot analysis, Scaev now shows a per‑lot TTY progress animation with a final summary of all inputs used:
|
||||
|
||||
- Spinner runs while enrichment is in progress.
|
||||
- Summary lists every page/API used to analyze the lot with:
|
||||
- URL/label
|
||||
- Size in bytes
|
||||
- Source state: cache | realtime | offline | db | intercepted
|
||||
- Duration in ms
|
||||
|
||||
Example output snippet:
|
||||
|
||||
```
|
||||
[LOT A1-28505-5] ✓ Done in 812 ms — pages/APIs used:
|
||||
• [html] https://www.troostwijkauctions.com/l/... | 142331 B | cache | 4 ms
|
||||
• [graphql] GraphQL lotDetails | 5321 B | realtime | 142 ms
|
||||
• [rest] REST bid history | 18234 B | realtime | 236 ms
|
||||
```
|
||||
|
||||
Notes:
|
||||
- In non‑TTY environments the spinner is replaced by simple log lines.
|
||||
- Intercepted GraphQL responses (captured during page load) are labeled as `intercepted` with near‑zero duration.
|
||||
|
||||
## Data Flow “Tunnel” (Simplified)
|
||||
|
||||
For each lot, the data “tunnels through” the following stages:
|
||||
|
||||
1. HTML page → parse `__NEXT_DATA__` for core lot fields and lot UUID.
|
||||
2. GraphQL `lotDetails` → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
|
||||
3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
|
||||
4. Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.
|
||||
|
||||
Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
|
||||
|
||||
## Migrations and ORM Roadmap
|
||||
|
||||
- Migrations follow a Flyway‑style convention in `db/migration` (e.g., `V1__initial_schema.sql`).
|
||||
- Current baseline is V1; there are no new migrations required at this time.
|
||||
- Raw SQL usage remains in place (SQLite) while we prepare a gradual move to SQLAlchemy 2.x targeting PostgreSQL.
|
||||
- See `docs/MIGRATIONS.md` for details on naming, workflow, and the future switch to PostgreSQL.
|
||||
|
||||
## Extension Points for Integration
|
||||
|
||||
### 1. **Downstream Processing Pipeline**
|
||||
|
||||
Reference in New Issue
Block a user