- Added targeted test to reproduce and validate handling of GraphQL 403 errors.

- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear. - Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded. ### Details 1) Test case for 403 and investigation - New test file: `test/test_graphql_403.py`. - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks. - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs. - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged. - Result: `pytest test/test_graphql_403.py -q` passes locally. - Root cause insights (from investigation and log improvements): - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes. - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting. 2) Incremental/in-place logging for downloads - Updated `src/scraper.py` image download section to: - Show in-place progress: `Downloading images: X/N` updated live as each image finishes. - After completion, print: `Downloaded: K/N new images`. - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot. 3) GraphQL client improvements - Updated `src/graphql_client.py`: - Added browser-like headers and contextual Referer. - Added small retry with backoff for 403/429. - Improved error logs to include status, lot id, and a short body snippet. ### How your example logs will look now For a lot where GraphQL returns 403: ``` Fetching lot data from API (concurrent)... GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF ``` For image downloads: ``` Images: 6 Downloading images: 0/6 ... 6/6 Downloaded: 6/6 new images Indexes: 0, 1, 2, 3, 4, 5 ``` (When all cached: `All 6 images already cached`) ### Notes - Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed. - If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 09:15:49 +01:00
parent e69563d4d6
commit 5a755a2125
8 changed files with 512 additions and 31 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -8,7 +8,7 @@ The scraper follows a **3-phase hierarchical crawling pattern** to extract aucti

 ```mariadb
 ┌─────────────────────────────────────────────────────────────────┐
-│                     TROOSTWIJK SCRAPER                          │
+│                         SCAEV SCRAPER                           │
 └─────────────────────────────────────────────────────────────────┘

 ┌─────────────────────────────────────────────────────────────────┐
@@ -346,6 +346,48 @@ Lot Page Parsed
        └── 001.jpg
 ```

+## Terminal Progress per Lot (TTY)
+
+During lot analysis, Scaev now shows a per‑lot TTY progress animation with a final summary of all inputs used:
+
+- Spinner runs while enrichment is in progress.
+- Summary lists every page/API used to analyze the lot with:
+  - URL/label
+  - Size in bytes
+  - Source state: cache | realtime | offline | db | intercepted
+  - Duration in ms
+
+Example output snippet:
+
+```
+[LOT A1-28505-5] ✓ Done in 812 ms — pages/APIs used:
+  • [html] https://www.troostwijkauctions.com/l/... | 142331 B | cache | 4 ms
+  • [graphql] GraphQL lotDetails | 5321 B | realtime | 142 ms
+  • [rest] REST bid history | 18234 B | realtime | 236 ms
+```
+
+Notes:
+- In non‑TTY environments the spinner is replaced by simple log lines.
+- Intercepted GraphQL responses (captured during page load) are labeled as `intercepted` with near‑zero duration.
+
+## Data Flow “Tunnel” (Simplified)
+
+For each lot, the data “tunnels through” the following stages:
+
+1. HTML page → parse `__NEXT_DATA__` for core lot fields and lot UUID.
+2. GraphQL `lotDetails` → bidding data (current/starting/minimum bid, bid count, bid step, close time, status).
+3. Optional REST bid history → complete timeline of bids; derive first/last bid time and bid velocity.
+4. Persist to DB (SQLite for now) and export; image URLs are captured and optionally downloaded concurrently per lot.
+
+Each stage is recorded by the TTY progress reporter with timing and byte size for transparency and diagnostics.
+
+## Migrations and ORM Roadmap
+
+- Migrations follow a Flyway‑style convention in `db/migration` (e.g., `V1__initial_schema.sql`).
+- Current baseline is V1; there are no new migrations required at this time.
+- Raw SQL usage remains in place (SQLite) while we prepare a gradual move to SQLAlchemy 2.x targeting PostgreSQL.
+- See `docs/MIGRATIONS.md` for details on naming, workflow, and the future switch to PostgreSQL.
+
 ## Extension Points for Integration

 ### 1. **Downstream Processing Pipeline**
--- a/docs/Deployment.md
+++ b/docs/Deployment.md
@@ -1,4 +1,4 @@
-# Deployment
+# Deployment (Scaev)

 ## Prerequisites

@@ -12,8 +12,8 @@

 ```bash
 # Clone repository
-git clone git@git.appmodel.nl:Tour/troost-scraper.git
-cd troost-scraper
+git clone git@git.appmodel.nl:Tour/scaev.git
+cd scaev

 # Create virtual environment
 python -m venv .venv
@@ -41,8 +41,8 @@ MAX_PAGES = 50
 ### 3. Create Output Directories

 ```bash
-sudo mkdir -p /var/troost-scraper/output
-sudo chown $USER:$USER /var/troost-scraper
+sudo mkdir -p /var/scaev/output
+sudo chown $USER:$USER /var/scaev
 ```

 ### 4. Run as Cron Job
@@ -51,7 +51,7 @@ Add to crontab (`crontab -e`):

 ```bash
 # Run scraper daily at 2 AM
-0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
+0 2 * * * cd /path/to/scaev && /path/to/.venv/bin/python main.py >> /var/log/scaev.log 2>&1
 ```

 ## Docker Deployment (Optional)
@@ -82,8 +82,8 @@ CMD ["python", "main.py"]
 Build and run:

 ```bash
-docker build -t troost-scraper .
-docker run -v /path/to/output:/output troost-scraper
+docker build -t scaev .
+docker run -v /path/to/output:/output scaev
 ```

 ## Monitoring
@@ -91,13 +91,13 @@ docker run -v /path/to/output:/output troost-scraper
 ### Check Logs

 ```bash
-tail -f /var/log/troost-scraper.log
+tail -f /var/log/scaev.log
 ```

 ### Monitor Output

 ```bash
-ls -lh /var/troost-scraper/output/
+ls -lh /var/scaev/output/
 ```

 ## Troubleshooting
@@ -113,7 +113,7 @@ playwright install --force chromium

 ```bash
 # Fix permissions
-sudo chown -R $USER:$USER /var/troost-scraper
+sudo chown -R $USER:$USER /var/scaev
 ```

 ### Memory Issues