tour/scaev

Files

Tour d18f08aa36 - Added targeted test to reproduce and validate handling of GraphQL 403 errors.

- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.

2025-12-09 07:57:22 +01:00

2.7 KiB

Raw Blame History

Auto-Start Setup Guide

The monitor doesn't run automatically yet. Choose your setup based on your server OS:

Linux Server (Systemd Service) ⭐ RECOMMENDED

Install:

cd /home/tour/scaev
chmod +x install_service.sh
./install_service.sh

The service will:

✅ Start automatically on server boot
✅ Restart automatically if it crashes
✅ Log to ~/scaev/logs/monitor.log
✅ Poll every 30 minutes

Management commands:

sudo systemctl status scaev-monitor     # Check if running
sudo systemctl stop scaev-monitor       # Stop
sudo systemctl start scaev-monitor      # Start
sudo systemctl restart scaev-monitor    # Restart
journalctl -u scaev-monitor -f          # Live logs
tail -f ~/scaev/logs/monitor.log        # Monitor log file

Windows (Task Scheduler)

Install (Run as Administrator):

cd C:\vibe\scaev
.\setup_windows_task.ps1

The task will:

✅ Start automatically on Windows boot
✅ Restart automatically if it crashes (up to 3 times)
✅ Run as SYSTEM user
✅ Poll every 30 minutes

Management:

Open Task Scheduler (taskschd.msc)
Find ScaevAuctionMonitor in Task Scheduler Library
Right-click to Run/Stop/Disable

Or via PowerShell:

Start-ScheduledTask -TaskName "ScaevAuctionMonitor"
Stop-ScheduledTask -TaskName "ScaevAuctionMonitor"
Get-ScheduledTask -TaskName "ScaevAuctionMonitor" | Get-ScheduledTaskInfo

Alternative: Cron Job (Linux)

For simpler setup without systemd:

# Edit crontab
crontab -e

# Add this line (runs on boot and restarts every hour if not running)
@reboot cd /home/tour/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
0 * * * * pgrep -f "monitor.py" || (cd /home/tour/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)

Verify It's Working

Check process is running:

# Linux
ps aux | grep monitor.py

# Windows
tasklist | findstr python

Check logs:

# Linux
tail -f ~/scaev/logs/monitor.log

# Windows
# Check Task Scheduler history

Check database is updating:

# Last modified time should update every 30 minutes
ls -lh C:/mnt/okcomputer/output/cache.db

Troubleshooting

Service won't start:

Check Python path is correct in service file
Check working directory exists
Check user permissions
View error logs: journalctl -u scaev-monitor -n 50

Monitor stops after a while:

Check disk space for logs
Check rate limiting isn't blocking requests
Increase RestartSec in service file

Database locked errors:

Ensure only one monitor instance is running
Add timeout to SQLite connections in config

2.7 KiB Raw Blame History