Files
scaev/README.md
Tour 2dda1aff00 - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 22:56:10 +01:00

178 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Python Setup & IDE Guide
Short, clear, Pythonfocused.
---
## Requirements
- **Python 3.10+**
Uses pattern matching, modern type hints, async improvements.
```bash
python --version
```
---
## IDE Setup (PyCharm / IntelliJ)
1. **Set interpreter:**
*File → Settings → Project → Python Interpreter → Select Python 3.10+*
2. **Fix syntax warnings:**
*Editor → Inspections → Python → Set language level to 3.10+*
3. **Ensure correct SDK:**
*Project Structure → Project SDK → Python 3.10+*
---
## Installation
```bash
# Activate venv
~\venvs\scaev\Scripts\Activate.ps1
# Install deps
pip install -r requirements.txt
# Playwright browsers
playwright install chromium
```
---
## Database Configuration (PostgreSQL)
The scraper now uses PostgreSQL (no more SQLite files). Configure via `DATABASE_URL`:
- Default (baked in):
`postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb`
- Override for your environment:
```bash
# Windows PowerShell
$env:DATABASE_URL = "postgresql://user:pass@host:5432/dbname"
# Linux/macOS
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
```
Packages used:
- Driver: `psycopg[binary]`
Nothing is written to local `.db` files anymore.
---
## Verify
```bash
python -c "import sys; print(sys.version)"
python main.py --help
```
Common fixes:
```bash
pip install playwright
playwright install chromium
```
---
# AutoStart (Monitor)
## Linux (systemd) — Recommended
```bash
cd ~/scaev
chmod +x install_service.sh
./install_service.sh
```
Service features:
- Autostart
- Autorestart
- Logs: `~/scaev/logs/monitor.log`
```bash
sudo systemctl status scaev-monitor
journalctl -u scaev-monitor -f
```
---
## Windows (Task Scheduler)
```powershell
cd C:\vibe\scaev
.\setup_windows_task.ps1
```
Manage:
```powershell
Start-ScheduledTask "ScaevAuctionMonitor"
```
---
# Cron Alternative (Linux)
```bash
crontab -e
@reboot cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
0 * * * * pgrep -f monitor.py || (cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
```
---
# Status Checks
```bash
ps aux | grep monitor.py
tasklist | findstr python
```
---
# Troubleshooting
- Wrong interpreter → Set Python 3.10+
- Multiple monitors running → kill extra processes
- PostgreSQL connectivity → verify `DATABASE_URL`, network/firewall, and credentials
- Service fails → check `journalctl -u scaev-monitor`
---
# Java Extractor (Short Version)
Prereqs: **Java 21**, **Maven**
Install:
```bash
mvn clean install
mvn exec:java -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
```
Run:
```bash
mvn exec:java -Dexec.args="--max-visits 3"
```
Enable native access (IntelliJ → VM Options):
```
--enable-native-access=ALL-UNNAMED
```
---
---
This file keeps everything compact, Pythonfocused, and ready for onboarding.