- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
178 lines
2.8 KiB
Markdown
178 lines
2.8 KiB
Markdown
# Python Setup & IDE Guide
|
||
|
||
Short, clear, Python‑focused.
|
||
|
||
---
|
||
|
||
## Requirements
|
||
|
||
- **Python 3.10+**
|
||
Uses pattern matching, modern type hints, async improvements.
|
||
|
||
```bash
|
||
python --version
|
||
```
|
||
|
||
---
|
||
|
||
## IDE Setup (PyCharm / IntelliJ)
|
||
|
||
1. **Set interpreter:**
|
||
*File → Settings → Project → Python Interpreter → Select Python 3.10+*
|
||
|
||
2. **Fix syntax warnings:**
|
||
*Editor → Inspections → Python → Set language level to 3.10+*
|
||
|
||
3. **Ensure correct SDK:**
|
||
*Project Structure → Project SDK → Python 3.10+*
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
# Activate venv
|
||
~\venvs\scaev\Scripts\Activate.ps1
|
||
|
||
# Install deps
|
||
pip install -r requirements.txt
|
||
|
||
# Playwright browsers
|
||
playwright install chromium
|
||
```
|
||
|
||
---
|
||
|
||
## Database Configuration (PostgreSQL)
|
||
|
||
The scraper now uses PostgreSQL (no more SQLite files). Configure via `DATABASE_URL`:
|
||
|
||
- Default (baked in):
|
||
`postgresql://auction:heel-goed-wachtwoord@192.168.1.159:5432/auctiondb`
|
||
- Override for your environment:
|
||
|
||
```bash
|
||
# Windows PowerShell
|
||
$env:DATABASE_URL = "postgresql://user:pass@host:5432/dbname"
|
||
|
||
# Linux/macOS
|
||
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
|
||
```
|
||
|
||
Packages used:
|
||
- Driver: `psycopg[binary]`
|
||
|
||
Nothing is written to local `.db` files anymore.
|
||
|
||
---
|
||
|
||
## Verify
|
||
|
||
```bash
|
||
python -c "import sys; print(sys.version)"
|
||
python main.py --help
|
||
```
|
||
|
||
Common fixes:
|
||
|
||
```bash
|
||
pip install playwright
|
||
playwright install chromium
|
||
```
|
||
|
||
---
|
||
|
||
# Auto‑Start (Monitor)
|
||
|
||
## Linux (systemd) — Recommended
|
||
|
||
```bash
|
||
cd ~/scaev
|
||
chmod +x install_service.sh
|
||
./install_service.sh
|
||
```
|
||
|
||
Service features:
|
||
- Auto‑start
|
||
- Auto‑restart
|
||
- Logs: `~/scaev/logs/monitor.log`
|
||
|
||
```bash
|
||
sudo systemctl status scaev-monitor
|
||
journalctl -u scaev-monitor -f
|
||
```
|
||
|
||
---
|
||
|
||
## Windows (Task Scheduler)
|
||
|
||
```powershell
|
||
cd C:\vibe\scaev
|
||
.\setup_windows_task.ps1
|
||
```
|
||
|
||
Manage:
|
||
|
||
```powershell
|
||
Start-ScheduledTask "ScaevAuctionMonitor"
|
||
```
|
||
|
||
---
|
||
|
||
# Cron Alternative (Linux)
|
||
|
||
```bash
|
||
crontab -e
|
||
@reboot cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
|
||
0 * * * * pgrep -f monitor.py || (cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
|
||
```
|
||
|
||
---
|
||
|
||
# Status Checks
|
||
|
||
```bash
|
||
ps aux | grep monitor.py
|
||
tasklist | findstr python
|
||
```
|
||
|
||
---
|
||
|
||
# Troubleshooting
|
||
|
||
- Wrong interpreter → Set Python 3.10+
|
||
- Multiple monitors running → kill extra processes
|
||
- PostgreSQL connectivity → verify `DATABASE_URL`, network/firewall, and credentials
|
||
- Service fails → check `journalctl -u scaev-monitor`
|
||
|
||
---
|
||
|
||
# Java Extractor (Short Version)
|
||
|
||
Prereqs: **Java 21**, **Maven**
|
||
|
||
Install:
|
||
|
||
```bash
|
||
mvn clean install
|
||
mvn exec:java -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
|
||
```
|
||
|
||
Run:
|
||
|
||
```bash
|
||
mvn exec:java -Dexec.args="--max-visits 3"
|
||
```
|
||
|
||
Enable native access (IntelliJ → VM Options):
|
||
|
||
```
|
||
--enable-native-access=ALL-UNNAMED
|
||
```
|
||
|
||
---
|
||
|
||
---
|
||
|
||
This file keeps everything compact, Python‑focused, and ready for onboarding.
|