- Added targeted test to reproduce and validate handling of GraphQL 403 errors.

- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
This commit is contained in:
Tour
2025-12-09 20:53:54 +01:00
parent 5ea2342dbc
commit 62d664c580
12 changed files with 125 additions and 1870 deletions

164
README.md
View File

@@ -1,75 +1,159 @@
# Setup & IDE Configuration
# Python Setup & IDE Guide
## Python Version Requirement
Short, clear, Pythonfocused.
This project **requires Python 3.10 or higher**.
---
The code uses Python 3.10+ features including:
- Structural pattern matching
- Union type syntax (`X | Y`)
- Improved type hints
- Modern async/await patterns
## Requirements
## IDE Configuration
- **Python 3.10+**
Uses pattern matching, modern type hints, async improvements.
### PyCharm / IntelliJ IDEA
```bash
python --version
```
If your IDE shows "Python 2.7 syntax" warnings, configure it for Python 3.10+:
---
1. **File → Project Structure → Project Settings → Project**
- Set Python SDK to 3.10 or higher
## IDE Setup (PyCharm / IntelliJ)
2. **File → Settings → Project → Python Interpreter**
- Select Python 3.10+ interpreter
- Click gear icon → Add → System Interpreter → Browse to your Python 3.10 installation
1. **Set interpreter:**
*File → Settings → Project Python Interpreter → Select Python 3.10+*
3. **File → Settings → Editor → Inspections → Python**
- Ensure "Python version" is set to 3.10+
- Check "Code compatibility inspection" → Set minimum version to 3.10
2. **Fix syntax warnings:**
*Editor → Inspections → Python → Set language level to 3.10+*
3. **Ensure correct SDK:**
*Project Structure → Project SDK → Python 3.10+*
---
## Installation
```bash
# Check Python version
python --version # Should be 3.10+
# Activate venv
~\venvs\scaev\Scripts\Activate.ps1
# Install dependencies
# Install deps
pip install -r requirements.txt
# Install Playwright browsers
# Playwright browsers
playwright install chromium
```
## Verifying Setup
---
## Verify
```bash
# Should print version 3.10.x or higher
python -c "import sys; print(sys.version)"
# Should run without errors
python main.py --help
```
## Common Issues
Common fixes:
### "ModuleNotFoundError: No module named 'playwright'"
```bash
pip install playwright
playwright install chromium
```
### "Python 2.7 does not support..." warnings in IDE
- Your IDE is configured for Python 2.7
- Follow IDE configuration steps above
- The code WILL work with Python 3.10+ despite warnings
---
### Script exits with "requires Python 3.10 or higher"
- You're running Python 3.9 or older
- Upgrade to Python 3.10+: https://www.python.org/downloads/
# AutoStart (Monitor)
## Version Files
## Linux (systemd) — Recommended
- `.python-version` - Used by pyenv and similar tools
- `requirements.txt` - Package dependencies
- Runtime checks in scripts ensure Python 3.10+
```bash
cd ~/scaev
chmod +x install_service.sh
./install_service.sh
```
Service features:
- Autostart
- Autorestart
- Logs: `~/scaev/logs/monitor.log`
```bash
sudo systemctl status scaev-monitor
journalctl -u scaev-monitor -f
```
---
## Windows (Task Scheduler)
```powershell
cd C:\vibe\scaev
.\setup_windows_task.ps1
```
Manage:
```powershell
Start-ScheduledTask "ScaevAuctionMonitor"
```
---
# Cron Alternative (Linux)
```bash
crontab -e
@reboot cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1
0 * * * * pgrep -f monitor.py || (cd ~/scaev && python3 src/monitor.py 30 >> logs/monitor.log 2>&1 &)
```
---
# Status Checks
```bash
ps aux | grep monitor.py
tasklist | findstr python
```
---
# Troubleshooting
- Wrong interpreter → Set Python 3.10+
- Multiple monitors running → kill extra processes
- SQLite locked → ensure one instance only
- Service fails → check `journalctl -u scaev-monitor`
---
# Java Extractor (Short Version)
Prereqs: **Java 21**, **Maven**
Install:
```bash
mvn clean install
mvn exec:java -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install"
```
Run:
```bash
mvn exec:java -Dexec.args="--max-visits 3"
```
Enable native access (IntelliJ → VM Options):
```
--enable-native-access=ALL-UNNAMED
```
---
## Cache
- Path: `cache/page_cache.db`
- Clear: delete the file
---
This file keeps everything compact, Pythonfocused, and ready for onboarding.