- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
123 lines
2.1 KiB
Markdown
123 lines
2.1 KiB
Markdown
# Deployment (Scaev)
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.8+ installed
|
|
- Access to a server (Linux/Windows)
|
|
- Playwright and dependencies installed
|
|
|
|
## Production Setup
|
|
|
|
### 1. Install on Server
|
|
|
|
```bash
|
|
# Clone repository
|
|
git clone git@git.appmodel.nl:Tour/scaev.git
|
|
cd scaev
|
|
|
|
# Create virtual environment
|
|
python -m venv .venv
|
|
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
playwright install chromium
|
|
playwright install-deps # Install system dependencies
|
|
```
|
|
|
|
### 2. Configuration
|
|
|
|
Create a configuration file or set environment variables:
|
|
|
|
```python
|
|
# main.py configuration
|
|
BASE_URL = "https://www.troostwijkauctions.com"
|
|
CACHE_DB = "/mnt/okcomputer/output/cache.db"
|
|
OUTPUT_DIR = "/mnt/okcomputer/output"
|
|
RATE_LIMIT_SECONDS = 0.5
|
|
MAX_PAGES = 50
|
|
```
|
|
|
|
### 3. Create Output Directories
|
|
|
|
```bash
|
|
sudo mkdir -p /var/scaev/output
|
|
sudo chown $USER:$USER /var/scaev
|
|
```
|
|
|
|
### 4. Run as Cron Job
|
|
|
|
Add to crontab (`crontab -e`):
|
|
|
|
```bash
|
|
# Run scraper daily at 2 AM
|
|
0 2 * * * cd /path/to/scaev && /path/to/.venv/bin/python main.py >> /var/log/scaev.log 2>&1
|
|
```
|
|
|
|
## Docker Deployment (Optional)
|
|
|
|
Create `Dockerfile`:
|
|
|
|
```dockerfile
|
|
FROM python:3.10-slim
|
|
|
|
WORKDIR /app
|
|
|
|
# Install system dependencies for Playwright
|
|
RUN apt-get update && apt-get install -y \
|
|
wget \
|
|
gnupg \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
RUN playwright install chromium
|
|
RUN playwright install-deps
|
|
|
|
COPY main.py .
|
|
|
|
CMD ["python", "main.py"]
|
|
```
|
|
|
|
Build and run:
|
|
|
|
```bash
|
|
docker build -t scaev .
|
|
docker run -v /path/to/output:/output scaev
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Check Logs
|
|
|
|
```bash
|
|
tail -f /var/log/scaev.log
|
|
```
|
|
|
|
### Monitor Output
|
|
|
|
```bash
|
|
ls -lh /var/scaev/output/
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Playwright Browser Issues
|
|
|
|
```bash
|
|
# Reinstall browsers
|
|
playwright install --force chromium
|
|
```
|
|
|
|
### Permission Issues
|
|
|
|
```bash
|
|
# Fix permissions
|
|
sudo chown -R $USER:$USER /var/scaev
|
|
```
|
|
|
|
### Memory Issues
|
|
|
|
- Reduce `MAX_PAGES` in configuration
|
|
- Run on machine with more RAM (Playwright needs ~1GB)
|