Files
scaev/src/main.py
Tour 2dda1aff00 - Added targeted test to reproduce and validate handling of GraphQL 403 errors.
- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.

### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
  - Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
  - Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
  - Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
  - Result: `pytest test/test_graphql_403.py -q` passes locally.

- Root cause insights (from investigation and log improvements):
  - 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
  - To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.

2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
  - Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
  - After completion, print: `Downloaded: K/N new images`.
  - Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.

3) GraphQL client improvements
- Updated `src/graphql_client.py`:
  - Added browser-like headers and contextual Referer.
  - Added small retry with backoff for 403/429.
  - Improved error logs to include status, lot id, and a short body snippet.

### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
  GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```

For image downloads:
```
Images: 6
  Downloading images: 0/6
 ... 6/6
  Downloaded: 6/6 new images
    Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)

### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
2025-12-09 22:56:10 +01:00

93 lines
2.9 KiB
Python

#!/usr/bin/env python3
"""
Scaev Auctions Scraper - Main Entry Point
Focuses on extracting auction lots with caching and rate limiting
"""
import sys
import asyncio
import json
import csv
from datetime import datetime
from pathlib import Path
import config
from cache import CacheManager
from scraper import TroostwijkScraper
def mask_db_url(url: str) -> str:
try:
from urllib.parse import urlparse
p = urlparse(url)
user = p.username or ''
host = p.hostname or ''
port = f":{p.port}" if p.port else ''
return f"{p.scheme}://{user}:***@{host}{port}{p.path or ''}"
except Exception:
return url
def main():
"""Main execution"""
# Check for test mode
if len(sys.argv) > 1 and sys.argv[1] == "--test":
# Import test function only when needed to avoid circular imports
from test import test_extraction
test_url = sys.argv[2] if len(sys.argv) > 2 else None
if test_url:
test_extraction(test_url)
else:
test_extraction()
return
print("Scaev Auctions Scraper")
print("=" * 60)
if config.OFFLINE:
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
print(f"Rate limit: {config.RATE_LIMIT_SECONDS} seconds BETWEEN EVERY REQUEST")
print(f"Database URL: {mask_db_url(config.DATABASE_URL)}")
print(f"Output directory: {config.OUTPUT_DIR}")
print(f"Max listing pages: {config.MAX_PAGES}")
print("=" * 60)
scraper = TroostwijkScraper()
try:
# Clear old cache (older than 7 days) - KEEP DATABASE CLEAN
scraper.cache.clear_old(max_age_hours=168)
# Run the crawler
results = asyncio.run(scraper.crawl_auctions(max_pages=config.MAX_PAGES))
# Export results to files
print("\n" + "="*60)
print("EXPORTING RESULTS TO FILES")
print("="*60)
files = scraper.export_to_files()
print("\n" + "="*60)
print("CRAWLING COMPLETED SUCCESSFULLY")
print("="*60)
print(f"Total pages scraped: {len(results)}")
print(f"\nAuctions JSON: {files['auctions_json']}")
print(f"Auctions CSV: {files['auctions_csv']}")
print(f"Lots JSON: {files['lots_json']}")
print(f"Lots CSV: {files['lots_csv']}")
# Count auctions vs lots
auctions = [r for r in results if r.get('type') == 'auction']
lots = [r for r in results if r.get('type') == 'lot']
print(f"\n Auctions: {len(auctions)}")
print(f" Lots: {len(lots)}")
except KeyboardInterrupt:
print("\nScraping interrupted by user - partial results saved in output directory")
except Exception as e:
print(f"\nERROR during scraping: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
from cache import CacheManager
from scraper import TroostwijkScraper
main()