- Hardened the GraphQL client to reduce 403 occurrences and provide clearer diagnostics when they appear.
- Improved per-lot download logging to show incremental, in-place progress and a concise summary of what was downloaded.
### Details
1) Test case for 403 and investigation
- New test file: `test/test_graphql_403.py`.
- Uses `importlib` to load `src/config.py` and `src/graphql_client.py` directly so it’s independent of sys.path quirks.
- Mocks `aiohttp.ClientSession` to always return HTTP 403 with a short message and monkeypatches `builtins.print` to capture logs.
- Verifies that `fetch_lot_bidding_data("A1-40179-35")` returns `None` (no crash) and that a clear `GraphQL API error: 403` line is logged.
- Result: `pytest test/test_graphql_403.py -q` passes locally.
- Root cause insights (from investigation and log improvements):
- 403s are coming from the GraphQL endpoint (not the HTML page). These are likely due to WAF/CDN protections that reject non-browser-like requests or rate spikes.
- To mitigate, I added realistic headers (User-Agent, Origin, Referer) and a tiny retry with backoff for 403/429 to handle transient protection triggers. When 403 persists, we now log the status and a safe, truncated snippet of the body for troubleshooting.
2) Incremental/in-place logging for downloads
- Updated `src/scraper.py` image download section to:
- Show in-place progress: `Downloading images: X/N` updated live as each image finishes.
- After completion, print: `Downloaded: K/N new images`.
- Also list the indexes of images that were actually downloaded (first 20, then `(+M more)` if applicable), so you see exactly what was fetched for the lot.
3) GraphQL client improvements
- Updated `src/graphql_client.py`:
- Added browser-like headers and contextual Referer.
- Added small retry with backoff for 403/429.
- Improved error logs to include status, lot id, and a short body snippet.
### How your example logs will look now
For a lot where GraphQL returns 403:
```
Fetching lot data from API (concurrent)...
GraphQL API error: 403 (lot=A1-40179-35) — Forbidden by WAF
```
For image downloads:
```
Images: 6
Downloading images: 0/6
... 6/6
Downloaded: 6/6 new images
Indexes: 0, 1, 2, 3, 4, 5
```
(When all cached: `All 6 images already cached`)
### Notes
- Full test run surfaced a pre-existing import error in `test/test_scraper.py` (unrelated to these changes). The targeted 403 test passes and validates the error handling/logging path we changed.
- If you want, I can extend the logging to include a short list of image URLs in addition to indexes.
93 lines
2.9 KiB
Python
93 lines
2.9 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
Scaev Auctions Scraper - Main Entry Point
|
|
Focuses on extracting auction lots with caching and rate limiting
|
|
"""
|
|
|
|
import sys
|
|
import asyncio
|
|
import json
|
|
import csv
|
|
from datetime import datetime
|
|
from pathlib import Path
|
|
|
|
import config
|
|
from cache import CacheManager
|
|
from scraper import TroostwijkScraper
|
|
|
|
def mask_db_url(url: str) -> str:
|
|
try:
|
|
from urllib.parse import urlparse
|
|
p = urlparse(url)
|
|
user = p.username or ''
|
|
host = p.hostname or ''
|
|
port = f":{p.port}" if p.port else ''
|
|
return f"{p.scheme}://{user}:***@{host}{port}{p.path or ''}"
|
|
except Exception:
|
|
return url
|
|
|
|
def main():
|
|
"""Main execution"""
|
|
# Check for test mode
|
|
if len(sys.argv) > 1 and sys.argv[1] == "--test":
|
|
# Import test function only when needed to avoid circular imports
|
|
from test import test_extraction
|
|
test_url = sys.argv[2] if len(sys.argv) > 2 else None
|
|
if test_url:
|
|
test_extraction(test_url)
|
|
else:
|
|
test_extraction()
|
|
return
|
|
|
|
print("Scaev Auctions Scraper")
|
|
print("=" * 60)
|
|
if config.OFFLINE:
|
|
print("OFFLINE MODE ENABLED — only database and cache will be used (no network)")
|
|
print(f"Rate limit: {config.RATE_LIMIT_SECONDS} seconds BETWEEN EVERY REQUEST")
|
|
print(f"Database URL: {mask_db_url(config.DATABASE_URL)}")
|
|
print(f"Output directory: {config.OUTPUT_DIR}")
|
|
print(f"Max listing pages: {config.MAX_PAGES}")
|
|
print("=" * 60)
|
|
|
|
scraper = TroostwijkScraper()
|
|
|
|
try:
|
|
# Clear old cache (older than 7 days) - KEEP DATABASE CLEAN
|
|
scraper.cache.clear_old(max_age_hours=168)
|
|
|
|
# Run the crawler
|
|
results = asyncio.run(scraper.crawl_auctions(max_pages=config.MAX_PAGES))
|
|
|
|
# Export results to files
|
|
print("\n" + "="*60)
|
|
print("EXPORTING RESULTS TO FILES")
|
|
print("="*60)
|
|
|
|
files = scraper.export_to_files()
|
|
|
|
print("\n" + "="*60)
|
|
print("CRAWLING COMPLETED SUCCESSFULLY")
|
|
print("="*60)
|
|
print(f"Total pages scraped: {len(results)}")
|
|
print(f"\nAuctions JSON: {files['auctions_json']}")
|
|
print(f"Auctions CSV: {files['auctions_csv']}")
|
|
print(f"Lots JSON: {files['lots_json']}")
|
|
print(f"Lots CSV: {files['lots_csv']}")
|
|
|
|
# Count auctions vs lots
|
|
auctions = [r for r in results if r.get('type') == 'auction']
|
|
lots = [r for r in results if r.get('type') == 'lot']
|
|
print(f"\n Auctions: {len(auctions)}")
|
|
print(f" Lots: {len(lots)}")
|
|
|
|
except KeyboardInterrupt:
|
|
print("\nScraping interrupted by user - partial results saved in output directory")
|
|
except Exception as e:
|
|
print(f"\nERROR during scraping: {e}")
|
|
import traceback
|
|
traceback.print_exc()
|
|
|
|
if __name__ == "__main__":
|
|
from cache import CacheManager
|
|
from scraper import TroostwijkScraper
|
|
main() |