# HTTP Rate Limiting ## Overview The Troostwijk Scraper implements **per-host HTTP rate limiting** to prevent overloading external services (especially Troostwijk APIs) and avoid getting blocked. ## Features - ✅ **Per-host rate limiting** - Different limits for different hosts - ✅ **Token bucket algorithm** - Allows burst traffic while maintaining steady rate - ✅ **Automatic host detection** - Extracts host from URL automatically - ✅ **Request statistics** - Tracks success/failure/rate-limited requests - ✅ **Thread-safe** - Uses semaphores for concurrent request handling - ✅ **Configurable** - Via `application.properties` ## Configuration Edit `src/main/resources/application.properties`: ```properties # Default rate limit for all hosts (requests per second) auction.http.rate-limit.default-max-rps=2 # Troostwijk-specific rate limit (requests per second) auction.http.rate-limit.troostwijk-max-rps=1 # HTTP request timeout (seconds) auction.http.timeout-seconds=30 ``` ### Recommended Settings | Service | Max RPS | Reason | |---------|---------|--------| | `troostwijkauctions.com` | **1 req/s** | Prevent blocking by Troostwijk | | Other image hosts | **2 req/s** | Balance speed and politeness | ## Usage The `RateLimitedHttpClient` is automatically injected into services that make HTTP requests: ```java @Inject RateLimitedHttpClient httpClient; // GET request for text HttpResponse response = httpClient.sendGet(url); // GET request for binary data (images) HttpResponse response = httpClient.sendGetBytes(imageUrl); ``` ### Integrated Services 1. **TroostwijkMonitor** - API calls for bid monitoring 2. **ImageProcessingService** - Image downloads 3. **QuarkusWorkflowScheduler** - Scheduled workflows ## Monitoring ### REST API Endpoints #### Get All Rate Limit Statistics ```bash GET http://localhost:8081/api/monitor/rate-limit/stats ``` Response: ```json { "hosts": 2, "statistics": { "api.troostwijkauctions.com": { "totalRequests": 150, "successfulRequests": 148, "failedRequests": 1, "rateLimitedRequests": 0, "averageDurationMs": 245 }, "images.troostwijkauctions.com": { "totalRequests": 320, "successfulRequests": 315, "failedRequests": 5, "rateLimitedRequests": 2, "averageDurationMs": 892 } } } ``` #### Get Statistics for Specific Host ```bash GET http://localhost:8081/api/monitor/rate-limit/stats/api.troostwijkauctions.com ``` Response: ```json { "host": "api.troostwijkauctions.com", "totalRequests": 150, "successfulRequests": 148, "failedRequests": 1, "rateLimitedRequests": 0, "averageDurationMs": 245 } ``` ## How It Works ### Token Bucket Algorithm 1. **Bucket initialization** - Starts with `maxRequestsPerSecond` tokens 2. **Request consumption** - Each request consumes 1 token 3. **Token refill** - Bucket refills every second 4. **Blocking** - If no tokens available, request waits ### Per-Host Rate Limiting The client automatically: 1. Extracts hostname from URL (e.g., `api.troostwijkauctions.com`) 2. Creates/retrieves rate limiter for that host 3. Applies configured limit (Troostwijk-specific or default) 4. Tracks statistics per host ### Request Flow ``` Request → Extract Host → Get Rate Limiter → Acquire Token → Send Request → Record Stats ↓ troostwijkauctions.com? ↓ Yes: 1 req/s | No: 2 req/s ``` ## Warning Signs Monitor for these indicators of rate limiting issues: | Metric | Warning Threshold | Action | |--------|------------------|--------| | `rateLimitedRequests` | > 0 | Server is rate limiting you - reduce `max-rps` | | `failedRequests` | > 5% | Investigate connection issues or increase timeout | | `averageDurationMs` | > 3000ms | Server may be slow - reduce load | ## Testing ### Manual Test via cURL ```bash # Test Troostwijk API rate limiting for i in {1..10}; do echo "Request $i at $(date +%T)" curl -s http://localhost:8081/api/monitor/status > /dev/null sleep 0.5 done # Check statistics curl http://localhost:8081/api/monitor/rate-limit/stats | jq ``` ### Check Logs Rate limiting is logged at DEBUG level: ``` 03:15:23 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (245ms) 03:15:24 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (251ms) 03:15:25 WARN [RateLimitedHttpClient] ⚠️ Rate limited by api.troostwijkauctions.com (HTTP 429) ``` ## Troubleshooting ### Problem: Getting HTTP 429 (Too Many Requests) **Solution:** Decrease `max-rps` for that host: ```properties auction.http.rate-limit.troostwijk-max-rps=0.5 ``` ### Problem: Requests too slow **Solution:** Increase `max-rps` (be careful not to get blocked): ```properties auction.http.rate-limit.default-max-rps=3 ``` ### Problem: Requests timing out **Solution:** Increase timeout: ```properties auction.http.timeout-seconds=60 ``` ## Best Practices 1. **Start conservative** - Begin with low limits (1 req/s) 2. **Monitor statistics** - Watch `rateLimitedRequests` metric 3. **Respect robots.txt** - Check host's crawling policy 4. **Use off-peak hours** - Run heavy scraping during low-traffic times 5. **Implement exponential backoff** - If receiving 429s, wait longer between retries ## Future Enhancements Potential improvements: - [ ] Dynamic rate adjustment based on 429 responses - [ ] Exponential backoff on failures - [ ] Per-endpoint rate limiting (not just per-host) - [ ] Request queue visualization - [ ] Integration with external rate limit APIs (e.g., Redis)