5.5 KiB
5.5 KiB
HTTP Rate Limiting
Overview
The Troostwijk Scraper implements per-host HTTP rate limiting to prevent overloading external services (especially Troostwijk APIs) and avoid getting blocked.
Features
- ✅ Per-host rate limiting - Different limits for different hosts
- ✅ Token bucket algorithm - Allows burst traffic while maintaining steady rate
- ✅ Automatic host detection - Extracts host from URL automatically
- ✅ Request statistics - Tracks success/failure/rate-limited requests
- ✅ Thread-safe - Uses semaphores for concurrent request handling
- ✅ Configurable - Via
application.properties
Configuration
Edit src/main/resources/application.properties:
# Default rate limit for all hosts (requests per second)
auction.http.rate-limit.default-max-rps=2
# Troostwijk-specific rate limit (requests per second)
auction.http.rate-limit.troostwijk-max-rps=1
# HTTP request timeout (seconds)
auction.http.timeout-seconds=30
Recommended Settings
| Service | Max RPS | Reason |
|---|---|---|
troostwijkauctions.com |
1 req/s | Prevent blocking by Troostwijk |
| Other image hosts | 2 req/s | Balance speed and politeness |
Usage
The RateLimitedHttpClient is automatically injected into services that make HTTP requests:
@Inject
RateLimitedHttpClient httpClient;
// GET request for text
HttpResponse<String> response = httpClient.sendGet(url);
// GET request for binary data (images)
HttpResponse<byte[]> response = httpClient.sendGetBytes(imageUrl);
Integrated Services
- TroostwijkMonitor - API calls for bid monitoring
- ImageProcessingService - Image downloads
- QuarkusWorkflowScheduler - Scheduled workflows
Monitoring
REST API Endpoints
Get All Rate Limit Statistics
GET http://localhost:8081/api/monitor/rate-limit/stats
Response:
{
"hosts": 2,
"statistics": {
"api.troostwijkauctions.com": {
"totalRequests": 150,
"successfulRequests": 148,
"failedRequests": 1,
"rateLimitedRequests": 0,
"averageDurationMs": 245
},
"images.troostwijkauctions.com": {
"totalRequests": 320,
"successfulRequests": 315,
"failedRequests": 5,
"rateLimitedRequests": 2,
"averageDurationMs": 892
}
}
}
Get Statistics for Specific Host
GET http://localhost:8081/api/monitor/rate-limit/stats/api.troostwijkauctions.com
Response:
{
"host": "api.troostwijkauctions.com",
"totalRequests": 150,
"successfulRequests": 148,
"failedRequests": 1,
"rateLimitedRequests": 0,
"averageDurationMs": 245
}
How It Works
Token Bucket Algorithm
- Bucket initialization - Starts with
maxRequestsPerSecondtokens - Request consumption - Each request consumes 1 token
- Token refill - Bucket refills every second
- Blocking - If no tokens available, request waits
Per-Host Rate Limiting
The client automatically:
- Extracts hostname from URL (e.g.,
api.troostwijkauctions.com) - Creates/retrieves rate limiter for that host
- Applies configured limit (Troostwijk-specific or default)
- Tracks statistics per host
Request Flow
Request → Extract Host → Get Rate Limiter → Acquire Token → Send Request → Record Stats
↓
troostwijkauctions.com?
↓
Yes: 1 req/s | No: 2 req/s
Warning Signs
Monitor for these indicators of rate limiting issues:
| Metric | Warning Threshold | Action |
|---|---|---|
rateLimitedRequests |
> 0 | Server is rate limiting you - reduce max-rps |
failedRequests |
> 5% | Investigate connection issues or increase timeout |
averageDurationMs |
> 3000ms | Server may be slow - reduce load |
Testing
Manual Test via cURL
# Test Troostwijk API rate limiting
for i in {1..10}; do
echo "Request $i at $(date +%T)"
curl -s http://localhost:8081/api/monitor/status > /dev/null
sleep 0.5
done
# Check statistics
curl http://localhost:8081/api/monitor/rate-limit/stats | jq
Check Logs
Rate limiting is logged at DEBUG level:
03:15:23 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (245ms)
03:15:24 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (251ms)
03:15:25 WARN [RateLimitedHttpClient] ⚠️ Rate limited by api.troostwijkauctions.com (HTTP 429)
Troubleshooting
Problem: Getting HTTP 429 (Too Many Requests)
Solution: Decrease max-rps for that host:
auction.http.rate-limit.troostwijk-max-rps=0.5
Problem: Requests too slow
Solution: Increase max-rps (be careful not to get blocked):
auction.http.rate-limit.default-max-rps=3
Problem: Requests timing out
Solution: Increase timeout:
auction.http.timeout-seconds=60
Best Practices
- Start conservative - Begin with low limits (1 req/s)
- Monitor statistics - Watch
rateLimitedRequestsmetric - Respect robots.txt - Check host's crawling policy
- Use off-peak hours - Run heavy scraping during low-traffic times
- Implement exponential backoff - If receiving 429s, wait longer between retries
Future Enhancements
Potential improvements:
- Dynamic rate adjustment based on 429 responses
- Exponential backoff on failures
- Per-endpoint rate limiting (not just per-host)
- Request queue visualization
- Integration with external rate limit APIs (e.g., Redis)