210 lines
5.5 KiB
Markdown
210 lines
5.5 KiB
Markdown
# HTTP Rate Limiting
|
|
|
|
## Overview
|
|
|
|
The Troostwijk Scraper implements **per-host HTTP rate limiting** to prevent overloading external services (especially Troostwijk APIs) and avoid getting blocked.
|
|
|
|
## Features
|
|
|
|
- ✅ **Per-host rate limiting** - Different limits for different hosts
|
|
- ✅ **Token bucket algorithm** - Allows burst traffic while maintaining steady rate
|
|
- ✅ **Automatic host detection** - Extracts host from URL automatically
|
|
- ✅ **Request statistics** - Tracks success/failure/rate-limited requests
|
|
- ✅ **Thread-safe** - Uses semaphores for concurrent request handling
|
|
- ✅ **Configurable** - Via `application.properties`
|
|
|
|
## Configuration
|
|
|
|
Edit `src/main/resources/application.properties`:
|
|
|
|
```properties
|
|
# Default rate limit for all hosts (requests per second)
|
|
auction.http.rate-limit.default-max-rps=2
|
|
|
|
# Troostwijk-specific rate limit (requests per second)
|
|
auction.http.rate-limit.troostwijk-max-rps=1
|
|
|
|
# HTTP request timeout (seconds)
|
|
auction.http.timeout-seconds=30
|
|
```
|
|
|
|
### Recommended Settings
|
|
|
|
| Service | Max RPS | Reason |
|
|
|---------|---------|--------|
|
|
| `troostwijkauctions.com` | **1 req/s** | Prevent blocking by Troostwijk |
|
|
| Other image hosts | **2 req/s** | Balance speed and politeness |
|
|
|
|
## Usage
|
|
|
|
The `RateLimitedHttpClient` is automatically injected into services that make HTTP requests:
|
|
|
|
```java
|
|
@Inject
|
|
RateLimitedHttpClient httpClient;
|
|
|
|
// GET request for text
|
|
HttpResponse<String> response = httpClient.sendGet(url);
|
|
|
|
// GET request for binary data (images)
|
|
HttpResponse<byte[]> response = httpClient.sendGetBytes(imageUrl);
|
|
```
|
|
|
|
### Integrated Services
|
|
|
|
1. **TroostwijkMonitor** - API calls for bid monitoring
|
|
2. **ImageProcessingService** - Image downloads
|
|
3. **QuarkusWorkflowScheduler** - Scheduled workflows
|
|
|
|
## Monitoring
|
|
|
|
### REST API Endpoints
|
|
|
|
#### Get All Rate Limit Statistics
|
|
```bash
|
|
GET http://localhost:8081/api/monitor/rate-limit/stats
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"hosts": 2,
|
|
"statistics": {
|
|
"api.troostwijkauctions.com": {
|
|
"totalRequests": 150,
|
|
"successfulRequests": 148,
|
|
"failedRequests": 1,
|
|
"rateLimitedRequests": 0,
|
|
"averageDurationMs": 245
|
|
},
|
|
"images.troostwijkauctions.com": {
|
|
"totalRequests": 320,
|
|
"successfulRequests": 315,
|
|
"failedRequests": 5,
|
|
"rateLimitedRequests": 2,
|
|
"averageDurationMs": 892
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Get Statistics for Specific Host
|
|
```bash
|
|
GET http://localhost:8081/api/monitor/rate-limit/stats/api.troostwijkauctions.com
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"host": "api.troostwijkauctions.com",
|
|
"totalRequests": 150,
|
|
"successfulRequests": 148,
|
|
"failedRequests": 1,
|
|
"rateLimitedRequests": 0,
|
|
"averageDurationMs": 245
|
|
}
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### Token Bucket Algorithm
|
|
|
|
1. **Bucket initialization** - Starts with `maxRequestsPerSecond` tokens
|
|
2. **Request consumption** - Each request consumes 1 token
|
|
3. **Token refill** - Bucket refills every second
|
|
4. **Blocking** - If no tokens available, request waits
|
|
|
|
### Per-Host Rate Limiting
|
|
|
|
The client automatically:
|
|
1. Extracts hostname from URL (e.g., `api.troostwijkauctions.com`)
|
|
2. Creates/retrieves rate limiter for that host
|
|
3. Applies configured limit (Troostwijk-specific or default)
|
|
4. Tracks statistics per host
|
|
|
|
### Request Flow
|
|
|
|
```
|
|
Request → Extract Host → Get Rate Limiter → Acquire Token → Send Request → Record Stats
|
|
↓
|
|
troostwijkauctions.com?
|
|
↓
|
|
Yes: 1 req/s | No: 2 req/s
|
|
```
|
|
|
|
## Warning Signs
|
|
|
|
Monitor for these indicators of rate limiting issues:
|
|
|
|
| Metric | Warning Threshold | Action |
|
|
|--------|------------------|--------|
|
|
| `rateLimitedRequests` | > 0 | Server is rate limiting you - reduce `max-rps` |
|
|
| `failedRequests` | > 5% | Investigate connection issues or increase timeout |
|
|
| `averageDurationMs` | > 3000ms | Server may be slow - reduce load |
|
|
|
|
## Testing
|
|
|
|
### Manual Test via cURL
|
|
|
|
```bash
|
|
# Test Troostwijk API rate limiting
|
|
for i in {1..10}; do
|
|
echo "Request $i at $(date +%T)"
|
|
curl -s http://localhost:8081/api/monitor/status > /dev/null
|
|
sleep 0.5
|
|
done
|
|
|
|
# Check statistics
|
|
curl http://localhost:8081/api/monitor/rate-limit/stats | jq
|
|
```
|
|
|
|
### Check Logs
|
|
|
|
Rate limiting is logged at DEBUG level:
|
|
|
|
```
|
|
03:15:23 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (245ms)
|
|
03:15:24 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (251ms)
|
|
03:15:25 WARN [RateLimitedHttpClient] ⚠️ Rate limited by api.troostwijkauctions.com (HTTP 429)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Problem: Getting HTTP 429 (Too Many Requests)
|
|
|
|
**Solution:** Decrease `max-rps` for that host:
|
|
```properties
|
|
auction.http.rate-limit.troostwijk-max-rps=0.5
|
|
```
|
|
|
|
### Problem: Requests too slow
|
|
|
|
**Solution:** Increase `max-rps` (be careful not to get blocked):
|
|
```properties
|
|
auction.http.rate-limit.default-max-rps=3
|
|
```
|
|
|
|
### Problem: Requests timing out
|
|
|
|
**Solution:** Increase timeout:
|
|
```properties
|
|
auction.http.timeout-seconds=60
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Start conservative** - Begin with low limits (1 req/s)
|
|
2. **Monitor statistics** - Watch `rateLimitedRequests` metric
|
|
3. **Respect robots.txt** - Check host's crawling policy
|
|
4. **Use off-peak hours** - Run heavy scraping during low-traffic times
|
|
5. **Implement exponential backoff** - If receiving 429s, wait longer between retries
|
|
|
|
## Future Enhancements
|
|
|
|
Potential improvements:
|
|
- [ ] Dynamic rate adjustment based on 429 responses
|
|
- [ ] Exponential backoff on failures
|
|
- [ ] Per-endpoint rate limiting (not just per-host)
|
|
- [ ] Request queue visualization
|
|
- [ ] Integration with external rate limit APIs (e.g., Redis)
|