Fix mock tests
This commit is contained in:
209
wiki/RATE_LIMITING.md
Normal file
209
wiki/RATE_LIMITING.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# HTTP Rate Limiting
|
||||
|
||||
## Overview
|
||||
|
||||
The Troostwijk Scraper implements **per-host HTTP rate limiting** to prevent overloading external services (especially Troostwijk APIs) and avoid getting blocked.
|
||||
|
||||
## Features
|
||||
|
||||
- ✅ **Per-host rate limiting** - Different limits for different hosts
|
||||
- ✅ **Token bucket algorithm** - Allows burst traffic while maintaining steady rate
|
||||
- ✅ **Automatic host detection** - Extracts host from URL automatically
|
||||
- ✅ **Request statistics** - Tracks success/failure/rate-limited requests
|
||||
- ✅ **Thread-safe** - Uses semaphores for concurrent request handling
|
||||
- ✅ **Configurable** - Via `application.properties`
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit `src/main/resources/application.properties`:
|
||||
|
||||
```properties
|
||||
# Default rate limit for all hosts (requests per second)
|
||||
auction.http.rate-limit.default-max-rps=2
|
||||
|
||||
# Troostwijk-specific rate limit (requests per second)
|
||||
auction.http.rate-limit.troostwijk-max-rps=1
|
||||
|
||||
# HTTP request timeout (seconds)
|
||||
auction.http.timeout-seconds=30
|
||||
```
|
||||
|
||||
### Recommended Settings
|
||||
|
||||
| Service | Max RPS | Reason |
|
||||
|---------|---------|--------|
|
||||
| `troostwijkauctions.com` | **1 req/s** | Prevent blocking by Troostwijk |
|
||||
| Other image hosts | **2 req/s** | Balance speed and politeness |
|
||||
|
||||
## Usage
|
||||
|
||||
The `RateLimitedHttpClient` is automatically injected into services that make HTTP requests:
|
||||
|
||||
```java
|
||||
@Inject
|
||||
RateLimitedHttpClient httpClient;
|
||||
|
||||
// GET request for text
|
||||
HttpResponse<String> response = httpClient.sendGet(url);
|
||||
|
||||
// GET request for binary data (images)
|
||||
HttpResponse<byte[]> response = httpClient.sendGetBytes(imageUrl);
|
||||
```
|
||||
|
||||
### Integrated Services
|
||||
|
||||
1. **TroostwijkMonitor** - API calls for bid monitoring
|
||||
2. **ImageProcessingService** - Image downloads
|
||||
3. **QuarkusWorkflowScheduler** - Scheduled workflows
|
||||
|
||||
## Monitoring
|
||||
|
||||
### REST API Endpoints
|
||||
|
||||
#### Get All Rate Limit Statistics
|
||||
```bash
|
||||
GET http://localhost:8081/api/monitor/rate-limit/stats
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"hosts": 2,
|
||||
"statistics": {
|
||||
"api.troostwijkauctions.com": {
|
||||
"totalRequests": 150,
|
||||
"successfulRequests": 148,
|
||||
"failedRequests": 1,
|
||||
"rateLimitedRequests": 0,
|
||||
"averageDurationMs": 245
|
||||
},
|
||||
"images.troostwijkauctions.com": {
|
||||
"totalRequests": 320,
|
||||
"successfulRequests": 315,
|
||||
"failedRequests": 5,
|
||||
"rateLimitedRequests": 2,
|
||||
"averageDurationMs": 892
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Get Statistics for Specific Host
|
||||
```bash
|
||||
GET http://localhost:8081/api/monitor/rate-limit/stats/api.troostwijkauctions.com
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"host": "api.troostwijkauctions.com",
|
||||
"totalRequests": 150,
|
||||
"successfulRequests": 148,
|
||||
"failedRequests": 1,
|
||||
"rateLimitedRequests": 0,
|
||||
"averageDurationMs": 245
|
||||
}
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Token Bucket Algorithm
|
||||
|
||||
1. **Bucket initialization** - Starts with `maxRequestsPerSecond` tokens
|
||||
2. **Request consumption** - Each request consumes 1 token
|
||||
3. **Token refill** - Bucket refills every second
|
||||
4. **Blocking** - If no tokens available, request waits
|
||||
|
||||
### Per-Host Rate Limiting
|
||||
|
||||
The client automatically:
|
||||
1. Extracts hostname from URL (e.g., `api.troostwijkauctions.com`)
|
||||
2. Creates/retrieves rate limiter for that host
|
||||
3. Applies configured limit (Troostwijk-specific or default)
|
||||
4. Tracks statistics per host
|
||||
|
||||
### Request Flow
|
||||
|
||||
```
|
||||
Request → Extract Host → Get Rate Limiter → Acquire Token → Send Request → Record Stats
|
||||
↓
|
||||
troostwijkauctions.com?
|
||||
↓
|
||||
Yes: 1 req/s | No: 2 req/s
|
||||
```
|
||||
|
||||
## Warning Signs
|
||||
|
||||
Monitor for these indicators of rate limiting issues:
|
||||
|
||||
| Metric | Warning Threshold | Action |
|
||||
|--------|------------------|--------|
|
||||
| `rateLimitedRequests` | > 0 | Server is rate limiting you - reduce `max-rps` |
|
||||
| `failedRequests` | > 5% | Investigate connection issues or increase timeout |
|
||||
| `averageDurationMs` | > 3000ms | Server may be slow - reduce load |
|
||||
|
||||
## Testing
|
||||
|
||||
### Manual Test via cURL
|
||||
|
||||
```bash
|
||||
# Test Troostwijk API rate limiting
|
||||
for i in {1..10}; do
|
||||
echo "Request $i at $(date +%T)"
|
||||
curl -s http://localhost:8081/api/monitor/status > /dev/null
|
||||
sleep 0.5
|
||||
done
|
||||
|
||||
# Check statistics
|
||||
curl http://localhost:8081/api/monitor/rate-limit/stats | jq
|
||||
```
|
||||
|
||||
### Check Logs
|
||||
|
||||
Rate limiting is logged at DEBUG level:
|
||||
|
||||
```
|
||||
03:15:23 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (245ms)
|
||||
03:15:24 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (251ms)
|
||||
03:15:25 WARN [RateLimitedHttpClient] ⚠️ Rate limited by api.troostwijkauctions.com (HTTP 429)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Problem: Getting HTTP 429 (Too Many Requests)
|
||||
|
||||
**Solution:** Decrease `max-rps` for that host:
|
||||
```properties
|
||||
auction.http.rate-limit.troostwijk-max-rps=0.5
|
||||
```
|
||||
|
||||
### Problem: Requests too slow
|
||||
|
||||
**Solution:** Increase `max-rps` (be careful not to get blocked):
|
||||
```properties
|
||||
auction.http.rate-limit.default-max-rps=3
|
||||
```
|
||||
|
||||
### Problem: Requests timing out
|
||||
|
||||
**Solution:** Increase timeout:
|
||||
```properties
|
||||
auction.http.timeout-seconds=60
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start conservative** - Begin with low limits (1 req/s)
|
||||
2. **Monitor statistics** - Watch `rateLimitedRequests` metric
|
||||
3. **Respect robots.txt** - Check host's crawling policy
|
||||
4. **Use off-peak hours** - Run heavy scraping during low-traffic times
|
||||
5. **Implement exponential backoff** - If receiving 429s, wait longer between retries
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements:
|
||||
- [ ] Dynamic rate adjustment based on 429 responses
|
||||
- [ ] Exponential backoff on failures
|
||||
- [ ] Per-endpoint rate limiting (not just per-host)
|
||||
- [ ] Request queue visualization
|
||||
- [ ] Integration with external rate limit APIs (e.g., Redis)
|
||||
Reference in New Issue
Block a user