USe ASM 9.8 with Java 25

This commit is contained in:
Tour
2025-12-04 03:31:27 +01:00
parent d2000a46bc
commit cad27f1842
10 changed files with 578 additions and 28 deletions

209
RATE_LIMITING.md Normal file
View File

@@ -0,0 +1,209 @@
# HTTP Rate Limiting
## Overview
The Troostwijk Scraper implements **per-host HTTP rate limiting** to prevent overloading external services (especially Troostwijk APIs) and avoid getting blocked.
## Features
-**Per-host rate limiting** - Different limits for different hosts
-**Token bucket algorithm** - Allows burst traffic while maintaining steady rate
-**Automatic host detection** - Extracts host from URL automatically
-**Request statistics** - Tracks success/failure/rate-limited requests
-**Thread-safe** - Uses semaphores for concurrent request handling
-**Configurable** - Via `application.properties`
## Configuration
Edit `src/main/resources/application.properties`:
```properties
# Default rate limit for all hosts (requests per second)
auction.http.rate-limit.default-max-rps=2
# Troostwijk-specific rate limit (requests per second)
auction.http.rate-limit.troostwijk-max-rps=1
# HTTP request timeout (seconds)
auction.http.timeout-seconds=30
```
### Recommended Settings
| Service | Max RPS | Reason |
|---------|---------|--------|
| `troostwijkauctions.com` | **1 req/s** | Prevent blocking by Troostwijk |
| Other image hosts | **2 req/s** | Balance speed and politeness |
## Usage
The `RateLimitedHttpClient` is automatically injected into services that make HTTP requests:
```java
@Inject
RateLimitedHttpClient httpClient;
// GET request for text
HttpResponse<String> response = httpClient.sendGet(url);
// GET request for binary data (images)
HttpResponse<byte[]> response = httpClient.sendGetBytes(imageUrl);
```
### Integrated Services
1. **TroostwijkMonitor** - API calls for bid monitoring
2. **ImageProcessingService** - Image downloads
3. **QuarkusWorkflowScheduler** - Scheduled workflows
## Monitoring
### REST API Endpoints
#### Get All Rate Limit Statistics
```bash
GET http://localhost:8081/api/monitor/rate-limit/stats
```
Response:
```json
{
"hosts": 2,
"statistics": {
"api.troostwijkauctions.com": {
"totalRequests": 150,
"successfulRequests": 148,
"failedRequests": 1,
"rateLimitedRequests": 0,
"averageDurationMs": 245
},
"images.troostwijkauctions.com": {
"totalRequests": 320,
"successfulRequests": 315,
"failedRequests": 5,
"rateLimitedRequests": 2,
"averageDurationMs": 892
}
}
}
```
#### Get Statistics for Specific Host
```bash
GET http://localhost:8081/api/monitor/rate-limit/stats/api.troostwijkauctions.com
```
Response:
```json
{
"host": "api.troostwijkauctions.com",
"totalRequests": 150,
"successfulRequests": 148,
"failedRequests": 1,
"rateLimitedRequests": 0,
"averageDurationMs": 245
}
```
## How It Works
### Token Bucket Algorithm
1. **Bucket initialization** - Starts with `maxRequestsPerSecond` tokens
2. **Request consumption** - Each request consumes 1 token
3. **Token refill** - Bucket refills every second
4. **Blocking** - If no tokens available, request waits
### Per-Host Rate Limiting
The client automatically:
1. Extracts hostname from URL (e.g., `api.troostwijkauctions.com`)
2. Creates/retrieves rate limiter for that host
3. Applies configured limit (Troostwijk-specific or default)
4. Tracks statistics per host
### Request Flow
```
Request → Extract Host → Get Rate Limiter → Acquire Token → Send Request → Record Stats
troostwijkauctions.com?
Yes: 1 req/s | No: 2 req/s
```
## Warning Signs
Monitor for these indicators of rate limiting issues:
| Metric | Warning Threshold | Action |
|--------|------------------|--------|
| `rateLimitedRequests` | > 0 | Server is rate limiting you - reduce `max-rps` |
| `failedRequests` | > 5% | Investigate connection issues or increase timeout |
| `averageDurationMs` | > 3000ms | Server may be slow - reduce load |
## Testing
### Manual Test via cURL
```bash
# Test Troostwijk API rate limiting
for i in {1..10}; do
echo "Request $i at $(date +%T)"
curl -s http://localhost:8081/api/monitor/status > /dev/null
sleep 0.5
done
# Check statistics
curl http://localhost:8081/api/monitor/rate-limit/stats | jq
```
### Check Logs
Rate limiting is logged at DEBUG level:
```
03:15:23 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (245ms)
03:15:24 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (251ms)
03:15:25 WARN [RateLimitedHttpClient] ⚠️ Rate limited by api.troostwijkauctions.com (HTTP 429)
```
## Troubleshooting
### Problem: Getting HTTP 429 (Too Many Requests)
**Solution:** Decrease `max-rps` for that host:
```properties
auction.http.rate-limit.troostwijk-max-rps=0.5
```
### Problem: Requests too slow
**Solution:** Increase `max-rps` (be careful not to get blocked):
```properties
auction.http.rate-limit.default-max-rps=3
```
### Problem: Requests timing out
**Solution:** Increase timeout:
```properties
auction.http.timeout-seconds=60
```
## Best Practices
1. **Start conservative** - Begin with low limits (1 req/s)
2. **Monitor statistics** - Watch `rateLimitedRequests` metric
3. **Respect robots.txt** - Check host's crawling policy
4. **Use off-peak hours** - Run heavy scraping during low-traffic times
5. **Implement exponential backoff** - If receiving 429s, wait longer between retries
## Future Enhancements
Potential improvements:
- [ ] Dynamic rate adjustment based on 429 responses
- [ ] Exponential backoff on failures
- [ ] Per-endpoint rate limiting (not just per-host)
- [ ] Request queue visualization
- [ ] Integration with external rate limit APIs (e.g., Redis)