Files
auctiora/wiki/RATE_LIMITING.md
2025-12-04 04:30:44 +01:00

5.5 KiB

HTTP Rate Limiting

Overview

The Troostwijk Scraper implements per-host HTTP rate limiting to prevent overloading external services (especially Troostwijk APIs) and avoid getting blocked.

Features

  • Per-host rate limiting - Different limits for different hosts
  • Token bucket algorithm - Allows burst traffic while maintaining steady rate
  • Automatic host detection - Extracts host from URL automatically
  • Request statistics - Tracks success/failure/rate-limited requests
  • Thread-safe - Uses semaphores for concurrent request handling
  • Configurable - Via application.properties

Configuration

Edit src/main/resources/application.properties:

# Default rate limit for all hosts (requests per second)
auction.http.rate-limit.default-max-rps=2

# Troostwijk-specific rate limit (requests per second)
auction.http.rate-limit.troostwijk-max-rps=1

# HTTP request timeout (seconds)
auction.http.timeout-seconds=30
Service Max RPS Reason
troostwijkauctions.com 1 req/s Prevent blocking by Troostwijk
Other image hosts 2 req/s Balance speed and politeness

Usage

The RateLimitedHttpClient is automatically injected into services that make HTTP requests:

@Inject
RateLimitedHttpClient httpClient;

// GET request for text
HttpResponse<String> response = httpClient.sendGet(url);

// GET request for binary data (images)
HttpResponse<byte[]> response = httpClient.sendGetBytes(imageUrl);

Integrated Services

  1. TroostwijkMonitor - API calls for bid monitoring
  2. ImageProcessingService - Image downloads
  3. QuarkusWorkflowScheduler - Scheduled workflows

Monitoring

REST API Endpoints

Get All Rate Limit Statistics

GET http://localhost:8081/api/monitor/rate-limit/stats

Response:

{
  "hosts": 2,
  "statistics": {
    "api.troostwijkauctions.com": {
      "totalRequests": 150,
      "successfulRequests": 148,
      "failedRequests": 1,
      "rateLimitedRequests": 0,
      "averageDurationMs": 245
    },
    "images.troostwijkauctions.com": {
      "totalRequests": 320,
      "successfulRequests": 315,
      "failedRequests": 5,
      "rateLimitedRequests": 2,
      "averageDurationMs": 892
    }
  }
}

Get Statistics for Specific Host

GET http://localhost:8081/api/monitor/rate-limit/stats/api.troostwijkauctions.com

Response:

{
  "host": "api.troostwijkauctions.com",
  "totalRequests": 150,
  "successfulRequests": 148,
  "failedRequests": 1,
  "rateLimitedRequests": 0,
  "averageDurationMs": 245
}

How It Works

Token Bucket Algorithm

  1. Bucket initialization - Starts with maxRequestsPerSecond tokens
  2. Request consumption - Each request consumes 1 token
  3. Token refill - Bucket refills every second
  4. Blocking - If no tokens available, request waits

Per-Host Rate Limiting

The client automatically:

  1. Extracts hostname from URL (e.g., api.troostwijkauctions.com)
  2. Creates/retrieves rate limiter for that host
  3. Applies configured limit (Troostwijk-specific or default)
  4. Tracks statistics per host

Request Flow

Request → Extract Host → Get Rate Limiter → Acquire Token → Send Request → Record Stats
                              ↓
                      troostwijkauctions.com?
                              ↓
                    Yes: 1 req/s | No: 2 req/s

Warning Signs

Monitor for these indicators of rate limiting issues:

Metric Warning Threshold Action
rateLimitedRequests > 0 Server is rate limiting you - reduce max-rps
failedRequests > 5% Investigate connection issues or increase timeout
averageDurationMs > 3000ms Server may be slow - reduce load

Testing

Manual Test via cURL

# Test Troostwijk API rate limiting
for i in {1..10}; do
  echo "Request $i at $(date +%T)"
  curl -s http://localhost:8081/api/monitor/status > /dev/null
  sleep 0.5
done

# Check statistics
curl http://localhost:8081/api/monitor/rate-limit/stats | jq

Check Logs

Rate limiting is logged at DEBUG level:

03:15:23 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (245ms)
03:15:24 DEBUG [RateLimitedHttpClient] HTTP 200 GET api.troostwijkauctions.com (251ms)
03:15:25 WARN  [RateLimitedHttpClient] ⚠️  Rate limited by api.troostwijkauctions.com (HTTP 429)

Troubleshooting

Problem: Getting HTTP 429 (Too Many Requests)

Solution: Decrease max-rps for that host:

auction.http.rate-limit.troostwijk-max-rps=0.5

Problem: Requests too slow

Solution: Increase max-rps (be careful not to get blocked):

auction.http.rate-limit.default-max-rps=3

Problem: Requests timing out

Solution: Increase timeout:

auction.http.timeout-seconds=60

Best Practices

  1. Start conservative - Begin with low limits (1 req/s)
  2. Monitor statistics - Watch rateLimitedRequests metric
  3. Respect robots.txt - Check host's crawling policy
  4. Use off-peak hours - Run heavy scraping during low-traffic times
  5. Implement exponential backoff - If receiving 429s, wait longer between retries

Future Enhancements

Potential improvements:

  • Dynamic rate adjustment based on 429 responses
  • Exponential backoff on failures
  • Per-endpoint rate limiting (not just per-host)
  • Request queue visualization
  • Integration with external rate limit APIs (e.g., Redis)