Files
auctiora/wiki/DATABASE_ARCHITECTURE.md
2025-12-06 05:39:59 +01:00

259 lines
7.8 KiB
Markdown

# Database Architecture
## Overview
The Auctiora auction monitoring system uses **SQLite** as its database engine, shared between the scraper process and the monitor application for simplicity and performance.
## Current State (Dec 2025)
- **Database**: `C:\mnt\okcomputer\output\cache.db`
- **Size**: 1.6 GB
- **Records**: 16,006 lots, 536,502 images
- **Concurrent Processes**: 2 (scraper + monitor)
- **Access Pattern**: Scraper writes, Monitor reads + occasional updates
## Why SQLite?
### ✅ Advantages for This Use Case
1. **Embedded Architecture**
- No separate database server to manage
- Zero network latency (local file access)
- Perfect for single-machine scraping + monitoring
2. **Excellent Read Performance**
- Monitor performs mostly SELECT queries
- Well-indexed access by `lot_id`, `url`, `auction_id`
- Sub-millisecond query times for simple lookups
3. **Simplicity**
- Single file database
- Automatic backup via file copy
- No connection pooling or authentication overhead
4. **Proven Scalability**
- Tested up to 281 TB database size
- 1.6 GB is only 0.0006% of capacity
- Handles billions of rows efficiently
5. **WAL Mode for Concurrency**
- Multiple readers don't block each other
- Readers don't block writers
- Writers don't block readers
- Perfect for scraper + monitor workload
## Configuration
### Connection String (DatabaseService.java:28)
```java
jdbc:sqlite:C:\mnt\okcomputer\output\cache.db?journal_mode=WAL&busy_timeout=10000
```
### Key PRAGMAs (DatabaseService.java:38-40)
```sql
PRAGMA journal_mode=WAL; -- Write-Ahead Logging for concurrency
PRAGMA busy_timeout=10000; -- 10s retry on lock contention
PRAGMA synchronous=NORMAL; -- Balance safety and performance
```
### What These Settings Do
| Setting | Purpose | Impact |
|---------|---------|--------|
| `journal_mode=WAL` | Write-Ahead Logging | Enables concurrent read/write access |
| `busy_timeout=10000` | Wait 10s on lock | Prevents immediate `SQLITE_BUSY` errors |
| `synchronous=NORMAL` | Balanced sync mode | Faster writes, still crash-safe |
## Schema Integration
### Scraper Schema (Read-Only for Monitor)
```sql
CREATE TABLE lots (
lot_id TEXT PRIMARY KEY,
auction_id TEXT,
url TEXT UNIQUE, -- ⚠️ Enforced by scraper
title TEXT,
current_bid TEXT,
closing_time TEXT,
manufacturer TEXT,
type TEXT,
year INTEGER,
currency TEXT DEFAULT 'EUR',
closing_notified INTEGER DEFAULT 0,
...
)
```
### Monitor Schema (Tables Created by Monitor)
```sql
CREATE TABLE images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
lot_id INTEGER,
url TEXT,
local_path TEXT,
labels TEXT, -- Object detection results
processed_at INTEGER,
FOREIGN KEY (lot_id) REFERENCES lots(lot_id)
)
```
### Handling Schema Conflicts
**Problem**: Scraper has `UNIQUE` constraint on `lots.url`
**Solution** (DatabaseService.java:361-424):
```java
// Try UPDATE first
UPDATE lots SET ... WHERE lot_id = ?
// If no rows updated, INSERT OR IGNORE
INSERT OR IGNORE INTO lots (...) VALUES (...)
```
This approach:
- ✅ Updates existing lots by `lot_id`
- ✅ Skips inserts that violate UNIQUE constraints
- ✅ No crashes on re-imports or duplicate URLs
## Performance Characteristics
### Current Performance
- Simple SELECT by ID: <1ms
- Full table scan (16K lots): ~50ms
- Image INSERT: <5ms
- Concurrent operations: No blocking observed
### Scalability Projections
| Metric | Current | 1 Year | 3 Years | SQLite Limit |
|--------|---------|--------|---------|--------------|
| Lots | 16K | 365K | 1M | 1B+ rows |
| Images | 536K | 19M | 54M | 1B+ rows |
| DB Size | 1.6GB | 36GB | 100GB | 281TB |
| Queries | <1ms | <5ms | <20ms | Depends on indexes |
## When to Migrate to PostgreSQL/MySQL
### 🚨 Migration Triggers
Consider migrating if you encounter **any** of these:
1. **Concurrency Limits**
- >5 concurrent writers needed
- Frequent `SQLITE_BUSY` errors despite WAL mode
- Need for distributed access across multiple servers
2. **Performance Degradation**
- Database >50GB AND queries >1s for simple SELECTs
- Complex JOIN queries become bottleneck
- Index sizes exceed available RAM
3. **Operational Requirements**
- Need for replication (master/slave)
- Geographic distribution required
- High availability / failover needed
- Remote access from multiple locations
4. **Advanced Features**
- Full-text search on large text fields
- Complex analytical queries (window functions, CTEs)
- User management and fine-grained permissions
- Connection pooling for web applications
### Migration Path (If Needed)
1. **Choose Database**: PostgreSQL (recommended) or MySQL
2. **Schema Export**: Use SQLite `.schema` command
3. **Data Migration**: Use `sqlite3-to-postgres` or custom scripts
4. **Update Connection**: Change JDBC URL in `application.properties`
5. **Update Queries**: Fix SQL dialect differences
6. **Performance Tuning**: Create appropriate indexes
Example PostgreSQL configuration:
```properties
# application.properties
auction.database.url=jdbc:postgresql://localhost:5432/auctiora
auction.database.username=monitor
auction.database.password=${DB_PASSWORD}
```
## Current Recommendation: ✅ **Stick with SQLite**
### Rationale
1. **Sufficient Capacity**: 1.6GB is 0.0006% of SQLite's limit
2. **Excellent Performance**: Sub-millisecond queries
3. **Simple Operations**: No complex transactions or analytics
4. **Low Concurrency**: Only 2 processes (scraper + monitor)
5. **Local Architecture**: No need for network DB access
6. **Zero Maintenance**: No DB server to manage or monitor
### Monitoring Dashboard Metrics
Track these to know when to reconsider:
```sql
-- Add to praetium.html dashboard
SELECT
(SELECT COUNT(*) FROM lots) as lot_count,
(SELECT COUNT(*) FROM images) as image_count,
(SELECT page_count * page_size FROM pragma_page_count(), pragma_page_size()) as db_size_bytes,
(SELECT (page_count - freelist_count) * 100.0 / page_count FROM pragma_page_count(), pragma_freelist_count()) as db_utilization
```
**Review decision when**:
- Database >20GB
- Query times >500ms for simple lookups
- More than 3 concurrent processes needed
## Backup Strategy
### Recommended Approach
```bash
# Nightly backup via Windows Task Scheduler
sqlite3 C:\mnt\okcomputer\output\cache.db ".backup C:\backups\cache_$(date +%Y%m%d).db"
# Keep last 30 days
forfiles /P C:\backups /M cache_*.db /D -30 /C "cmd /c del @path"
```
### WAL File Management
SQLite creates additional files in WAL mode:
- `cache.db` - Main database
- `cache.db-wal` - Write-Ahead Log
- `cache.db-shm` - Shared memory
**Important**: Backup all three files together for consistency.
## Integration Points
### Scraper Process
- **Writes**: INSERT new lots, auctions, images
- **Schema Owner**: Creates tables, enforces constraints
- **Frequency**: Continuous (every 30 minutes)
### Monitor Process (Auctiora)
- **Reads**: SELECT lots, auctions for monitoring
- **Writes**: UPDATE bid amounts, notification flags; INSERT image processing results
- **Schema**: Adds `images` table for object detection
- **Frequency**: Every 15 seconds (dashboard refresh)
### Conflict Resolution
| Conflict | Strategy | Implementation |
|----------|----------|----------------|
| Duplicate lot_id | UPDATE instead of INSERT | DatabaseService.upsertLot() |
| Duplicate URL | INSERT OR IGNORE | Silent skip |
| Oversized IDs (>Long.MAX_VALUE) | Return 0L, skip import | ScraperDataAdapter.extractNumericId() |
| Invalid timestamps | Try-catch, log, continue | DatabaseService.getAllAuctions() |
| Database locked | 10s busy_timeout + WAL | Connection string |
## References
- [SQLite Documentation](https://www.sqlite.org/docs.html)
- [WAL Mode](https://www.sqlite.org/wal.html)
- [SQLite Limits](https://www.sqlite.org/limits.html)
- [When to Use SQLite](https://www.sqlite.org/whentouse.html)