Files
auctiora/docs/DATABASE_CLEANUP_GUIDE.md
2025-12-07 09:59:08 +01:00

193 lines
6.2 KiB
Markdown

# Database Cleanup Guide
## Problem: Mixed Data Formats
Your production database (`cache.db`) contains data from two different scrapers:
### Valid Data (99.92%)
- **Format**: `A1-34732-49` (lot_id) + `c1f44ec2-ad6e-4c98-b0e2-cb1d8ccddcab` (auction_id UUID)
- **Count**: 16,794 lots
- **Source**: Current GraphQL-based scraper
- **Status**: ✅ Clean, with proper auction_id
### Invalid Data (0.08%)
- **Format**: `bmw-550i-4-4-v8-high-executive-...` (slug as lot_id) + `""` (empty auction_id)
- **Count**: 13 lots
- **Source**: Old legacy scraper
- **Status**: ❌ Missing auction_id, causes issues
## Impact
These 13 invalid entries:
- Cause `NullPointerException` in analytics when grouping by country
- Cannot be properly linked to auctions
- Skew statistics slightly
- May cause issues with intelligence features that rely on auction_id
## Solution 1: Clean Sync (Recommended)
The updated sync script now **automatically removes old local data** before syncing:
```bash
# Windows PowerShell
.\scripts\Sync-ProductionData.ps1
# Linux/Mac
./scripts/sync-production-data.sh --db-only
```
**What it does**:
1. Backs up existing database to `cache.db.backup-YYYYMMDD-HHMMSS`
2. **Removes old local database completely**
3. Downloads fresh copy from production
4. Shows data quality report
**Output includes**:
```
Database statistics:
┌─────────────┬────────┐
│ table_name │ count │
├─────────────┼────────┤
│ auctions │ 526 │
│ lots │ 16807 │
│ images │ 536502 │
│ cache │ 2134 │
└─────────────┴────────┘
Data quality:
┌────────────────────────────────────┬────────┬────────────┐
│ metric │ count │ percentage │
├────────────────────────────────────┼────────┼────────────┤
│ Valid lots │ 16794 │ 99.92% │
│ Invalid lots (missing auction_id) │ 13 │ 0.08% │
│ Lots with intelligence fields │ 0 │ 0.00% │
└────────────────────────────────────┴────────┴────────────┘
```
## Solution 2: Manual Cleanup
If you want to clean your existing local database without re-downloading:
```bash
# Dry run (see what would be deleted)
./scripts/cleanup-database.sh --dry-run
# Actual cleanup
./scripts/cleanup-database.sh
```
**What it does**:
1. Creates backup before cleanup
2. Deletes lots with missing auction_id
3. Deletes orphaned images (images without matching lots)
4. Compacts database (VACUUM) to reclaim space
5. Shows before/after statistics
**Example output**:
```
Current database state:
┌──────────────────────────────────┬────────┐
│ metric │ count │
├──────────────────────────────────┼────────┤
│ Total lots │ 16807 │
│ Valid lots (with auction_id) │ 16794 │
│ Invalid lots (missing auction_id) │ 13 │
└──────────────────────────────────┴────────┘
Analyzing data to clean up...
→ Invalid lots to delete: 13
→ Orphaned images to delete: 0
This will permanently delete the above records.
Continue? (y/N) y
Cleaning up database...
[1/2] Deleting invalid lots...
✓ Deleted 13 invalid lots
[2/2] Deleting orphaned images...
✓ Deleted 0 orphaned images
[3/3] Compacting database...
✓ Database compacted
Final database state:
┌───────────────┬────────┐
│ metric │ count │
├───────────────┼────────┤
│ Total lots │ 16794 │
│ Total images │ 536502 │
└───────────────┴────────┘
Database size: 8.9G
```
## Solution 3: SQL Manual Cleanup
If you prefer to manually clean using SQL:
```sql
-- Backup first!
-- cp cache.db cache.db.backup
-- Check invalid entries
SELECT COUNT(*), 'Invalid' as type
FROM lots
WHERE auction_id IS NULL OR auction_id = ''
UNION ALL
SELECT COUNT(*), 'Valid'
FROM lots
WHERE auction_id IS NOT NULL AND auction_id != '';
-- Delete invalid lots
DELETE FROM lots
WHERE auction_id IS NULL OR auction_id = '';
-- Delete orphaned images
DELETE FROM images
WHERE lot_id NOT IN (SELECT lot_id FROM lots);
-- Compact database
VACUUM;
```
## Prevention: Production Database Cleanup
To prevent these invalid entries from accumulating on production, you can:
1. **Clean production database** (one-time):
```bash
ssh tour@athena.lan
docker run --rm -v shared-auction-data:/data alpine sqlite3 /data/cache.db "DELETE FROM lots WHERE auction_id IS NULL OR auction_id = '';"
```
2. **Update scraper** to ensure all lots have auction_id
3. **Add validation** in scraper to reject lots without auction_id
## When to Clean
### Immediately if:
- ❌ Seeing `NullPointerException` in analytics
- ❌ Dashboard insights failing
- ❌ Country distribution not working
### Periodically:
- 🔄 After syncing from production (if production has invalid data)
- 🔄 Weekly/monthly maintenance
- 🔄 Before major testing or demos
## Recommendation
**Use Solution 1 (Clean Sync)** for simplicity:
- ✅ Guarantees clean state
- ✅ No manual SQL needed
- ✅ Shows data quality report
- ✅ Safe (automatic backup)
The 13 invalid entries are from an old scraper and represent only 0.08% of data, so cleaning them up has minimal impact but prevents future errors.
---
**Related Documentation**:
- [Sync Scripts README](../scripts/README.md)
- [Data Sync Setup](DATA_SYNC_SETUP.md)
- [Database Architecture](../wiki/DATABASE_ARCHITECTURE.md)