6.2 KiB
6.2 KiB
Database Cleanup Guide
Problem: Mixed Data Formats
Your production database (cache.db) contains data from two different scrapers:
Valid Data (99.92%)
- Format:
A1-34732-49(lot_id) +c1f44ec2-ad6e-4c98-b0e2-cb1d8ccddcab(auction_id UUID) - Count: 16,794 lots
- Source: Current GraphQL-based scraper
- Status: ✅ Clean, with proper auction_id
Invalid Data (0.08%)
- Format:
bmw-550i-4-4-v8-high-executive-...(slug as lot_id) +""(empty auction_id) - Count: 13 lots
- Source: Old legacy scraper
- Status: ❌ Missing auction_id, causes issues
Impact
These 13 invalid entries:
- Cause
NullPointerExceptionin analytics when grouping by country - Cannot be properly linked to auctions
- Skew statistics slightly
- May cause issues with intelligence features that rely on auction_id
Solution 1: Clean Sync (Recommended)
The updated sync script now automatically removes old local data before syncing:
# Windows PowerShell
.\scripts\Sync-ProductionData.ps1
# Linux/Mac
./scripts/sync-production-data.sh --db-only
What it does:
- Backs up existing database to
cache.db.backup-YYYYMMDD-HHMMSS - Removes old local database completely
- Downloads fresh copy from production
- Shows data quality report
Output includes:
Database statistics:
┌─────────────┬────────┐
│ table_name │ count │
├─────────────┼────────┤
│ auctions │ 526 │
│ lots │ 16807 │
│ images │ 536502 │
│ cache │ 2134 │
└─────────────┴────────┘
Data quality:
┌────────────────────────────────────┬────────┬────────────┐
│ metric │ count │ percentage │
├────────────────────────────────────┼────────┼────────────┤
│ Valid lots │ 16794 │ 99.92% │
│ Invalid lots (missing auction_id) │ 13 │ 0.08% │
│ Lots with intelligence fields │ 0 │ 0.00% │
└────────────────────────────────────┴────────┴────────────┘
Solution 2: Manual Cleanup
If you want to clean your existing local database without re-downloading:
# Dry run (see what would be deleted)
./scripts/cleanup-database.sh --dry-run
# Actual cleanup
./scripts/cleanup-database.sh
What it does:
- Creates backup before cleanup
- Deletes lots with missing auction_id
- Deletes orphaned images (images without matching lots)
- Compacts database (VACUUM) to reclaim space
- Shows before/after statistics
Example output:
Current database state:
┌──────────────────────────────────┬────────┐
│ metric │ count │
├──────────────────────────────────┼────────┤
│ Total lots │ 16807 │
│ Valid lots (with auction_id) │ 16794 │
│ Invalid lots (missing auction_id) │ 13 │
└──────────────────────────────────┴────────┘
Analyzing data to clean up...
→ Invalid lots to delete: 13
→ Orphaned images to delete: 0
This will permanently delete the above records.
Continue? (y/N) y
Cleaning up database...
[1/2] Deleting invalid lots...
✓ Deleted 13 invalid lots
[2/2] Deleting orphaned images...
✓ Deleted 0 orphaned images
[3/3] Compacting database...
✓ Database compacted
Final database state:
┌───────────────┬────────┐
│ metric │ count │
├───────────────┼────────┤
│ Total lots │ 16794 │
│ Total images │ 536502 │
└───────────────┴────────┘
Database size: 8.9G
Solution 3: SQL Manual Cleanup
If you prefer to manually clean using SQL:
-- Backup first!
-- cp cache.db cache.db.backup
-- Check invalid entries
SELECT COUNT(*), 'Invalid' as type
FROM lots
WHERE auction_id IS NULL OR auction_id = ''
UNION ALL
SELECT COUNT(*), 'Valid'
FROM lots
WHERE auction_id IS NOT NULL AND auction_id != '';
-- Delete invalid lots
DELETE FROM lots
WHERE auction_id IS NULL OR auction_id = '';
-- Delete orphaned images
DELETE FROM images
WHERE lot_id NOT IN (SELECT lot_id FROM lots);
-- Compact database
VACUUM;
Prevention: Production Database Cleanup
To prevent these invalid entries from accumulating on production, you can:
- Clean production database (one-time):
ssh tour@athena.lan
docker run --rm -v shared-auction-data:/data alpine sqlite3 /data/cache.db "DELETE FROM lots WHERE auction_id IS NULL OR auction_id = '';"
- Update scraper to ensure all lots have auction_id
- Add validation in scraper to reject lots without auction_id
When to Clean
Immediately if:
- ❌ Seeing
NullPointerExceptionin analytics - ❌ Dashboard insights failing
- ❌ Country distribution not working
Periodically:
- 🔄 After syncing from production (if production has invalid data)
- 🔄 Weekly/monthly maintenance
- 🔄 Before major testing or demos
Recommendation
Use Solution 1 (Clean Sync) for simplicity:
- ✅ Guarantees clean state
- ✅ No manual SQL needed
- ✅ Shows data quality report
- ✅ Safe (automatic backup)
The 13 invalid entries are from an old scraper and represent only 0.08% of data, so cleaning them up has minimal impact but prevents future errors.
Related Documentation: