Files
auctiora/docs/DATABASE_CLEANUP_GUIDE.md
2025-12-07 09:59:08 +01:00

6.2 KiB

Database Cleanup Guide

Problem: Mixed Data Formats

Your production database (cache.db) contains data from two different scrapers:

Valid Data (99.92%)

  • Format: A1-34732-49 (lot_id) + c1f44ec2-ad6e-4c98-b0e2-cb1d8ccddcab (auction_id UUID)
  • Count: 16,794 lots
  • Source: Current GraphQL-based scraper
  • Status: Clean, with proper auction_id

Invalid Data (0.08%)

  • Format: bmw-550i-4-4-v8-high-executive-... (slug as lot_id) + "" (empty auction_id)
  • Count: 13 lots
  • Source: Old legacy scraper
  • Status: Missing auction_id, causes issues

Impact

These 13 invalid entries:

  • Cause NullPointerException in analytics when grouping by country
  • Cannot be properly linked to auctions
  • Skew statistics slightly
  • May cause issues with intelligence features that rely on auction_id

The updated sync script now automatically removes old local data before syncing:

# Windows PowerShell
.\scripts\Sync-ProductionData.ps1

# Linux/Mac
./scripts/sync-production-data.sh --db-only

What it does:

  1. Backs up existing database to cache.db.backup-YYYYMMDD-HHMMSS
  2. Removes old local database completely
  3. Downloads fresh copy from production
  4. Shows data quality report

Output includes:

Database statistics:
┌─────────────┬────────┐
│ table_name  │ count  │
├─────────────┼────────┤
│ auctions    │ 526    │
│ lots        │ 16807  │
│ images      │ 536502 │
│ cache       │ 2134   │
└─────────────┴────────┘

Data quality:
┌────────────────────────────────────┬────────┬────────────┐
│ metric                             │ count  │ percentage │
├────────────────────────────────────┼────────┼────────────┤
│ Valid lots                         │ 16794  │ 99.92%     │
│ Invalid lots (missing auction_id)  │ 13     │ 0.08%      │
│ Lots with intelligence fields      │ 0      │ 0.00%      │
└────────────────────────────────────┴────────┴────────────┘

Solution 2: Manual Cleanup

If you want to clean your existing local database without re-downloading:

# Dry run (see what would be deleted)
./scripts/cleanup-database.sh --dry-run

# Actual cleanup
./scripts/cleanup-database.sh

What it does:

  1. Creates backup before cleanup
  2. Deletes lots with missing auction_id
  3. Deletes orphaned images (images without matching lots)
  4. Compacts database (VACUUM) to reclaim space
  5. Shows before/after statistics

Example output:

Current database state:
┌──────────────────────────────────┬────────┐
│ metric                            │ count  │
├──────────────────────────────────┼────────┤
│ Total lots                        │ 16807  │
│ Valid lots (with auction_id)      │ 16794  │
│ Invalid lots (missing auction_id) │ 13     │
└──────────────────────────────────┴────────┘

Analyzing data to clean up...
  → Invalid lots to delete: 13
  → Orphaned images to delete: 0

This will permanently delete the above records.
Continue? (y/N) y

Cleaning up database...
  [1/2] Deleting invalid lots...
  ✓ Deleted 13 invalid lots
  [2/2] Deleting orphaned images...
  ✓ Deleted 0 orphaned images
  [3/3] Compacting database...
  ✓ Database compacted

Final database state:
┌───────────────┬────────┐
│ metric        │ count  │
├───────────────┼────────┤
│ Total lots    │ 16794  │
│ Total images  │ 536502 │
└───────────────┴────────┘

Database size: 8.9G

Solution 3: SQL Manual Cleanup

If you prefer to manually clean using SQL:

-- Backup first!
-- cp cache.db cache.db.backup

-- Check invalid entries
SELECT COUNT(*), 'Invalid' as type
FROM lots
WHERE auction_id IS NULL OR auction_id = ''
UNION ALL
SELECT COUNT(*), 'Valid'
FROM lots
WHERE auction_id IS NOT NULL AND auction_id != '';

-- Delete invalid lots
DELETE FROM lots
WHERE auction_id IS NULL OR auction_id = '';

-- Delete orphaned images
DELETE FROM images
WHERE lot_id NOT IN (SELECT lot_id FROM lots);

-- Compact database
VACUUM;

Prevention: Production Database Cleanup

To prevent these invalid entries from accumulating on production, you can:

  1. Clean production database (one-time):
ssh tour@athena.lan
docker run --rm -v shared-auction-data:/data alpine sqlite3 /data/cache.db "DELETE FROM lots WHERE auction_id IS NULL OR auction_id = '';"
  1. Update scraper to ensure all lots have auction_id
  2. Add validation in scraper to reject lots without auction_id

When to Clean

Immediately if:

  • Seeing NullPointerException in analytics
  • Dashboard insights failing
  • Country distribution not working

Periodically:

  • 🔄 After syncing from production (if production has invalid data)
  • 🔄 Weekly/monthly maintenance
  • 🔄 Before major testing or demos

Recommendation

Use Solution 1 (Clean Sync) for simplicity:

  • Guarantees clean state
  • No manual SQL needed
  • Shows data quality report
  • Safe (automatic backup)

The 13 invalid entries are from an old scraper and represent only 0.08% of data, so cleaning them up has minimal impact but prevents future errors.


Related Documentation: