# Database Cleanup Guide ## Problem: Mixed Data Formats Your production database (`cache.db`) contains data from two different scrapers: ### Valid Data (99.92%) - **Format**: `A1-34732-49` (lot_id) + `c1f44ec2-ad6e-4c98-b0e2-cb1d8ccddcab` (auction_id UUID) - **Count**: 16,794 lots - **Source**: Current GraphQL-based scraper - **Status**: ✅ Clean, with proper auction_id ### Invalid Data (0.08%) - **Format**: `bmw-550i-4-4-v8-high-executive-...` (slug as lot_id) + `""` (empty auction_id) - **Count**: 13 lots - **Source**: Old legacy scraper - **Status**: ❌ Missing auction_id, causes issues ## Impact These 13 invalid entries: - Cause `NullPointerException` in analytics when grouping by country - Cannot be properly linked to auctions - Skew statistics slightly - May cause issues with intelligence features that rely on auction_id ## Solution 1: Clean Sync (Recommended) The updated sync script now **automatically removes old local data** before syncing: ```bash # Windows PowerShell .\scripts\Sync-ProductionData.ps1 # Linux/Mac ./scripts/sync-production-data.sh --db-only ``` **What it does**: 1. Backs up existing database to `cache.db.backup-YYYYMMDD-HHMMSS` 2. **Removes old local database completely** 3. Downloads fresh copy from production 4. Shows data quality report **Output includes**: ``` Database statistics: ┌─────────────┬────────┐ │ table_name │ count │ ├─────────────┼────────┤ │ auctions │ 526 │ │ lots │ 16807 │ │ images │ 536502 │ │ cache │ 2134 │ └─────────────┴────────┘ Data quality: ┌────────────────────────────────────┬────────┬────────────┐ │ metric │ count │ percentage │ ├────────────────────────────────────┼────────┼────────────┤ │ Valid lots │ 16794 │ 99.92% │ │ Invalid lots (missing auction_id) │ 13 │ 0.08% │ │ Lots with intelligence fields │ 0 │ 0.00% │ └────────────────────────────────────┴────────┴────────────┘ ``` ## Solution 2: Manual Cleanup If you want to clean your existing local database without re-downloading: ```bash # Dry run (see what would be deleted) ./scripts/cleanup-database.sh --dry-run # Actual cleanup ./scripts/cleanup-database.sh ``` **What it does**: 1. Creates backup before cleanup 2. Deletes lots with missing auction_id 3. Deletes orphaned images (images without matching lots) 4. Compacts database (VACUUM) to reclaim space 5. Shows before/after statistics **Example output**: ``` Current database state: ┌──────────────────────────────────┬────────┐ │ metric │ count │ ├──────────────────────────────────┼────────┤ │ Total lots │ 16807 │ │ Valid lots (with auction_id) │ 16794 │ │ Invalid lots (missing auction_id) │ 13 │ └──────────────────────────────────┴────────┘ Analyzing data to clean up... → Invalid lots to delete: 13 → Orphaned images to delete: 0 This will permanently delete the above records. Continue? (y/N) y Cleaning up database... [1/2] Deleting invalid lots... ✓ Deleted 13 invalid lots [2/2] Deleting orphaned images... ✓ Deleted 0 orphaned images [3/3] Compacting database... ✓ Database compacted Final database state: ┌───────────────┬────────┐ │ metric │ count │ ├───────────────┼────────┤ │ Total lots │ 16794 │ │ Total images │ 536502 │ └───────────────┴────────┘ Database size: 8.9G ``` ## Solution 3: SQL Manual Cleanup If you prefer to manually clean using SQL: ```sql -- Backup first! -- cp cache.db cache.db.backup -- Check invalid entries SELECT COUNT(*), 'Invalid' as type FROM lots WHERE auction_id IS NULL OR auction_id = '' UNION ALL SELECT COUNT(*), 'Valid' FROM lots WHERE auction_id IS NOT NULL AND auction_id != ''; -- Delete invalid lots DELETE FROM lots WHERE auction_id IS NULL OR auction_id = ''; -- Delete orphaned images DELETE FROM images WHERE lot_id NOT IN (SELECT lot_id FROM lots); -- Compact database VACUUM; ``` ## Prevention: Production Database Cleanup To prevent these invalid entries from accumulating on production, you can: 1. **Clean production database** (one-time): ```bash ssh tour@athena.lan docker run --rm -v shared-auction-data:/data alpine sqlite3 /data/cache.db "DELETE FROM lots WHERE auction_id IS NULL OR auction_id = '';" ``` 2. **Update scraper** to ensure all lots have auction_id 3. **Add validation** in scraper to reject lots without auction_id ## When to Clean ### Immediately if: - ❌ Seeing `NullPointerException` in analytics - ❌ Dashboard insights failing - ❌ Country distribution not working ### Periodically: - 🔄 After syncing from production (if production has invalid data) - 🔄 Weekly/monthly maintenance - 🔄 Before major testing or demos ## Recommendation **Use Solution 1 (Clean Sync)** for simplicity: - ✅ Guarantees clean state - ✅ No manual SQL needed - ✅ Shows data quality report - ✅ Safe (automatic backup) The 13 invalid entries are from an old scraper and represent only 0.08% of data, so cleaning them up has minimal impact but prevents future errors. --- **Related Documentation**: - [Sync Scripts README](../scripts/README.md) - [Data Sync Setup](DATA_SYNC_SETUP.md) - [Database Architecture](../wiki/DATABASE_ARCHITECTURE.md)