193 lines
6.2 KiB
Markdown
193 lines
6.2 KiB
Markdown
# Database Cleanup Guide
|
|
|
|
## Problem: Mixed Data Formats
|
|
|
|
Your production database (`cache.db`) contains data from two different scrapers:
|
|
|
|
### Valid Data (99.92%)
|
|
- **Format**: `A1-34732-49` (lot_id) + `c1f44ec2-ad6e-4c98-b0e2-cb1d8ccddcab` (auction_id UUID)
|
|
- **Count**: 16,794 lots
|
|
- **Source**: Current GraphQL-based scraper
|
|
- **Status**: ✅ Clean, with proper auction_id
|
|
|
|
### Invalid Data (0.08%)
|
|
- **Format**: `bmw-550i-4-4-v8-high-executive-...` (slug as lot_id) + `""` (empty auction_id)
|
|
- **Count**: 13 lots
|
|
- **Source**: Old legacy scraper
|
|
- **Status**: ❌ Missing auction_id, causes issues
|
|
|
|
## Impact
|
|
|
|
These 13 invalid entries:
|
|
- Cause `NullPointerException` in analytics when grouping by country
|
|
- Cannot be properly linked to auctions
|
|
- Skew statistics slightly
|
|
- May cause issues with intelligence features that rely on auction_id
|
|
|
|
## Solution 1: Clean Sync (Recommended)
|
|
|
|
The updated sync script now **automatically removes old local data** before syncing:
|
|
|
|
```bash
|
|
# Windows PowerShell
|
|
.\scripts\Sync-ProductionData.ps1
|
|
|
|
# Linux/Mac
|
|
./scripts/sync-production-data.sh --db-only
|
|
```
|
|
|
|
**What it does**:
|
|
1. Backs up existing database to `cache.db.backup-YYYYMMDD-HHMMSS`
|
|
2. **Removes old local database completely**
|
|
3. Downloads fresh copy from production
|
|
4. Shows data quality report
|
|
|
|
**Output includes**:
|
|
```
|
|
Database statistics:
|
|
┌─────────────┬────────┐
|
|
│ table_name │ count │
|
|
├─────────────┼────────┤
|
|
│ auctions │ 526 │
|
|
│ lots │ 16807 │
|
|
│ images │ 536502 │
|
|
│ cache │ 2134 │
|
|
└─────────────┴────────┘
|
|
|
|
Data quality:
|
|
┌────────────────────────────────────┬────────┬────────────┐
|
|
│ metric │ count │ percentage │
|
|
├────────────────────────────────────┼────────┼────────────┤
|
|
│ Valid lots │ 16794 │ 99.92% │
|
|
│ Invalid lots (missing auction_id) │ 13 │ 0.08% │
|
|
│ Lots with intelligence fields │ 0 │ 0.00% │
|
|
└────────────────────────────────────┴────────┴────────────┘
|
|
```
|
|
|
|
## Solution 2: Manual Cleanup
|
|
|
|
If you want to clean your existing local database without re-downloading:
|
|
|
|
```bash
|
|
# Dry run (see what would be deleted)
|
|
./scripts/cleanup-database.sh --dry-run
|
|
|
|
# Actual cleanup
|
|
./scripts/cleanup-database.sh
|
|
```
|
|
|
|
**What it does**:
|
|
1. Creates backup before cleanup
|
|
2. Deletes lots with missing auction_id
|
|
3. Deletes orphaned images (images without matching lots)
|
|
4. Compacts database (VACUUM) to reclaim space
|
|
5. Shows before/after statistics
|
|
|
|
**Example output**:
|
|
```
|
|
Current database state:
|
|
┌──────────────────────────────────┬────────┐
|
|
│ metric │ count │
|
|
├──────────────────────────────────┼────────┤
|
|
│ Total lots │ 16807 │
|
|
│ Valid lots (with auction_id) │ 16794 │
|
|
│ Invalid lots (missing auction_id) │ 13 │
|
|
└──────────────────────────────────┴────────┘
|
|
|
|
Analyzing data to clean up...
|
|
→ Invalid lots to delete: 13
|
|
→ Orphaned images to delete: 0
|
|
|
|
This will permanently delete the above records.
|
|
Continue? (y/N) y
|
|
|
|
Cleaning up database...
|
|
[1/2] Deleting invalid lots...
|
|
✓ Deleted 13 invalid lots
|
|
[2/2] Deleting orphaned images...
|
|
✓ Deleted 0 orphaned images
|
|
[3/3] Compacting database...
|
|
✓ Database compacted
|
|
|
|
Final database state:
|
|
┌───────────────┬────────┐
|
|
│ metric │ count │
|
|
├───────────────┼────────┤
|
|
│ Total lots │ 16794 │
|
|
│ Total images │ 536502 │
|
|
└───────────────┴────────┘
|
|
|
|
Database size: 8.9G
|
|
```
|
|
|
|
## Solution 3: SQL Manual Cleanup
|
|
|
|
If you prefer to manually clean using SQL:
|
|
|
|
```sql
|
|
-- Backup first!
|
|
-- cp cache.db cache.db.backup
|
|
|
|
-- Check invalid entries
|
|
SELECT COUNT(*), 'Invalid' as type
|
|
FROM lots
|
|
WHERE auction_id IS NULL OR auction_id = ''
|
|
UNION ALL
|
|
SELECT COUNT(*), 'Valid'
|
|
FROM lots
|
|
WHERE auction_id IS NOT NULL AND auction_id != '';
|
|
|
|
-- Delete invalid lots
|
|
DELETE FROM lots
|
|
WHERE auction_id IS NULL OR auction_id = '';
|
|
|
|
-- Delete orphaned images
|
|
DELETE FROM images
|
|
WHERE lot_id NOT IN (SELECT lot_id FROM lots);
|
|
|
|
-- Compact database
|
|
VACUUM;
|
|
```
|
|
|
|
## Prevention: Production Database Cleanup
|
|
|
|
To prevent these invalid entries from accumulating on production, you can:
|
|
|
|
1. **Clean production database** (one-time):
|
|
```bash
|
|
ssh tour@athena.lan
|
|
docker run --rm -v shared-auction-data:/data alpine sqlite3 /data/cache.db "DELETE FROM lots WHERE auction_id IS NULL OR auction_id = '';"
|
|
```
|
|
|
|
2. **Update scraper** to ensure all lots have auction_id
|
|
3. **Add validation** in scraper to reject lots without auction_id
|
|
|
|
## When to Clean
|
|
|
|
### Immediately if:
|
|
- ❌ Seeing `NullPointerException` in analytics
|
|
- ❌ Dashboard insights failing
|
|
- ❌ Country distribution not working
|
|
|
|
### Periodically:
|
|
- 🔄 After syncing from production (if production has invalid data)
|
|
- 🔄 Weekly/monthly maintenance
|
|
- 🔄 Before major testing or demos
|
|
|
|
## Recommendation
|
|
|
|
**Use Solution 1 (Clean Sync)** for simplicity:
|
|
- ✅ Guarantees clean state
|
|
- ✅ No manual SQL needed
|
|
- ✅ Shows data quality report
|
|
- ✅ Safe (automatic backup)
|
|
|
|
The 13 invalid entries are from an old scraper and represent only 0.08% of data, so cleaning them up has minimal impact but prevents future errors.
|
|
|
|
---
|
|
|
|
**Related Documentation**:
|
|
- [Sync Scripts README](../scripts/README.md)
|
|
- [Data Sync Setup](DATA_SYNC_SETUP.md)
|
|
- [Database Architecture](../wiki/DATABASE_ARCHITECTURE.md)
|