Fix mock tests
This commit is contained in:
192
docs/DATABASE_CLEANUP_GUIDE.md
Normal file
192
docs/DATABASE_CLEANUP_GUIDE.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Database Cleanup Guide
|
||||
|
||||
## Problem: Mixed Data Formats
|
||||
|
||||
Your production database (`cache.db`) contains data from two different scrapers:
|
||||
|
||||
### Valid Data (99.92%)
|
||||
- **Format**: `A1-34732-49` (lot_id) + `c1f44ec2-ad6e-4c98-b0e2-cb1d8ccddcab` (auction_id UUID)
|
||||
- **Count**: 16,794 lots
|
||||
- **Source**: Current GraphQL-based scraper
|
||||
- **Status**: ✅ Clean, with proper auction_id
|
||||
|
||||
### Invalid Data (0.08%)
|
||||
- **Format**: `bmw-550i-4-4-v8-high-executive-...` (slug as lot_id) + `""` (empty auction_id)
|
||||
- **Count**: 13 lots
|
||||
- **Source**: Old legacy scraper
|
||||
- **Status**: ❌ Missing auction_id, causes issues
|
||||
|
||||
## Impact
|
||||
|
||||
These 13 invalid entries:
|
||||
- Cause `NullPointerException` in analytics when grouping by country
|
||||
- Cannot be properly linked to auctions
|
||||
- Skew statistics slightly
|
||||
- May cause issues with intelligence features that rely on auction_id
|
||||
|
||||
## Solution 1: Clean Sync (Recommended)
|
||||
|
||||
The updated sync script now **automatically removes old local data** before syncing:
|
||||
|
||||
```bash
|
||||
# Windows PowerShell
|
||||
.\scripts\Sync-ProductionData.ps1
|
||||
|
||||
# Linux/Mac
|
||||
./scripts/sync-production-data.sh --db-only
|
||||
```
|
||||
|
||||
**What it does**:
|
||||
1. Backs up existing database to `cache.db.backup-YYYYMMDD-HHMMSS`
|
||||
2. **Removes old local database completely**
|
||||
3. Downloads fresh copy from production
|
||||
4. Shows data quality report
|
||||
|
||||
**Output includes**:
|
||||
```
|
||||
Database statistics:
|
||||
┌─────────────┬────────┐
|
||||
│ table_name │ count │
|
||||
├─────────────┼────────┤
|
||||
│ auctions │ 526 │
|
||||
│ lots │ 16807 │
|
||||
│ images │ 536502 │
|
||||
│ cache │ 2134 │
|
||||
└─────────────┴────────┘
|
||||
|
||||
Data quality:
|
||||
┌────────────────────────────────────┬────────┬────────────┐
|
||||
│ metric │ count │ percentage │
|
||||
├────────────────────────────────────┼────────┼────────────┤
|
||||
│ Valid lots │ 16794 │ 99.92% │
|
||||
│ Invalid lots (missing auction_id) │ 13 │ 0.08% │
|
||||
│ Lots with intelligence fields │ 0 │ 0.00% │
|
||||
└────────────────────────────────────┴────────┴────────────┘
|
||||
```
|
||||
|
||||
## Solution 2: Manual Cleanup
|
||||
|
||||
If you want to clean your existing local database without re-downloading:
|
||||
|
||||
```bash
|
||||
# Dry run (see what would be deleted)
|
||||
./scripts/cleanup-database.sh --dry-run
|
||||
|
||||
# Actual cleanup
|
||||
./scripts/cleanup-database.sh
|
||||
```
|
||||
|
||||
**What it does**:
|
||||
1. Creates backup before cleanup
|
||||
2. Deletes lots with missing auction_id
|
||||
3. Deletes orphaned images (images without matching lots)
|
||||
4. Compacts database (VACUUM) to reclaim space
|
||||
5. Shows before/after statistics
|
||||
|
||||
**Example output**:
|
||||
```
|
||||
Current database state:
|
||||
┌──────────────────────────────────┬────────┐
|
||||
│ metric │ count │
|
||||
├──────────────────────────────────┼────────┤
|
||||
│ Total lots │ 16807 │
|
||||
│ Valid lots (with auction_id) │ 16794 │
|
||||
│ Invalid lots (missing auction_id) │ 13 │
|
||||
└──────────────────────────────────┴────────┘
|
||||
|
||||
Analyzing data to clean up...
|
||||
→ Invalid lots to delete: 13
|
||||
→ Orphaned images to delete: 0
|
||||
|
||||
This will permanently delete the above records.
|
||||
Continue? (y/N) y
|
||||
|
||||
Cleaning up database...
|
||||
[1/2] Deleting invalid lots...
|
||||
✓ Deleted 13 invalid lots
|
||||
[2/2] Deleting orphaned images...
|
||||
✓ Deleted 0 orphaned images
|
||||
[3/3] Compacting database...
|
||||
✓ Database compacted
|
||||
|
||||
Final database state:
|
||||
┌───────────────┬────────┐
|
||||
│ metric │ count │
|
||||
├───────────────┼────────┤
|
||||
│ Total lots │ 16794 │
|
||||
│ Total images │ 536502 │
|
||||
└───────────────┴────────┘
|
||||
|
||||
Database size: 8.9G
|
||||
```
|
||||
|
||||
## Solution 3: SQL Manual Cleanup
|
||||
|
||||
If you prefer to manually clean using SQL:
|
||||
|
||||
```sql
|
||||
-- Backup first!
|
||||
-- cp cache.db cache.db.backup
|
||||
|
||||
-- Check invalid entries
|
||||
SELECT COUNT(*), 'Invalid' as type
|
||||
FROM lots
|
||||
WHERE auction_id IS NULL OR auction_id = ''
|
||||
UNION ALL
|
||||
SELECT COUNT(*), 'Valid'
|
||||
FROM lots
|
||||
WHERE auction_id IS NOT NULL AND auction_id != '';
|
||||
|
||||
-- Delete invalid lots
|
||||
DELETE FROM lots
|
||||
WHERE auction_id IS NULL OR auction_id = '';
|
||||
|
||||
-- Delete orphaned images
|
||||
DELETE FROM images
|
||||
WHERE lot_id NOT IN (SELECT lot_id FROM lots);
|
||||
|
||||
-- Compact database
|
||||
VACUUM;
|
||||
```
|
||||
|
||||
## Prevention: Production Database Cleanup
|
||||
|
||||
To prevent these invalid entries from accumulating on production, you can:
|
||||
|
||||
1. **Clean production database** (one-time):
|
||||
```bash
|
||||
ssh tour@athena.lan
|
||||
docker run --rm -v shared-auction-data:/data alpine sqlite3 /data/cache.db "DELETE FROM lots WHERE auction_id IS NULL OR auction_id = '';"
|
||||
```
|
||||
|
||||
2. **Update scraper** to ensure all lots have auction_id
|
||||
3. **Add validation** in scraper to reject lots without auction_id
|
||||
|
||||
## When to Clean
|
||||
|
||||
### Immediately if:
|
||||
- ❌ Seeing `NullPointerException` in analytics
|
||||
- ❌ Dashboard insights failing
|
||||
- ❌ Country distribution not working
|
||||
|
||||
### Periodically:
|
||||
- 🔄 After syncing from production (if production has invalid data)
|
||||
- 🔄 Weekly/monthly maintenance
|
||||
- 🔄 Before major testing or demos
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Use Solution 1 (Clean Sync)** for simplicity:
|
||||
- ✅ Guarantees clean state
|
||||
- ✅ No manual SQL needed
|
||||
- ✅ Shows data quality report
|
||||
- ✅ Safe (automatic backup)
|
||||
|
||||
The 13 invalid entries are from an old scraper and represent only 0.08% of data, so cleaning them up has minimal impact but prevents future errors.
|
||||
|
||||
---
|
||||
|
||||
**Related Documentation**:
|
||||
- [Sync Scripts README](../scripts/README.md)
|
||||
- [Data Sync Setup](DATA_SYNC_SETUP.md)
|
||||
- [Database Architecture](../wiki/DATABASE_ARCHITECTURE.md)
|
||||
Reference in New Issue
Block a user