# Fixing Malformed Database Entries ## Problem After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data: ### Examples of Malformed Data ```csv A1-34327,"",https://...,"",€Huidig ​​bod,0,gap,"","","",... A1-39577,"",https://...,"",€Huidig ​​bod,0,gap,"","","",... ``` **Issues identified:** 1. ❌ Missing `auction_id` (empty string) 2. ❌ Missing `title` (empty string) 3. ❌ Invalid bid value: `€Huidig ​​bod` (Dutch for "Current bid" - placeholder text) 4. ❌ Invalid timestamp: `gap` (should be empty or valid date) 5. ❌ Missing `viewing_time`, `pickup_date`, and other fields ## Root Cause Earlier scraping runs: - Used less strict validation - Fell back to HTML parsing when `__NEXT_DATA__` JSON extraction failed - HTML parser extracted placeholder text as actual values - Continued on errors instead of flagging incomplete data ## Solution ### Step 1: Parser Improvements ✅ **Fixed in `src/parse.py`:** 1. **Timestamp parsing** (lines 37-70): - Filters invalid strings like "gap", "materieel wegens vereffening" - Returns empty string instead of invalid value - Handles Unix timestamps in seconds and milliseconds 2. **Bid extraction** (lines 246-280): - Rejects placeholder text like "€Huidig ​​bod", "€Huidig bod" - Removes zero-width Unicode spaces - Returns "No bids" instead of invalid placeholder text ### Step 2: Detection and Repair Scripts ✅ Created two scripts to fix existing data: #### A. `script/migrate_reparse_lots.py` **Purpose:** Re-parse ALL cached entries with improved JSON extraction ```bash # Preview what would be changed # python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db ``` ```bash # Preview what would be changed python script/migrate_reparse_lots.py --dry-run # Apply changes python script/migrate_reparse_lots.py # Use custom database path python script/migrate_reparse_lots.py --db /path/to/cache.db ``` **What it does:** - Reads all cached HTML pages from `cache` table - Re-parses using improved `__NEXT_DATA__` JSON extraction - Updates existing database entries with newly extracted fields - Populates missing `auction_id`, `viewing_time`, `pickup_date`, etc. #### B. `script/fix_malformed_entries.py` ⭐ **RECOMMENDED** **Purpose:** Detect and fix ONLY malformed entries ```bash # Preview malformed entries and fixes python script/fix_malformed_entries.py --dry-run # Fix malformed entries python script/fix_malformed_entries.py # Use custom database path python script/fix_malformed_entries.py --db /path/to/cache.db ``` **What it detects:** ```sql -- Auctions with issues SELECT * FROM auctions WHERE auction_id = '' OR auction_id IS NULL OR title = '' OR title IS NULL OR first_lot_closing_time = 'gap' -- Lots with issues SELECT * FROM lots WHERE auction_id = '' OR auction_id IS NULL OR title = '' OR title IS NULL OR current_bid LIKE '%Huidig%bod%' OR closing_time = 'gap' OR closing_time = '' ``` **Example output:** ``` ================================================================= MALFORMED ENTRY DETECTION AND REPAIR ================================================================= 1. CHECKING AUCTIONS... Found 23 malformed auction entries Fixing auction: A1-39577 URL: https://www.troostwijkauctions.com/a/...-A1-39577 ✓ Parsed successfully: auction_id: A1-39577 title: Bootveiling Rotterdam - Console boten, RIB, speedboten... location: Rotterdam, NL lots: 45 ✓ Database updated 2. CHECKING LOTS... Found 127 malformed lot entries Fixing lot: A1-39529-10 URL: https://www.troostwijkauctions.com/l/...-A1-39529-10 ✓ Parsed successfully: lot_id: A1-39529-10 auction_id: A1-39529 title: Audi A7 Sportback Personenauto bid: No bids closing: 2024-12-08 15:30:00 ✓ Database updated ================================================================= SUMMARY ================================================================= Auctions: - Found: 23 - Fixed: 21 - Failed: 2 Lots: - Found: 127 - Fixed: 124 - Failed: 3 ``` ### Step 3: Verification After running the fix script, verify the data: ```bash # Check if malformed entries still exist python -c " import sqlite3 conn = sqlite3.connect('path/to/cache.db') print('Auctions with empty auction_id:') print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0]) print('Lots with invalid bids:') print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0]) print('Lots with \"gap\" timestamps:') print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0]) " ``` Expected result after fix: **All counts should be 0** ### Step 4: Prevention To prevent future occurrences: 1. **Validation in scraper** - Add validation before saving to database: ```python def validate_lot_data(lot_data: Dict) -> bool: """Validate lot data before saving""" required_fields = ['lot_id', 'title', 'url'] invalid_values = ['gap', '€Huidig bod', '€Huidig ​​bod', ''] for field in required_fields: value = lot_data.get(field, '') if not value or value in invalid_values: print(f" ⚠️ Invalid {field}: {value}") return False return True # In save_lot method: if not validate_lot_data(lot_data): print(f" ❌ Skipping invalid lot: {lot_data.get('url')}") return ``` 2. **Prefer JSON over HTML** - Ensure `__NEXT_DATA__` parsing is tried first (already implemented) 3. **Logging** - Add logging for fallback to HTML parsing: ```python if next_data: return next_data else: print(f" ⚠️ No __NEXT_DATA__ found, falling back to HTML parsing: {url}") # HTML parsing... ``` ## Recommended Workflow ```bash # 1. First, run dry-run to see what will be fixed python script/fix_malformed_entries.py --dry-run # 2. Review the output - check if fixes look correct # 3. Run the actual fix python script/fix_malformed_entries.py # 4. Verify the results python script/fix_malformed_entries.py --dry-run # Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries" # 5. (Optional) Run full migration to ensure all fields are populated python script/migrate_reparse_lots.py ``` ## Files Modified/Created ### Modified: - ✅ `src/parse.py` - Improved timestamp and bid parsing with validation ### Created: - ✅ `script/fix_malformed_entries.py` - Targeted fix for malformed entries - ✅ `script/migrate_reparse_lots.py` - Full re-parse migration - ✅ `_wiki/JAVA_FIXES_NEEDED.md` - Java-side fixes documentation - ✅ `_wiki/FIXING_MALFORMED_ENTRIES.md` - This file ## Database Location If you get "no such table" errors, find your actual database: ```bash # Find all .db files find . -name "*.db" # Check which one has data sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots" # Use that path with --db flag python script/fix_malformed_entries.py --db /actual/path/to/cache.db ``` ## Next Steps After fixing malformed entries: 1. ✅ Run `fix_malformed_entries.py` to repair bad data 2. ⏳ Apply Java-side fixes (see `_wiki/JAVA_FIXES_NEEDED.md`) 3. ⏳ Re-run Java monitoring process 4. ✅ Add validation to prevent future issues