7.3 KiB
Fixing Malformed Database Entries
Problem
After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data:
Examples of Malformed Data
A1-34327,"",https://...,"",€Huidig bod,0,gap,"","","",...
A1-39577,"",https://...,"",€Huidig bod,0,gap,"","","",...
Issues identified:
- ❌ Missing
auction_id(empty string) - ❌ Missing
title(empty string) - ❌ Invalid bid value:
€Huidig bod(Dutch for "Current bid" - placeholder text) - ❌ Invalid timestamp:
gap(should be empty or valid date) - ❌ Missing
viewing_time,pickup_date, and other fields
Root Cause
Earlier scraping runs:
- Used less strict validation
- Fell back to HTML parsing when
__NEXT_DATA__JSON extraction failed - HTML parser extracted placeholder text as actual values
- Continued on errors instead of flagging incomplete data
Solution
Step 1: Parser Improvements ✅
Fixed in src/parse.py:
-
Timestamp parsing (lines 37-70):
- Filters invalid strings like "gap", "materieel wegens vereffening"
- Returns empty string instead of invalid value
- Handles Unix timestamps in seconds and milliseconds
-
Bid extraction (lines 246-280):
- Rejects placeholder text like "€Huidig bod", "€Huidig bod"
- Removes zero-width Unicode spaces
- Returns "No bids" instead of invalid placeholder text
Step 2: Detection and Repair Scripts ✅
Created two scripts to fix existing data:
A. script/migrate_reparse_lots.py
Purpose: Re-parse ALL cached entries with improved JSON extraction
# Preview what would be changed
# python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db
python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db
# Preview what would be changed
python script/migrate_reparse_lots.py --dry-run
# Apply changes
python script/migrate_reparse_lots.py
# Use custom database path
python script/migrate_reparse_lots.py --db /path/to/cache.db
What it does:
- Reads all cached HTML pages from
cachetable - Re-parses using improved
__NEXT_DATA__JSON extraction - Updates existing database entries with newly extracted fields
- Populates missing
auction_id,viewing_time,pickup_date, etc.
B. script/fix_malformed_entries.py ⭐ RECOMMENDED
Purpose: Detect and fix ONLY malformed entries
# Preview malformed entries and fixes
python script/fix_malformed_entries.py --dry-run
# Fix malformed entries
python script/fix_malformed_entries.py
# Use custom database path
python script/fix_malformed_entries.py --db /path/to/cache.db
What it detects:
-- Auctions with issues
SELECT * FROM auctions WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR first_lot_closing_time = 'gap'
-- Lots with issues
SELECT * FROM lots WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR current_bid LIKE '%Huidig%bod%'
OR closing_time = 'gap' OR closing_time = ''
Example output:
=================================================================
MALFORMED ENTRY DETECTION AND REPAIR
=================================================================
1. CHECKING AUCTIONS...
Found 23 malformed auction entries
Fixing auction: A1-39577
URL: https://www.troostwijkauctions.com/a/...-A1-39577
✓ Parsed successfully:
auction_id: A1-39577
title: Bootveiling Rotterdam - Console boten, RIB, speedboten...
location: Rotterdam, NL
lots: 45
✓ Database updated
2. CHECKING LOTS...
Found 127 malformed lot entries
Fixing lot: A1-39529-10
URL: https://www.troostwijkauctions.com/l/...-A1-39529-10
✓ Parsed successfully:
lot_id: A1-39529-10
auction_id: A1-39529
title: Audi A7 Sportback Personenauto
bid: No bids
closing: 2024-12-08 15:30:00
✓ Database updated
=================================================================
SUMMARY
=================================================================
Auctions:
- Found: 23
- Fixed: 21
- Failed: 2
Lots:
- Found: 127
- Fixed: 124
- Failed: 3
Step 3: Verification
After running the fix script, verify the data:
# Check if malformed entries still exist
python -c "
import sqlite3
conn = sqlite3.connect('path/to/cache.db')
print('Auctions with empty auction_id:')
print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0])
print('Lots with invalid bids:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0])
print('Lots with \"gap\" timestamps:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0])
"
Expected result after fix: All counts should be 0
Step 4: Prevention
To prevent future occurrences:
- Validation in scraper - Add validation before saving to database:
def validate_lot_data(lot_data: Dict) -> bool:
"""Validate lot data before saving"""
required_fields = ['lot_id', 'title', 'url']
invalid_values = ['gap', '€Huidig bod', '€Huidig bod', '']
for field in required_fields:
value = lot_data.get(field, '')
if not value or value in invalid_values:
print(f" ⚠️ Invalid {field}: {value}")
return False
return True
# In save_lot method:
if not validate_lot_data(lot_data):
print(f" ❌ Skipping invalid lot: {lot_data.get('url')}")
return
-
Prefer JSON over HTML - Ensure
__NEXT_DATA__parsing is tried first (already implemented) -
Logging - Add logging for fallback to HTML parsing:
if next_data:
return next_data
else:
print(f" ⚠️ No __NEXT_DATA__ found, falling back to HTML parsing: {url}")
# HTML parsing...
Recommended Workflow
# 1. First, run dry-run to see what will be fixed
python script/fix_malformed_entries.py --dry-run
# 2. Review the output - check if fixes look correct
# 3. Run the actual fix
python script/fix_malformed_entries.py
# 4. Verify the results
python script/fix_malformed_entries.py --dry-run
# Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries"
# 5. (Optional) Run full migration to ensure all fields are populated
python script/migrate_reparse_lots.py
Files Modified/Created
Modified:
- ✅
src/parse.py- Improved timestamp and bid parsing with validation
Created:
- ✅
script/fix_malformed_entries.py- Targeted fix for malformed entries - ✅
script/migrate_reparse_lots.py- Full re-parse migration - ✅
_wiki/JAVA_FIXES_NEEDED.md- Java-side fixes documentation - ✅
_wiki/FIXING_MALFORMED_ENTRIES.md- This file
Database Location
If you get "no such table" errors, find your actual database:
# Find all .db files
find . -name "*.db"
# Check which one has data
sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots"
# Use that path with --db flag
python script/fix_malformed_entries.py --db /actual/path/to/cache.db
Next Steps
After fixing malformed entries:
- ✅ Run
fix_malformed_entries.pyto repair bad data - ⏳ Apply Java-side fixes (see
_wiki/JAVA_FIXES_NEEDED.md) - ⏳ Re-run Java monitoring process
- ✅ Add validation to prevent future issues