Files
scaev/_wiki/FIXING_MALFORMED_ENTRIES.md
2025-12-05 06:48:08 +01:00

7.3 KiB
Raw Blame History

Fixing Malformed Database Entries

Problem

After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data:

Examples of Malformed Data

A1-34327,"",https://...,"",€Huidig bod,0,gap,"","","",...
A1-39577,"",https://...,"",€Huidig bod,0,gap,"","","",...

Issues identified:

  1. Missing auction_id (empty string)
  2. Missing title (empty string)
  3. Invalid bid value: €Huidig bod (Dutch for "Current bid" - placeholder text)
  4. Invalid timestamp: gap (should be empty or valid date)
  5. Missing viewing_time, pickup_date, and other fields

Root Cause

Earlier scraping runs:

  • Used less strict validation
  • Fell back to HTML parsing when __NEXT_DATA__ JSON extraction failed
  • HTML parser extracted placeholder text as actual values
  • Continued on errors instead of flagging incomplete data

Solution

Step 1: Parser Improvements

Fixed in src/parse.py:

  1. Timestamp parsing (lines 37-70):

    • Filters invalid strings like "gap", "materieel wegens vereffening"
    • Returns empty string instead of invalid value
    • Handles Unix timestamps in seconds and milliseconds
  2. Bid extraction (lines 246-280):

    • Rejects placeholder text like "€Huidig bod", "€Huidig bod"
    • Removes zero-width Unicode spaces
    • Returns "No bids" instead of invalid placeholder text

Step 2: Detection and Repair Scripts

Created two scripts to fix existing data:

A. script/migrate_reparse_lots.py

Purpose: Re-parse ALL cached entries with improved JSON extraction

# Preview what would be changed
# python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db
python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db
# Preview what would be changed
python script/migrate_reparse_lots.py --dry-run
# Apply changes
python script/migrate_reparse_lots.py

# Use custom database path
python script/migrate_reparse_lots.py --db /path/to/cache.db

What it does:

  • Reads all cached HTML pages from cache table
  • Re-parses using improved __NEXT_DATA__ JSON extraction
  • Updates existing database entries with newly extracted fields
  • Populates missing auction_id, viewing_time, pickup_date, etc.

Purpose: Detect and fix ONLY malformed entries

# Preview malformed entries and fixes
python script/fix_malformed_entries.py --dry-run

# Fix malformed entries
python script/fix_malformed_entries.py

# Use custom database path
python script/fix_malformed_entries.py --db /path/to/cache.db

What it detects:

-- Auctions with issues
SELECT * FROM auctions WHERE
    auction_id = '' OR auction_id IS NULL
    OR title = '' OR title IS NULL
    OR first_lot_closing_time = 'gap'

-- Lots with issues
SELECT * FROM lots WHERE
    auction_id = '' OR auction_id IS NULL
    OR title = '' OR title IS NULL
    OR current_bid LIKE '%Huidig%bod%'
    OR closing_time = 'gap' OR closing_time = ''

Example output:

=================================================================
MALFORMED ENTRY DETECTION AND REPAIR
=================================================================

1. CHECKING AUCTIONS...
   Found 23 malformed auction entries

  Fixing auction: A1-39577
    URL: https://www.troostwijkauctions.com/a/...-A1-39577
    ✓ Parsed successfully:
       auction_id: A1-39577
       title: Bootveiling Rotterdam - Console boten, RIB, speedboten...
       location: Rotterdam, NL
       lots: 45
    ✓ Database updated

2. CHECKING LOTS...
   Found 127 malformed lot entries

  Fixing lot: A1-39529-10
    URL: https://www.troostwijkauctions.com/l/...-A1-39529-10
    ✓ Parsed successfully:
       lot_id: A1-39529-10
       auction_id: A1-39529
       title: Audi A7 Sportback Personenauto
       bid: No bids
       closing: 2024-12-08 15:30:00
    ✓ Database updated

=================================================================
SUMMARY
=================================================================
Auctions:
  - Found:  23
  - Fixed:  21
  - Failed: 2

Lots:
  - Found:  127
  - Fixed:  124
  - Failed: 3

Step 3: Verification

After running the fix script, verify the data:

# Check if malformed entries still exist
python -c "
import sqlite3
conn = sqlite3.connect('path/to/cache.db')

print('Auctions with empty auction_id:')
print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0])

print('Lots with invalid bids:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0])

print('Lots with \"gap\" timestamps:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0])
"

Expected result after fix: All counts should be 0

Step 4: Prevention

To prevent future occurrences:

  1. Validation in scraper - Add validation before saving to database:
def validate_lot_data(lot_data: Dict) -> bool:
    """Validate lot data before saving"""
    required_fields = ['lot_id', 'title', 'url']
    invalid_values = ['gap', '€Huidig bod', '€Huidig bod', '']

    for field in required_fields:
        value = lot_data.get(field, '')
        if not value or value in invalid_values:
            print(f"  ⚠️  Invalid {field}: {value}")
            return False

    return True

# In save_lot method:
if not validate_lot_data(lot_data):
    print(f"  ❌ Skipping invalid lot: {lot_data.get('url')}")
    return
  1. Prefer JSON over HTML - Ensure __NEXT_DATA__ parsing is tried first (already implemented)

  2. Logging - Add logging for fallback to HTML parsing:

if next_data:
    return next_data
else:
    print(f"  ⚠️  No __NEXT_DATA__ found, falling back to HTML parsing: {url}")
    # HTML parsing...
# 1. First, run dry-run to see what will be fixed
python script/fix_malformed_entries.py --dry-run

# 2. Review the output - check if fixes look correct

# 3. Run the actual fix
python script/fix_malformed_entries.py

# 4. Verify the results
python script/fix_malformed_entries.py --dry-run
# Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries"

# 5. (Optional) Run full migration to ensure all fields are populated
python script/migrate_reparse_lots.py

Files Modified/Created

Modified:

  • src/parse.py - Improved timestamp and bid parsing with validation

Created:

  • script/fix_malformed_entries.py - Targeted fix for malformed entries
  • script/migrate_reparse_lots.py - Full re-parse migration
  • _wiki/JAVA_FIXES_NEEDED.md - Java-side fixes documentation
  • _wiki/FIXING_MALFORMED_ENTRIES.md - This file

Database Location

If you get "no such table" errors, find your actual database:

# Find all .db files
find . -name "*.db"

# Check which one has data
sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots"

# Use that path with --db flag
python script/fix_malformed_entries.py --db /actual/path/to/cache.db

Next Steps

After fixing malformed entries:

  1. Run fix_malformed_entries.py to repair bad data
  2. Apply Java-side fixes (see _wiki/JAVA_FIXES_NEEDED.md)
  3. Re-run Java monitoring process
  4. Add validation to prevent future issues