integrating with monitor app

This commit is contained in:
Tour
2025-12-05 06:48:08 +01:00
parent 72afdf772b
commit aea188699f
7 changed files with 1234 additions and 7 deletions

View File

@@ -0,0 +1,262 @@
# Fixing Malformed Database Entries
## Problem
After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data:
### Examples of Malformed Data
```csv
A1-34327,"",https://...,"",€Huidig bod,0,gap,"","","",...
A1-39577,"",https://...,"",€Huidig bod,0,gap,"","","",...
```
**Issues identified:**
1. ❌ Missing `auction_id` (empty string)
2. ❌ Missing `title` (empty string)
3. ❌ Invalid bid value: `€Huidig bod` (Dutch for "Current bid" - placeholder text)
4. ❌ Invalid timestamp: `gap` (should be empty or valid date)
5. ❌ Missing `viewing_time`, `pickup_date`, and other fields
## Root Cause
Earlier scraping runs:
- Used less strict validation
- Fell back to HTML parsing when `__NEXT_DATA__` JSON extraction failed
- HTML parser extracted placeholder text as actual values
- Continued on errors instead of flagging incomplete data
## Solution
### Step 1: Parser Improvements ✅
**Fixed in `src/parse.py`:**
1. **Timestamp parsing** (lines 37-70):
- Filters invalid strings like "gap", "materieel wegens vereffening"
- Returns empty string instead of invalid value
- Handles Unix timestamps in seconds and milliseconds
2. **Bid extraction** (lines 246-280):
- Rejects placeholder text like "€Huidig bod", "€Huidig bod"
- Removes zero-width Unicode spaces
- Returns "No bids" instead of invalid placeholder text
### Step 2: Detection and Repair Scripts ✅
Created two scripts to fix existing data:
#### A. `script/migrate_reparse_lots.py`
**Purpose:** Re-parse ALL cached entries with improved JSON extraction
```bash
# Preview what would be changed
# python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db
python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db
```
```bash
# Preview what would be changed
python script/migrate_reparse_lots.py --dry-run
# Apply changes
python script/migrate_reparse_lots.py
# Use custom database path
python script/migrate_reparse_lots.py --db /path/to/cache.db
```
**What it does:**
- Reads all cached HTML pages from `cache` table
- Re-parses using improved `__NEXT_DATA__` JSON extraction
- Updates existing database entries with newly extracted fields
- Populates missing `auction_id`, `viewing_time`, `pickup_date`, etc.
#### B. `script/fix_malformed_entries.py` ⭐ **RECOMMENDED**
**Purpose:** Detect and fix ONLY malformed entries
```bash
# Preview malformed entries and fixes
python script/fix_malformed_entries.py --dry-run
# Fix malformed entries
python script/fix_malformed_entries.py
# Use custom database path
python script/fix_malformed_entries.py --db /path/to/cache.db
```
**What it detects:**
```sql
-- Auctions with issues
SELECT * FROM auctions WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR first_lot_closing_time = 'gap'
-- Lots with issues
SELECT * FROM lots WHERE
auction_id = '' OR auction_id IS NULL
OR title = '' OR title IS NULL
OR current_bid LIKE '%Huidig%bod%'
OR closing_time = 'gap' OR closing_time = ''
```
**Example output:**
```
=================================================================
MALFORMED ENTRY DETECTION AND REPAIR
=================================================================
1. CHECKING AUCTIONS...
Found 23 malformed auction entries
Fixing auction: A1-39577
URL: https://www.troostwijkauctions.com/a/...-A1-39577
✓ Parsed successfully:
auction_id: A1-39577
title: Bootveiling Rotterdam - Console boten, RIB, speedboten...
location: Rotterdam, NL
lots: 45
✓ Database updated
2. CHECKING LOTS...
Found 127 malformed lot entries
Fixing lot: A1-39529-10
URL: https://www.troostwijkauctions.com/l/...-A1-39529-10
✓ Parsed successfully:
lot_id: A1-39529-10
auction_id: A1-39529
title: Audi A7 Sportback Personenauto
bid: No bids
closing: 2024-12-08 15:30:00
✓ Database updated
=================================================================
SUMMARY
=================================================================
Auctions:
- Found: 23
- Fixed: 21
- Failed: 2
Lots:
- Found: 127
- Fixed: 124
- Failed: 3
```
### Step 3: Verification
After running the fix script, verify the data:
```bash
# Check if malformed entries still exist
python -c "
import sqlite3
conn = sqlite3.connect('path/to/cache.db')
print('Auctions with empty auction_id:')
print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0])
print('Lots with invalid bids:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0])
print('Lots with \"gap\" timestamps:')
print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0])
"
```
Expected result after fix: **All counts should be 0**
### Step 4: Prevention
To prevent future occurrences:
1. **Validation in scraper** - Add validation before saving to database:
```python
def validate_lot_data(lot_data: Dict) -> bool:
"""Validate lot data before saving"""
required_fields = ['lot_id', 'title', 'url']
invalid_values = ['gap', '€Huidig bod', '€Huidig bod', '']
for field in required_fields:
value = lot_data.get(field, '')
if not value or value in invalid_values:
print(f" ⚠️ Invalid {field}: {value}")
return False
return True
# In save_lot method:
if not validate_lot_data(lot_data):
print(f" ❌ Skipping invalid lot: {lot_data.get('url')}")
return
```
2. **Prefer JSON over HTML** - Ensure `__NEXT_DATA__` parsing is tried first (already implemented)
3. **Logging** - Add logging for fallback to HTML parsing:
```python
if next_data:
return next_data
else:
print(f" ⚠️ No __NEXT_DATA__ found, falling back to HTML parsing: {url}")
# HTML parsing...
```
## Recommended Workflow
```bash
# 1. First, run dry-run to see what will be fixed
python script/fix_malformed_entries.py --dry-run
# 2. Review the output - check if fixes look correct
# 3. Run the actual fix
python script/fix_malformed_entries.py
# 4. Verify the results
python script/fix_malformed_entries.py --dry-run
# Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries"
# 5. (Optional) Run full migration to ensure all fields are populated
python script/migrate_reparse_lots.py
```
## Files Modified/Created
### Modified:
-`src/parse.py` - Improved timestamp and bid parsing with validation
### Created:
-`script/fix_malformed_entries.py` - Targeted fix for malformed entries
-`script/migrate_reparse_lots.py` - Full re-parse migration
-`_wiki/JAVA_FIXES_NEEDED.md` - Java-side fixes documentation
-`_wiki/FIXING_MALFORMED_ENTRIES.md` - This file
## Database Location
If you get "no such table" errors, find your actual database:
```bash
# Find all .db files
find . -name "*.db"
# Check which one has data
sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots"
# Use that path with --db flag
python script/fix_malformed_entries.py --db /actual/path/to/cache.db
```
## Next Steps
After fixing malformed entries:
1. ✅ Run `fix_malformed_entries.py` to repair bad data
2. ⏳ Apply Java-side fixes (see `_wiki/JAVA_FIXES_NEEDED.md`)
3. ⏳ Re-run Java monitoring process
4. ✅ Add validation to prevent future issues