integrating with monitor app
This commit is contained in:
262
_wiki/FIXING_MALFORMED_ENTRIES.md
Normal file
262
_wiki/FIXING_MALFORMED_ENTRIES.md
Normal file
@@ -0,0 +1,262 @@
|
||||
# Fixing Malformed Database Entries
|
||||
|
||||
## Problem
|
||||
|
||||
After the initial scrape run with less strict validation, the database contains entries with incomplete or incorrect data:
|
||||
|
||||
### Examples of Malformed Data
|
||||
|
||||
```csv
|
||||
A1-34327,"",https://...,"",€Huidig bod,0,gap,"","","",...
|
||||
A1-39577,"",https://...,"",€Huidig bod,0,gap,"","","",...
|
||||
```
|
||||
|
||||
**Issues identified:**
|
||||
1. ❌ Missing `auction_id` (empty string)
|
||||
2. ❌ Missing `title` (empty string)
|
||||
3. ❌ Invalid bid value: `€Huidig bod` (Dutch for "Current bid" - placeholder text)
|
||||
4. ❌ Invalid timestamp: `gap` (should be empty or valid date)
|
||||
5. ❌ Missing `viewing_time`, `pickup_date`, and other fields
|
||||
|
||||
## Root Cause
|
||||
|
||||
Earlier scraping runs:
|
||||
- Used less strict validation
|
||||
- Fell back to HTML parsing when `__NEXT_DATA__` JSON extraction failed
|
||||
- HTML parser extracted placeholder text as actual values
|
||||
- Continued on errors instead of flagging incomplete data
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Parser Improvements ✅
|
||||
|
||||
**Fixed in `src/parse.py`:**
|
||||
|
||||
1. **Timestamp parsing** (lines 37-70):
|
||||
- Filters invalid strings like "gap", "materieel wegens vereffening"
|
||||
- Returns empty string instead of invalid value
|
||||
- Handles Unix timestamps in seconds and milliseconds
|
||||
|
||||
2. **Bid extraction** (lines 246-280):
|
||||
- Rejects placeholder text like "€Huidig bod", "€Huidig bod"
|
||||
- Removes zero-width Unicode spaces
|
||||
- Returns "No bids" instead of invalid placeholder text
|
||||
|
||||
### Step 2: Detection and Repair Scripts ✅
|
||||
|
||||
Created two scripts to fix existing data:
|
||||
|
||||
#### A. `script/migrate_reparse_lots.py`
|
||||
**Purpose:** Re-parse ALL cached entries with improved JSON extraction
|
||||
|
||||
|
||||
```bash
|
||||
# Preview what would be changed
|
||||
# python script/fix_malformed_entries.py --db C:/mnt/okcomputer/output/cache.db
|
||||
python script/migrate_reparse_lots.py --db C:/mnt/okcomputer/output/cache.db
|
||||
```
|
||||
|
||||
```bash
|
||||
# Preview what would be changed
|
||||
python script/migrate_reparse_lots.py --dry-run
|
||||
# Apply changes
|
||||
python script/migrate_reparse_lots.py
|
||||
|
||||
# Use custom database path
|
||||
python script/migrate_reparse_lots.py --db /path/to/cache.db
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
- Reads all cached HTML pages from `cache` table
|
||||
- Re-parses using improved `__NEXT_DATA__` JSON extraction
|
||||
- Updates existing database entries with newly extracted fields
|
||||
- Populates missing `auction_id`, `viewing_time`, `pickup_date`, etc.
|
||||
|
||||
#### B. `script/fix_malformed_entries.py` ⭐ **RECOMMENDED**
|
||||
**Purpose:** Detect and fix ONLY malformed entries
|
||||
|
||||
```bash
|
||||
# Preview malformed entries and fixes
|
||||
python script/fix_malformed_entries.py --dry-run
|
||||
|
||||
# Fix malformed entries
|
||||
python script/fix_malformed_entries.py
|
||||
|
||||
# Use custom database path
|
||||
python script/fix_malformed_entries.py --db /path/to/cache.db
|
||||
```
|
||||
|
||||
**What it detects:**
|
||||
```sql
|
||||
-- Auctions with issues
|
||||
SELECT * FROM auctions WHERE
|
||||
auction_id = '' OR auction_id IS NULL
|
||||
OR title = '' OR title IS NULL
|
||||
OR first_lot_closing_time = 'gap'
|
||||
|
||||
-- Lots with issues
|
||||
SELECT * FROM lots WHERE
|
||||
auction_id = '' OR auction_id IS NULL
|
||||
OR title = '' OR title IS NULL
|
||||
OR current_bid LIKE '%Huidig%bod%'
|
||||
OR closing_time = 'gap' OR closing_time = ''
|
||||
```
|
||||
|
||||
**Example output:**
|
||||
```
|
||||
=================================================================
|
||||
MALFORMED ENTRY DETECTION AND REPAIR
|
||||
=================================================================
|
||||
|
||||
1. CHECKING AUCTIONS...
|
||||
Found 23 malformed auction entries
|
||||
|
||||
Fixing auction: A1-39577
|
||||
URL: https://www.troostwijkauctions.com/a/...-A1-39577
|
||||
✓ Parsed successfully:
|
||||
auction_id: A1-39577
|
||||
title: Bootveiling Rotterdam - Console boten, RIB, speedboten...
|
||||
location: Rotterdam, NL
|
||||
lots: 45
|
||||
✓ Database updated
|
||||
|
||||
2. CHECKING LOTS...
|
||||
Found 127 malformed lot entries
|
||||
|
||||
Fixing lot: A1-39529-10
|
||||
URL: https://www.troostwijkauctions.com/l/...-A1-39529-10
|
||||
✓ Parsed successfully:
|
||||
lot_id: A1-39529-10
|
||||
auction_id: A1-39529
|
||||
title: Audi A7 Sportback Personenauto
|
||||
bid: No bids
|
||||
closing: 2024-12-08 15:30:00
|
||||
✓ Database updated
|
||||
|
||||
=================================================================
|
||||
SUMMARY
|
||||
=================================================================
|
||||
Auctions:
|
||||
- Found: 23
|
||||
- Fixed: 21
|
||||
- Failed: 2
|
||||
|
||||
Lots:
|
||||
- Found: 127
|
||||
- Fixed: 124
|
||||
- Failed: 3
|
||||
```
|
||||
|
||||
### Step 3: Verification
|
||||
|
||||
After running the fix script, verify the data:
|
||||
|
||||
```bash
|
||||
# Check if malformed entries still exist
|
||||
python -c "
|
||||
import sqlite3
|
||||
conn = sqlite3.connect('path/to/cache.db')
|
||||
|
||||
print('Auctions with empty auction_id:')
|
||||
print(conn.execute('SELECT COUNT(*) FROM auctions WHERE auction_id = \"\" OR auction_id IS NULL').fetchone()[0])
|
||||
|
||||
print('Lots with invalid bids:')
|
||||
print(conn.execute('SELECT COUNT(*) FROM lots WHERE current_bid LIKE \"%Huidig%bod%\"').fetchone()[0])
|
||||
|
||||
print('Lots with \"gap\" timestamps:')
|
||||
print(conn.execute('SELECT COUNT(*) FROM lots WHERE closing_time = \"gap\"').fetchone()[0])
|
||||
"
|
||||
```
|
||||
|
||||
Expected result after fix: **All counts should be 0**
|
||||
|
||||
### Step 4: Prevention
|
||||
|
||||
To prevent future occurrences:
|
||||
|
||||
1. **Validation in scraper** - Add validation before saving to database:
|
||||
|
||||
```python
|
||||
def validate_lot_data(lot_data: Dict) -> bool:
|
||||
"""Validate lot data before saving"""
|
||||
required_fields = ['lot_id', 'title', 'url']
|
||||
invalid_values = ['gap', '€Huidig bod', '€Huidig bod', '']
|
||||
|
||||
for field in required_fields:
|
||||
value = lot_data.get(field, '')
|
||||
if not value or value in invalid_values:
|
||||
print(f" ⚠️ Invalid {field}: {value}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# In save_lot method:
|
||||
if not validate_lot_data(lot_data):
|
||||
print(f" ❌ Skipping invalid lot: {lot_data.get('url')}")
|
||||
return
|
||||
```
|
||||
|
||||
2. **Prefer JSON over HTML** - Ensure `__NEXT_DATA__` parsing is tried first (already implemented)
|
||||
|
||||
3. **Logging** - Add logging for fallback to HTML parsing:
|
||||
|
||||
```python
|
||||
if next_data:
|
||||
return next_data
|
||||
else:
|
||||
print(f" ⚠️ No __NEXT_DATA__ found, falling back to HTML parsing: {url}")
|
||||
# HTML parsing...
|
||||
```
|
||||
|
||||
## Recommended Workflow
|
||||
|
||||
```bash
|
||||
# 1. First, run dry-run to see what will be fixed
|
||||
python script/fix_malformed_entries.py --dry-run
|
||||
|
||||
# 2. Review the output - check if fixes look correct
|
||||
|
||||
# 3. Run the actual fix
|
||||
python script/fix_malformed_entries.py
|
||||
|
||||
# 4. Verify the results
|
||||
python script/fix_malformed_entries.py --dry-run
|
||||
# Should show "Found 0 malformed auction entries" and "Found 0 malformed lot entries"
|
||||
|
||||
# 5. (Optional) Run full migration to ensure all fields are populated
|
||||
python script/migrate_reparse_lots.py
|
||||
```
|
||||
|
||||
## Files Modified/Created
|
||||
|
||||
### Modified:
|
||||
- ✅ `src/parse.py` - Improved timestamp and bid parsing with validation
|
||||
|
||||
### Created:
|
||||
- ✅ `script/fix_malformed_entries.py` - Targeted fix for malformed entries
|
||||
- ✅ `script/migrate_reparse_lots.py` - Full re-parse migration
|
||||
- ✅ `_wiki/JAVA_FIXES_NEEDED.md` - Java-side fixes documentation
|
||||
- ✅ `_wiki/FIXING_MALFORMED_ENTRIES.md` - This file
|
||||
|
||||
## Database Location
|
||||
|
||||
If you get "no such table" errors, find your actual database:
|
||||
|
||||
```bash
|
||||
# Find all .db files
|
||||
find . -name "*.db"
|
||||
|
||||
# Check which one has data
|
||||
sqlite3 path/to/cache.db "SELECT COUNT(*) FROM lots"
|
||||
|
||||
# Use that path with --db flag
|
||||
python script/fix_malformed_entries.py --db /actual/path/to/cache.db
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After fixing malformed entries:
|
||||
1. ✅ Run `fix_malformed_entries.py` to repair bad data
|
||||
2. ⏳ Apply Java-side fixes (see `_wiki/JAVA_FIXES_NEEDED.md`)
|
||||
3. ⏳ Re-run Java monitoring process
|
||||
4. ✅ Add validation to prevent future issues
|
||||
Reference in New Issue
Block a user