Files
scaev/_wiki/TESTING.md
Tour b12f3a5ee2 a
2025-12-04 14:53:55 +01:00

280 lines
8.5 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Testing & Migration Guide
## Overview
This guide covers:
1. Migrating existing cache to compressed format
2. Running the test suite
3. Understanding test results
## Step 1: Migrate Cache to Compressed Format
If you have an existing database with uncompressed entries (from before compression was added), run the migration script:
```bash
python migrate_compress_cache.py
```
### What it does:
- Finds all cache entries where data is uncompressed
- Compresses them using zlib (level 9)
- Reports compression statistics and space saved
- Verifies all entries are compressed
### Expected output:
```
Cache Compression Migration Tool
============================================================
Initial database size: 1024.56 MB
Found 1134 uncompressed cache entries
Starting compression...
Compressed 100/1134 entries... (78.3% reduction so far)
Compressed 200/1134 entries... (79.1% reduction so far)
...
============================================================
MIGRATION COMPLETE
============================================================
Entries compressed: 1134
Original size: 1024.56 MB
Compressed size: 198.34 MB
Space saved: 826.22 MB
Compression ratio: 80.6%
============================================================
VERIFICATION:
Compressed entries: 1134
Uncompressed entries: 0
✓ All cache entries are compressed!
Final database size: 1024.56 MB
Database size reduced by: 0.00 MB
✓ Migration complete! You can now run VACUUM to reclaim disk space:
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
```
### Reclaim disk space:
After migration, the database file still contains the space used by old uncompressed data. To actually reclaim the disk space:
```bash
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
```
This will rebuild the database file and reduce its size significantly.
## Step 2: Run Tests
The test suite validates that auction and lot parsing works correctly using **cached data only** (no live requests to server).
```bash
python test_scraper.py
```
### What it tests:
**Auction Pages:**
- Type detection (must be 'auction')
- auction_id extraction
- title extraction
- location extraction
- lots_count extraction
- first_lot_closing_time extraction
**Lot Pages:**
- Type detection (must be 'lot')
- lot_id extraction
- title extraction (must not be '...', 'N/A', or empty)
- location extraction (must not be 'Locatie', 'Location', or empty)
- current_bid extraction (must not be '€Huidig bod' or invalid)
- closing_time extraction
- images array extraction
- bid_count validation
- viewing_time and pickup_date (optional)
### Expected output:
```
======================================================================
TROOSTWIJK SCRAPER TEST SUITE
======================================================================
This test suite uses CACHED data only - no live requests to server
======================================================================
======================================================================
CACHE STATUS CHECK
======================================================================
Total cache entries: 1134
Compressed: 1134 (100.0%)
Uncompressed: 0 (0.0%)
✓ All cache entries are compressed!
======================================================================
TEST URL CACHE STATUS:
======================================================================
✓ https://www.troostwijkauctions.com/a/online-auction-cnc-lat...
✓ https://www.troostwijkauctions.com/a/faillissement-bab-sho...
✓ https://www.troostwijkauctions.com/a/industriele-goederen-...
✓ https://www.troostwijkauctions.com/l/%25282x%2529-duo-bure...
✓ https://www.troostwijkauctions.com/l/tos-sui-50-1000-unive...
✓ https://www.troostwijkauctions.com/l/rolcontainer-%25282x%...
6/6 test URLs are cached
======================================================================
TESTING AUCTIONS
======================================================================
======================================================================
Testing Auction: https://www.troostwijkauctions.com/a/online-auction...
======================================================================
✓ Cache hit (age: 12.3 hours)
✓ auction_id: A7-39813
✓ title: Online Auction: CNC Lathes, Machining Centres & Precision...
✓ location: Cluj-Napoca, RO
✓ first_lot_closing_time: 2024-12-05 14:30:00
✓ lots_count: 45
======================================================================
TESTING LOTS
======================================================================
======================================================================
Testing Lot: https://www.troostwijkauctions.com/l/%25282x%2529-duo...
======================================================================
✓ Cache hit (age: 8.7 hours)
✓ lot_id: A1-28505-5
✓ title: (2x) Duo Bureau - 160x168 cm
✓ location: Dongen, NL
✓ current_bid: No bids
✓ closing_time: 2024-12-10 16:00:00
✓ images: 2 images
1. https://media.tbauctions.com/image-media/c3f9825f-e3fd...
2. https://media.tbauctions.com/image-media/45c85ced-9c63...
✓ bid_count: 0
✓ viewing_time: 2024-12-08 09:00:00 - 2024-12-08 17:00:00
✓ pickup_date: 2024-12-11 09:00:00 - 2024-12-11 15:00:00
======================================================================
TEST SUMMARY
======================================================================
Total tests: 6
Passed: 6 ✓
Failed: 0 ✗
Success rate: 100.0%
======================================================================
```
## Test URLs
The test suite tests these specific URLs (you can modify in `test_scraper.py`):
**Auctions:**
- https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813
- https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557
- https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675
**Lots:**
- https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5
- https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9
- https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101
## Adding More Test Cases
To add more test URLs, edit `test_scraper.py`:
```python
TEST_AUCTIONS = [
"https://www.troostwijkauctions.com/a/your-auction-url",
# ... add more
]
TEST_LOTS = [
"https://www.troostwijkauctions.com/l/your-lot-url",
# ... add more
]
```
Then run the main scraper to cache these URLs:
```bash
python main.py
```
Then run tests:
```bash
python test_scraper.py
```
## Troubleshooting
### "NOT IN CACHE" errors
If tests show URLs are not cached, run the main scraper first:
```bash
python main.py
```
### "Failed to decompress cache" warnings
This means you have uncompressed legacy data. Run the migration:
```bash
python migrate_compress_cache.py
```
### Tests failing with parsing errors
Check the detailed error output in the TEST SUMMARY section. It will show:
- Which field failed validation
- The actual value that was extracted
- Why it failed (empty, wrong type, invalid format)
## Cache Behavior
The test suite uses cached data with these characteristics:
- **No rate limiting** - reads from DB instantly
- **No server load** - zero HTTP requests
- **Repeatable** - same results every time
- **Fast** - all tests run in < 5 seconds
This allows you to:
- Test parsing changes without re-scraping
- Run tests repeatedly during development
- Validate changes before deploying
- Ensure data quality without server impact
## Continuous Integration
You can integrate these tests into CI/CD:
```bash
# Run migration if needed
python migrate_compress_cache.py
# Run tests
python test_scraper.py
# Exit code: 0 = success, 1 = failure
```
## Performance Benchmarks
Based on typical HTML sizes:
| Metric | Before Compression | After Compression | Improvement |
|--------|-------------------|-------------------|-------------|
| Avg page size | 800 KB | 150 KB | 81.3% |
| 1000 pages | 800 MB | 150 MB | 650 MB saved |
| 10,000 pages | 8 GB | 1.5 GB | 6.5 GB saved |
| DB read speed | ~50 ms | ~5 ms | 10x faster |
## Best Practices
1. **Always run migration after upgrading** to the compressed cache version
2. **Run VACUUM** after migration to reclaim disk space
3. **Run tests after major changes** to parsing logic
4. **Add test cases for edge cases** you encounter in production
5. **Keep test URLs diverse** - different auctions, lot types, languages
6. **Monitor cache hit rates** to ensure effective caching