a
This commit is contained in:
279
_wiki/TESTING.md
Normal file
279
_wiki/TESTING.md
Normal file
@@ -0,0 +1,279 @@
|
||||
# Testing & Migration Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers:
|
||||
1. Migrating existing cache to compressed format
|
||||
2. Running the test suite
|
||||
3. Understanding test results
|
||||
|
||||
## Step 1: Migrate Cache to Compressed Format
|
||||
|
||||
If you have an existing database with uncompressed entries (from before compression was added), run the migration script:
|
||||
|
||||
```bash
|
||||
python migrate_compress_cache.py
|
||||
```
|
||||
|
||||
### What it does:
|
||||
- Finds all cache entries where data is uncompressed
|
||||
- Compresses them using zlib (level 9)
|
||||
- Reports compression statistics and space saved
|
||||
- Verifies all entries are compressed
|
||||
|
||||
### Expected output:
|
||||
```
|
||||
Cache Compression Migration Tool
|
||||
============================================================
|
||||
Initial database size: 1024.56 MB
|
||||
|
||||
Found 1134 uncompressed cache entries
|
||||
Starting compression...
|
||||
Compressed 100/1134 entries... (78.3% reduction so far)
|
||||
Compressed 200/1134 entries... (79.1% reduction so far)
|
||||
...
|
||||
|
||||
============================================================
|
||||
MIGRATION COMPLETE
|
||||
============================================================
|
||||
Entries compressed: 1134
|
||||
Original size: 1024.56 MB
|
||||
Compressed size: 198.34 MB
|
||||
Space saved: 826.22 MB
|
||||
Compression ratio: 80.6%
|
||||
============================================================
|
||||
|
||||
VERIFICATION:
|
||||
Compressed entries: 1134
|
||||
Uncompressed entries: 0
|
||||
✓ All cache entries are compressed!
|
||||
|
||||
Final database size: 1024.56 MB
|
||||
Database size reduced by: 0.00 MB
|
||||
|
||||
✓ Migration complete! You can now run VACUUM to reclaim disk space:
|
||||
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
|
||||
```
|
||||
|
||||
### Reclaim disk space:
|
||||
After migration, the database file still contains the space used by old uncompressed data. To actually reclaim the disk space:
|
||||
|
||||
```bash
|
||||
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
|
||||
```
|
||||
|
||||
This will rebuild the database file and reduce its size significantly.
|
||||
|
||||
## Step 2: Run Tests
|
||||
|
||||
The test suite validates that auction and lot parsing works correctly using **cached data only** (no live requests to server).
|
||||
|
||||
```bash
|
||||
python test_scraper.py
|
||||
```
|
||||
|
||||
### What it tests:
|
||||
|
||||
**Auction Pages:**
|
||||
- Type detection (must be 'auction')
|
||||
- auction_id extraction
|
||||
- title extraction
|
||||
- location extraction
|
||||
- lots_count extraction
|
||||
- first_lot_closing_time extraction
|
||||
|
||||
**Lot Pages:**
|
||||
- Type detection (must be 'lot')
|
||||
- lot_id extraction
|
||||
- title extraction (must not be '...', 'N/A', or empty)
|
||||
- location extraction (must not be 'Locatie', 'Location', or empty)
|
||||
- current_bid extraction (must not be '€Huidig bod' or invalid)
|
||||
- closing_time extraction
|
||||
- images array extraction
|
||||
- bid_count validation
|
||||
- viewing_time and pickup_date (optional)
|
||||
|
||||
### Expected output:
|
||||
|
||||
```
|
||||
======================================================================
|
||||
TROOSTWIJK SCRAPER TEST SUITE
|
||||
======================================================================
|
||||
|
||||
This test suite uses CACHED data only - no live requests to server
|
||||
======================================================================
|
||||
|
||||
======================================================================
|
||||
CACHE STATUS CHECK
|
||||
======================================================================
|
||||
Total cache entries: 1134
|
||||
Compressed: 1134 (100.0%)
|
||||
Uncompressed: 0 (0.0%)
|
||||
|
||||
✓ All cache entries are compressed!
|
||||
|
||||
======================================================================
|
||||
TEST URL CACHE STATUS:
|
||||
======================================================================
|
||||
✓ https://www.troostwijkauctions.com/a/online-auction-cnc-lat...
|
||||
✓ https://www.troostwijkauctions.com/a/faillissement-bab-sho...
|
||||
✓ https://www.troostwijkauctions.com/a/industriele-goederen-...
|
||||
✓ https://www.troostwijkauctions.com/l/%25282x%2529-duo-bure...
|
||||
✓ https://www.troostwijkauctions.com/l/tos-sui-50-1000-unive...
|
||||
✓ https://www.troostwijkauctions.com/l/rolcontainer-%25282x%...
|
||||
|
||||
6/6 test URLs are cached
|
||||
|
||||
======================================================================
|
||||
TESTING AUCTIONS
|
||||
======================================================================
|
||||
|
||||
======================================================================
|
||||
Testing Auction: https://www.troostwijkauctions.com/a/online-auction...
|
||||
======================================================================
|
||||
✓ Cache hit (age: 12.3 hours)
|
||||
✓ auction_id: A7-39813
|
||||
✓ title: Online Auction: CNC Lathes, Machining Centres & Precision...
|
||||
✓ location: Cluj-Napoca, RO
|
||||
✓ first_lot_closing_time: 2024-12-05 14:30:00
|
||||
✓ lots_count: 45
|
||||
|
||||
======================================================================
|
||||
TESTING LOTS
|
||||
======================================================================
|
||||
|
||||
======================================================================
|
||||
Testing Lot: https://www.troostwijkauctions.com/l/%25282x%2529-duo...
|
||||
======================================================================
|
||||
✓ Cache hit (age: 8.7 hours)
|
||||
✓ lot_id: A1-28505-5
|
||||
✓ title: (2x) Duo Bureau - 160x168 cm
|
||||
✓ location: Dongen, NL
|
||||
✓ current_bid: No bids
|
||||
✓ closing_time: 2024-12-10 16:00:00
|
||||
✓ images: 2 images
|
||||
1. https://media.tbauctions.com/image-media/c3f9825f-e3fd...
|
||||
2. https://media.tbauctions.com/image-media/45c85ced-9c63...
|
||||
✓ bid_count: 0
|
||||
✓ viewing_time: 2024-12-08 09:00:00 - 2024-12-08 17:00:00
|
||||
✓ pickup_date: 2024-12-11 09:00:00 - 2024-12-11 15:00:00
|
||||
|
||||
======================================================================
|
||||
TEST SUMMARY
|
||||
======================================================================
|
||||
|
||||
Total tests: 6
|
||||
Passed: 6 ✓
|
||||
Failed: 0 ✗
|
||||
Success rate: 100.0%
|
||||
|
||||
======================================================================
|
||||
```
|
||||
|
||||
## Test URLs
|
||||
|
||||
The test suite tests these specific URLs (you can modify in `test_scraper.py`):
|
||||
|
||||
**Auctions:**
|
||||
- https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813
|
||||
- https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557
|
||||
- https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675
|
||||
|
||||
**Lots:**
|
||||
- https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5
|
||||
- https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9
|
||||
- https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101
|
||||
|
||||
## Adding More Test Cases
|
||||
|
||||
To add more test URLs, edit `test_scraper.py`:
|
||||
|
||||
```python
|
||||
TEST_AUCTIONS = [
|
||||
"https://www.troostwijkauctions.com/a/your-auction-url",
|
||||
# ... add more
|
||||
]
|
||||
|
||||
TEST_LOTS = [
|
||||
"https://www.troostwijkauctions.com/l/your-lot-url",
|
||||
# ... add more
|
||||
]
|
||||
```
|
||||
|
||||
Then run the main scraper to cache these URLs:
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Then run tests:
|
||||
```bash
|
||||
python test_scraper.py
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "NOT IN CACHE" errors
|
||||
If tests show URLs are not cached, run the main scraper first:
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
### "Failed to decompress cache" warnings
|
||||
This means you have uncompressed legacy data. Run the migration:
|
||||
```bash
|
||||
python migrate_compress_cache.py
|
||||
```
|
||||
|
||||
### Tests failing with parsing errors
|
||||
Check the detailed error output in the TEST SUMMARY section. It will show:
|
||||
- Which field failed validation
|
||||
- The actual value that was extracted
|
||||
- Why it failed (empty, wrong type, invalid format)
|
||||
|
||||
## Cache Behavior
|
||||
|
||||
The test suite uses cached data with these characteristics:
|
||||
- **No rate limiting** - reads from DB instantly
|
||||
- **No server load** - zero HTTP requests
|
||||
- **Repeatable** - same results every time
|
||||
- **Fast** - all tests run in < 5 seconds
|
||||
|
||||
This allows you to:
|
||||
- Test parsing changes without re-scraping
|
||||
- Run tests repeatedly during development
|
||||
- Validate changes before deploying
|
||||
- Ensure data quality without server impact
|
||||
|
||||
## Continuous Integration
|
||||
|
||||
You can integrate these tests into CI/CD:
|
||||
|
||||
```bash
|
||||
# Run migration if needed
|
||||
python migrate_compress_cache.py
|
||||
|
||||
# Run tests
|
||||
python test_scraper.py
|
||||
|
||||
# Exit code: 0 = success, 1 = failure
|
||||
```
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
Based on typical HTML sizes:
|
||||
|
||||
| Metric | Before Compression | After Compression | Improvement |
|
||||
|--------|-------------------|-------------------|-------------|
|
||||
| Avg page size | 800 KB | 150 KB | 81.3% |
|
||||
| 1000 pages | 800 MB | 150 MB | 650 MB saved |
|
||||
| 10,000 pages | 8 GB | 1.5 GB | 6.5 GB saved |
|
||||
| DB read speed | ~50 ms | ~5 ms | 10x faster |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always run migration after upgrading** to the compressed cache version
|
||||
2. **Run VACUUM** after migration to reclaim disk space
|
||||
3. **Run tests after major changes** to parsing logic
|
||||
4. **Add test cases for edge cases** you encounter in production
|
||||
5. **Keep test URLs diverse** - different auctions, lot types, languages
|
||||
6. **Monitor cache hit rates** to ensure effective caching
|
||||
Reference in New Issue
Block a user