first
This commit is contained in:
326
wiki/ARCHITECTURE.md
Normal file
326
wiki/ARCHITECTURE.md
Normal file
@@ -0,0 +1,326 @@
|
||||
# Scaev - Architecture & Data Flow
|
||||
|
||||
## System Overview
|
||||
|
||||
The scraper follows a **3-phase hierarchical crawling pattern** to extract auction and lot data from Troostwijk Auctions website.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mariadb
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TROOSTWIJK SCRAPER │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 1: COLLECT AUCTION URLs │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Listing Page │────────▶│ Extract /a/ │ │
|
||||
│ │ /auctions? │ │ auction URLs │ │
|
||||
│ │ page=1..N │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [ List of Auction URLs ] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 2: EXTRACT LOT URLs FROM AUCTIONS │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Auction Page │────────▶│ Parse │ │
|
||||
│ │ /a/... │ │ __NEXT_DATA__│ │
|
||||
│ └──────────────┘ │ JSON │ │
|
||||
│ │ └──────────────┘ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Save Auction │ │ Extract /l/ │ │
|
||||
│ │ Metadata │ │ lot URLs │ │
|
||||
│ │ to DB │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [ List of Lot URLs ] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 3: SCRAPE LOT DETAILS │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Lot Page │────────▶│ Parse │ │
|
||||
│ │ /l/... │ │ __NEXT_DATA__│ │
|
||||
│ └──────────────┘ │ JSON │ │
|
||||
│ └──────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────────────────┴─────────────────┐ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Save Lot │ │ Save Images │ │
|
||||
│ │ Details │ │ URLs to DB │ │
|
||||
│ │ to DB │ └──────────────┘ │
|
||||
│ └──────────────┘ │ │
|
||||
│ ▼ │
|
||||
│ [Optional Download] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
```mariadb
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ CACHE TABLE (HTML Storage with Compression) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ cache │
|
||||
│ ├── url (TEXT, PRIMARY KEY) │
|
||||
│ ├── content (BLOB) -- Compressed HTML (zlib) │
|
||||
│ ├── timestamp (REAL) │
|
||||
│ ├── status_code (INTEGER) │
|
||||
│ └── compressed (INTEGER) -- 1=compressed, 0=plain │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ AUCTIONS TABLE │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ auctions │
|
||||
│ ├── auction_id (TEXT, PRIMARY KEY) -- e.g. "A7-39813" │
|
||||
│ ├── url (TEXT, UNIQUE) │
|
||||
│ ├── title (TEXT) │
|
||||
│ ├── location (TEXT) -- e.g. "Cluj-Napoca, RO" │
|
||||
│ ├── lots_count (INTEGER) │
|
||||
│ ├── first_lot_closing_time (TEXT) │
|
||||
│ └── scraped_at (TEXT) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ LOTS TABLE │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ lots │
|
||||
│ ├── lot_id (TEXT, PRIMARY KEY) -- e.g. "A1-28505-5" │
|
||||
│ ├── auction_id (TEXT) -- FK to auctions │
|
||||
│ ├── url (TEXT, UNIQUE) │
|
||||
│ ├── title (TEXT) │
|
||||
│ ├── current_bid (TEXT) -- "€123.45" or "No bids" │
|
||||
│ ├── bid_count (INTEGER) │
|
||||
│ ├── closing_time (TEXT) │
|
||||
│ ├── viewing_time (TEXT) │
|
||||
│ ├── pickup_date (TEXT) │
|
||||
│ ├── location (TEXT) -- e.g. "Dongen, NL" │
|
||||
│ ├── description (TEXT) │
|
||||
│ ├── category (TEXT) │
|
||||
│ └── scraped_at (TEXT) │
|
||||
│ FOREIGN KEY (auction_id) → auctions(auction_id) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ IMAGES TABLE (Image URLs & Download Status) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ images ◀── THIS TABLE HOLDS IMAGE LINKS│
|
||||
│ ├── id (INTEGER, PRIMARY KEY AUTOINCREMENT) │
|
||||
│ ├── lot_id (TEXT) -- FK to lots │
|
||||
│ ├── url (TEXT) -- Image URL │
|
||||
│ ├── local_path (TEXT) -- Path after download │
|
||||
│ └── downloaded (INTEGER) -- 0=pending, 1=downloaded │
|
||||
│ FOREIGN KEY (lot_id) → lots(lot_id) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Sequence Diagram
|
||||
|
||||
```
|
||||
User Scraper Playwright Cache DB Data Tables
|
||||
│ │ │ │ │
|
||||
│ Run │ │ │ │
|
||||
├──────────────▶│ │ │ │
|
||||
│ │ │ │ │
|
||||
│ │ Phase 1: Listing Pages │ │
|
||||
│ ├───────────────▶│ │ │
|
||||
│ │ goto() │ │ │
|
||||
│ │◀───────────────┤ │ │
|
||||
│ │ HTML │ │ │
|
||||
│ ├───────────────────────────────▶│ │
|
||||
│ │ compress & cache │ │
|
||||
│ │ │ │ │
|
||||
│ │ Phase 2: Auction Pages │ │
|
||||
│ ├───────────────▶│ │ │
|
||||
│ │◀───────────────┤ │ │
|
||||
│ │ HTML │ │ │
|
||||
│ │ │ │ │
|
||||
│ │ Parse __NEXT_DATA__ JSON │ │
|
||||
│ │────────────────────────────────────────────────▶│
|
||||
│ │ │ │ INSERT auctions
|
||||
│ │ │ │ │
|
||||
│ │ Phase 3: Lot Pages │ │
|
||||
│ ├───────────────▶│ │ │
|
||||
│ │◀───────────────┤ │ │
|
||||
│ │ HTML │ │ │
|
||||
│ │ │ │ │
|
||||
│ │ Parse __NEXT_DATA__ JSON │ │
|
||||
│ │────────────────────────────────────────────────▶│
|
||||
│ │ │ │ INSERT lots │
|
||||
│ │────────────────────────────────────────────────▶│
|
||||
│ │ │ │ INSERT images│
|
||||
│ │ │ │ │
|
||||
│ │ Export to CSV/JSON │ │
|
||||
│ │◀────────────────────────────────────────────────┤
|
||||
│ │ Query all data │ │
|
||||
│◀──────────────┤ │ │ │
|
||||
│ Results │ │ │ │
|
||||
```
|
||||
|
||||
## Data Flow Details
|
||||
|
||||
### 1. **Page Retrieval & Caching**
|
||||
```
|
||||
Request URL
|
||||
│
|
||||
├──▶ Check cache DB (with timestamp validation)
|
||||
│ │
|
||||
│ ├─[HIT]──▶ Decompress (if compressed=1)
|
||||
│ │ └──▶ Return HTML
|
||||
│ │
|
||||
│ └─[MISS]─▶ Fetch via Playwright
|
||||
│ │
|
||||
│ ├──▶ Compress HTML (zlib level 9)
|
||||
│ │ ~70-90% size reduction
|
||||
│ │
|
||||
│ └──▶ Store in cache DB (compressed=1)
|
||||
│
|
||||
└──▶ Return HTML for parsing
|
||||
```
|
||||
|
||||
### 2. **JSON Parsing Strategy**
|
||||
```
|
||||
HTML Content
|
||||
│
|
||||
└──▶ Extract <script id="__NEXT_DATA__">
|
||||
│
|
||||
├──▶ Parse JSON
|
||||
│ │
|
||||
│ ├─[has pageProps.lot]──▶ Individual LOT
|
||||
│ │ └──▶ Extract: title, bid, location, images, etc.
|
||||
│ │
|
||||
│ └─[has pageProps.auction]──▶ AUCTION
|
||||
│ │
|
||||
│ ├─[has lots[] array]──▶ Auction with lots
|
||||
│ │ └──▶ Extract: title, location, lots_count
|
||||
│ │
|
||||
│ └─[no lots[] array]──▶ Old format lot
|
||||
│ └──▶ Parse as lot
|
||||
│
|
||||
└──▶ Fallback to HTML regex parsing (if JSON fails)
|
||||
```
|
||||
|
||||
### 3. **Image Handling**
|
||||
```
|
||||
Lot Page Parsed
|
||||
│
|
||||
├──▶ Extract images[] from JSON
|
||||
│ │
|
||||
│ └──▶ INSERT INTO images (lot_id, url, downloaded=0)
|
||||
│
|
||||
└──▶ [If DOWNLOAD_IMAGES=True]
|
||||
│
|
||||
├──▶ Download each image
|
||||
│ │
|
||||
│ ├──▶ Save to: /images/{lot_id}/001.jpg
|
||||
│ │
|
||||
│ └──▶ UPDATE images SET local_path=?, downloaded=1
|
||||
│
|
||||
└──▶ Rate limit between downloads (0.5s)
|
||||
```
|
||||
|
||||
## Key Configuration
|
||||
|
||||
| Setting | Value | Purpose |
|
||||
|---------|-------|---------|
|
||||
| `CACHE_DB` | `/mnt/okcomputer/output/cache.db` | SQLite database path |
|
||||
| `IMAGES_DIR` | `/mnt/okcomputer/output/images` | Downloaded images storage |
|
||||
| `RATE_LIMIT_SECONDS` | `0.5` | Delay between requests |
|
||||
| `DOWNLOAD_IMAGES` | `False` | Toggle image downloading |
|
||||
| `MAX_PAGES` | `50` | Number of listing pages to crawl |
|
||||
|
||||
## Output Files
|
||||
|
||||
```
|
||||
/mnt/okcomputer/output/
|
||||
├── cache.db # SQLite database (compressed HTML + data)
|
||||
├── auctions_{timestamp}.json # Exported auctions
|
||||
├── auctions_{timestamp}.csv # Exported auctions
|
||||
├── lots_{timestamp}.json # Exported lots
|
||||
├── lots_{timestamp}.csv # Exported lots
|
||||
└── images/ # Downloaded images (if enabled)
|
||||
├── A1-28505-5/
|
||||
│ ├── 001.jpg
|
||||
│ └── 002.jpg
|
||||
└── A1-28505-6/
|
||||
└── 001.jpg
|
||||
```
|
||||
|
||||
## Extension Points for Integration
|
||||
|
||||
### 1. **Downstream Processing Pipeline**
|
||||
```sqlite
|
||||
-- Query lots without downloaded images
|
||||
SELECT lot_id, url FROM images WHERE downloaded = 0;
|
||||
|
||||
-- Process images: OCR, classification, etc.
|
||||
-- Update status when complete
|
||||
UPDATE images SET downloaded = 1, local_path = ? WHERE id = ?;
|
||||
```
|
||||
|
||||
### 2. **Real-time Monitoring**
|
||||
```sqlite
|
||||
-- Check for new lots every N minutes
|
||||
SELECT COUNT(*) FROM lots WHERE scraped_at > datetime('now', '-1 hour');
|
||||
|
||||
-- Monitor bid changes
|
||||
SELECT lot_id, current_bid, bid_count FROM lots WHERE bid_count > 0;
|
||||
```
|
||||
|
||||
### 3. **Analytics & Reporting**
|
||||
```sqlite
|
||||
-- Top locations
|
||||
SELECT location, COUNT(*) as lot_count FROM lots GROUP BY location;
|
||||
|
||||
-- Auction statistics
|
||||
SELECT
|
||||
a.auction_id,
|
||||
a.title,
|
||||
COUNT(l.lot_id) as actual_lots,
|
||||
SUM(CASE WHEN l.bid_count > 0 THEN 1 ELSE 0 END) as lots_with_bids
|
||||
FROM auctions a
|
||||
LEFT JOIN lots l ON a.auction_id = l.auction_id
|
||||
GROUP BY a.auction_id
|
||||
```
|
||||
|
||||
### 4. **Image Processing Integration**
|
||||
```sqlite
|
||||
-- Get all images for a lot
|
||||
SELECT url, local_path FROM images WHERE lot_id = 'A1-28505-5';
|
||||
|
||||
-- Batch process unprocessed images
|
||||
SELECT i.id, i.lot_id, i.local_path, l.title, l.category
|
||||
FROM images i
|
||||
JOIN lots l ON i.lot_id = l.lot_id
|
||||
WHERE i.downloaded = 1 AND i.local_path IS NOT NULL;
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
- **Compression**: ~70-90% HTML size reduction (1GB → ~100-300MB)
|
||||
- **Rate Limiting**: Exactly 0.5s between requests (respectful scraping)
|
||||
- **Caching**: 24-hour default cache validity (configurable)
|
||||
- **Throughput**: ~7,200 pages/hour (with 0.5s rate limit)
|
||||
- **Scalability**: SQLite handles millions of rows efficiently
|
||||
|
||||
## Error Handling
|
||||
|
||||
- **Network failures**: Cached as status_code=500, retry after cache expiry
|
||||
- **Parse failures**: Falls back to HTML regex patterns
|
||||
- **Compression errors**: Auto-detects and handles uncompressed legacy data
|
||||
- **Missing fields**: Defaults to "No bids", empty string, or 0
|
||||
|
||||
## Rate Limiting & Ethics
|
||||
|
||||
- **REQUIRED**: 0.5 second delay between ALL requests
|
||||
- **Respects cache**: Avoids unnecessary re-fetching
|
||||
- **User-Agent**: Identifies as standard browser
|
||||
- **No parallelization**: Single-threaded sequential crawling
|
||||
122
wiki/Deployment.md
Normal file
122
wiki/Deployment.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Deployment
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+ installed
|
||||
- Access to a server (Linux/Windows)
|
||||
- Playwright and dependencies installed
|
||||
|
||||
## Production Setup
|
||||
|
||||
### 1. Install on Server
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone git@git.appmodel.nl:Tour/troost-scraper.git
|
||||
cd troost-scraper
|
||||
|
||||
# Create virtual environment
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
playwright install chromium
|
||||
playwright install-deps # Install system dependencies
|
||||
```
|
||||
|
||||
### 2. Configuration
|
||||
|
||||
Create a configuration file or set environment variables:
|
||||
|
||||
```python
|
||||
# main.py configuration
|
||||
BASE_URL = "https://www.troostwijkauctions.com"
|
||||
CACHE_DB = "/var/troost-scraper/cache.db"
|
||||
OUTPUT_DIR = "/var/troost-scraper/output"
|
||||
RATE_LIMIT_SECONDS = 0.5
|
||||
MAX_PAGES = 50
|
||||
```
|
||||
|
||||
### 3. Create Output Directories
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /var/troost-scraper/output
|
||||
sudo chown $USER:$USER /var/troost-scraper
|
||||
```
|
||||
|
||||
### 4. Run as Cron Job
|
||||
|
||||
Add to crontab (`crontab -e`):
|
||||
|
||||
```bash
|
||||
# Run scraper daily at 2 AM
|
||||
0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
|
||||
```
|
||||
|
||||
## Docker Deployment (Optional)
|
||||
|
||||
Create `Dockerfile`:
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.10-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies for Playwright
|
||||
RUN apt-get update && apt-get install -y \
|
||||
wget \
|
||||
gnupg \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
RUN playwright install chromium
|
||||
RUN playwright install-deps
|
||||
|
||||
COPY main.py .
|
||||
|
||||
CMD ["python", "main.py"]
|
||||
```
|
||||
|
||||
Build and run:
|
||||
|
||||
```bash
|
||||
docker build -t troost-scraper .
|
||||
docker run -v /path/to/output:/output troost-scraper
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Logs
|
||||
|
||||
```bash
|
||||
tail -f /var/log/troost-scraper.log
|
||||
```
|
||||
|
||||
### Monitor Output
|
||||
|
||||
```bash
|
||||
ls -lh /var/troost-scraper/output/
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Playwright Browser Issues
|
||||
|
||||
```bash
|
||||
# Reinstall browsers
|
||||
playwright install --force chromium
|
||||
```
|
||||
|
||||
### Permission Issues
|
||||
|
||||
```bash
|
||||
# Fix permissions
|
||||
sudo chown -R $USER:$USER /var/troost-scraper
|
||||
```
|
||||
|
||||
### Memory Issues
|
||||
|
||||
- Reduce `MAX_PAGES` in configuration
|
||||
- Run on machine with more RAM (Playwright needs ~1GB)
|
||||
71
wiki/Getting-Started.md
Normal file
71
wiki/Getting-Started.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# Getting Started
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Git
|
||||
- pip (Python package manager)
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Clone the repository
|
||||
|
||||
```bash
|
||||
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
|
||||
cd troost-scraper
|
||||
```
|
||||
|
||||
### 2. Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Install Playwright browsers
|
||||
|
||||
```bash
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit the configuration in `main.py`:
|
||||
|
||||
```python
|
||||
BASE_URL = "https://www.troostwijkauctions.com"
|
||||
CACHE_DB = "/path/to/cache.db" # Path to cache database
|
||||
OUTPUT_DIR = "/path/to/output" # Output directory
|
||||
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
||||
MAX_PAGES = 50 # Number of listing pages
|
||||
```
|
||||
|
||||
**Windows users:** Use paths like `C:\\output\\cache.db`
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic scraping
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Crawl listing pages to collect lot URLs
|
||||
2. Scrape each individual lot page
|
||||
3. Save results in JSON and CSV formats
|
||||
4. Cache all pages for future runs
|
||||
|
||||
### Test mode
|
||||
|
||||
Debug extraction on a specific URL:
|
||||
|
||||
```bash
|
||||
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The scraper generates:
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
|
||||
- `cache.db` - SQLite cache (persistent)
|
||||
107
wiki/HOLISTIC.md
Normal file
107
wiki/HOLISTIC.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
The Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. **Browser Automation (Playwright)**
|
||||
- Launches Chromium browser in headless mode
|
||||
- Bypasses Cloudflare protection
|
||||
- Handles dynamic content rendering
|
||||
- Supports network idle detection
|
||||
|
||||
### 2. **Cache Manager (SQLite)**
|
||||
- Caches every fetched page
|
||||
- Prevents redundant requests
|
||||
- Stores page content, timestamps, and status codes
|
||||
- Auto-cleans entries older than 7 days
|
||||
- Database: `cache.db`
|
||||
|
||||
### 3. **Rate Limiter**
|
||||
- Enforces exactly 0.5 seconds between requests
|
||||
- Prevents server overload
|
||||
- Tracks last request time globally
|
||||
|
||||
### 4. **Data Extractor**
|
||||
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
|
||||
- **Fallback method:** HTML pattern matching with regex
|
||||
- Extracts: title, location, bid info, dates, images, descriptions
|
||||
|
||||
### 5. **Output Manager**
|
||||
- Exports data in JSON and CSV formats
|
||||
- Saves progress checkpoints every 10 lots
|
||||
- Timestamped filenames for tracking
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
1. Listing Pages → Extract lot URLs → Store in memory
|
||||
↓
|
||||
2. For each lot URL → Check cache → If cached: use cached content
|
||||
↓ If not: fetch with rate limit
|
||||
↓
|
||||
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
|
||||
↓
|
||||
4. Every 10 lots → Save progress checkpoint
|
||||
↓
|
||||
5. All lots complete → Export final JSON + CSV
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### Why Playwright?
|
||||
- Handles JavaScript-rendered content (Next.js)
|
||||
- Bypasses Cloudflare protection
|
||||
- More reliable than requests/BeautifulSoup for modern SPAs
|
||||
|
||||
### Why JSON extraction?
|
||||
- Site uses Next.js with embedded `__NEXT_DATA__`
|
||||
- JSON is more reliable than HTML pattern matching
|
||||
- Avoids breaking when HTML/CSS changes
|
||||
- Faster parsing
|
||||
|
||||
### Why SQLite caching?
|
||||
- Persistent across runs
|
||||
- Reduces load on target server
|
||||
- Enables test mode without re-fetching
|
||||
- Respects website resources
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
troost-scraper/
|
||||
├── main.py # Main scraper logic
|
||||
├── requirements.txt # Python dependencies
|
||||
├── README.md # Documentation
|
||||
├── .gitignore # Git exclusions
|
||||
└── output/ # Generated files (not in git)
|
||||
├── cache.db # SQLite cache
|
||||
├── *_partial_*.json # Progress checkpoints
|
||||
├── *_final_*.json # Final JSON output
|
||||
└── *_final_*.csv # Final CSV output
|
||||
```
|
||||
|
||||
## Classes
|
||||
|
||||
### `CacheManager`
|
||||
- `__init__(db_path)` - Initialize cache database
|
||||
- `get(url, max_age_hours)` - Retrieve cached page
|
||||
- `set(url, content, status_code)` - Cache a page
|
||||
- `clear_old(max_age_hours)` - Remove old entries
|
||||
|
||||
### `TroostwijkScraper`
|
||||
- `crawl_auctions(max_pages)` - Main entry point
|
||||
- `crawl_listing_page(page, page_num)` - Extract lot URLs
|
||||
- `crawl_lot(page, url)` - Scrape individual lot
|
||||
- `_extract_nextjs_data(content)` - Parse JSON data
|
||||
- `_parse_lot_page(content, url)` - Extract all fields
|
||||
- `save_final_results(data)` - Export JSON + CSV
|
||||
|
||||
## Scalability Notes
|
||||
|
||||
- **Rate limiting** prevents IP blocks but slows execution
|
||||
- **Caching** makes subsequent runs instant for unchanged pages
|
||||
- **Progress checkpoints** allow resuming after interruption
|
||||
- **Async/await** used throughout for non-blocking I/O
|
||||
18
wiki/Home.md
Normal file
18
wiki/Home.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# scaev Wiki
|
||||
|
||||
Welcome to the scaev documentation.
|
||||
|
||||
## Contents
|
||||
|
||||
- [Getting Started](Getting-Started)
|
||||
- [Architecture](Architecture)
|
||||
- [Deployment](Deployment)
|
||||
|
||||
## Overview
|
||||
|
||||
Scaev Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
||||
|
||||
## Quick Links
|
||||
|
||||
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
|
||||
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
|
||||
279
wiki/TESTING.md
Normal file
279
wiki/TESTING.md
Normal file
@@ -0,0 +1,279 @@
|
||||
# Testing & Migration Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers:
|
||||
1. Migrating existing cache to compressed format
|
||||
2. Running the test suite
|
||||
3. Understanding test results
|
||||
|
||||
## Step 1: Migrate Cache to Compressed Format
|
||||
|
||||
If you have an existing database with uncompressed entries (from before compression was added), run the migration script:
|
||||
|
||||
```bash
|
||||
python migrate_compress_cache.py
|
||||
```
|
||||
|
||||
### What it does:
|
||||
- Finds all cache entries where data is uncompressed
|
||||
- Compresses them using zlib (level 9)
|
||||
- Reports compression statistics and space saved
|
||||
- Verifies all entries are compressed
|
||||
|
||||
### Expected output:
|
||||
```
|
||||
Cache Compression Migration Tool
|
||||
============================================================
|
||||
Initial database size: 1024.56 MB
|
||||
|
||||
Found 1134 uncompressed cache entries
|
||||
Starting compression...
|
||||
Compressed 100/1134 entries... (78.3% reduction so far)
|
||||
Compressed 200/1134 entries... (79.1% reduction so far)
|
||||
...
|
||||
|
||||
============================================================
|
||||
MIGRATION COMPLETE
|
||||
============================================================
|
||||
Entries compressed: 1134
|
||||
Original size: 1024.56 MB
|
||||
Compressed size: 198.34 MB
|
||||
Space saved: 826.22 MB
|
||||
Compression ratio: 80.6%
|
||||
============================================================
|
||||
|
||||
VERIFICATION:
|
||||
Compressed entries: 1134
|
||||
Uncompressed entries: 0
|
||||
✓ All cache entries are compressed!
|
||||
|
||||
Final database size: 1024.56 MB
|
||||
Database size reduced by: 0.00 MB
|
||||
|
||||
✓ Migration complete! You can now run VACUUM to reclaim disk space:
|
||||
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
|
||||
```
|
||||
|
||||
### Reclaim disk space:
|
||||
After migration, the database file still contains the space used by old uncompressed data. To actually reclaim the disk space:
|
||||
|
||||
```bash
|
||||
sqlite3 /mnt/okcomputer/output/cache.db 'VACUUM;'
|
||||
```
|
||||
|
||||
This will rebuild the database file and reduce its size significantly.
|
||||
|
||||
## Step 2: Run Tests
|
||||
|
||||
The test suite validates that auction and lot parsing works correctly using **cached data only** (no live requests to server).
|
||||
|
||||
```bash
|
||||
python test_scraper.py
|
||||
```
|
||||
|
||||
### What it tests:
|
||||
|
||||
**Auction Pages:**
|
||||
- Type detection (must be 'auction')
|
||||
- auction_id extraction
|
||||
- title extraction
|
||||
- location extraction
|
||||
- lots_count extraction
|
||||
- first_lot_closing_time extraction
|
||||
|
||||
**Lot Pages:**
|
||||
- Type detection (must be 'lot')
|
||||
- lot_id extraction
|
||||
- title extraction (must not be '...', 'N/A', or empty)
|
||||
- location extraction (must not be 'Locatie', 'Location', or empty)
|
||||
- current_bid extraction (must not be '€Huidig bod' or invalid)
|
||||
- closing_time extraction
|
||||
- images array extraction
|
||||
- bid_count validation
|
||||
- viewing_time and pickup_date (optional)
|
||||
|
||||
### Expected output:
|
||||
|
||||
```
|
||||
======================================================================
|
||||
TROOSTWIJK SCRAPER TEST SUITE
|
||||
======================================================================
|
||||
|
||||
This test suite uses CACHED data only - no live requests to server
|
||||
======================================================================
|
||||
|
||||
======================================================================
|
||||
CACHE STATUS CHECK
|
||||
======================================================================
|
||||
Total cache entries: 1134
|
||||
Compressed: 1134 (100.0%)
|
||||
Uncompressed: 0 (0.0%)
|
||||
|
||||
✓ All cache entries are compressed!
|
||||
|
||||
======================================================================
|
||||
TEST URL CACHE STATUS:
|
||||
======================================================================
|
||||
✓ https://www.troostwijkauctions.com/a/online-auction-cnc-lat...
|
||||
✓ https://www.troostwijkauctions.com/a/faillissement-bab-sho...
|
||||
✓ https://www.troostwijkauctions.com/a/industriele-goederen-...
|
||||
✓ https://www.troostwijkauctions.com/l/%25282x%2529-duo-bure...
|
||||
✓ https://www.troostwijkauctions.com/l/tos-sui-50-1000-unive...
|
||||
✓ https://www.troostwijkauctions.com/l/rolcontainer-%25282x%...
|
||||
|
||||
6/6 test URLs are cached
|
||||
|
||||
======================================================================
|
||||
TESTING AUCTIONS
|
||||
======================================================================
|
||||
|
||||
======================================================================
|
||||
Testing Auction: https://www.troostwijkauctions.com/a/online-auction...
|
||||
======================================================================
|
||||
✓ Cache hit (age: 12.3 hours)
|
||||
✓ auction_id: A7-39813
|
||||
✓ title: Online Auction: CNC Lathes, Machining Centres & Precision...
|
||||
✓ location: Cluj-Napoca, RO
|
||||
✓ first_lot_closing_time: 2024-12-05 14:30:00
|
||||
✓ lots_count: 45
|
||||
|
||||
======================================================================
|
||||
TESTING LOTS
|
||||
======================================================================
|
||||
|
||||
======================================================================
|
||||
Testing Lot: https://www.troostwijkauctions.com/l/%25282x%2529-duo...
|
||||
======================================================================
|
||||
✓ Cache hit (age: 8.7 hours)
|
||||
✓ lot_id: A1-28505-5
|
||||
✓ title: (2x) Duo Bureau - 160x168 cm
|
||||
✓ location: Dongen, NL
|
||||
✓ current_bid: No bids
|
||||
✓ closing_time: 2024-12-10 16:00:00
|
||||
✓ images: 2 images
|
||||
1. https://media.tbauctions.com/image-media/c3f9825f-e3fd...
|
||||
2. https://media.tbauctions.com/image-media/45c85ced-9c63...
|
||||
✓ bid_count: 0
|
||||
✓ viewing_time: 2024-12-08 09:00:00 - 2024-12-08 17:00:00
|
||||
✓ pickup_date: 2024-12-11 09:00:00 - 2024-12-11 15:00:00
|
||||
|
||||
======================================================================
|
||||
TEST SUMMARY
|
||||
======================================================================
|
||||
|
||||
Total tests: 6
|
||||
Passed: 6 ✓
|
||||
Failed: 0 ✗
|
||||
Success rate: 100.0%
|
||||
|
||||
======================================================================
|
||||
```
|
||||
|
||||
## Test URLs
|
||||
|
||||
The test suite tests these specific URLs (you can modify in `test_scraper.py`):
|
||||
|
||||
**Auctions:**
|
||||
- https://www.troostwijkauctions.com/a/online-auction-cnc-lathes-machining-centres-precision-measurement-romania-A7-39813
|
||||
- https://www.troostwijkauctions.com/a/faillissement-bab-shortlease-i-ii-b-v-%E2%80%93-2024-big-ass-energieopslagsystemen-A1-39557
|
||||
- https://www.troostwijkauctions.com/a/industriele-goederen-uit-diverse-bedrijfsbeeindigingen-A1-38675
|
||||
|
||||
**Lots:**
|
||||
- https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5
|
||||
- https://www.troostwijkauctions.com/l/tos-sui-50-1000-universele-draaibank-A7-39568-9
|
||||
- https://www.troostwijkauctions.com/l/rolcontainer-%25282x%2529-A1-40191-101
|
||||
|
||||
## Adding More Test Cases
|
||||
|
||||
To add more test URLs, edit `test_scraper.py`:
|
||||
|
||||
```python
|
||||
TEST_AUCTIONS = [
|
||||
"https://www.troostwijkauctions.com/a/your-auction-url",
|
||||
# ... add more
|
||||
]
|
||||
|
||||
TEST_LOTS = [
|
||||
"https://www.troostwijkauctions.com/l/your-lot-url",
|
||||
# ... add more
|
||||
]
|
||||
```
|
||||
|
||||
Then run the main scraper to cache these URLs:
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Then run tests:
|
||||
```bash
|
||||
python test_scraper.py
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "NOT IN CACHE" errors
|
||||
If tests show URLs are not cached, run the main scraper first:
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
### "Failed to decompress cache" warnings
|
||||
This means you have uncompressed legacy data. Run the migration:
|
||||
```bash
|
||||
python migrate_compress_cache.py
|
||||
```
|
||||
|
||||
### Tests failing with parsing errors
|
||||
Check the detailed error output in the TEST SUMMARY section. It will show:
|
||||
- Which field failed validation
|
||||
- The actual value that was extracted
|
||||
- Why it failed (empty, wrong type, invalid format)
|
||||
|
||||
## Cache Behavior
|
||||
|
||||
The test suite uses cached data with these characteristics:
|
||||
- **No rate limiting** - reads from DB instantly
|
||||
- **No server load** - zero HTTP requests
|
||||
- **Repeatable** - same results every time
|
||||
- **Fast** - all tests run in < 5 seconds
|
||||
|
||||
This allows you to:
|
||||
- Test parsing changes without re-scraping
|
||||
- Run tests repeatedly during development
|
||||
- Validate changes before deploying
|
||||
- Ensure data quality without server impact
|
||||
|
||||
## Continuous Integration
|
||||
|
||||
You can integrate these tests into CI/CD:
|
||||
|
||||
```bash
|
||||
# Run migration if needed
|
||||
python migrate_compress_cache.py
|
||||
|
||||
# Run tests
|
||||
python test_scraper.py
|
||||
|
||||
# Exit code: 0 = success, 1 = failure
|
||||
```
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
Based on typical HTML sizes:
|
||||
|
||||
| Metric | Before Compression | After Compression | Improvement |
|
||||
|--------|-------------------|-------------------|-------------|
|
||||
| Avg page size | 800 KB | 150 KB | 81.3% |
|
||||
| 1000 pages | 800 MB | 150 MB | 650 MB saved |
|
||||
| 10,000 pages | 8 GB | 1.5 GB | 6.5 GB saved |
|
||||
| DB read speed | ~50 ms | ~5 ms | 10x faster |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always run migration after upgrading** to the compressed cache version
|
||||
2. **Run VACUUM** after migration to reclaim disk space
|
||||
3. **Run tests after major changes** to parsing logic
|
||||
4. **Add test cases for edge cases** you encounter in production
|
||||
5. **Keep test URLs diverse** - different auctions, lot types, languages
|
||||
6. **Monitor cache hit rates** to ensure effective caching
|
||||
Reference in New Issue
Block a user