Add complete wiki documentation: Home, Getting Started, Architecture, and Deployment guides

2025-12-03 12:33:06 +01:00
parent 87212cd612
commit 56caa38978
4 changed files with 318 additions and 0 deletions
--- a/Architecture.md
+++ b/Architecture.md
@@ -0,0 +1,107 @@
+# Architecture
+
+## Overview
+
+The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
+
+## Core Components
+
+### 1. **Browser Automation (Playwright)**
+- Launches Chromium browser in headless mode
+- Bypasses Cloudflare protection
+- Handles dynamic content rendering
+- Supports network idle detection
+
+### 2. **Cache Manager (SQLite)**
+- Caches every fetched page
+- Prevents redundant requests
+- Stores page content, timestamps, and status codes
+- Auto-cleans entries older than 7 days
+- Database: `cache.db`
+
+### 3. **Rate Limiter**
+- Enforces exactly 0.5 seconds between requests
+- Prevents server overload
+- Tracks last request time globally
+
+### 4. **Data Extractor**
+- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
+- **Fallback method:** HTML pattern matching with regex
+- Extracts: title, location, bid info, dates, images, descriptions
+
+### 5. **Output Manager**
+- Exports data in JSON and CSV formats
+- Saves progress checkpoints every 10 lots
+- Timestamped filenames for tracking
+
+## Data Flow
+
+```
+1. Listing Pages → Extract lot URLs → Store in memory
+                                           ↓
+2. For each lot URL → Check cache → If cached: use cached content
+                          ↓              If not: fetch with rate limit
+                          ↓
+3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
+                          ↓
+4. Every 10 lots → Save progress checkpoint
+                          ↓
+5. All lots complete → Export final JSON + CSV
+```
+
+## Key Design Decisions
+
+### Why Playwright?
+- Handles JavaScript-rendered content (Next.js)
+- Bypasses Cloudflare protection
+- More reliable than requests/BeautifulSoup for modern SPAs
+
+### Why JSON extraction?
+- Site uses Next.js with embedded `__NEXT_DATA__`
+- JSON is more reliable than HTML pattern matching
+- Avoids breaking when HTML/CSS changes
+- Faster parsing
+
+### Why SQLite caching?
+- Persistent across runs
+- Reduces load on target server
+- Enables test mode without re-fetching
+- Respects website resources
+
+## File Structure
+
+```
+troost-scraper/
+├── main.py                    # Main scraper logic
+├── requirements.txt           # Python dependencies
+├── README.md                  # Documentation
+├── .gitignore                 # Git exclusions
+└── output/                    # Generated files (not in git)
+    ├── cache.db              # SQLite cache
+    ├── *_partial_*.json      # Progress checkpoints
+    ├── *_final_*.json        # Final JSON output
+    └── *_final_*.csv         # Final CSV output
+```
+
+## Classes
+
+### `CacheManager`
+- `__init__(db_path)` - Initialize cache database
+- `get(url, max_age_hours)` - Retrieve cached page
+- `set(url, content, status_code)` - Cache a page
+- `clear_old(max_age_hours)` - Remove old entries
+
+### `TroostwijkScraper`
+- `crawl_auctions(max_pages)` - Main entry point
+- `crawl_listing_page(page, page_num)` - Extract lot URLs
+- `crawl_lot(page, url)` - Scrape individual lot
+- `_extract_nextjs_data(content)` - Parse JSON data
+- `_parse_lot_page(content, url)` - Extract all fields
+- `save_final_results(data)` - Export JSON + CSV
+
+## Scalability Notes
+
+- **Rate limiting** prevents IP blocks but slows execution
+- **Caching** makes subsequent runs instant for unchanged pages
+- **Progress checkpoints** allow resuming after interruption
+- **Async/await** used throughout for non-blocking I/O
--- a/Deployment.md
+++ b/Deployment.md
@@ -0,0 +1,122 @@
+# Deployment
+
+## Prerequisites
+
+- Python 3.8+ installed
+- Access to a server (Linux/Windows)
+- Playwright and dependencies installed
+
+## Production Setup
+
+### 1. Install on Server
+
+```bash
+# Clone repository
+git clone git@git.appmodel.nl:Tour/troost-scraper.git
+cd troost-scraper
+
+# Create virtual environment
+python -m venv .venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+
+# Install dependencies
+pip install -r requirements.txt
+playwright install chromium
+playwright install-deps  # Install system dependencies
+```
+
+### 2. Configuration
+
+Create a configuration file or set environment variables:
+
+```python
+# main.py configuration
+BASE_URL = "https://www.troostwijkauctions.com"
+CACHE_DB = "/var/troost-scraper/cache.db"
+OUTPUT_DIR = "/var/troost-scraper/output"
+RATE_LIMIT_SECONDS = 0.5
+MAX_PAGES = 50
+```
+
+### 3. Create Output Directories
+
+```bash
+sudo mkdir -p /var/troost-scraper/output
+sudo chown $USER:$USER /var/troost-scraper
+```
+
+### 4. Run as Cron Job
+
+Add to crontab (`crontab -e`):
+
+```bash
+# Run scraper daily at 2 AM
+0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
+```
+
+## Docker Deployment (Optional)
+
+Create `Dockerfile`:
+
+```dockerfile
+FROM python:3.10-slim
+
+WORKDIR /app
+
+# Install system dependencies for Playwright
+RUN apt-get update && apt-get install -y \
+    wget \
+    gnupg \
+    && rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+RUN playwright install chromium
+RUN playwright install-deps
+
+COPY main.py .
+
+CMD ["python", "main.py"]
+```
+
+Build and run:
+
+```bash
+docker build -t troost-scraper .
+docker run -v /path/to/output:/output troost-scraper
+```
+
+## Monitoring
+
+### Check Logs
+
+```bash
+tail -f /var/log/troost-scraper.log
+```
+
+### Monitor Output
+
+```bash
+ls -lh /var/troost-scraper/output/
+```
+
+## Troubleshooting
+
+### Playwright Browser Issues
+
+```bash
+# Reinstall browsers
+playwright install --force chromium
+```
+
+### Permission Issues
+
+```bash
+# Fix permissions
+sudo chown -R $USER:$USER /var/troost-scraper
+```
+
+### Memory Issues
+
+- Reduce `MAX_PAGES` in configuration
+- Run on machine with more RAM (Playwright needs ~1GB)
--- a/Getting-Started.md
+++ b/Getting-Started.md
@@ -0,0 +1,71 @@
+# Getting Started
+
+## Prerequisites
+
+- Python 3.8+
+- Git
+- pip (Python package manager)
+
+## Installation
+
+### 1. Clone the repository
+
+```bash
+git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
+cd troost-scraper
+```
+
+### 2. Install dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 3. Install Playwright browsers
+
+```bash
+playwright install chromium
+```
+
+## Configuration
+
+Edit the configuration in `main.py`:
+
+```python
+BASE_URL = "https://www.troostwijkauctions.com"
+CACHE_DB = "/path/to/cache.db"           # Path to cache database
+OUTPUT_DIR = "/path/to/output"            # Output directory
+RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
+MAX_PAGES = 50                            # Number of listing pages
+```
+
+**Windows users:** Use paths like `C:\\output\\cache.db`
+
+## Usage
+
+### Basic scraping
+
+```bash
+python main.py
+```
+
+This will:
+1. Crawl listing pages to collect lot URLs
+2. Scrape each individual lot page
+3. Save results in JSON and CSV formats
+4. Cache all pages for future runs
+
+### Test mode
+
+Debug extraction on a specific URL:
+
+```bash
+python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
+```
+
+## Output
+
+The scraper generates:
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
+- `cache.db` - SQLite cache (persistent)
--- a/Home.md
+++ b/Home.md
@@ -0,0 +1,18 @@
+# troost-scraper Wiki
+
+Welcome to the troost-scraper documentation.
+
+## Contents
+
+- [Getting Started](Getting-Started)
+- [Architecture](Architecture)
+- [Deployment](Deployment)
+
+## Overview
+
+Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
+
+## Quick Links
+
+- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
+- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)