Add complete wiki documentation: Home, Getting Started, Architecture, and Deployment guides

2025-12-03 12:33:06 +01:00
parent 87212cd612
commit 56caa38978
4 changed files with 318 additions and 0 deletions
--- a/Architecture.md
+++ b/Architecture.md
@@ -0,0 +1,107 @@
 # Architecture
 ## Overview
 The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
 ## Core Components
 ### 1. **Browser Automation (Playwright)**
 - Launches Chromium browser in headless mode
 - Bypasses Cloudflare protection
 - Handles dynamic content rendering
 - Supports network idle detection
 ### 2. **Cache Manager (SQLite)**
 - Caches every fetched page
 - Prevents redundant requests
 - Stores page content, timestamps, and status codes
 - Auto-cleans entries older than 7 days
 - Database: `cache.db`
 ### 3. **Rate Limiter**
 - Enforces exactly 0.5 seconds between requests
 - Prevents server overload
 - Tracks last request time globally
 ### 4. **Data Extractor**
 - **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
 - **Fallback method:** HTML pattern matching with regex
 - Extracts: title, location, bid info, dates, images, descriptions
 ### 5. **Output Manager**
 - Exports data in JSON and CSV formats
 - Saves progress checkpoints every 10 lots
 - Timestamped filenames for tracking
 ## Data Flow
 ```
 1. Listing Pages → Extract lot URLs → Store in memory
                                           ↓
 2. For each lot URL → Check cache → If cached: use cached content
                          ↓              If not: fetch with rate limit
                          ↓
 3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
                          ↓
 4. Every 10 lots → Save progress checkpoint
                          ↓
 5. All lots complete → Export final JSON + CSV
 ```
 ## Key Design Decisions
 ### Why Playwright?
 - Handles JavaScript-rendered content (Next.js)
 - Bypasses Cloudflare protection
 - More reliable than requests/BeautifulSoup for modern SPAs
 ### Why JSON extraction?
 - Site uses Next.js with embedded `__NEXT_DATA__`
 - JSON is more reliable than HTML pattern matching
 - Avoids breaking when HTML/CSS changes
 - Faster parsing
 ### Why SQLite caching?
 - Persistent across runs
 - Reduces load on target server
 - Enables test mode without re-fetching
 - Respects website resources
 ## File Structure
 ```
 troost-scraper/
 ├── main.py                    # Main scraper logic
 ├── requirements.txt           # Python dependencies
 ├── README.md                  # Documentation
 ├── .gitignore                 # Git exclusions
 └── output/                    # Generated files (not in git)
    ├── cache.db              # SQLite cache
    ├── *_partial_*.json      # Progress checkpoints
    ├── *_final_*.json        # Final JSON output
    └── *_final_*.csv         # Final CSV output
 ```
 ## Classes
 ### `CacheManager`
 - `__init__(db_path)` - Initialize cache database
 - `get(url, max_age_hours)` - Retrieve cached page
 - `set(url, content, status_code)` - Cache a page
 - `clear_old(max_age_hours)` - Remove old entries
 ### `TroostwijkScraper`
 - `crawl_auctions(max_pages)` - Main entry point
 - `crawl_listing_page(page, page_num)` - Extract lot URLs
 - `crawl_lot(page, url)` - Scrape individual lot
 - `_extract_nextjs_data(content)` - Parse JSON data
 - `_parse_lot_page(content, url)` - Extract all fields
 - `save_final_results(data)` - Export JSON + CSV
 ## Scalability Notes
 - **Rate limiting** prevents IP blocks but slows execution
 - **Caching** makes subsequent runs instant for unchanged pages
 - **Progress checkpoints** allow resuming after interruption
 - **Async/await** used throughout for non-blocking I/O
--- a/Deployment.md
+++ b/Deployment.md
@@ -0,0 +1,122 @@
 # Deployment
 ## Prerequisites
 - Python 3.8+ installed
 - Access to a server (Linux/Windows)
 - Playwright and dependencies installed
 ## Production Setup
 ### 1. Install on Server
 ```bash
 # Clone repository
 git clone git@git.appmodel.nl:Tour/troost-scraper.git
 cd troost-scraper
 # Create virtual environment
 python -m venv .venv
 source .venv/bin/activate  # On Windows: .venv\Scripts\activate
 # Install dependencies
 pip install -r requirements.txt
 playwright install chromium
 playwright install-deps  # Install system dependencies
 ```
 ### 2. Configuration
 Create a configuration file or set environment variables:
 ```python
 # main.py configuration
 BASE_URL = "https://www.troostwijkauctions.com"
 CACHE_DB = "/var/troost-scraper/cache.db"
 OUTPUT_DIR = "/var/troost-scraper/output"
 RATE_LIMIT_SECONDS = 0.5
 MAX_PAGES = 50
 ```
 ### 3. Create Output Directories
 ```bash
 sudo mkdir -p /var/troost-scraper/output
 sudo chown $USER:$USER /var/troost-scraper
 ```
 ### 4. Run as Cron Job
 Add to crontab (`crontab -e`):
 ```bash
 # Run scraper daily at 2 AM
 0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
 ```
 ## Docker Deployment (Optional)
 Create `Dockerfile`:
 ```dockerfile
 FROM python:3.10-slim
 WORKDIR /app
 # Install system dependencies for Playwright
 RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 RUN playwright install chromium
 RUN playwright install-deps
 COPY main.py .
 CMD ["python", "main.py"]
 ```
 Build and run:
 ```bash
 docker build -t troost-scraper .
 docker run -v /path/to/output:/output troost-scraper
 ```
 ## Monitoring
 ### Check Logs
 ```bash
 tail -f /var/log/troost-scraper.log
 ```
 ### Monitor Output
 ```bash
 ls -lh /var/troost-scraper/output/
 ```
 ## Troubleshooting
 ### Playwright Browser Issues
 ```bash
 # Reinstall browsers
 playwright install --force chromium
 ```
 ### Permission Issues
 ```bash
 # Fix permissions
 sudo chown -R $USER:$USER /var/troost-scraper
 ```
 ### Memory Issues
 - Reduce `MAX_PAGES` in configuration
 - Run on machine with more RAM (Playwright needs ~1GB)
--- a/Getting-Started.md
+++ b/Getting-Started.md
@@ -0,0 +1,71 @@
 # Getting Started
 ## Prerequisites
 - Python 3.8+
 - Git
 - pip (Python package manager)
 ## Installation
 ### 1. Clone the repository
 ```bash
 git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
 cd troost-scraper
 ```
 ### 2. Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
 ### 3. Install Playwright browsers
 ```bash
 playwright install chromium
 ```
 ## Configuration
 Edit the configuration in `main.py`:
 ```python
 BASE_URL = "https://www.troostwijkauctions.com"
 CACHE_DB = "/path/to/cache.db"           # Path to cache database
 OUTPUT_DIR = "/path/to/output"            # Output directory
 RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
 MAX_PAGES = 50                            # Number of listing pages
 ```
 **Windows users:** Use paths like `C:\\output\\cache.db`
 ## Usage
 ### Basic scraping
 ```bash
 python main.py
 ```
 This will:
 1. Crawl listing pages to collect lot URLs
 2. Scrape each individual lot page
 3. Save results in JSON and CSV formats
 4. Cache all pages for future runs
 ### Test mode
 Debug extraction on a specific URL:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
 ```
 ## Output
 The scraper generates:
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
 - `cache.db` - SQLite cache (persistent)
--- a/Home.md
+++ b/Home.md
@@ -0,0 +1,18 @@
 # troost-scraper Wiki
 Welcome to the troost-scraper documentation.
 ## Contents
 - [Getting Started](Getting-Started)
 - [Architecture](Architecture)
 - [Deployment](Deployment)
 ## Overview
 Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
 ## Quick Links
 - [Repository](https://git.appmodel.nl/Tour/troost-scraper)
 - [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)