From 56caa3897839d232f818a08cef80885506a6aaf0 Mon Sep 17 00:00:00 2001 From: Tour Date: Wed, 3 Dec 2025 12:33:06 +0100 Subject: [PATCH] Add complete wiki documentation: Home, Getting Started, Architecture, and Deployment guides --- Architecture.md | 107 +++++++++++++++++++++++++++++++++++++++ Deployment.md | 122 +++++++++++++++++++++++++++++++++++++++++++++ Getting-Started.md | 71 ++++++++++++++++++++++++++ Home.md | 18 +++++++ 4 files changed, 318 insertions(+) create mode 100644 Architecture.md create mode 100644 Deployment.md create mode 100644 Getting-Started.md create mode 100644 Home.md diff --git a/Architecture.md b/Architecture.md new file mode 100644 index 0000000..2b6d3f7 --- /dev/null +++ b/Architecture.md @@ -0,0 +1,107 @@ +# Architecture + +## Overview + +The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching. + +## Core Components + +### 1. **Browser Automation (Playwright)** +- Launches Chromium browser in headless mode +- Bypasses Cloudflare protection +- Handles dynamic content rendering +- Supports network idle detection + +### 2. **Cache Manager (SQLite)** +- Caches every fetched page +- Prevents redundant requests +- Stores page content, timestamps, and status codes +- Auto-cleans entries older than 7 days +- Database: `cache.db` + +### 3. **Rate Limiter** +- Enforces exactly 0.5 seconds between requests +- Prevents server overload +- Tracks last request time globally + +### 4. **Data Extractor** +- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages +- **Fallback method:** HTML pattern matching with regex +- Extracts: title, location, bid info, dates, images, descriptions + +### 5. **Output Manager** +- Exports data in JSON and CSV formats +- Saves progress checkpoints every 10 lots +- Timestamped filenames for tracking + +## Data Flow + +``` +1. Listing Pages → Extract lot URLs → Store in memory + ↓ +2. For each lot URL → Check cache → If cached: use cached content + ↓ If not: fetch with rate limit + ↓ +3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results + ↓ +4. Every 10 lots → Save progress checkpoint + ↓ +5. All lots complete → Export final JSON + CSV +``` + +## Key Design Decisions + +### Why Playwright? +- Handles JavaScript-rendered content (Next.js) +- Bypasses Cloudflare protection +- More reliable than requests/BeautifulSoup for modern SPAs + +### Why JSON extraction? +- Site uses Next.js with embedded `__NEXT_DATA__` +- JSON is more reliable than HTML pattern matching +- Avoids breaking when HTML/CSS changes +- Faster parsing + +### Why SQLite caching? +- Persistent across runs +- Reduces load on target server +- Enables test mode without re-fetching +- Respects website resources + +## File Structure + +``` +troost-scraper/ +├── main.py # Main scraper logic +├── requirements.txt # Python dependencies +├── README.md # Documentation +├── .gitignore # Git exclusions +└── output/ # Generated files (not in git) + ├── cache.db # SQLite cache + ├── *_partial_*.json # Progress checkpoints + ├── *_final_*.json # Final JSON output + └── *_final_*.csv # Final CSV output +``` + +## Classes + +### `CacheManager` +- `__init__(db_path)` - Initialize cache database +- `get(url, max_age_hours)` - Retrieve cached page +- `set(url, content, status_code)` - Cache a page +- `clear_old(max_age_hours)` - Remove old entries + +### `TroostwijkScraper` +- `crawl_auctions(max_pages)` - Main entry point +- `crawl_listing_page(page, page_num)` - Extract lot URLs +- `crawl_lot(page, url)` - Scrape individual lot +- `_extract_nextjs_data(content)` - Parse JSON data +- `_parse_lot_page(content, url)` - Extract all fields +- `save_final_results(data)` - Export JSON + CSV + +## Scalability Notes + +- **Rate limiting** prevents IP blocks but slows execution +- **Caching** makes subsequent runs instant for unchanged pages +- **Progress checkpoints** allow resuming after interruption +- **Async/await** used throughout for non-blocking I/O diff --git a/Deployment.md b/Deployment.md new file mode 100644 index 0000000..2944db5 --- /dev/null +++ b/Deployment.md @@ -0,0 +1,122 @@ +# Deployment + +## Prerequisites + +- Python 3.8+ installed +- Access to a server (Linux/Windows) +- Playwright and dependencies installed + +## Production Setup + +### 1. Install on Server + +```bash +# Clone repository +git clone git@git.appmodel.nl:Tour/troost-scraper.git +cd troost-scraper + +# Create virtual environment +python -m venv .venv +source .venv/bin/activate # On Windows: .venv\Scripts\activate + +# Install dependencies +pip install -r requirements.txt +playwright install chromium +playwright install-deps # Install system dependencies +``` + +### 2. Configuration + +Create a configuration file or set environment variables: + +```python +# main.py configuration +BASE_URL = "https://www.troostwijkauctions.com" +CACHE_DB = "/var/troost-scraper/cache.db" +OUTPUT_DIR = "/var/troost-scraper/output" +RATE_LIMIT_SECONDS = 0.5 +MAX_PAGES = 50 +``` + +### 3. Create Output Directories + +```bash +sudo mkdir -p /var/troost-scraper/output +sudo chown $USER:$USER /var/troost-scraper +``` + +### 4. Run as Cron Job + +Add to crontab (`crontab -e`): + +```bash +# Run scraper daily at 2 AM +0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1 +``` + +## Docker Deployment (Optional) + +Create `Dockerfile`: + +```dockerfile +FROM python:3.10-slim + +WORKDIR /app + +# Install system dependencies for Playwright +RUN apt-get update && apt-get install -y \ + wget \ + gnupg \ + && rm -rf /var/lib/apt/lists/* + +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt +RUN playwright install chromium +RUN playwright install-deps + +COPY main.py . + +CMD ["python", "main.py"] +``` + +Build and run: + +```bash +docker build -t troost-scraper . +docker run -v /path/to/output:/output troost-scraper +``` + +## Monitoring + +### Check Logs + +```bash +tail -f /var/log/troost-scraper.log +``` + +### Monitor Output + +```bash +ls -lh /var/troost-scraper/output/ +``` + +## Troubleshooting + +### Playwright Browser Issues + +```bash +# Reinstall browsers +playwright install --force chromium +``` + +### Permission Issues + +```bash +# Fix permissions +sudo chown -R $USER:$USER /var/troost-scraper +``` + +### Memory Issues + +- Reduce `MAX_PAGES` in configuration +- Run on machine with more RAM (Playwright needs ~1GB) diff --git a/Getting-Started.md b/Getting-Started.md new file mode 100644 index 0000000..160c7ce --- /dev/null +++ b/Getting-Started.md @@ -0,0 +1,71 @@ +# Getting Started + +## Prerequisites + +- Python 3.8+ +- Git +- pip (Python package manager) + +## Installation + +### 1. Clone the repository + +```bash +git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git +cd troost-scraper +``` + +### 2. Install dependencies + +```bash +pip install -r requirements.txt +``` + +### 3. Install Playwright browsers + +```bash +playwright install chromium +``` + +## Configuration + +Edit the configuration in `main.py`: + +```python +BASE_URL = "https://www.troostwijkauctions.com" +CACHE_DB = "/path/to/cache.db" # Path to cache database +OUTPUT_DIR = "/path/to/output" # Output directory +RATE_LIMIT_SECONDS = 0.5 # Delay between requests +MAX_PAGES = 50 # Number of listing pages +``` + +**Windows users:** Use paths like `C:\\output\\cache.db` + +## Usage + +### Basic scraping + +```bash +python main.py +``` + +This will: +1. Crawl listing pages to collect lot URLs +2. Scrape each individual lot page +3. Save results in JSON and CSV formats +4. Cache all pages for future runs + +### Test mode + +Debug extraction on a specific URL: + +```bash +python main.py --test "https://www.troostwijkauctions.com/a/lot-url" +``` + +## Output + +The scraper generates: +- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data +- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export +- `cache.db` - SQLite cache (persistent) diff --git a/Home.md b/Home.md new file mode 100644 index 0000000..3a3f86b --- /dev/null +++ b/Home.md @@ -0,0 +1,18 @@ +# troost-scraper Wiki + +Welcome to the troost-scraper documentation. + +## Contents + +- [Getting Started](Getting-Started) +- [Architecture](Architecture) +- [Deployment](Deployment) + +## Overview + +Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching. + +## Quick Links + +- [Repository](https://git.appmodel.nl/Tour/troost-scraper) +- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)