Add complete wiki documentation: Home, Getting Started, Architecture, and Deployment guides
107
Architecture.md
Normal file
107
Architecture.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
||||||
|
|
||||||
|
## Core Components
|
||||||
|
|
||||||
|
### 1. **Browser Automation (Playwright)**
|
||||||
|
- Launches Chromium browser in headless mode
|
||||||
|
- Bypasses Cloudflare protection
|
||||||
|
- Handles dynamic content rendering
|
||||||
|
- Supports network idle detection
|
||||||
|
|
||||||
|
### 2. **Cache Manager (SQLite)**
|
||||||
|
- Caches every fetched page
|
||||||
|
- Prevents redundant requests
|
||||||
|
- Stores page content, timestamps, and status codes
|
||||||
|
- Auto-cleans entries older than 7 days
|
||||||
|
- Database: `cache.db`
|
||||||
|
|
||||||
|
### 3. **Rate Limiter**
|
||||||
|
- Enforces exactly 0.5 seconds between requests
|
||||||
|
- Prevents server overload
|
||||||
|
- Tracks last request time globally
|
||||||
|
|
||||||
|
### 4. **Data Extractor**
|
||||||
|
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
|
||||||
|
- **Fallback method:** HTML pattern matching with regex
|
||||||
|
- Extracts: title, location, bid info, dates, images, descriptions
|
||||||
|
|
||||||
|
### 5. **Output Manager**
|
||||||
|
- Exports data in JSON and CSV formats
|
||||||
|
- Saves progress checkpoints every 10 lots
|
||||||
|
- Timestamped filenames for tracking
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Listing Pages → Extract lot URLs → Store in memory
|
||||||
|
↓
|
||||||
|
2. For each lot URL → Check cache → If cached: use cached content
|
||||||
|
↓ If not: fetch with rate limit
|
||||||
|
↓
|
||||||
|
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
|
||||||
|
↓
|
||||||
|
4. Every 10 lots → Save progress checkpoint
|
||||||
|
↓
|
||||||
|
5. All lots complete → Export final JSON + CSV
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Design Decisions
|
||||||
|
|
||||||
|
### Why Playwright?
|
||||||
|
- Handles JavaScript-rendered content (Next.js)
|
||||||
|
- Bypasses Cloudflare protection
|
||||||
|
- More reliable than requests/BeautifulSoup for modern SPAs
|
||||||
|
|
||||||
|
### Why JSON extraction?
|
||||||
|
- Site uses Next.js with embedded `__NEXT_DATA__`
|
||||||
|
- JSON is more reliable than HTML pattern matching
|
||||||
|
- Avoids breaking when HTML/CSS changes
|
||||||
|
- Faster parsing
|
||||||
|
|
||||||
|
### Why SQLite caching?
|
||||||
|
- Persistent across runs
|
||||||
|
- Reduces load on target server
|
||||||
|
- Enables test mode without re-fetching
|
||||||
|
- Respects website resources
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
troost-scraper/
|
||||||
|
├── main.py # Main scraper logic
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── README.md # Documentation
|
||||||
|
├── .gitignore # Git exclusions
|
||||||
|
└── output/ # Generated files (not in git)
|
||||||
|
├── cache.db # SQLite cache
|
||||||
|
├── *_partial_*.json # Progress checkpoints
|
||||||
|
├── *_final_*.json # Final JSON output
|
||||||
|
└── *_final_*.csv # Final CSV output
|
||||||
|
```
|
||||||
|
|
||||||
|
## Classes
|
||||||
|
|
||||||
|
### `CacheManager`
|
||||||
|
- `__init__(db_path)` - Initialize cache database
|
||||||
|
- `get(url, max_age_hours)` - Retrieve cached page
|
||||||
|
- `set(url, content, status_code)` - Cache a page
|
||||||
|
- `clear_old(max_age_hours)` - Remove old entries
|
||||||
|
|
||||||
|
### `TroostwijkScraper`
|
||||||
|
- `crawl_auctions(max_pages)` - Main entry point
|
||||||
|
- `crawl_listing_page(page, page_num)` - Extract lot URLs
|
||||||
|
- `crawl_lot(page, url)` - Scrape individual lot
|
||||||
|
- `_extract_nextjs_data(content)` - Parse JSON data
|
||||||
|
- `_parse_lot_page(content, url)` - Extract all fields
|
||||||
|
- `save_final_results(data)` - Export JSON + CSV
|
||||||
|
|
||||||
|
## Scalability Notes
|
||||||
|
|
||||||
|
- **Rate limiting** prevents IP blocks but slows execution
|
||||||
|
- **Caching** makes subsequent runs instant for unchanged pages
|
||||||
|
- **Progress checkpoints** allow resuming after interruption
|
||||||
|
- **Async/await** used throughout for non-blocking I/O
|
||||||
122
Deployment.md
Normal file
122
Deployment.md
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
# Deployment
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Python 3.8+ installed
|
||||||
|
- Access to a server (Linux/Windows)
|
||||||
|
- Playwright and dependencies installed
|
||||||
|
|
||||||
|
## Production Setup
|
||||||
|
|
||||||
|
### 1. Install on Server
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone repository
|
||||||
|
git clone git@git.appmodel.nl:Tour/troost-scraper.git
|
||||||
|
cd troost-scraper
|
||||||
|
|
||||||
|
# Create virtual environment
|
||||||
|
python -m venv .venv
|
||||||
|
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
playwright install chromium
|
||||||
|
playwright install-deps # Install system dependencies
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Configuration
|
||||||
|
|
||||||
|
Create a configuration file or set environment variables:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# main.py configuration
|
||||||
|
BASE_URL = "https://www.troostwijkauctions.com"
|
||||||
|
CACHE_DB = "/var/troost-scraper/cache.db"
|
||||||
|
OUTPUT_DIR = "/var/troost-scraper/output"
|
||||||
|
RATE_LIMIT_SECONDS = 0.5
|
||||||
|
MAX_PAGES = 50
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Create Output Directories
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo mkdir -p /var/troost-scraper/output
|
||||||
|
sudo chown $USER:$USER /var/troost-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Run as Cron Job
|
||||||
|
|
||||||
|
Add to crontab (`crontab -e`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run scraper daily at 2 AM
|
||||||
|
0 2 * * * cd /path/to/troost-scraper && /path/to/.venv/bin/python main.py >> /var/log/troost-scraper.log 2>&1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Docker Deployment (Optional)
|
||||||
|
|
||||||
|
Create `Dockerfile`:
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM python:3.10-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install system dependencies for Playwright
|
||||||
|
RUN apt-get update && apt-get install -y \
|
||||||
|
wget \
|
||||||
|
gnupg \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
RUN playwright install chromium
|
||||||
|
RUN playwright install-deps
|
||||||
|
|
||||||
|
COPY main.py .
|
||||||
|
|
||||||
|
CMD ["python", "main.py"]
|
||||||
|
```
|
||||||
|
|
||||||
|
Build and run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker build -t troost-scraper .
|
||||||
|
docker run -v /path/to/output:/output troost-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Check Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tail -f /var/log/troost-scraper.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitor Output
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -lh /var/troost-scraper/output/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Playwright Browser Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Reinstall browsers
|
||||||
|
playwright install --force chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
### Permission Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Fix permissions
|
||||||
|
sudo chown -R $USER:$USER /var/troost-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory Issues
|
||||||
|
|
||||||
|
- Reduce `MAX_PAGES` in configuration
|
||||||
|
- Run on machine with more RAM (Playwright needs ~1GB)
|
||||||
71
Getting-Started.md
Normal file
71
Getting-Started.md
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
# Getting Started
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Python 3.8+
|
||||||
|
- Git
|
||||||
|
- pip (Python package manager)
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### 1. Clone the repository
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
|
||||||
|
cd troost-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Install dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Install Playwright browsers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
playwright install chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit the configuration in `main.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
BASE_URL = "https://www.troostwijkauctions.com"
|
||||||
|
CACHE_DB = "/path/to/cache.db" # Path to cache database
|
||||||
|
OUTPUT_DIR = "/path/to/output" # Output directory
|
||||||
|
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
||||||
|
MAX_PAGES = 50 # Number of listing pages
|
||||||
|
```
|
||||||
|
|
||||||
|
**Windows users:** Use paths like `C:\\output\\cache.db`
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Basic scraping
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This will:
|
||||||
|
1. Crawl listing pages to collect lot URLs
|
||||||
|
2. Scrape each individual lot page
|
||||||
|
3. Save results in JSON and CSV formats
|
||||||
|
4. Cache all pages for future runs
|
||||||
|
|
||||||
|
### Test mode
|
||||||
|
|
||||||
|
Debug extraction on a specific URL:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output
|
||||||
|
|
||||||
|
The scraper generates:
|
||||||
|
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
|
||||||
|
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
|
||||||
|
- `cache.db` - SQLite cache (persistent)
|
||||||
18
Home.md
Normal file
18
Home.md
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
# troost-scraper Wiki
|
||||||
|
|
||||||
|
Welcome to the troost-scraper documentation.
|
||||||
|
|
||||||
|
## Contents
|
||||||
|
|
||||||
|
- [Getting Started](Getting-Started)
|
||||||
|
- [Architecture](Architecture)
|
||||||
|
- [Deployment](Deployment)
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
||||||
|
|
||||||
|
## Quick Links
|
||||||
|
|
||||||
|
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
|
||||||
|
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
|
||||||
Reference in New Issue
Block a user