first

2025-12-04 14:49:58 +01:00
commit 79e14be37a
22 changed files with 2765 additions and 0 deletions
--- a/wiki/Getting-Started.md
+++ b/wiki/Getting-Started.md
@@ -0,0 +1,71 @@
+# Getting Started
+
+## Prerequisites
+
+- Python 3.8+
+- Git
+- pip (Python package manager)
+
+## Installation
+
+### 1. Clone the repository
+
+```bash
+git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
+cd troost-scraper
+```
+
+### 2. Install dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 3. Install Playwright browsers
+
+```bash
+playwright install chromium
+```
+
+## Configuration
+
+Edit the configuration in `main.py`:
+
+```python
+BASE_URL = "https://www.troostwijkauctions.com"
+CACHE_DB = "/path/to/cache.db"           # Path to cache database
+OUTPUT_DIR = "/path/to/output"            # Output directory
+RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
+MAX_PAGES = 50                            # Number of listing pages
+```
+
+**Windows users:** Use paths like `C:\\output\\cache.db`
+
+## Usage
+
+### Basic scraping
+
+```bash
+python main.py
+```
+
+This will:
+1. Crawl listing pages to collect lot URLs
+2. Scrape each individual lot page
+3. Save results in JSON and CSV formats
+4. Cache all pages for future runs
+
+### Test mode
+
+Debug extraction on a specific URL:
+
+```bash
+python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
+```
+
+## Output
+
+The scraper generates:
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
+- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
+- `cache.db` - SQLite cache (persistent)