Files
scaev/wiki/Getting-Started.md
2025-12-04 14:49:58 +01:00

72 lines
1.4 KiB
Markdown

# Getting Started
## Prerequisites
- Python 3.8+
- Git
- pip (Python package manager)
## Installation
### 1. Clone the repository
```bash
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
### 3. Install Playwright browsers
```bash
playwright install chromium
```
## Configuration
Edit the configuration in `main.py`:
```python
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db" # Path to cache database
OUTPUT_DIR = "/path/to/output" # Output directory
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
MAX_PAGES = 50 # Number of listing pages
```
**Windows users:** Use paths like `C:\\output\\cache.db`
## Usage
### Basic scraping
```bash
python main.py
```
This will:
1. Crawl listing pages to collect lot URLs
2. Scrape each individual lot page
3. Save results in JSON and CSV formats
4. Cache all pages for future runs
### Test mode
Debug extraction on a specific URL:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
```
## Output
The scraper generates:
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
- `cache.db` - SQLite cache (persistent)