first
This commit is contained in:
71
wiki/Getting-Started.md
Normal file
71
wiki/Getting-Started.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# Getting Started
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Git
|
||||
- pip (Python package manager)
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Clone the repository
|
||||
|
||||
```bash
|
||||
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
|
||||
cd troost-scraper
|
||||
```
|
||||
|
||||
### 2. Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Install Playwright browsers
|
||||
|
||||
```bash
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit the configuration in `main.py`:
|
||||
|
||||
```python
|
||||
BASE_URL = "https://www.troostwijkauctions.com"
|
||||
CACHE_DB = "/path/to/cache.db" # Path to cache database
|
||||
OUTPUT_DIR = "/path/to/output" # Output directory
|
||||
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
||||
MAX_PAGES = 50 # Number of listing pages
|
||||
```
|
||||
|
||||
**Windows users:** Use paths like `C:\\output\\cache.db`
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic scraping
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Crawl listing pages to collect lot URLs
|
||||
2. Scrape each individual lot page
|
||||
3. Save results in JSON and CSV formats
|
||||
4. Cache all pages for future runs
|
||||
|
||||
### Test mode
|
||||
|
||||
Debug extraction on a specific URL:
|
||||
|
||||
```bash
|
||||
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The scraper generates:
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
|
||||
- `cache.db` - SQLite cache (persistent)
|
||||
Reference in New Issue
Block a user