Update wiki documentation with Python/Playwright architecture and setup instructions

Tour
2025-12-03 12:27:16 +01:00
parent 7b370ee380
commit c1e024bcb0
7 changed files with 238 additions and 216 deletions

15
Architect.md Normal file

@@ -0,0 +1,15 @@
# Architecture1
## Overview
Document your application architecture here.
## Components
- **Frontend**: Description
- **Backend**: Description
- **Database**: Description
## Diagrams
Add architecture diagrams here.

107
Architecture.md Normal file

@@ -0,0 +1,107 @@
# Architecture
## Overview
The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
## Core Components
### 1. **Browser Automation (Playwright)**
- Launches Chromium browser in headless mode
- Bypasses Cloudflare protection
- Handles dynamic content rendering
- Supports network idle detection
### 2. **Cache Manager (SQLite)**
- Caches every fetched page
- Prevents redundant requests
- Stores page content, timestamps, and status codes
- Auto-cleans entries older than 7 days
- Database: `cache.db`
### 3. **Rate Limiter**
- Enforces exactly 0.5 seconds between requests
- Prevents server overload
- Tracks last request time globally
### 4. **Data Extractor**
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
- **Fallback method:** HTML pattern matching with regex
- Extracts: title, location, bid info, dates, images, descriptions
### 5. **Output Manager**
- Exports data in JSON and CSV formats
- Saves progress checkpoints every 10 lots
- Timestamped filenames for tracking
## Data Flow
```
1. Listing Pages → Extract lot URLs → Store in memory
2. For each lot URL → Check cache → If cached: use cached content
↓ If not: fetch with rate limit
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
4. Every 10 lots → Save progress checkpoint
5. All lots complete → Export final JSON + CSV
```
## Key Design Decisions
### Why Playwright?
- Handles JavaScript-rendered content (Next.js)
- Bypasses Cloudflare protection
- More reliable than requests/BeautifulSoup for modern SPAs
### Why JSON extraction?
- Site uses Next.js with embedded `__NEXT_DATA__`
- JSON is more reliable than HTML pattern matching
- Avoids breaking when HTML/CSS changes
- Faster parsing
### Why SQLite caching?
- Persistent across runs
- Reduces load on target server
- Enables test mode without re-fetching
- Respects website resources
## File Structure
```
troost-scraper/
├── main.py # Main scraper logic
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── .gitignore # Git exclusions
└── output/ # Generated files (not in git)
├── cache.db # SQLite cache
├── *_partial_*.json # Progress checkpoints
├── *_final_*.json # Final JSON output
└── *_final_*.csv # Final CSV output
```
## Classes
### `CacheManager`
- `__init__(db_path)` - Initialize cache database
- `get(url, max_age_hours)` - Retrieve cached page
- `set(url, content, status_code)` - Cache a page
- `clear_old(max_age_hours)` - Remove old entries
### `TroostwijkScraper`
- `crawl_auctions(max_pages)` - Main entry point
- `crawl_listing_page(page, page_num)` - Extract lot URLs
- `crawl_lot(page, url)` - Scrape individual lot
- `_extract_nextjs_data(content)` - Parse JSON data
- `_parse_lot_page(content, url)` - Extract all fields
- `save_final_results(data)` - Export JSON + CSV
## Scalability Notes
- **Rate limiting** prevents IP blocks but slows execution
- **Caching** makes subsequent runs instant for unchanged pages
- **Progress checkpoints** allow resuming after interruption
- **Async/await** used throughout for non-blocking I/O

2
Architecture.md.url Normal file

@@ -0,0 +1,2 @@
[InternetShortcut]
URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md

27
Deployment.md Normal file

@@ -0,0 +1,27 @@
# Deployment
## Automatic Deployment
This project uses automatic deployment via git hooks.
### Pipeline
1. Push to `main` branch
2. Git hook triggers
3. Code syncs to `/opt/apps/troost-scraper`
4. Docker builds and deploys
5. Container starts automatically
### Manual Deployment
```bash
sudo apps:deploy troost-scraper
```
## Monitoring
View logs:
```bash
tail -f /home/git/logs/apps:deploy-troost-scraper.log
docker logs troost-scraper
```

71
Getting-Started.md Normal file

@@ -0,0 +1,71 @@
# Getting Started
## Prerequisites
- Python 3.8+
- Git
- pip (Python package manager)
## Installation
### 1. Clone the repository
```bash
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
cd troost-scraper
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
### 3. Install Playwright browsers
```bash
playwright install chromium
```
## Configuration
Edit the configuration in `main.py`:
```python
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/path/to/cache.db" # Path to cache database
OUTPUT_DIR = "/path/to/output" # Output directory
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
MAX_PAGES = 50 # Number of listing pages
```
**Windows users:** Use paths like `C:\\output\\cache.db`
## Usage
### Basic scraping
```bash
python main.py
```
This will:
1. Crawl listing pages to collect lot URLs
2. Scrape each individual lot page
3. Save results in JSON and CSV formats
4. Cache all pages for future runs
### Test mode
Debug extraction on a specific URL:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
```
## Output
The scraper generates:
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
- `cache.db` - SQLite cache (persistent)

14
Home.md Normal file

@@ -0,0 +1,14 @@
# troost-scraper Wiki
Welcome to the troost-scraper documentation.
## Contents
- [Getting Started](Getting-Started.md)
- [Architecture](Architecture.md)
- [Deployment](Deployment.md)
## Quick Links
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)

218
README.md

@@ -1,217 +1,3 @@
# Troostwijk Auctions Scraper # troost-scraper-wiki
A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities. Wiki for troost-scraper
## Features - 1
- **Playwright-based scraping** - Bypasses Cloudflare protection
- **SQLite caching** - Caches every page to avoid redundant requests
- **Rate limiting** - Strictly enforces 0.5 seconds between requests
- **Multi-format output** - Exports data in both JSON and CSV formats
- **Progress saving** - Automatically saves progress every 10 lots
- **Test mode** - Debug extraction patterns on cached pages
## Requirements
- Python 3.8+
- Playwright (with Chromium browser)
## Installation
1. **Clone or download this project**
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
3. **Install Playwright browsers:**
```bash
playwright install chromium
```
## Configuration
Edit the configuration variables in `main.py`:
```python
BASE_URL = "https://www.troostwijkauctions.com"
CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database
OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
MAX_PAGES = 50 # Number of listing pages to crawl
```
**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
## Usage
### Basic Scraping
Run the scraper to collect auction lot data:
```bash
python main.py
```
This will:
1. Crawl listing pages to collect lot URLs
2. Scrape each individual lot page
3. Save results in both JSON and CSV formats
4. Cache all pages to avoid re-fetching
### Test Mode
Test extraction patterns on a specific cached URL:
```bash
# Test with default URL
python main.py --test
# Test with specific URL
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
```
This is useful for debugging extraction patterns and verifying data is being extracted correctly.
## Output Files
The scraper generates the following files:
### During Execution
- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
### Final Output
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
### Cache
- `cache.db` - SQLite database with cached page content (persistent across runs)
## Data Extracted
For each auction lot, the scraper extracts:
- **URL** - Direct link to the lot
- **Lot ID** - Unique identifier (e.g., A7-35847)
- **Title** - Lot title/description
- **Current Bid** - Current bid amount
- **Bid Count** - Number of bids placed
- **End Date** - Auction end time
- **Location** - Physical location of the item
- **Description** - Detailed description
- **Category** - Auction category
- **Images** - Up to 5 product images
- **Scraped At** - Timestamp of data collection
## How It Works
### Phase 1: Collect Lot URLs
The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
### Phase 2: Scrape Individual Lots
Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
### Caching Strategy
- Every successfully fetched page is cached in SQLite
- Cache is checked before making any request
- Cache entries older than 7 days are automatically cleaned
- Failed requests (500 errors) are also cached to avoid retrying
### Rate Limiting
- Enforces exactly 0.5 seconds between ALL requests
- Applies to both listing pages and individual lot pages
- Prevents server overload and potential IP blocking
## Troubleshooting
### Issue: "Huidig bod" / "Locatie" instead of actual values
**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
The scraper correctly extracts:
- **Title** from `auction.name`
- **Location** from `viewingDays` or `collectionDays`
- **Images** from `auction.image.url`
- **End date** from `minEndDate`
- **Lot ID** from `auction.displayId`
To verify extraction is working:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
```
**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
### Issue: No lots found
- Check if the website structure has changed
- Verify `BASE_URL` is correct
- Try clearing the cache database
### Issue: Cloudflare blocking
- Playwright should bypass this automatically
- If issues persist, try adjusting user agent or headers in `crawl_auctions()`
### Issue: Slow scraping
- This is intentional due to rate limiting (0.5s between requests)
- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
- First run will be slower; subsequent runs use cache
## Project Structure
```
troost-scraper/
├── main.py # Main scraper script
├── requirements.txt # Python dependencies
├── README.md # This file
└── output/ # Generated output files (created automatically)
├── cache.db # SQLite cache
├── *.json # JSON output files
└── *.csv # CSV output files
```
## Development
### Adding New Extraction Fields
1. Add extraction method in `TroostwijkScraper` class:
```python
def _extract_new_field(self, content: str) -> str:
pattern = r'your-regex-pattern'
match = re.search(pattern, content)
return match.group(1) if match else ""
```
2. Add field to `_parse_lot_page()`:
```python
data = {
# ... existing fields ...
'new_field': self._extract_new_field(content),
}
```
3. Add field to CSV export in `save_final_results()`:
```python
fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
```
### Testing Extraction Patterns
Use test mode to verify patterns work correctly:
```bash
python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
```
## License
This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
## Notes
- **Be respectful:** The rate limiting is intentionally conservative
- **Check legality:** Ensure web scraping is permitted in your jurisdiction
- **Monitor changes:** Website structure may change over time, requiring pattern updates
- **Cache management:** Old cache entries are auto-cleaned after 7 days