Update wiki documentation with Python/Playwright architecture and setup instructions
15
Architect.md
Normal file
15
Architect.md
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
# Architecture1
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Document your application architecture here.
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
- **Frontend**: Description
|
||||||
|
- **Backend**: Description
|
||||||
|
- **Database**: Description
|
||||||
|
|
||||||
|
## Diagrams
|
||||||
|
|
||||||
|
Add architecture diagrams here.
|
||||||
107
Architecture.md
Normal file
107
Architecture.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
||||||
|
|
||||||
|
## Core Components
|
||||||
|
|
||||||
|
### 1. **Browser Automation (Playwright)**
|
||||||
|
- Launches Chromium browser in headless mode
|
||||||
|
- Bypasses Cloudflare protection
|
||||||
|
- Handles dynamic content rendering
|
||||||
|
- Supports network idle detection
|
||||||
|
|
||||||
|
### 2. **Cache Manager (SQLite)**
|
||||||
|
- Caches every fetched page
|
||||||
|
- Prevents redundant requests
|
||||||
|
- Stores page content, timestamps, and status codes
|
||||||
|
- Auto-cleans entries older than 7 days
|
||||||
|
- Database: `cache.db`
|
||||||
|
|
||||||
|
### 3. **Rate Limiter**
|
||||||
|
- Enforces exactly 0.5 seconds between requests
|
||||||
|
- Prevents server overload
|
||||||
|
- Tracks last request time globally
|
||||||
|
|
||||||
|
### 4. **Data Extractor**
|
||||||
|
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
|
||||||
|
- **Fallback method:** HTML pattern matching with regex
|
||||||
|
- Extracts: title, location, bid info, dates, images, descriptions
|
||||||
|
|
||||||
|
### 5. **Output Manager**
|
||||||
|
- Exports data in JSON and CSV formats
|
||||||
|
- Saves progress checkpoints every 10 lots
|
||||||
|
- Timestamped filenames for tracking
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Listing Pages → Extract lot URLs → Store in memory
|
||||||
|
↓
|
||||||
|
2. For each lot URL → Check cache → If cached: use cached content
|
||||||
|
↓ If not: fetch with rate limit
|
||||||
|
↓
|
||||||
|
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
|
||||||
|
↓
|
||||||
|
4. Every 10 lots → Save progress checkpoint
|
||||||
|
↓
|
||||||
|
5. All lots complete → Export final JSON + CSV
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Design Decisions
|
||||||
|
|
||||||
|
### Why Playwright?
|
||||||
|
- Handles JavaScript-rendered content (Next.js)
|
||||||
|
- Bypasses Cloudflare protection
|
||||||
|
- More reliable than requests/BeautifulSoup for modern SPAs
|
||||||
|
|
||||||
|
### Why JSON extraction?
|
||||||
|
- Site uses Next.js with embedded `__NEXT_DATA__`
|
||||||
|
- JSON is more reliable than HTML pattern matching
|
||||||
|
- Avoids breaking when HTML/CSS changes
|
||||||
|
- Faster parsing
|
||||||
|
|
||||||
|
### Why SQLite caching?
|
||||||
|
- Persistent across runs
|
||||||
|
- Reduces load on target server
|
||||||
|
- Enables test mode without re-fetching
|
||||||
|
- Respects website resources
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
troost-scraper/
|
||||||
|
├── main.py # Main scraper logic
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── README.md # Documentation
|
||||||
|
├── .gitignore # Git exclusions
|
||||||
|
└── output/ # Generated files (not in git)
|
||||||
|
├── cache.db # SQLite cache
|
||||||
|
├── *_partial_*.json # Progress checkpoints
|
||||||
|
├── *_final_*.json # Final JSON output
|
||||||
|
└── *_final_*.csv # Final CSV output
|
||||||
|
```
|
||||||
|
|
||||||
|
## Classes
|
||||||
|
|
||||||
|
### `CacheManager`
|
||||||
|
- `__init__(db_path)` - Initialize cache database
|
||||||
|
- `get(url, max_age_hours)` - Retrieve cached page
|
||||||
|
- `set(url, content, status_code)` - Cache a page
|
||||||
|
- `clear_old(max_age_hours)` - Remove old entries
|
||||||
|
|
||||||
|
### `TroostwijkScraper`
|
||||||
|
- `crawl_auctions(max_pages)` - Main entry point
|
||||||
|
- `crawl_listing_page(page, page_num)` - Extract lot URLs
|
||||||
|
- `crawl_lot(page, url)` - Scrape individual lot
|
||||||
|
- `_extract_nextjs_data(content)` - Parse JSON data
|
||||||
|
- `_parse_lot_page(content, url)` - Extract all fields
|
||||||
|
- `save_final_results(data)` - Export JSON + CSV
|
||||||
|
|
||||||
|
## Scalability Notes
|
||||||
|
|
||||||
|
- **Rate limiting** prevents IP blocks but slows execution
|
||||||
|
- **Caching** makes subsequent runs instant for unchanged pages
|
||||||
|
- **Progress checkpoints** allow resuming after interruption
|
||||||
|
- **Async/await** used throughout for non-blocking I/O
|
||||||
2
Architecture.md.url
Normal file
2
Architecture.md.url
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
[InternetShortcut]
|
||||||
|
URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md
|
||||||
27
Deployment.md
Normal file
27
Deployment.md
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
# Deployment
|
||||||
|
|
||||||
|
## Automatic Deployment
|
||||||
|
|
||||||
|
This project uses automatic deployment via git hooks.
|
||||||
|
|
||||||
|
### Pipeline
|
||||||
|
|
||||||
|
1. Push to `main` branch
|
||||||
|
2. Git hook triggers
|
||||||
|
3. Code syncs to `/opt/apps/troost-scraper`
|
||||||
|
4. Docker builds and deploys
|
||||||
|
5. Container starts automatically
|
||||||
|
|
||||||
|
### Manual Deployment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apps:deploy troost-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
View logs:
|
||||||
|
```bash
|
||||||
|
tail -f /home/git/logs/apps:deploy-troost-scraper.log
|
||||||
|
docker logs troost-scraper
|
||||||
|
```
|
||||||
71
Getting-Started.md
Normal file
71
Getting-Started.md
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
# Getting Started
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Python 3.8+
|
||||||
|
- Git
|
||||||
|
- pip (Python package manager)
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### 1. Clone the repository
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
|
||||||
|
cd troost-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Install dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Install Playwright browsers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
playwright install chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit the configuration in `main.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
BASE_URL = "https://www.troostwijkauctions.com"
|
||||||
|
CACHE_DB = "/path/to/cache.db" # Path to cache database
|
||||||
|
OUTPUT_DIR = "/path/to/output" # Output directory
|
||||||
|
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
||||||
|
MAX_PAGES = 50 # Number of listing pages
|
||||||
|
```
|
||||||
|
|
||||||
|
**Windows users:** Use paths like `C:\\output\\cache.db`
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Basic scraping
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This will:
|
||||||
|
1. Crawl listing pages to collect lot URLs
|
||||||
|
2. Scrape each individual lot page
|
||||||
|
3. Save results in JSON and CSV formats
|
||||||
|
4. Cache all pages for future runs
|
||||||
|
|
||||||
|
### Test mode
|
||||||
|
|
||||||
|
Debug extraction on a specific URL:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output
|
||||||
|
|
||||||
|
The scraper generates:
|
||||||
|
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
|
||||||
|
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
|
||||||
|
- `cache.db` - SQLite cache (persistent)
|
||||||
14
Home.md
Normal file
14
Home.md
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
# troost-scraper Wiki
|
||||||
|
|
||||||
|
Welcome to the troost-scraper documentation.
|
||||||
|
|
||||||
|
## Contents
|
||||||
|
|
||||||
|
- [Getting Started](Getting-Started.md)
|
||||||
|
- [Architecture](Architecture.md)
|
||||||
|
- [Deployment](Deployment.md)
|
||||||
|
|
||||||
|
## Quick Links
|
||||||
|
|
||||||
|
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
|
||||||
|
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
|
||||||
218
README.md
218
README.md
@@ -1,217 +1,3 @@
|
|||||||
# Troostwijk Auctions Scraper
|
# troost-scraper-wiki
|
||||||
|
|
||||||
A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
|
Wiki for troost-scraper
|
||||||
|
|
||||||
## Features - 1
|
|
||||||
|
|
||||||
- **Playwright-based scraping** - Bypasses Cloudflare protection
|
|
||||||
- **SQLite caching** - Caches every page to avoid redundant requests
|
|
||||||
- **Rate limiting** - Strictly enforces 0.5 seconds between requests
|
|
||||||
- **Multi-format output** - Exports data in both JSON and CSV formats
|
|
||||||
- **Progress saving** - Automatically saves progress every 10 lots
|
|
||||||
- **Test mode** - Debug extraction patterns on cached pages
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
- Python 3.8+
|
|
||||||
- Playwright (with Chromium browser)
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
1. **Clone or download this project**
|
|
||||||
|
|
||||||
2. **Install dependencies:**
|
|
||||||
```bash
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Install Playwright browsers:**
|
|
||||||
```bash
|
|
||||||
playwright install chromium
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
Edit the configuration variables in `main.py`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
BASE_URL = "https://www.troostwijkauctions.com"
|
|
||||||
CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database
|
|
||||||
OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory
|
|
||||||
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
|
||||||
MAX_PAGES = 50 # Number of listing pages to crawl
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### Basic Scraping
|
|
||||||
|
|
||||||
Run the scraper to collect auction lot data:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python main.py
|
|
||||||
```
|
|
||||||
|
|
||||||
This will:
|
|
||||||
1. Crawl listing pages to collect lot URLs
|
|
||||||
2. Scrape each individual lot page
|
|
||||||
3. Save results in both JSON and CSV formats
|
|
||||||
4. Cache all pages to avoid re-fetching
|
|
||||||
|
|
||||||
### Test Mode
|
|
||||||
|
|
||||||
Test extraction patterns on a specific cached URL:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Test with default URL
|
|
||||||
python main.py --test
|
|
||||||
|
|
||||||
# Test with specific URL
|
|
||||||
python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
|
|
||||||
```
|
|
||||||
|
|
||||||
This is useful for debugging extraction patterns and verifying data is being extracted correctly.
|
|
||||||
|
|
||||||
## Output Files
|
|
||||||
|
|
||||||
The scraper generates the following files:
|
|
||||||
|
|
||||||
### During Execution
|
|
||||||
- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
|
|
||||||
|
|
||||||
### Final Output
|
|
||||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
|
|
||||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
|
|
||||||
|
|
||||||
### Cache
|
|
||||||
- `cache.db` - SQLite database with cached page content (persistent across runs)
|
|
||||||
|
|
||||||
## Data Extracted
|
|
||||||
|
|
||||||
For each auction lot, the scraper extracts:
|
|
||||||
|
|
||||||
- **URL** - Direct link to the lot
|
|
||||||
- **Lot ID** - Unique identifier (e.g., A7-35847)
|
|
||||||
- **Title** - Lot title/description
|
|
||||||
- **Current Bid** - Current bid amount
|
|
||||||
- **Bid Count** - Number of bids placed
|
|
||||||
- **End Date** - Auction end time
|
|
||||||
- **Location** - Physical location of the item
|
|
||||||
- **Description** - Detailed description
|
|
||||||
- **Category** - Auction category
|
|
||||||
- **Images** - Up to 5 product images
|
|
||||||
- **Scraped At** - Timestamp of data collection
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
### Phase 1: Collect Lot URLs
|
|
||||||
The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
|
|
||||||
|
|
||||||
### Phase 2: Scrape Individual Lots
|
|
||||||
Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
|
|
||||||
|
|
||||||
### Caching Strategy
|
|
||||||
- Every successfully fetched page is cached in SQLite
|
|
||||||
- Cache is checked before making any request
|
|
||||||
- Cache entries older than 7 days are automatically cleaned
|
|
||||||
- Failed requests (500 errors) are also cached to avoid retrying
|
|
||||||
|
|
||||||
### Rate Limiting
|
|
||||||
- Enforces exactly 0.5 seconds between ALL requests
|
|
||||||
- Applies to both listing pages and individual lot pages
|
|
||||||
- Prevents server overload and potential IP blocking
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Issue: "Huidig bod" / "Locatie" instead of actual values
|
|
||||||
|
|
||||||
**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
|
|
||||||
|
|
||||||
The scraper correctly extracts:
|
|
||||||
- **Title** from `auction.name`
|
|
||||||
- **Location** from `viewingDays` or `collectionDays`
|
|
||||||
- **Images** from `auction.image.url`
|
|
||||||
- **End date** from `minEndDate`
|
|
||||||
- **Lot ID** from `auction.displayId`
|
|
||||||
|
|
||||||
To verify extraction is working:
|
|
||||||
```bash
|
|
||||||
python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
|
|
||||||
|
|
||||||
### Issue: No lots found
|
|
||||||
|
|
||||||
- Check if the website structure has changed
|
|
||||||
- Verify `BASE_URL` is correct
|
|
||||||
- Try clearing the cache database
|
|
||||||
|
|
||||||
### Issue: Cloudflare blocking
|
|
||||||
|
|
||||||
- Playwright should bypass this automatically
|
|
||||||
- If issues persist, try adjusting user agent or headers in `crawl_auctions()`
|
|
||||||
|
|
||||||
### Issue: Slow scraping
|
|
||||||
|
|
||||||
- This is intentional due to rate limiting (0.5s between requests)
|
|
||||||
- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
|
|
||||||
- First run will be slower; subsequent runs use cache
|
|
||||||
|
|
||||||
## Project Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
troost-scraper/
|
|
||||||
├── main.py # Main scraper script
|
|
||||||
├── requirements.txt # Python dependencies
|
|
||||||
├── README.md # This file
|
|
||||||
└── output/ # Generated output files (created automatically)
|
|
||||||
├── cache.db # SQLite cache
|
|
||||||
├── *.json # JSON output files
|
|
||||||
└── *.csv # CSV output files
|
|
||||||
```
|
|
||||||
|
|
||||||
## Development
|
|
||||||
|
|
||||||
### Adding New Extraction Fields
|
|
||||||
|
|
||||||
1. Add extraction method in `TroostwijkScraper` class:
|
|
||||||
```python
|
|
||||||
def _extract_new_field(self, content: str) -> str:
|
|
||||||
pattern = r'your-regex-pattern'
|
|
||||||
match = re.search(pattern, content)
|
|
||||||
return match.group(1) if match else ""
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Add field to `_parse_lot_page()`:
|
|
||||||
```python
|
|
||||||
data = {
|
|
||||||
# ... existing fields ...
|
|
||||||
'new_field': self._extract_new_field(content),
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Add field to CSV export in `save_final_results()`:
|
|
||||||
```python
|
|
||||||
fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
|
|
||||||
```
|
|
||||||
|
|
||||||
### Testing Extraction Patterns
|
|
||||||
|
|
||||||
Use test mode to verify patterns work correctly:
|
|
||||||
```bash
|
|
||||||
python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
|
|
||||||
```
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- **Be respectful:** The rate limiting is intentionally conservative
|
|
||||||
- **Check legality:** Ensure web scraping is permitted in your jurisdiction
|
|
||||||
- **Monitor changes:** Website structure may change over time, requiring pattern updates
|
|
||||||
- **Cache management:** Old cache entries are auto-cleaned after 7 days
|
|
||||||
Reference in New Issue
Block a user