all
15
Architect.md
15
Architect.md
@@ -1,15 +0,0 @@
|
||||
# Architecture1
|
||||
|
||||
## Overview
|
||||
|
||||
Document your application architecture here.
|
||||
|
||||
## Components
|
||||
|
||||
- **Frontend**: Description
|
||||
- **Backend**: Description
|
||||
- **Database**: Description
|
||||
|
||||
## Diagrams
|
||||
|
||||
Add architecture diagrams here.
|
||||
107
Architecture.md
107
Architecture.md
@@ -1,107 +0,0 @@
|
||||
# Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. **Browser Automation (Playwright)**
|
||||
- Launches Chromium browser in headless mode
|
||||
- Bypasses Cloudflare protection
|
||||
- Handles dynamic content rendering
|
||||
- Supports network idle detection
|
||||
|
||||
### 2. **Cache Manager (SQLite)**
|
||||
- Caches every fetched page
|
||||
- Prevents redundant requests
|
||||
- Stores page content, timestamps, and status codes
|
||||
- Auto-cleans entries older than 7 days
|
||||
- Database: `cache.db`
|
||||
|
||||
### 3. **Rate Limiter**
|
||||
- Enforces exactly 0.5 seconds between requests
|
||||
- Prevents server overload
|
||||
- Tracks last request time globally
|
||||
|
||||
### 4. **Data Extractor**
|
||||
- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
|
||||
- **Fallback method:** HTML pattern matching with regex
|
||||
- Extracts: title, location, bid info, dates, images, descriptions
|
||||
|
||||
### 5. **Output Manager**
|
||||
- Exports data in JSON and CSV formats
|
||||
- Saves progress checkpoints every 10 lots
|
||||
- Timestamped filenames for tracking
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
1. Listing Pages → Extract lot URLs → Store in memory
|
||||
↓
|
||||
2. For each lot URL → Check cache → If cached: use cached content
|
||||
↓ If not: fetch with rate limit
|
||||
↓
|
||||
3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
|
||||
↓
|
||||
4. Every 10 lots → Save progress checkpoint
|
||||
↓
|
||||
5. All lots complete → Export final JSON + CSV
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### Why Playwright?
|
||||
- Handles JavaScript-rendered content (Next.js)
|
||||
- Bypasses Cloudflare protection
|
||||
- More reliable than requests/BeautifulSoup for modern SPAs
|
||||
|
||||
### Why JSON extraction?
|
||||
- Site uses Next.js with embedded `__NEXT_DATA__`
|
||||
- JSON is more reliable than HTML pattern matching
|
||||
- Avoids breaking when HTML/CSS changes
|
||||
- Faster parsing
|
||||
|
||||
### Why SQLite caching?
|
||||
- Persistent across runs
|
||||
- Reduces load on target server
|
||||
- Enables test mode without re-fetching
|
||||
- Respects website resources
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
troost-scraper/
|
||||
├── main.py # Main scraper logic
|
||||
├── requirements.txt # Python dependencies
|
||||
├── README.md # Documentation
|
||||
├── .gitignore # Git exclusions
|
||||
└── output/ # Generated files (not in git)
|
||||
├── cache.db # SQLite cache
|
||||
├── *_partial_*.json # Progress checkpoints
|
||||
├── *_final_*.json # Final JSON output
|
||||
└── *_final_*.csv # Final CSV output
|
||||
```
|
||||
|
||||
## Classes
|
||||
|
||||
### `CacheManager`
|
||||
- `__init__(db_path)` - Initialize cache database
|
||||
- `get(url, max_age_hours)` - Retrieve cached page
|
||||
- `set(url, content, status_code)` - Cache a page
|
||||
- `clear_old(max_age_hours)` - Remove old entries
|
||||
|
||||
### `TroostwijkScraper`
|
||||
- `crawl_auctions(max_pages)` - Main entry point
|
||||
- `crawl_listing_page(page, page_num)` - Extract lot URLs
|
||||
- `crawl_lot(page, url)` - Scrape individual lot
|
||||
- `_extract_nextjs_data(content)` - Parse JSON data
|
||||
- `_parse_lot_page(content, url)` - Extract all fields
|
||||
- `save_final_results(data)` - Export JSON + CSV
|
||||
|
||||
## Scalability Notes
|
||||
|
||||
- **Rate limiting** prevents IP blocks but slows execution
|
||||
- **Caching** makes subsequent runs instant for unchanged pages
|
||||
- **Progress checkpoints** allow resuming after interruption
|
||||
- **Async/await** used throughout for non-blocking I/O
|
||||
@@ -1,2 +0,0 @@
|
||||
[InternetShortcut]
|
||||
URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md
|
||||
@@ -1,27 +0,0 @@
|
||||
# Deployment
|
||||
|
||||
## Automatic Deployment
|
||||
|
||||
This project uses automatic deployment via git hooks.
|
||||
|
||||
### Pipeline
|
||||
|
||||
1. Push to `main` branch
|
||||
2. Git hook triggers
|
||||
3. Code syncs to `/opt/apps/troost-scraper`
|
||||
4. Docker builds and deploys
|
||||
5. Container starts automatically
|
||||
|
||||
### Manual Deployment
|
||||
|
||||
```bash
|
||||
sudo apps:deploy troost-scraper
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
View logs:
|
||||
```bash
|
||||
tail -f /home/git/logs/apps:deploy-troost-scraper.log
|
||||
docker logs troost-scraper
|
||||
```
|
||||
@@ -1,71 +0,0 @@
|
||||
# Getting Started
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Git
|
||||
- pip (Python package manager)
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Clone the repository
|
||||
|
||||
```bash
|
||||
git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
|
||||
cd troost-scraper
|
||||
```
|
||||
|
||||
### 2. Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Install Playwright browsers
|
||||
|
||||
```bash
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit the configuration in `main.py`:
|
||||
|
||||
```python
|
||||
BASE_URL = "https://www.troostwijkauctions.com"
|
||||
CACHE_DB = "/path/to/cache.db" # Path to cache database
|
||||
OUTPUT_DIR = "/path/to/output" # Output directory
|
||||
RATE_LIMIT_SECONDS = 0.5 # Delay between requests
|
||||
MAX_PAGES = 50 # Number of listing pages
|
||||
```
|
||||
|
||||
**Windows users:** Use paths like `C:\\output\\cache.db`
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic scraping
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Crawl listing pages to collect lot URLs
|
||||
2. Scrape each individual lot page
|
||||
3. Save results in JSON and CSV formats
|
||||
4. Cache all pages for future runs
|
||||
|
||||
### Test mode
|
||||
|
||||
Debug extraction on a specific URL:
|
||||
|
||||
```bash
|
||||
python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The scraper generates:
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
|
||||
- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
|
||||
- `cache.db` - SQLite cache (persistent)
|
||||
14
Home.md
14
Home.md
@@ -1,14 +0,0 @@
|
||||
# troost-scraper Wiki
|
||||
|
||||
Welcome to the troost-scraper documentation.
|
||||
|
||||
## Contents
|
||||
|
||||
- [Getting Started](Getting-Started.md)
|
||||
- [Architecture](Architecture.md)
|
||||
- [Deployment](Deployment.md)
|
||||
|
||||
## Quick Links
|
||||
|
||||
- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
|
||||
- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
|
||||
@@ -1 +0,0 @@
|
||||
Welkom op de wiki.
|
||||
Reference in New Issue
Block a user