From 87212cd61205e5ab60b77539ecf0fbcb1409f021 Mon Sep 17 00:00:00 2001 From: Tour Date: Wed, 3 Dec 2025 12:29:05 +0100 Subject: [PATCH] all --- Architect.md | 15 ------- Architecture.md | 107 -------------------------------------------- Architecture.md.url | 2 - Deployment.md | 27 ----------- Getting-Started.md | 71 ----------------------------- Home.md | 14 ------ Home2223.md | 1 - 7 files changed, 237 deletions(-) delete mode 100644 Architect.md delete mode 100644 Architecture.md delete mode 100644 Architecture.md.url delete mode 100644 Deployment.md delete mode 100644 Getting-Started.md delete mode 100644 Home.md delete mode 100644 Home2223.md diff --git a/Architect.md b/Architect.md deleted file mode 100644 index 5c7d6dc..0000000 --- a/Architect.md +++ /dev/null @@ -1,15 +0,0 @@ -# Architecture1 - -## Overview - -Document your application architecture here. - -## Components - -- **Frontend**: Description -- **Backend**: Description -- **Database**: Description - -## Diagrams - -Add architecture diagrams here. diff --git a/Architecture.md b/Architecture.md deleted file mode 100644 index 2b6d3f7..0000000 --- a/Architecture.md +++ /dev/null @@ -1,107 +0,0 @@ -# Architecture - -## Overview - -The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching. - -## Core Components - -### 1. **Browser Automation (Playwright)** -- Launches Chromium browser in headless mode -- Bypasses Cloudflare protection -- Handles dynamic content rendering -- Supports network idle detection - -### 2. **Cache Manager (SQLite)** -- Caches every fetched page -- Prevents redundant requests -- Stores page content, timestamps, and status codes -- Auto-cleans entries older than 7 days -- Database: `cache.db` - -### 3. **Rate Limiter** -- Enforces exactly 0.5 seconds between requests -- Prevents server overload -- Tracks last request time globally - -### 4. **Data Extractor** -- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages -- **Fallback method:** HTML pattern matching with regex -- Extracts: title, location, bid info, dates, images, descriptions - -### 5. **Output Manager** -- Exports data in JSON and CSV formats -- Saves progress checkpoints every 10 lots -- Timestamped filenames for tracking - -## Data Flow - -``` -1. Listing Pages → Extract lot URLs → Store in memory - ↓ -2. For each lot URL → Check cache → If cached: use cached content - ↓ If not: fetch with rate limit - ↓ -3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results - ↓ -4. Every 10 lots → Save progress checkpoint - ↓ -5. All lots complete → Export final JSON + CSV -``` - -## Key Design Decisions - -### Why Playwright? -- Handles JavaScript-rendered content (Next.js) -- Bypasses Cloudflare protection -- More reliable than requests/BeautifulSoup for modern SPAs - -### Why JSON extraction? -- Site uses Next.js with embedded `__NEXT_DATA__` -- JSON is more reliable than HTML pattern matching -- Avoids breaking when HTML/CSS changes -- Faster parsing - -### Why SQLite caching? -- Persistent across runs -- Reduces load on target server -- Enables test mode without re-fetching -- Respects website resources - -## File Structure - -``` -troost-scraper/ -├── main.py # Main scraper logic -├── requirements.txt # Python dependencies -├── README.md # Documentation -├── .gitignore # Git exclusions -└── output/ # Generated files (not in git) - ├── cache.db # SQLite cache - ├── *_partial_*.json # Progress checkpoints - ├── *_final_*.json # Final JSON output - └── *_final_*.csv # Final CSV output -``` - -## Classes - -### `CacheManager` -- `__init__(db_path)` - Initialize cache database -- `get(url, max_age_hours)` - Retrieve cached page -- `set(url, content, status_code)` - Cache a page -- `clear_old(max_age_hours)` - Remove old entries - -### `TroostwijkScraper` -- `crawl_auctions(max_pages)` - Main entry point -- `crawl_listing_page(page, page_num)` - Extract lot URLs -- `crawl_lot(page, url)` - Scrape individual lot -- `_extract_nextjs_data(content)` - Parse JSON data -- `_parse_lot_page(content, url)` - Extract all fields -- `save_final_results(data)` - Export JSON + CSV - -## Scalability Notes - -- **Rate limiting** prevents IP blocks but slows execution -- **Caching** makes subsequent runs instant for unchanged pages -- **Progress checkpoints** allow resuming after interruption -- **Async/await** used throughout for non-blocking I/O diff --git a/Architecture.md.url b/Architecture.md.url deleted file mode 100644 index fd64c11..0000000 --- a/Architecture.md.url +++ /dev/null @@ -1,2 +0,0 @@ -[InternetShortcut] -URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md diff --git a/Deployment.md b/Deployment.md deleted file mode 100644 index 74d6002..0000000 --- a/Deployment.md +++ /dev/null @@ -1,27 +0,0 @@ -# Deployment - -## Automatic Deployment - -This project uses automatic deployment via git hooks. - -### Pipeline - -1. Push to `main` branch -2. Git hook triggers -3. Code syncs to `/opt/apps/troost-scraper` -4. Docker builds and deploys -5. Container starts automatically - -### Manual Deployment - -```bash -sudo apps:deploy troost-scraper -``` - -## Monitoring - -View logs: -```bash -tail -f /home/git/logs/apps:deploy-troost-scraper.log -docker logs troost-scraper -``` diff --git a/Getting-Started.md b/Getting-Started.md deleted file mode 100644 index abee363..0000000 --- a/Getting-Started.md +++ /dev/null @@ -1,71 +0,0 @@ -# Getting Started - -## Prerequisites - -- Python 3.8+ -- Git -- pip (Python package manager) - -## Installation - -### 1. Clone the repository - -```bash -git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git -cd troost-scraper -``` - -### 2. Install dependencies - -```bash -pip install -r requirements.txt -``` - -### 3. Install Playwright browsers - -```bash -playwright install chromium -``` - -## Configuration - -Edit the configuration in `main.py`: - -```python -BASE_URL = "https://www.troostwijkauctions.com" -CACHE_DB = "/path/to/cache.db" # Path to cache database -OUTPUT_DIR = "/path/to/output" # Output directory -RATE_LIMIT_SECONDS = 0.5 # Delay between requests -MAX_PAGES = 50 # Number of listing pages -``` - -**Windows users:** Use paths like `C:\\output\\cache.db` - -## Usage - -### Basic scraping - -```bash -python main.py -``` - -This will: -1. Crawl listing pages to collect lot URLs -2. Scrape each individual lot page -3. Save results in JSON and CSV formats -4. Cache all pages for future runs - -### Test mode - -Debug extraction on a specific URL: - -```bash -python main.py --test "https://www.troostwijkauctions.com/a/lot-url" -``` - -## Output - -The scraper generates: -- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data -- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export -- `cache.db` - SQLite cache (persistent) diff --git a/Home.md b/Home.md deleted file mode 100644 index 2d784db..0000000 --- a/Home.md +++ /dev/null @@ -1,14 +0,0 @@ -# troost-scraper Wiki - -Welcome to the troost-scraper documentation. - -## Contents - -- [Getting Started](Getting-Started.md) -- [Architecture](Architecture.md) -- [Deployment](Deployment.md) - -## Quick Links - -- [Repository](https://git.appmodel.nl/Tour/troost-scraper) -- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues) diff --git a/Home2223.md b/Home2223.md deleted file mode 100644 index 1eb345d..0000000 --- a/Home2223.md +++ /dev/null @@ -1 +0,0 @@ -Welkom op de wiki. \ No newline at end of file