all

2025-12-03 12:29:05 +01:00
parent 999a6d056f
commit 87212cd612
7 changed files with 0 additions and 237 deletions
--- a/Architect.md
+++ b/Architect.md
@@ -1,15 +0,0 @@
 # Architecture1
 ## Overview
 Document your application architecture here.
 ## Components 
 - **Frontend**: Description
 - **Backend**: Description
 - **Database**: Description
 ## Diagrams
 Add architecture diagrams here.
--- a/Architecture.md
+++ b/Architecture.md
@@ -1,107 +0,0 @@
 # Architecture
 ## Overview
 The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
 ## Core Components
 ### 1. **Browser Automation (Playwright)**
 - Launches Chromium browser in headless mode
 - Bypasses Cloudflare protection
 - Handles dynamic content rendering
 - Supports network idle detection
 ### 2. **Cache Manager (SQLite)**
 - Caches every fetched page
 - Prevents redundant requests
 - Stores page content, timestamps, and status codes
 - Auto-cleans entries older than 7 days
 - Database: `cache.db`
 ### 3. **Rate Limiter**
 - Enforces exactly 0.5 seconds between requests
 - Prevents server overload
 - Tracks last request time globally
 ### 4. **Data Extractor**
 - **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
 - **Fallback method:** HTML pattern matching with regex
 - Extracts: title, location, bid info, dates, images, descriptions
 ### 5. **Output Manager**
 - Exports data in JSON and CSV formats
 - Saves progress checkpoints every 10 lots
 - Timestamped filenames for tracking
 ## Data Flow
 ```
 1. Listing Pages → Extract lot URLs → Store in memory
                                           ↓
 2. For each lot URL → Check cache → If cached: use cached content
                          ↓              If not: fetch with rate limit
                          ↓
 3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
                          ↓
 4. Every 10 lots → Save progress checkpoint
                          ↓
 5. All lots complete → Export final JSON + CSV
 ```
 ## Key Design Decisions
 ### Why Playwright?
 - Handles JavaScript-rendered content (Next.js)
 - Bypasses Cloudflare protection
 - More reliable than requests/BeautifulSoup for modern SPAs
 ### Why JSON extraction?
 - Site uses Next.js with embedded `__NEXT_DATA__`
 - JSON is more reliable than HTML pattern matching
 - Avoids breaking when HTML/CSS changes
 - Faster parsing
 ### Why SQLite caching?
 - Persistent across runs
 - Reduces load on target server
 - Enables test mode without re-fetching
 - Respects website resources
 ## File Structure
 ```
 troost-scraper/
 ├── main.py                    # Main scraper logic
 ├── requirements.txt           # Python dependencies
 ├── README.md                  # Documentation
 ├── .gitignore                 # Git exclusions
 └── output/                    # Generated files (not in git)
    ├── cache.db              # SQLite cache
    ├── *_partial_*.json      # Progress checkpoints
    ├── *_final_*.json        # Final JSON output
    └── *_final_*.csv         # Final CSV output
 ```
 ## Classes
 ### `CacheManager`
 - `__init__(db_path)` - Initialize cache database
 - `get(url, max_age_hours)` - Retrieve cached page
 - `set(url, content, status_code)` - Cache a page
 - `clear_old(max_age_hours)` - Remove old entries
 ### `TroostwijkScraper`
 - `crawl_auctions(max_pages)` - Main entry point
 - `crawl_listing_page(page, page_num)` - Extract lot URLs
 - `crawl_lot(page, url)` - Scrape individual lot
 - `_extract_nextjs_data(content)` - Parse JSON data
 - `_parse_lot_page(content, url)` - Extract all fields
 - `save_final_results(data)` - Export JSON + CSV
 ## Scalability Notes
 - **Rate limiting** prevents IP blocks but slows execution
 - **Caching** makes subsequent runs instant for unchanged pages
 - **Progress checkpoints** allow resuming after interruption
 - **Async/await** used throughout for non-blocking I/O
--- a/Architecture.md.url
+++ b/Architecture.md.url
@@ -1,2 +0,0 @@
 [InternetShortcut]
 URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md
--- a/Deployment.md
+++ b/Deployment.md
@@ -1,27 +0,0 @@
 # Deployment
 ## Automatic  Deployment
 This project uses automatic deployment via git hooks.
 ### Pipeline 
 1. Push to `main` branch
 2. Git hook triggers
 3. Code syncs to `/opt/apps/troost-scraper`
 4. Docker builds and deploys
 5. Container starts automatically
 ### Manual Deployment
 ```bash
 sudo apps:deploy troost-scraper
 ```
 ## Monitoring
 View logs:
 ```bash
 tail -f /home/git/logs/apps:deploy-troost-scraper.log
 docker logs troost-scraper
 ```
--- a/Getting-Started.md
+++ b/Getting-Started.md
@@ -1,71 +0,0 @@
 # Getting  Started
 ## Prerequisites
 - Python 3.8+
 - Git
 - pip (Python package manager)
 ## Installation
 ### 1. Clone the repository
 ```bash
 git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
 cd troost-scraper
 ```
 ### 2. Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
 ### 3. Install Playwright browsers
 ```bash
 playwright install chromium
 ```
 ## Configuration
 Edit the configuration in `main.py`:
 ```python
 BASE_URL = "https://www.troostwijkauctions.com"
 CACHE_DB = "/path/to/cache.db"           # Path to cache database
 OUTPUT_DIR = "/path/to/output"            # Output directory
 RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
 MAX_PAGES = 50                            # Number of listing pages
 ```
 **Windows users:** Use paths like `C:\\output\\cache.db`
 ## Usage
 ### Basic scraping
 ```bash
 python main.py
 ```
 This will:
 1. Crawl listing pages to collect lot URLs
 2. Scrape each individual lot page
 3. Save results in JSON and CSV formats
 4. Cache all pages for future runs
 ### Test mode
 Debug extraction on a specific URL:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
 ```
 ## Output
 The scraper generates:
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
 - `cache.db` - SQLite cache (persistent)
--- a/Home.md
+++ b/Home.md
@@ -1,14 +0,0 @@
 # troost-scraper  Wiki
 Welcome to the troost-scraper documentation.
 ## Contents
 - [Getting Started](Getting-Started.md)
 - [Architecture](Architecture.md)
 - [Deployment](Deployment.md)
 ## Quick Links
 - [Repository](https://git.appmodel.nl/Tour/troost-scraper)
 - [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
--- a/Home2223.md
+++ b/Home2223.md
@@ -1 +0,0 @@
 Welkom op de wiki.
		`@@ -1,2 +0,0 @@`
			`[InternetShortcut]`
			`URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md`