From 87212cd61205e5ab60b77539ecf0fbcb1409f021 Mon Sep 17 00:00:00 2001
From: Tour <tour@example.com>
Date: Wed, 3 Dec 2025 12:29:05 +0100
Subject: [PATCH] all

---
 Architect.md        |  15 -------
 Architecture.md     | 107 --------------------------------------------
 Architecture.md.url |   2 -
 Deployment.md       |  27 -----------
 Getting-Started.md  |  71 -----------------------------
 Home.md             |  14 ------
 Home2223.md         |   1 -
 7 files changed, 237 deletions(-)
 delete mode 100644 Architect.md
 delete mode 100644 Architecture.md
 delete mode 100644 Architecture.md.url
 delete mode 100644 Deployment.md
 delete mode 100644 Getting-Started.md
 delete mode 100644 Home.md
 delete mode 100644 Home2223.md

diff --git a/Architect.md b/Architect.md
deleted file mode 100644
index 5c7d6dc..0000000
--- a/Architect.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# Architecture1
-
-## Overview
-
-Document your application architecture here.
-
-## Components 
-
-- **Frontend**: Description
-- **Backend**: Description
-- **Database**: Description
-
-## Diagrams
-
-Add architecture diagrams here.
diff --git a/Architecture.md b/Architecture.md
deleted file mode 100644
index 2b6d3f7..0000000
--- a/Architecture.md
+++ /dev/null
@@ -1,107 +0,0 @@
-# Architecture
-
-## Overview
-
-The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
-
-## Core Components
-
-### 1. **Browser Automation (Playwright)**
-- Launches Chromium browser in headless mode
-- Bypasses Cloudflare protection
-- Handles dynamic content rendering
-- Supports network idle detection
-
-### 2. **Cache Manager (SQLite)**
-- Caches every fetched page
-- Prevents redundant requests
-- Stores page content, timestamps, and status codes
-- Auto-cleans entries older than 7 days
-- Database: `cache.db`
-
-### 3. **Rate Limiter**
-- Enforces exactly 0.5 seconds between requests
-- Prevents server overload
-- Tracks last request time globally
-
-### 4. **Data Extractor**
-- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
-- **Fallback method:** HTML pattern matching with regex
-- Extracts: title, location, bid info, dates, images, descriptions
-
-### 5. **Output Manager**
-- Exports data in JSON and CSV formats
-- Saves progress checkpoints every 10 lots
-- Timestamped filenames for tracking
-
-## Data Flow
-
-```
-1. Listing Pages → Extract lot URLs → Store in memory
-                                           ↓
-2. For each lot URL → Check cache → If cached: use cached content
-                          ↓              If not: fetch with rate limit
-                          ↓
-3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
-                          ↓
-4. Every 10 lots → Save progress checkpoint
-                          ↓
-5. All lots complete → Export final JSON + CSV
-```
-
-## Key Design Decisions
-
-### Why Playwright?
-- Handles JavaScript-rendered content (Next.js)
-- Bypasses Cloudflare protection
-- More reliable than requests/BeautifulSoup for modern SPAs
-
-### Why JSON extraction?
-- Site uses Next.js with embedded `__NEXT_DATA__`
-- JSON is more reliable than HTML pattern matching
-- Avoids breaking when HTML/CSS changes
-- Faster parsing
-
-### Why SQLite caching?
-- Persistent across runs
-- Reduces load on target server
-- Enables test mode without re-fetching
-- Respects website resources
-
-## File Structure
-
-```
-troost-scraper/
-├── main.py                    # Main scraper logic
-├── requirements.txt           # Python dependencies
-├── README.md                  # Documentation
-├── .gitignore                 # Git exclusions
-└── output/                    # Generated files (not in git)
-    ├── cache.db              # SQLite cache
-    ├── *_partial_*.json      # Progress checkpoints
-    ├── *_final_*.json        # Final JSON output
-    └── *_final_*.csv         # Final CSV output
-```
-
-## Classes
-
-### `CacheManager`
-- `__init__(db_path)` - Initialize cache database
-- `get(url, max_age_hours)` - Retrieve cached page
-- `set(url, content, status_code)` - Cache a page
-- `clear_old(max_age_hours)` - Remove old entries
-
-### `TroostwijkScraper`
-- `crawl_auctions(max_pages)` - Main entry point
-- `crawl_listing_page(page, page_num)` - Extract lot URLs
-- `crawl_lot(page, url)` - Scrape individual lot
-- `_extract_nextjs_data(content)` - Parse JSON data
-- `_parse_lot_page(content, url)` - Extract all fields
-- `save_final_results(data)` - Export JSON + CSV
-
-## Scalability Notes
-
-- **Rate limiting** prevents IP blocks but slows execution
-- **Caching** makes subsequent runs instant for unchanged pages
-- **Progress checkpoints** allow resuming after interruption
-- **Async/await** used throughout for non-blocking I/O
diff --git a/Architecture.md.url b/Architecture.md.url
deleted file mode 100644
index fd64c11..0000000
--- a/Architecture.md.url
+++ /dev/null
@@ -1,2 +0,0 @@
-[InternetShortcut]
-URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md
diff --git a/Deployment.md b/Deployment.md
deleted file mode 100644
index 74d6002..0000000
--- a/Deployment.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# Deployment
-
-## Automatic  Deployment
-
-This project uses automatic deployment via git hooks.
-
-### Pipeline 
-
-1. Push to `main` branch
-2. Git hook triggers
-3. Code syncs to `/opt/apps/troost-scraper`
-4. Docker builds and deploys
-5. Container starts automatically
-
-### Manual Deployment
-
-```bash
-sudo apps:deploy troost-scraper
-```
-
-## Monitoring
-
-View logs:
-```bash
-tail -f /home/git/logs/apps:deploy-troost-scraper.log
-docker logs troost-scraper
-```
diff --git a/Getting-Started.md b/Getting-Started.md
deleted file mode 100644
index abee363..0000000
--- a/Getting-Started.md
+++ /dev/null
@@ -1,71 +0,0 @@
-# Getting  Started
-
-## Prerequisites
-
-- Python 3.8+
-- Git
-- pip (Python package manager)
-
-## Installation
-
-### 1. Clone the repository
-
-```bash
-git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
-cd troost-scraper
-```
-
-### 2. Install dependencies
-
-```bash
-pip install -r requirements.txt
-```
-
-### 3. Install Playwright browsers
-
-```bash
-playwright install chromium
-```
-
-## Configuration
-
-Edit the configuration in `main.py`:
-
-```python
-BASE_URL = "https://www.troostwijkauctions.com"
-CACHE_DB = "/path/to/cache.db"           # Path to cache database
-OUTPUT_DIR = "/path/to/output"            # Output directory
-RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
-MAX_PAGES = 50                            # Number of listing pages
-```
-
-**Windows users:** Use paths like `C:\\output\\cache.db`
-
-## Usage
-
-### Basic scraping
-
-```bash
-python main.py
-```
-
-This will:
-1. Crawl listing pages to collect lot URLs
-2. Scrape each individual lot page
-3. Save results in JSON and CSV formats
-4. Cache all pages for future runs
-
-### Test mode
-
-Debug extraction on a specific URL:
-
-```bash
-python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
-```
-
-## Output
-
-The scraper generates:
-- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
-- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
-- `cache.db` - SQLite cache (persistent)
diff --git a/Home.md b/Home.md
deleted file mode 100644
index 2d784db..0000000
--- a/Home.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# troost-scraper  Wiki
-
-Welcome to the troost-scraper documentation.
-
-## Contents
-
-- [Getting Started](Getting-Started.md)
-- [Architecture](Architecture.md)
-- [Deployment](Deployment.md)
-
-## Quick Links
-
-- [Repository](https://git.appmodel.nl/Tour/troost-scraper)
-- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
diff --git a/Home2223.md b/Home2223.md
deleted file mode 100644
index 1eb345d..0000000
--- a/Home2223.md
+++ /dev/null
@@ -1 +0,0 @@
-Welkom op de wiki.
\ No newline at end of file