Update wiki documentation with Python/Playwright architecture and setup instructions

2025-12-03 12:27:16 +01:00
parent 7b370ee380
commit c1e024bcb0
7 changed files with 238 additions and 216 deletions
--- a/Architect.md
+++ b/Architect.md
@@ -0,0 +1,15 @@
 # Architecture1
 ## Overview
 Document your application architecture here.
 ## Components 
 - **Frontend**: Description
 - **Backend**: Description
 - **Database**: Description
 ## Diagrams
 Add architecture diagrams here.
--- a/Architecture.md
+++ b/Architecture.md
@@ -0,0 +1,107 @@
 # Architecture
 ## Overview
 The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching.
 ## Core Components
 ### 1. **Browser Automation (Playwright)**
 - Launches Chromium browser in headless mode
 - Bypasses Cloudflare protection
 - Handles dynamic content rendering
 - Supports network idle detection
 ### 2. **Cache Manager (SQLite)**
 - Caches every fetched page
 - Prevents redundant requests
 - Stores page content, timestamps, and status codes
 - Auto-cleans entries older than 7 days
 - Database: `cache.db`
 ### 3. **Rate Limiter**
 - Enforces exactly 0.5 seconds between requests
 - Prevents server overload
 - Tracks last request time globally
 ### 4. **Data Extractor**
 - **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages
 - **Fallback method:** HTML pattern matching with regex
 - Extracts: title, location, bid info, dates, images, descriptions
 ### 5. **Output Manager**
 - Exports data in JSON and CSV formats
 - Saves progress checkpoints every 10 lots
 - Timestamped filenames for tracking
 ## Data Flow
 ```
 1. Listing Pages → Extract lot URLs → Store in memory
                                           ↓
 2. For each lot URL → Check cache → If cached: use cached content
                          ↓              If not: fetch with rate limit
                          ↓
 3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results
                          ↓
 4. Every 10 lots → Save progress checkpoint
                          ↓
 5. All lots complete → Export final JSON + CSV
 ```
 ## Key Design Decisions
 ### Why Playwright?
 - Handles JavaScript-rendered content (Next.js)
 - Bypasses Cloudflare protection
 - More reliable than requests/BeautifulSoup for modern SPAs
 ### Why JSON extraction?
 - Site uses Next.js with embedded `__NEXT_DATA__`
 - JSON is more reliable than HTML pattern matching
 - Avoids breaking when HTML/CSS changes
 - Faster parsing
 ### Why SQLite caching?
 - Persistent across runs
 - Reduces load on target server
 - Enables test mode without re-fetching
 - Respects website resources
 ## File Structure
 ```
 troost-scraper/
 ├── main.py                    # Main scraper logic
 ├── requirements.txt           # Python dependencies
 ├── README.md                  # Documentation
 ├── .gitignore                 # Git exclusions
 └── output/                    # Generated files (not in git)
    ├── cache.db              # SQLite cache
    ├── *_partial_*.json      # Progress checkpoints
    ├── *_final_*.json        # Final JSON output
    └── *_final_*.csv         # Final CSV output
 ```
 ## Classes
 ### `CacheManager`
 - `__init__(db_path)` - Initialize cache database
 - `get(url, max_age_hours)` - Retrieve cached page
 - `set(url, content, status_code)` - Cache a page
 - `clear_old(max_age_hours)` - Remove old entries
 ### `TroostwijkScraper`
 - `crawl_auctions(max_pages)` - Main entry point
 - `crawl_listing_page(page, page_num)` - Extract lot URLs
 - `crawl_lot(page, url)` - Scrape individual lot
 - `_extract_nextjs_data(content)` - Parse JSON data
 - `_parse_lot_page(content, url)` - Extract all fields
 - `save_final_results(data)` - Export JSON + CSV
 ## Scalability Notes
 - **Rate limiting** prevents IP blocks but slows execution
 - **Caching** makes subsequent runs instant for unchanged pages
 - **Progress checkpoints** allow resuming after interruption
 - **Async/await** used throughout for non-blocking I/O
--- a/Architecture.md.url
+++ b/Architecture.md.url
@@ -0,0 +1,2 @@
 [InternetShortcut]
 URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md
--- a/Deployment.md
+++ b/Deployment.md
@@ -0,0 +1,27 @@
 # Deployment
 ## Automatic  Deployment
 This project uses automatic deployment via git hooks.
 ### Pipeline 
 1. Push to `main` branch
 2. Git hook triggers
 3. Code syncs to `/opt/apps/troost-scraper`
 4. Docker builds and deploys
 5. Container starts automatically
 ### Manual Deployment
 ```bash
 sudo apps:deploy troost-scraper
 ```
 ## Monitoring
 View logs:
 ```bash
 tail -f /home/git/logs/apps:deploy-troost-scraper.log
 docker logs troost-scraper
 ```
--- a/Getting-Started.md
+++ b/Getting-Started.md
@@ -0,0 +1,71 @@
 # Getting  Started
 ## Prerequisites
 - Python 3.8+
 - Git
 - pip (Python package manager)
 ## Installation
 ### 1. Clone the repository
 ```bash
 git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git
 cd troost-scraper
 ```
 ### 2. Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
 ### 3. Install Playwright browsers
 ```bash
 playwright install chromium
 ```
 ## Configuration
 Edit the configuration in `main.py`:
 ```python
 BASE_URL = "https://www.troostwijkauctions.com"
 CACHE_DB = "/path/to/cache.db"           # Path to cache database
 OUTPUT_DIR = "/path/to/output"            # Output directory
 RATE_LIMIT_SECONDS = 0.5                  # Delay between requests
 MAX_PAGES = 50                            # Number of listing pages
 ```
 **Windows users:** Use paths like `C:\\output\\cache.db`
 ## Usage
 ### Basic scraping
 ```bash
 python main.py
 ```
 This will:
 1. Crawl listing pages to collect lot URLs
 2. Scrape each individual lot page
 3. Save results in JSON and CSV formats
 4. Cache all pages for future runs
 ### Test mode
 Debug extraction on a specific URL:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/lot-url"
 ```
 ## Output
 The scraper generates:
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export
 - `cache.db` - SQLite cache (persistent)
--- a/Home.md
+++ b/Home.md
@@ -0,0 +1,14 @@
 # troost-scraper  Wiki
 Welcome to the troost-scraper documentation.
 ## Contents
 - [Getting Started](Getting-Started.md)
 - [Architecture](Architecture.md)
 - [Deployment](Deployment.md)
 ## Quick Links
 - [Repository](https://git.appmodel.nl/Tour/troost-scraper)
 - [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues)
--- a/README.md
+++ b/README.md
@@ -1,217 +1,3 @@
-# Troostwijk Auctions Scraper
+#  troost-scraper-wiki
-A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities.
+Wiki for troost-scraper
 ## Features - 1
 - **Playwright-based scraping** - Bypasses Cloudflare protection
 - **SQLite caching** - Caches every page to avoid redundant requests
 - **Rate limiting** - Strictly enforces 0.5 seconds between requests
 - **Multi-format output** - Exports data in both JSON and CSV formats
 - **Progress saving** - Automatically saves progress every 10 lots
 - **Test mode** - Debug extraction patterns on cached pages
 ## Requirements
 - Python 3.8+
 - Playwright (with Chromium browser)
 ## Installation
 1. **Clone or download this project**
 2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```
 3. **Install Playwright browsers:**
   ```bash
   playwright install chromium
   ```
 ## Configuration
 Edit the configuration variables in `main.py`:
 ```python
 BASE_URL = "https://www.troostwijkauctions.com"
 CACHE_DB = "/mnt/okcomputer/output/cache.db"      # Path to cache database
 OUTPUT_DIR = "/mnt/okcomputer/output"              # Output directory
 RATE_LIMIT_SECONDS = 0.5                           # Delay between requests
 MAX_PAGES = 50                                     # Number of listing pages to crawl
 ```
 **Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`).
 ## Usage
 ### Basic Scraping
 Run the scraper to collect auction lot data:
 ```bash
 python main.py
 ```
 This will:
 1. Crawl listing pages to collect lot URLs
 2. Scrape each individual lot page
 3. Save results in both JSON and CSV formats
 4. Cache all pages to avoid re-fetching
 ### Test Mode
 Test extraction patterns on a specific cached URL:
 ```bash
 # Test with default URL
 python main.py --test
 # Test with specific URL
 python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here"
 ```
 This is useful for debugging extraction patterns and verifying data is being extracted correctly.
 ## Output Files
 The scraper generates the following files:
 ### During Execution
 - `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots)
 ### Final Output
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format
 - `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format
 ### Cache
 - `cache.db` - SQLite database with cached page content (persistent across runs)
 ## Data Extracted
 For each auction lot, the scraper extracts:
 - **URL** - Direct link to the lot
 - **Lot ID** - Unique identifier (e.g., A7-35847)
 - **Title** - Lot title/description
 - **Current Bid** - Current bid amount
 - **Bid Count** - Number of bids placed
 - **End Date** - Auction end time
 - **Location** - Physical location of the item
 - **Description** - Detailed description
 - **Category** - Auction category
 - **Images** - Up to 5 product images
 - **Scraped At** - Timestamp of data collection
 ## How It Works
 ### Phase 1: Collect Lot URLs
 The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs.
 ### Phase 2: Scrape Individual Lots
 Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast.
 ### Caching Strategy
 - Every successfully fetched page is cached in SQLite
 - Cache is checked before making any request
 - Cache entries older than 7 days are automatically cleaned
 - Failed requests (500 errors) are also cached to avoid retrying
 ### Rate Limiting
 - Enforces exactly 0.5 seconds between ALL requests
 - Applies to both listing pages and individual lot pages
 - Prevents server overload and potential IP blocking
 ## Troubleshooting
 ### Issue: "Huidig bod" / "Locatie" instead of actual values
 **✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed.
 The scraper correctly extracts:
 - **Title** from `auction.name`
 - **Location** from `viewingDays` or `collectionDays`
 - **Images** from `auction.image.url`
 - **End date** from `minEndDate`
 - **Lot ID** from `auction.displayId`
 To verify extraction is working:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url"
 ```
 **Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details.
 ### Issue: No lots found
 - Check if the website structure has changed
 - Verify `BASE_URL` is correct
 - Try clearing the cache database
 ### Issue: Cloudflare blocking
 - Playwright should bypass this automatically
 - If issues persist, try adjusting user agent or headers in `crawl_auctions()`
 ### Issue: Slow scraping
 - This is intentional due to rate limiting (0.5s between requests)
 - Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s)
 - First run will be slower; subsequent runs use cache
 ## Project Structure
 ```
 troost-scraper/
 ├── main.py              # Main scraper script
 ├── requirements.txt     # Python dependencies
 ├── README.md           # This file
 └── output/             # Generated output files (created automatically)
    ├── cache.db        # SQLite cache
    ├── *.json          # JSON output files
    └── *.csv           # CSV output files
 ```
 ## Development
 ### Adding New Extraction Fields
 1. Add extraction method in `TroostwijkScraper` class:
   ```python
   def _extract_new_field(self, content: str) -> str:
       pattern = r'your-regex-pattern'
       match = re.search(pattern, content)
       return match.group(1) if match else ""
   ```
 2. Add field to `_parse_lot_page()`:
   ```python
   data = {
       # ... existing fields ...
       'new_field': self._extract_new_field(content),
   }
   ```
 3. Add field to CSV export in `save_final_results()`:
   ```python
   fieldnames = ['url', 'lot_id', ..., 'new_field', ...]
   ```
 ### Testing Extraction Patterns
 Use test mode to verify patterns work correctly:
 ```bash
 python main.py --test "https://www.troostwijkauctions.com/a/your-test-url"
 ```
 ## License
 This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool.
 ## Notes
 - **Be respectful:** The rate limiting is intentionally conservative
 - **Check legality:** Ensure web scraping is permitted in your jurisdiction
 - **Monitor changes:** Website structure may change over time, requiring pattern updates
 - **Cache management:** Old cache entries are auto-cleaned after 7 days
		`@@ -0,0 +1,2 @@`
							`[InternetShortcut]`
							`URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md`