diff --git a/Architect.md b/Architect.md new file mode 100644 index 0000000..5c7d6dc --- /dev/null +++ b/Architect.md @@ -0,0 +1,15 @@ +# Architecture1 + +## Overview + +Document your application architecture here. + +## Components + +- **Frontend**: Description +- **Backend**: Description +- **Database**: Description + +## Diagrams + +Add architecture diagrams here. diff --git a/Architecture.md b/Architecture.md new file mode 100644 index 0000000..2b6d3f7 --- /dev/null +++ b/Architecture.md @@ -0,0 +1,107 @@ +# Architecture + +## Overview + +The Troostwijk Auctions Scraper is a Python-based web scraper that extracts auction lot data using Playwright for browser automation and SQLite for caching. + +## Core Components + +### 1. **Browser Automation (Playwright)** +- Launches Chromium browser in headless mode +- Bypasses Cloudflare protection +- Handles dynamic content rendering +- Supports network idle detection + +### 2. **Cache Manager (SQLite)** +- Caches every fetched page +- Prevents redundant requests +- Stores page content, timestamps, and status codes +- Auto-cleans entries older than 7 days +- Database: `cache.db` + +### 3. **Rate Limiter** +- Enforces exactly 0.5 seconds between requests +- Prevents server overload +- Tracks last request time globally + +### 4. **Data Extractor** +- **Primary method:** Parses `__NEXT_DATA__` JSON from Next.js pages +- **Fallback method:** HTML pattern matching with regex +- Extracts: title, location, bid info, dates, images, descriptions + +### 5. **Output Manager** +- Exports data in JSON and CSV formats +- Saves progress checkpoints every 10 lots +- Timestamped filenames for tracking + +## Data Flow + +``` +1. Listing Pages → Extract lot URLs → Store in memory + ↓ +2. For each lot URL → Check cache → If cached: use cached content + ↓ If not: fetch with rate limit + ↓ +3. Parse __NEXT_DATA__ JSON → Extract fields → Store in results + ↓ +4. Every 10 lots → Save progress checkpoint + ↓ +5. All lots complete → Export final JSON + CSV +``` + +## Key Design Decisions + +### Why Playwright? +- Handles JavaScript-rendered content (Next.js) +- Bypasses Cloudflare protection +- More reliable than requests/BeautifulSoup for modern SPAs + +### Why JSON extraction? +- Site uses Next.js with embedded `__NEXT_DATA__` +- JSON is more reliable than HTML pattern matching +- Avoids breaking when HTML/CSS changes +- Faster parsing + +### Why SQLite caching? +- Persistent across runs +- Reduces load on target server +- Enables test mode without re-fetching +- Respects website resources + +## File Structure + +``` +troost-scraper/ +├── main.py # Main scraper logic +├── requirements.txt # Python dependencies +├── README.md # Documentation +├── .gitignore # Git exclusions +└── output/ # Generated files (not in git) + ├── cache.db # SQLite cache + ├── *_partial_*.json # Progress checkpoints + ├── *_final_*.json # Final JSON output + └── *_final_*.csv # Final CSV output +``` + +## Classes + +### `CacheManager` +- `__init__(db_path)` - Initialize cache database +- `get(url, max_age_hours)` - Retrieve cached page +- `set(url, content, status_code)` - Cache a page +- `clear_old(max_age_hours)` - Remove old entries + +### `TroostwijkScraper` +- `crawl_auctions(max_pages)` - Main entry point +- `crawl_listing_page(page, page_num)` - Extract lot URLs +- `crawl_lot(page, url)` - Scrape individual lot +- `_extract_nextjs_data(content)` - Parse JSON data +- `_parse_lot_page(content, url)` - Extract all fields +- `save_final_results(data)` - Export JSON + CSV + +## Scalability Notes + +- **Rate limiting** prevents IP blocks but slows execution +- **Caching** makes subsequent runs instant for unchanged pages +- **Progress checkpoints** allow resuming after interruption +- **Async/await** used throughout for non-blocking I/O diff --git a/Architecture.md.url b/Architecture.md.url new file mode 100644 index 0000000..fd64c11 --- /dev/null +++ b/Architecture.md.url @@ -0,0 +1,2 @@ +[InternetShortcut] +URL=https://git.appmodel.nl/Tour/troost-scraper-wiki/src/branch/main/Architect.md diff --git a/Deployment.md b/Deployment.md new file mode 100644 index 0000000..74d6002 --- /dev/null +++ b/Deployment.md @@ -0,0 +1,27 @@ +# Deployment + +## Automatic Deployment + +This project uses automatic deployment via git hooks. + +### Pipeline + +1. Push to `main` branch +2. Git hook triggers +3. Code syncs to `/opt/apps/troost-scraper` +4. Docker builds and deploys +5. Container starts automatically + +### Manual Deployment + +```bash +sudo apps:deploy troost-scraper +``` + +## Monitoring + +View logs: +```bash +tail -f /home/git/logs/apps:deploy-troost-scraper.log +docker logs troost-scraper +``` diff --git a/Getting-Started.md b/Getting-Started.md new file mode 100644 index 0000000..abee363 --- /dev/null +++ b/Getting-Started.md @@ -0,0 +1,71 @@ +# Getting Started + +## Prerequisites + +- Python 3.8+ +- Git +- pip (Python package manager) + +## Installation + +### 1. Clone the repository + +```bash +git clone --recurse-submodules git@git.appmodel.nl:Tour/troost-scraper.git +cd troost-scraper +``` + +### 2. Install dependencies + +```bash +pip install -r requirements.txt +``` + +### 3. Install Playwright browsers + +```bash +playwright install chromium +``` + +## Configuration + +Edit the configuration in `main.py`: + +```python +BASE_URL = "https://www.troostwijkauctions.com" +CACHE_DB = "/path/to/cache.db" # Path to cache database +OUTPUT_DIR = "/path/to/output" # Output directory +RATE_LIMIT_SECONDS = 0.5 # Delay between requests +MAX_PAGES = 50 # Number of listing pages +``` + +**Windows users:** Use paths like `C:\\output\\cache.db` + +## Usage + +### Basic scraping + +```bash +python main.py +``` + +This will: +1. Crawl listing pages to collect lot URLs +2. Scrape each individual lot page +3. Save results in JSON and CSV formats +4. Cache all pages for future runs + +### Test mode + +Debug extraction on a specific URL: + +```bash +python main.py --test "https://www.troostwijkauctions.com/a/lot-url" +``` + +## Output + +The scraper generates: +- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data +- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - CSV export +- `cache.db` - SQLite cache (persistent) diff --git a/Home.md b/Home.md new file mode 100644 index 0000000..2d784db --- /dev/null +++ b/Home.md @@ -0,0 +1,14 @@ +# troost-scraper Wiki + +Welcome to the troost-scraper documentation. + +## Contents + +- [Getting Started](Getting-Started.md) +- [Architecture](Architecture.md) +- [Deployment](Deployment.md) + +## Quick Links + +- [Repository](https://git.appmodel.nl/Tour/troost-scraper) +- [Issues](https://git.appmodel.nl/Tour/troost-scraper/issues) diff --git a/README.md b/README.md index 94aa631..15205f3 100644 --- a/README.md +++ b/README.md @@ -1,217 +1,3 @@ -# Troostwijk Auctions Scraper +# troost-scraper-wiki -A robust web scraper for extracting auction lot data from Troostwijk Auctions, featuring intelligent caching, rate limiting, and Cloudflare bypass capabilities. - -## Features - 1 - -- **Playwright-based scraping** - Bypasses Cloudflare protection -- **SQLite caching** - Caches every page to avoid redundant requests -- **Rate limiting** - Strictly enforces 0.5 seconds between requests -- **Multi-format output** - Exports data in both JSON and CSV formats -- **Progress saving** - Automatically saves progress every 10 lots -- **Test mode** - Debug extraction patterns on cached pages - -## Requirements - -- Python 3.8+ -- Playwright (with Chromium browser) - -## Installation - -1. **Clone or download this project** - -2. **Install dependencies:** - ```bash - pip install -r requirements.txt - ``` - -3. **Install Playwright browsers:** - ```bash - playwright install chromium - ``` - -## Configuration - -Edit the configuration variables in `main.py`: - -```python -BASE_URL = "https://www.troostwijkauctions.com" -CACHE_DB = "/mnt/okcomputer/output/cache.db" # Path to cache database -OUTPUT_DIR = "/mnt/okcomputer/output" # Output directory -RATE_LIMIT_SECONDS = 0.5 # Delay between requests -MAX_PAGES = 50 # Number of listing pages to crawl -``` - -**Note:** Update the paths to match your system (especially on Windows, use paths like `C:\\output\\cache.db`). - -## Usage - -### Basic Scraping - -Run the scraper to collect auction lot data: - -```bash -python main.py -``` - -This will: -1. Crawl listing pages to collect lot URLs -2. Scrape each individual lot page -3. Save results in both JSON and CSV formats -4. Cache all pages to avoid re-fetching - -### Test Mode - -Test extraction patterns on a specific cached URL: - -```bash -# Test with default URL -python main.py --test - -# Test with specific URL -python main.py --test "https://www.troostwijkauctions.com/a/lot-url-here" -``` - -This is useful for debugging extraction patterns and verifying data is being extracted correctly. - -## Output Files - -The scraper generates the following files: - -### During Execution -- `troostwijk_lots_partial_YYYYMMDD_HHMMSS.json` - Progress checkpoints (every 10 lots) - -### Final Output -- `troostwijk_lots_final_YYYYMMDD_HHMMSS.json` - Complete data in JSON format -- `troostwijk_lots_final_YYYYMMDD_HHMMSS.csv` - Complete data in CSV format - -### Cache -- `cache.db` - SQLite database with cached page content (persistent across runs) - -## Data Extracted - -For each auction lot, the scraper extracts: - -- **URL** - Direct link to the lot -- **Lot ID** - Unique identifier (e.g., A7-35847) -- **Title** - Lot title/description -- **Current Bid** - Current bid amount -- **Bid Count** - Number of bids placed -- **End Date** - Auction end time -- **Location** - Physical location of the item -- **Description** - Detailed description -- **Category** - Auction category -- **Images** - Up to 5 product images -- **Scraped At** - Timestamp of data collection - -## How It Works - -### Phase 1: Collect Lot URLs -The scraper iterates through auction listing pages (`/auctions?page=N`) and collects all lot URLs. - -### Phase 2: Scrape Individual Lots -Each lot page is visited and data is extracted from the embedded JSON data (`__NEXT_DATA__`). The site is built with Next.js and includes all auction/lot data in a JSON structure, making extraction reliable and fast. - -### Caching Strategy -- Every successfully fetched page is cached in SQLite -- Cache is checked before making any request -- Cache entries older than 7 days are automatically cleaned -- Failed requests (500 errors) are also cached to avoid retrying - -### Rate Limiting -- Enforces exactly 0.5 seconds between ALL requests -- Applies to both listing pages and individual lot pages -- Prevents server overload and potential IP blocking - -## Troubleshooting - -### Issue: "Huidig bod" / "Locatie" instead of actual values - -**✓ FIXED!** The site uses Next.js with all data embedded in `__NEXT_DATA__` JSON. The scraper now automatically extracts data from JSON first, falling back to HTML pattern matching only if needed. - -The scraper correctly extracts: -- **Title** from `auction.name` -- **Location** from `viewingDays` or `collectionDays` -- **Images** from `auction.image.url` -- **End date** from `minEndDate` -- **Lot ID** from `auction.displayId` - -To verify extraction is working: -```bash -python main.py --test "https://www.troostwijkauctions.com/a/your-auction-url" -``` - -**Note:** Some URLs point to auction pages (collections of lots) rather than individual lots. Individual lots within auctions may have bid information, while auction pages show the collection details. - -### Issue: No lots found - -- Check if the website structure has changed -- Verify `BASE_URL` is correct -- Try clearing the cache database - -### Issue: Cloudflare blocking - -- Playwright should bypass this automatically -- If issues persist, try adjusting user agent or headers in `crawl_auctions()` - -### Issue: Slow scraping - -- This is intentional due to rate limiting (0.5s between requests) -- Adjust `RATE_LIMIT_SECONDS` if needed (not recommended below 0.5s) -- First run will be slower; subsequent runs use cache - -## Project Structure - -``` -troost-scraper/ -├── main.py # Main scraper script -├── requirements.txt # Python dependencies -├── README.md # This file -└── output/ # Generated output files (created automatically) - ├── cache.db # SQLite cache - ├── *.json # JSON output files - └── *.csv # CSV output files -``` - -## Development - -### Adding New Extraction Fields - -1. Add extraction method in `TroostwijkScraper` class: - ```python - def _extract_new_field(self, content: str) -> str: - pattern = r'your-regex-pattern' - match = re.search(pattern, content) - return match.group(1) if match else "" - ``` - -2. Add field to `_parse_lot_page()`: - ```python - data = { - # ... existing fields ... - 'new_field': self._extract_new_field(content), - } - ``` - -3. Add field to CSV export in `save_final_results()`: - ```python - fieldnames = ['url', 'lot_id', ..., 'new_field', ...] - ``` - -### Testing Extraction Patterns - -Use test mode to verify patterns work correctly: -```bash -python main.py --test "https://www.troostwijkauctions.com/a/your-test-url" -``` - -## License - -This scraper is for educational and research purposes. Please respect Troostwijk Auctions' terms of service and robots.txt when using this tool. - -## Notes - -- **Be respectful:** The rate limiting is intentionally conservative -- **Check legality:** Ensure web scraping is permitted in your jurisdiction -- **Monitor changes:** Website structure may change over time, requiring pattern updates -- **Cache management:** Old cache entries are auto-cleaned after 7 days +Wiki for troost-scraper \ No newline at end of file