# Sophena - Troostwijk Auctions Data Extraction A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper. ## Prerequisites - **Java 25** (for Quarkus) - **Maven 3.8+** - **Python 3.8+** - **pip** (Python package manager) ## Project Structure ``` scrape-ui/ ├── src/ # Quarkus Java backend ├── python/ # Python scrapers │ ├── kimki-troost.py # Main scraper │ ├── advanced_crawler.py # Advanced crawling system │ └── troostwijk_data_extractor.py ├── public/ # Static web assets ├── pom.xml # Maven configuration └── README.md ``` ## Getting Started ### 1. Starting the Quarkus Application #### Development Mode (with hot reload) ```bash mvn quarkus:dev ``` The application will start on `http://localhost:8080` #### Production Mode Build the application: ```bash mvn clean package ``` Run the packaged application: ```bash java -jar target/quarkus-app/quarkus-run.jar ``` #### Using Docker Build the Docker image: ```bash docker build -t scrape-ui . ``` Run the container: ```bash docker run -p 8080:8080 scrape-ui ``` ### 2. Running the Python Scraper #### Install Dependencies ```bash cd python pip install -r requirements.txt ``` If `requirements.txt` doesn't exist, install common dependencies: ```bash pip install requests beautifulsoup4 selenium lxml ``` #### Run the Main Scraper ```bash python kimki-troost.py ``` #### Alternative Scrapers **Advanced Crawler** (with fallback strategies): ```bash python advanced_crawler.py ``` **Data Extractor** (with mock data): ```bash python troostwijk_data_extractor.py ``` ## Features ### Quarkus Backend - RESTful API with JAX-RS - JSON serialization with Jackson - Dependency injection with CDI - Hot reload in development mode - Optimized for Java 25 ### Python Scraper - Multiple scraping strategies - User agent rotation - Anti-detection mechanisms - Data export to JSON/CSV - Interactive dashboard generation ## API Endpoints Access the Quarkus REST endpoints at: - `http://localhost:8080/api/*` ## Development ### Quarkus Dev Mode Features - Automatic code reload on changes - Dev UI available at `http://localhost:8080/q/dev` - Built-in debugging support ### Python Development - Scrapers output data to timestamped files - Generated files include JSON, CSV, and analysis reports - Interactive dashboard created as `index.html` ## Configuration ### Quarkus Configuration Edit `src/main/resources/application.properties` for: - Server port - Database settings - CORS configuration - Logging levels ### Python Configuration Modify scraper parameters in the Python files: - Request delays - User agents - Target URLs - Output formats ## Troubleshooting ### Quarkus Issues - Ensure Java 25 is installed: `java -version` - Clean and rebuild: `mvn clean install` - Check port 8080 is available ### Python Scraper Issues - Website access restrictions may require proxy usage - Increase delays between requests to avoid rate limiting - Check for CAPTCHA requirements - Verify target website structure hasn't changed ## Data Output Scraped data is saved in the `python/` directory: - `troostwijk_kavels_*.json` - Complete dataset - `troostwijk_kavels_*.csv` - CSV format - `troostwijk_analysis_*.json` - Statistical analysis - `index.html` - Interactive visualization dashboard ## License [Your License Here]