Init

2025-12-04 11:35:53 +01:00
commit d306a65c11
23 changed files with 1896 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,166 @@
+# Sophena - Troostwijk Auctions Data Extraction
+
+A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.
+
+## Prerequisites
+
+- **Java 25** (for Quarkus)
+- **Maven 3.8+**
+- **Python 3.8+**
+- **pip** (Python package manager)
+
+## Project Structure
+
+```
+scrape-ui/
+├── src/                    # Quarkus Java backend
+├── python/                 # Python scrapers
+│   ├── kimki-troost.py            # Main scraper
+│   ├── advanced_crawler.py        # Advanced crawling system
+│   └── troostwijk_data_extractor.py
+├── public/                 # Static web assets
+├── pom.xml                # Maven configuration
+└── README.md
+```
+
+## Getting Started
+
+### 1. Starting the Quarkus Application
+
+#### Development Mode (with hot reload)
+
+```bash
+mvn quarkus:dev
+```
+
+The application will start on `http://localhost:8080`
+
+#### Production Mode
+
+Build the application:
+```bash
+mvn clean package
+```
+
+Run the packaged application:
+```bash
+java -jar target/quarkus-app/quarkus-run.jar
+```
+
+#### Using Docker
+
+Build the Docker image:
+```bash
+docker build -t scrape-ui .
+```
+
+Run the container:
+```bash
+docker run -p 8080:8080 scrape-ui
+```
+
+### 2. Running the Python Scraper
+
+#### Install Dependencies
+
+```bash
+cd python
+pip install -r requirements.txt
+```
+
+If `requirements.txt` doesn't exist, install common dependencies:
+```bash
+pip install requests beautifulsoup4 selenium lxml
+```
+
+#### Run the Main Scraper
+
+```bash
+python kimki-troost.py
+```
+
+#### Alternative Scrapers
+
+**Advanced Crawler** (with fallback strategies):
+```bash
+python advanced_crawler.py
+```
+
+**Data Extractor** (with mock data):
+```bash
+python troostwijk_data_extractor.py
+```
+
+## Features
+
+### Quarkus Backend
+- RESTful API with JAX-RS
+- JSON serialization with Jackson
+- Dependency injection with CDI
+- Hot reload in development mode
+- Optimized for Java 25
+
+### Python Scraper
+- Multiple scraping strategies
+- User agent rotation
+- Anti-detection mechanisms
+- Data export to JSON/CSV
+- Interactive dashboard generation
+
+## API Endpoints
+
+Access the Quarkus REST endpoints at:
+- `http://localhost:8080/api/*`
+
+## Development
+
+### Quarkus Dev Mode Features
+- Automatic code reload on changes
+- Dev UI available at `http://localhost:8080/q/dev`
+- Built-in debugging support
+
+### Python Development
+- Scrapers output data to timestamped files
+- Generated files include JSON, CSV, and analysis reports
+- Interactive dashboard created as `index.html`
+
+## Configuration
+
+### Quarkus Configuration
+Edit `src/main/resources/application.properties` for:
+- Server port
+- Database settings
+- CORS configuration
+- Logging levels
+
+### Python Configuration
+Modify scraper parameters in the Python files:
+- Request delays
+- User agents
+- Target URLs
+- Output formats
+
+## Troubleshooting
+
+### Quarkus Issues
+- Ensure Java 25 is installed: `java -version`
+- Clean and rebuild: `mvn clean install`
+- Check port 8080 is available
+
+### Python Scraper Issues
+- Website access restrictions may require proxy usage
+- Increase delays between requests to avoid rate limiting
+- Check for CAPTCHA requirements
+- Verify target website structure hasn't changed
+
+## Data Output
+
+Scraped data is saved in the `python/` directory:
+- `troostwijk_kavels_*.json` - Complete dataset
+- `troostwijk_kavels_*.csv` - CSV format
+- `troostwijk_analysis_*.json` - Statistical analysis
+- `index.html` - Interactive visualization dashboard
+
+## License
+
+[Your License Here]