Sophena - Troostwijk Auctions Data Extraction
A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.
Prerequisites
- Java 25 (for Quarkus)
- Maven 3.8+
- Python 3.8+
- pip (Python package manager)
Project Structure
scrape-ui/
├── src/ # Quarkus Java backend
├── python/ # Python scrapers
│ ├── kimki-troost.py # Main scraper
│ ├── advanced_crawler.py # Advanced crawling system
│ └── troostwijk_data_extractor.py
├── public/ # Static web assets
├── pom.xml # Maven configuration
└── README.md
Getting Started
1. Starting the Quarkus Application
Development Mode (with hot reload)
mvn quarkus:dev
The application will start on http://localhost:8080
Production Mode
Build the application:
mvn clean package
Run the packaged application:
java -jar target/quarkus-app/quarkus-run.jar
Using Docker
Build the Docker image:
docker build -t scrape-ui .
Run the container:
docker run -p 8080:8080 scrape-ui
2. Running the Python Scraper
Install Dependencies
cd python
pip install -r requirements.txt
If requirements.txt doesn't exist, install common dependencies:
pip install requests beautifulsoup4 selenium lxml
Run the Main Scraper
python kimki-troost.py
Alternative Scrapers
Advanced Crawler (with fallback strategies):
python advanced_crawler.py
Data Extractor (with mock data):
python troostwijk_data_extractor.py
Features
Quarkus Backend
- RESTful API with JAX-RS
- JSON serialization with Jackson
- Dependency injection with CDI
- Hot reload in development mode
- Optimized for Java 25
Python Scraper
- Multiple scraping strategies
- User agent rotation
- Anti-detection mechanisms
- Data export to JSON/CSV
- Interactive dashboard generation
API Endpoints
Access the Quarkus REST endpoints at:
http://localhost:8080/api/*
Development
Quarkus Dev Mode Features
- Automatic code reload on changes
- Dev UI available at
http://localhost:8080/q/dev - Built-in debugging support
Python Development
- Scrapers output data to timestamped files
- Generated files include JSON, CSV, and analysis reports
- Interactive dashboard created as
index.html
Configuration
Quarkus Configuration
Edit src/main/resources/application.properties for:
- Server port
- Database settings
- CORS configuration
- Logging levels
Python Configuration
Modify scraper parameters in the Python files:
- Request delays
- User agents
- Target URLs
- Output formats
Troubleshooting
Quarkus Issues
- Ensure Java 25 is installed:
java -version - Clean and rebuild:
mvn clean install - Check port 8080 is available
Python Scraper Issues
- Website access restrictions may require proxy usage
- Increase delays between requests to avoid rate limiting
- Check for CAPTCHA requirements
- Verify target website structure hasn't changed
Data Output
Scraped data is saved in the python/ directory:
troostwijk_kavels_*.json- Complete datasettroostwijk_kavels_*.csv- CSV formattroostwijk_analysis_*.json- Statistical analysisindex.html- Interactive visualization dashboard
License
[Your License Here]
Description
Languages
HTML
70.9%
Java
23.8%
PowerShell
3.1%
Dockerfile
2.2%