This commit is contained in:
Tour
2025-12-04 11:35:53 +01:00
commit d306a65c11
23 changed files with 1896 additions and 0 deletions

166
README.md Normal file
View File

@@ -0,0 +1,166 @@
# Sophena - Troostwijk Auctions Data Extraction
A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.
## Prerequisites
- **Java 25** (for Quarkus)
- **Maven 3.8+**
- **Python 3.8+**
- **pip** (Python package manager)
## Project Structure
```
scrape-ui/
├── src/ # Quarkus Java backend
├── python/ # Python scrapers
│ ├── kimki-troost.py # Main scraper
│ ├── advanced_crawler.py # Advanced crawling system
│ └── troostwijk_data_extractor.py
├── public/ # Static web assets
├── pom.xml # Maven configuration
└── README.md
```
## Getting Started
### 1. Starting the Quarkus Application
#### Development Mode (with hot reload)
```bash
mvn quarkus:dev
```
The application will start on `http://localhost:8080`
#### Production Mode
Build the application:
```bash
mvn clean package
```
Run the packaged application:
```bash
java -jar target/quarkus-app/quarkus-run.jar
```
#### Using Docker
Build the Docker image:
```bash
docker build -t scrape-ui .
```
Run the container:
```bash
docker run -p 8080:8080 scrape-ui
```
### 2. Running the Python Scraper
#### Install Dependencies
```bash
cd python
pip install -r requirements.txt
```
If `requirements.txt` doesn't exist, install common dependencies:
```bash
pip install requests beautifulsoup4 selenium lxml
```
#### Run the Main Scraper
```bash
python kimki-troost.py
```
#### Alternative Scrapers
**Advanced Crawler** (with fallback strategies):
```bash
python advanced_crawler.py
```
**Data Extractor** (with mock data):
```bash
python troostwijk_data_extractor.py
```
## Features
### Quarkus Backend
- RESTful API with JAX-RS
- JSON serialization with Jackson
- Dependency injection with CDI
- Hot reload in development mode
- Optimized for Java 25
### Python Scraper
- Multiple scraping strategies
- User agent rotation
- Anti-detection mechanisms
- Data export to JSON/CSV
- Interactive dashboard generation
## API Endpoints
Access the Quarkus REST endpoints at:
- `http://localhost:8080/api/*`
## Development
### Quarkus Dev Mode Features
- Automatic code reload on changes
- Dev UI available at `http://localhost:8080/q/dev`
- Built-in debugging support
### Python Development
- Scrapers output data to timestamped files
- Generated files include JSON, CSV, and analysis reports
- Interactive dashboard created as `index.html`
## Configuration
### Quarkus Configuration
Edit `src/main/resources/application.properties` for:
- Server port
- Database settings
- CORS configuration
- Logging levels
### Python Configuration
Modify scraper parameters in the Python files:
- Request delays
- User agents
- Target URLs
- Output formats
## Troubleshooting
### Quarkus Issues
- Ensure Java 25 is installed: `java -version`
- Clean and rebuild: `mvn clean install`
- Check port 8080 is available
### Python Scraper Issues
- Website access restrictions may require proxy usage
- Increase delays between requests to avoid rate limiting
- Check for CAPTCHA requirements
- Verify target website structure hasn't changed
## Data Output
Scraped data is saved in the `python/` directory:
- `troostwijk_kavels_*.json` - Complete dataset
- `troostwijk_kavels_*.csv` - CSV format
- `troostwijk_analysis_*.json` - Statistical analysis
- `index.html` - Interactive visualization dashboard
## License
[Your License Here]