sophena/README.md

# Sophena - Troostwijk Auctions Data Extraction

A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.

## Prerequisites

- **Java 25** (for Quarkus)
- **Maven 3.8+**
- **Python 3.8+**
- **pip** (Python package manager)

## Project Structure

```
scrape-ui/
├── src/                    # Quarkus Java backend
├── python/                 # Python scrapers
│   ├── kimki-troost.py            # Main scraper
│   ├── advanced_crawler.py        # Advanced crawling system
│   └── troostwijk_data_extractor.py
├── public/                 # Static web assets
├── pom.xml                # Maven configuration
└── README.md
```

## Getting Started

### 1. Starting the Quarkus Application

#### Development Mode (with hot reload)

```bash
mvn quarkus:dev
```

The application will start on `http://localhost:8080`

#### Production Mode

Build the application:
```bash
mvn clean package
```

Run the packaged application:
```bash
java -jar target/quarkus-app/quarkus-run.jar
```

#### Using Docker

Build the Docker image:
```bash
docker build -t scrape-ui .
```

Run the container:
```bash
docker run -p 8080:8080 scrape-ui
```

### 2. Running the Python Scraper

#### Install Dependencies

```bash
cd python
pip install -r requirements.txt
```

If `requirements.txt` doesn't exist, install common dependencies:
```bash
pip install requests beautifulsoup4 selenium lxml
```

#### Run the Main Scraper

```bash
python kimki-troost.py
```

#### Alternative Scrapers

**Advanced Crawler** (with fallback strategies):
```bash
python advanced_crawler.py
```

**Data Extractor** (with mock data):
```bash
python troostwijk_data_extractor.py
```

## Features

### Quarkus Backend
- RESTful API with JAX-RS
- JSON serialization with Jackson
- Dependency injection with CDI
- Hot reload in development mode
- Optimized for Java 25

### Python Scraper
- Multiple scraping strategies
- User agent rotation
- Anti-detection mechanisms
- Data export to JSON/CSV
- Interactive dashboard generation

## API Endpoints

Access the Quarkus REST endpoints at:
- `http://localhost:8080/api/*`

## Development

### Quarkus Dev Mode Features
- Automatic code reload on changes
- Dev UI available at `http://localhost:8080/q/dev`
- Built-in debugging support

### Python Development
- Scrapers output data to timestamped files
- Generated files include JSON, CSV, and analysis reports
- Interactive dashboard created as `index.html`

## Configuration

### Quarkus Configuration
Edit `src/main/resources/application.properties` for:
- Server port
- Database settings
- CORS configuration
- Logging levels

### Python Configuration
Modify scraper parameters in the Python files:
- Request delays
- User agents
- Target URLs
- Output formats

## Troubleshooting

### Quarkus Issues
- Ensure Java 25 is installed: `java -version`
- Clean and rebuild: `mvn clean install`
- Check port 8080 is available

### Python Scraper Issues
- Website access restrictions may require proxy usage
- Increase delays between requests to avoid rate limiting
- Check for CAPTCHA requirements
- Verify target website structure hasn't changed

## Data Output

Scraped data is saved in the `python/` directory:
- `troostwijk_kavels_*.json` - Complete dataset
- `troostwijk_kavels_*.csv` - CSV format
- `troostwijk_analysis_*.json` - Statistical analysis
- `index.html` - Interactive visualization dashboard

## License

[Your License Here]