Files
sophena/README.md
2025-12-04 11:35:53 +01:00

3.4 KiB

Sophena - Troostwijk Auctions Data Extraction

A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.

Prerequisites

  • Java 25 (for Quarkus)
  • Maven 3.8+
  • Python 3.8+
  • pip (Python package manager)

Project Structure

scrape-ui/
├── src/                    # Quarkus Java backend
├── python/                 # Python scrapers
│   ├── kimki-troost.py            # Main scraper
│   ├── advanced_crawler.py        # Advanced crawling system
│   └── troostwijk_data_extractor.py
├── public/                 # Static web assets
├── pom.xml                # Maven configuration
└── README.md

Getting Started

1. Starting the Quarkus Application

Development Mode (with hot reload)

mvn quarkus:dev

The application will start on http://localhost:8080

Production Mode

Build the application:

mvn clean package

Run the packaged application:

java -jar target/quarkus-app/quarkus-run.jar

Using Docker

Build the Docker image:

docker build -t scrape-ui .

Run the container:

docker run -p 8080:8080 scrape-ui

2. Running the Python Scraper

Install Dependencies

cd python
pip install -r requirements.txt

If requirements.txt doesn't exist, install common dependencies:

pip install requests beautifulsoup4 selenium lxml

Run the Main Scraper

python kimki-troost.py

Alternative Scrapers

Advanced Crawler (with fallback strategies):

python advanced_crawler.py

Data Extractor (with mock data):

python troostwijk_data_extractor.py

Features

Quarkus Backend

  • RESTful API with JAX-RS
  • JSON serialization with Jackson
  • Dependency injection with CDI
  • Hot reload in development mode
  • Optimized for Java 25

Python Scraper

  • Multiple scraping strategies
  • User agent rotation
  • Anti-detection mechanisms
  • Data export to JSON/CSV
  • Interactive dashboard generation

API Endpoints

Access the Quarkus REST endpoints at:

  • http://localhost:8080/api/*

Development

Quarkus Dev Mode Features

  • Automatic code reload on changes
  • Dev UI available at http://localhost:8080/q/dev
  • Built-in debugging support

Python Development

  • Scrapers output data to timestamped files
  • Generated files include JSON, CSV, and analysis reports
  • Interactive dashboard created as index.html

Configuration

Quarkus Configuration

Edit src/main/resources/application.properties for:

  • Server port
  • Database settings
  • CORS configuration
  • Logging levels

Python Configuration

Modify scraper parameters in the Python files:

  • Request delays
  • User agents
  • Target URLs
  • Output formats

Troubleshooting

Quarkus Issues

  • Ensure Java 25 is installed: java -version
  • Clean and rebuild: mvn clean install
  • Check port 8080 is available

Python Scraper Issues

  • Website access restrictions may require proxy usage
  • Increase delays between requests to avoid rate limiting
  • Check for CAPTCHA requirements
  • Verify target website structure hasn't changed

Data Output

Scraped data is saved in the python/ directory:

  • troostwijk_kavels_*.json - Complete dataset
  • troostwijk_kavels_*.csv - CSV format
  • troostwijk_analysis_*.json - Statistical analysis
  • index.html - Interactive visualization dashboard

License

[Your License Here]