Files

Tour d306a65c11 Init

2025-12-04 11:35:53 +01:00

3.4 KiB

Raw Blame History

Sophena - Troostwijk Auctions Data Extraction

A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.

Prerequisites

Java 25 (for Quarkus)
Maven 3.8+
Python 3.8+
pip (Python package manager)

Project Structure

scrape-ui/
├── src/                    # Quarkus Java backend
├── python/                 # Python scrapers
│   ├── kimki-troost.py            # Main scraper
│   ├── advanced_crawler.py        # Advanced crawling system
│   └── troostwijk_data_extractor.py
├── public/                 # Static web assets
├── pom.xml                # Maven configuration
└── README.md

Getting Started

1. Starting the Quarkus Application

Development Mode (with hot reload)

mvn quarkus:dev

The application will start on http://localhost:8080

Production Mode

Build the application:

mvn clean package

Run the packaged application:

java -jar target/quarkus-app/quarkus-run.jar

Using Docker

Build the Docker image:

docker build -t scrape-ui .

Run the container:

docker run -p 8080:8080 scrape-ui

2. Running the Python Scraper

Install Dependencies

cd python
pip install -r requirements.txt

If requirements.txt doesn't exist, install common dependencies:

pip install requests beautifulsoup4 selenium lxml

Run the Main Scraper

python kimki-troost.py

Alternative Scrapers

Advanced Crawler (with fallback strategies):

python advanced_crawler.py

Data Extractor (with mock data):

python troostwijk_data_extractor.py

Features

Quarkus Backend

RESTful API with JAX-RS
JSON serialization with Jackson
Dependency injection with CDI
Hot reload in development mode
Optimized for Java 25

Python Scraper

Multiple scraping strategies
User agent rotation
Anti-detection mechanisms
Data export to JSON/CSV
Interactive dashboard generation

API Endpoints

Access the Quarkus REST endpoints at:

http://localhost:8080/api/*

Development

Quarkus Dev Mode Features

Automatic code reload on changes
Dev UI available at http://localhost:8080/q/dev
Built-in debugging support

Python Development

Scrapers output data to timestamped files
Generated files include JSON, CSV, and analysis reports
Interactive dashboard created as index.html

Configuration

Quarkus Configuration

Edit src/main/resources/application.properties for:

Server port
Database settings
CORS configuration
Logging levels

Python Configuration

Modify scraper parameters in the Python files:

Request delays
User agents
Target URLs
Output formats

Troubleshooting

Quarkus Issues

Ensure Java 25 is installed: java -version
Clean and rebuild: mvn clean install
Check port 8080 is available

Python Scraper Issues

Website access restrictions may require proxy usage
Increase delays between requests to avoid rate limiting
Check for CAPTCHA requirements
Verify target website structure hasn't changed

Data Output

Scraped data is saved in the python/ directory:

troostwijk_kavels_*.json - Complete dataset
troostwijk_kavels_*.csv - CSV format
troostwijk_analysis_*.json - Statistical analysis
index.html - Interactive visualization dashboard

License

[Your License Here]

3.4 KiB Raw Blame History