167 lines
3.4 KiB
Markdown
167 lines
3.4 KiB
Markdown
# Sophena - Troostwijk Auctions Data Extraction
|
|
|
|
A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.
|
|
|
|
## Prerequisites
|
|
|
|
- **Java 25** (for Quarkus)
|
|
- **Maven 3.8+**
|
|
- **Python 3.8+**
|
|
- **pip** (Python package manager)
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
scrape-ui/
|
|
├── src/ # Quarkus Java backend
|
|
├── python/ # Python scrapers
|
|
│ ├── kimki-troost.py # Main scraper
|
|
│ ├── advanced_crawler.py # Advanced crawling system
|
|
│ └── troostwijk_data_extractor.py
|
|
├── public/ # Static web assets
|
|
├── pom.xml # Maven configuration
|
|
└── README.md
|
|
```
|
|
|
|
## Getting Started
|
|
|
|
### 1. Starting the Quarkus Application
|
|
|
|
#### Development Mode (with hot reload)
|
|
|
|
```bash
|
|
mvn quarkus:dev
|
|
```
|
|
|
|
The application will start on `http://localhost:8080`
|
|
|
|
#### Production Mode
|
|
|
|
Build the application:
|
|
```bash
|
|
mvn clean package
|
|
```
|
|
|
|
Run the packaged application:
|
|
```bash
|
|
java -jar target/quarkus-app/quarkus-run.jar
|
|
```
|
|
|
|
#### Using Docker
|
|
|
|
Build the Docker image:
|
|
```bash
|
|
docker build -t scrape-ui .
|
|
```
|
|
|
|
Run the container:
|
|
```bash
|
|
docker run -p 8080:8080 scrape-ui
|
|
```
|
|
|
|
### 2. Running the Python Scraper
|
|
|
|
#### Install Dependencies
|
|
|
|
```bash
|
|
cd python
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
If `requirements.txt` doesn't exist, install common dependencies:
|
|
```bash
|
|
pip install requests beautifulsoup4 selenium lxml
|
|
```
|
|
|
|
#### Run the Main Scraper
|
|
|
|
```bash
|
|
python kimki-troost.py
|
|
```
|
|
|
|
#### Alternative Scrapers
|
|
|
|
**Advanced Crawler** (with fallback strategies):
|
|
```bash
|
|
python advanced_crawler.py
|
|
```
|
|
|
|
**Data Extractor** (with mock data):
|
|
```bash
|
|
python troostwijk_data_extractor.py
|
|
```
|
|
|
|
## Features
|
|
|
|
### Quarkus Backend
|
|
- RESTful API with JAX-RS
|
|
- JSON serialization with Jackson
|
|
- Dependency injection with CDI
|
|
- Hot reload in development mode
|
|
- Optimized for Java 25
|
|
|
|
### Python Scraper
|
|
- Multiple scraping strategies
|
|
- User agent rotation
|
|
- Anti-detection mechanisms
|
|
- Data export to JSON/CSV
|
|
- Interactive dashboard generation
|
|
|
|
## API Endpoints
|
|
|
|
Access the Quarkus REST endpoints at:
|
|
- `http://localhost:8080/api/*`
|
|
|
|
## Development
|
|
|
|
### Quarkus Dev Mode Features
|
|
- Automatic code reload on changes
|
|
- Dev UI available at `http://localhost:8080/q/dev`
|
|
- Built-in debugging support
|
|
|
|
### Python Development
|
|
- Scrapers output data to timestamped files
|
|
- Generated files include JSON, CSV, and analysis reports
|
|
- Interactive dashboard created as `index.html`
|
|
|
|
## Configuration
|
|
|
|
### Quarkus Configuration
|
|
Edit `src/main/resources/application.properties` for:
|
|
- Server port
|
|
- Database settings
|
|
- CORS configuration
|
|
- Logging levels
|
|
|
|
### Python Configuration
|
|
Modify scraper parameters in the Python files:
|
|
- Request delays
|
|
- User agents
|
|
- Target URLs
|
|
- Output formats
|
|
|
|
## Troubleshooting
|
|
|
|
### Quarkus Issues
|
|
- Ensure Java 25 is installed: `java -version`
|
|
- Clean and rebuild: `mvn clean install`
|
|
- Check port 8080 is available
|
|
|
|
### Python Scraper Issues
|
|
- Website access restrictions may require proxy usage
|
|
- Increase delays between requests to avoid rate limiting
|
|
- Check for CAPTCHA requirements
|
|
- Verify target website structure hasn't changed
|
|
|
|
## Data Output
|
|
|
|
Scraped data is saved in the `python/` directory:
|
|
- `troostwijk_kavels_*.json` - Complete dataset
|
|
- `troostwijk_kavels_*.csv` - CSV format
|
|
- `troostwijk_analysis_*.json` - Statistical analysis
|
|
- `index.html` - Interactive visualization dashboard
|
|
|
|
## License
|
|
|
|
[Your License Here]
|