Init
This commit is contained in:
166
README.md
Normal file
166
README.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Sophena - Troostwijk Auctions Data Extraction
|
||||
|
||||
A full-stack application for scraping and analyzing auction data from Troostwijk Auctions, consisting of a Quarkus backend and Python scraper.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Java 25** (for Quarkus)
|
||||
- **Maven 3.8+**
|
||||
- **Python 3.8+**
|
||||
- **pip** (Python package manager)
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
scrape-ui/
|
||||
├── src/ # Quarkus Java backend
|
||||
├── python/ # Python scrapers
|
||||
│ ├── kimki-troost.py # Main scraper
|
||||
│ ├── advanced_crawler.py # Advanced crawling system
|
||||
│ └── troostwijk_data_extractor.py
|
||||
├── public/ # Static web assets
|
||||
├── pom.xml # Maven configuration
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Getting Started
|
||||
|
||||
### 1. Starting the Quarkus Application
|
||||
|
||||
#### Development Mode (with hot reload)
|
||||
|
||||
```bash
|
||||
mvn quarkus:dev
|
||||
```
|
||||
|
||||
The application will start on `http://localhost:8080`
|
||||
|
||||
#### Production Mode
|
||||
|
||||
Build the application:
|
||||
```bash
|
||||
mvn clean package
|
||||
```
|
||||
|
||||
Run the packaged application:
|
||||
```bash
|
||||
java -jar target/quarkus-app/quarkus-run.jar
|
||||
```
|
||||
|
||||
#### Using Docker
|
||||
|
||||
Build the Docker image:
|
||||
```bash
|
||||
docker build -t scrape-ui .
|
||||
```
|
||||
|
||||
Run the container:
|
||||
```bash
|
||||
docker run -p 8080:8080 scrape-ui
|
||||
```
|
||||
|
||||
### 2. Running the Python Scraper
|
||||
|
||||
#### Install Dependencies
|
||||
|
||||
```bash
|
||||
cd python
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
If `requirements.txt` doesn't exist, install common dependencies:
|
||||
```bash
|
||||
pip install requests beautifulsoup4 selenium lxml
|
||||
```
|
||||
|
||||
#### Run the Main Scraper
|
||||
|
||||
```bash
|
||||
python kimki-troost.py
|
||||
```
|
||||
|
||||
#### Alternative Scrapers
|
||||
|
||||
**Advanced Crawler** (with fallback strategies):
|
||||
```bash
|
||||
python advanced_crawler.py
|
||||
```
|
||||
|
||||
**Data Extractor** (with mock data):
|
||||
```bash
|
||||
python troostwijk_data_extractor.py
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
### Quarkus Backend
|
||||
- RESTful API with JAX-RS
|
||||
- JSON serialization with Jackson
|
||||
- Dependency injection with CDI
|
||||
- Hot reload in development mode
|
||||
- Optimized for Java 25
|
||||
|
||||
### Python Scraper
|
||||
- Multiple scraping strategies
|
||||
- User agent rotation
|
||||
- Anti-detection mechanisms
|
||||
- Data export to JSON/CSV
|
||||
- Interactive dashboard generation
|
||||
|
||||
## API Endpoints
|
||||
|
||||
Access the Quarkus REST endpoints at:
|
||||
- `http://localhost:8080/api/*`
|
||||
|
||||
## Development
|
||||
|
||||
### Quarkus Dev Mode Features
|
||||
- Automatic code reload on changes
|
||||
- Dev UI available at `http://localhost:8080/q/dev`
|
||||
- Built-in debugging support
|
||||
|
||||
### Python Development
|
||||
- Scrapers output data to timestamped files
|
||||
- Generated files include JSON, CSV, and analysis reports
|
||||
- Interactive dashboard created as `index.html`
|
||||
|
||||
## Configuration
|
||||
|
||||
### Quarkus Configuration
|
||||
Edit `src/main/resources/application.properties` for:
|
||||
- Server port
|
||||
- Database settings
|
||||
- CORS configuration
|
||||
- Logging levels
|
||||
|
||||
### Python Configuration
|
||||
Modify scraper parameters in the Python files:
|
||||
- Request delays
|
||||
- User agents
|
||||
- Target URLs
|
||||
- Output formats
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Quarkus Issues
|
||||
- Ensure Java 25 is installed: `java -version`
|
||||
- Clean and rebuild: `mvn clean install`
|
||||
- Check port 8080 is available
|
||||
|
||||
### Python Scraper Issues
|
||||
- Website access restrictions may require proxy usage
|
||||
- Increase delays between requests to avoid rate limiting
|
||||
- Check for CAPTCHA requirements
|
||||
- Verify target website structure hasn't changed
|
||||
|
||||
## Data Output
|
||||
|
||||
Scraped data is saved in the `python/` directory:
|
||||
- `troostwijk_kavels_*.json` - Complete dataset
|
||||
- `troostwijk_kavels_*.csv` - CSV format
|
||||
- `troostwijk_analysis_*.json` - Statistical analysis
|
||||
- `index.html` - Interactive visualization dashboard
|
||||
|
||||
## License
|
||||
|
||||
[Your License Here]
|
||||
Reference in New Issue
Block a user