chore: update 6 file(s)

This commit is contained in:
mike
2025-12-17 22:30:41 +01:00
parent a53c0e2902
commit 4343b7a5a2
6 changed files with 1122 additions and 220 deletions

249
README.md
View File

@@ -1,16 +1,15 @@
# Verbatim Dicta
Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
Real-time audio transcription using Whisper AI with optional LLM analysis. Captures microphone and speaker audio simultaneously for comprehensive transcription.
## Features
- Real-time transcription of system audio (Windows/Linux)
- Multiple Whisper model sizes (tiny to large)
- Multi-language support
- **Sentence extraction mode** - Stitches audio chunks into complete sentences
- Optional LLM analysis for fact-checking and question generation (via Ollama)
- GPU acceleration support
- Flexible audio device configuration
- **Dual audio capture** - Record microphone and speaker output simultaneously
- **Real-time transcription** - Process audio as it's captured with Whisper models
- **LLM analysis** - Optional fact-checking and question generation via Ollama
- **Multi-language** - Support for 50+ languages
- **File output** - Save transcripts with timestamps and analysis
- **GPU acceleration** - CUDA support for faster processing
## Quick Start
@@ -18,17 +17,14 @@ Real-time audio transcription using Whisper AI with optional LLM-powered analysi
# Install dependencies
pip install -r requirements.txt
# Basic transcription (no LLM)
python transcribe_speakers.py
# With LLM analysis (optional)
python transcribe_speakers.py --enable-llm
# With sentence extraction
python transcribe_speakers.py --sentence-mode
# List audio devices
python transcribe_speakers.py --list-devices
./run_transcribe.sh --list-devices
# Basic transcription
./run_transcribe.sh --model medium --language en
# With LLM analysis and file output
./run_transcribe.sh --model medium --enable-llm --output transcript.txt
```
## Requirements
@@ -58,172 +54,153 @@ For CUDA 12.1:
pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
```
### 3. Audio Loopback Setup
### 3. Audio Setup
**Windows - Option A (Stereo Mix):**
1. Right-click speaker icon → Sounds → Recording tab
2. Right-click → Show Disabled Devices
3. Enable and set Stereo Mix as default
**Linux (PulseAudio/PipeWire):**
```bash
# List devices to find your monitor device
./run_transcribe.sh --list-devices
**Windows - Option B (VB-Cable, recommended):**
1. Download from [vb-audio.com](https://vb-audio.com/Cable/)
2. Install and restart
3. Use `--device "CABLE Output"`
# Use with monitor device
./run_transcribe.sh --monitor "alsa_output.monitor"
```
**Linux:**
Configure PulseAudio loopback or use `transcribe_dual_linux.py`
**Windows:**
- Enable "Stereo Mix" in Sound settings, or
- Install VB-Cable from [vb-audio.com](https://vb-audio.com/Cable/)
### 4. LLM Features (Optional)
### 4. LLM Support (Optional)
```bash
# Install Ollama from ollama.ai
ollama pull llama3.2
ollama pull qwen2.5:3b
```
## Usage
### Available Scripts
- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`)
- `transcribe_dual_linux.py` - Linux-specific with dual audio support
### Common Commands
### Command Line Options
```bash
# Quick start with GPU (English)
./RUN_GPU.sh
python transcribe.py [OPTIONS]
# Dutch language
./RUN_DUTCH.sh
# Dutch with LLM analysis
./RUN_DUTCH_LLM.sh
# With LLM analysis
./RUN_GPU.sh --enable-llm
# Save to file
./RUN_GPU.sh --output transcript.txt
# Other languages (Spanish, French, German, etc.)
./RUN_GPU.sh --language es # Spanish
./RUN_GPU.sh --language fr # French
./RUN_GPU.sh --language de # German
# Maximum accuracy with LLM and sentence extraction
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
# Force CPU (if GPU issues)
python transcribe_speakers.py --force-cpu
Options:
--model {tiny,base,small,medium,large} Whisper model (default: tiny)
--language CODE Language code (default: en)
--mic DEVICE Microphone device name
--monitor DEVICE Speaker monitor device name
--interval SECONDS Processing interval (default: 5.0)
--min-duration SECONDS Minimum audio duration (default: 2.0)
--enable-llm Enable LLM analysis
--llm-model MODEL Ollama model (default: qwen2.5:3b)
--output FILE Save transcript to file
--force-cpu Force CPU processing
--list-devices List audio devices
```
### Key Options
### Examples
| Option | Description | Default |
|--------|-------------|---------|
| `--model` | Model size: tiny/base/small/medium/large | base |
| `--language` | Language code (en/es/fr/de/ja/etc.) | en |
| `--device` | Audio device name (partial match) | Auto |
| `--interval` | Processing interval (seconds) | 8.0 |
| `--min-duration` | Minimum audio duration | 3.0 |
| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False |
| `--enable-llm` | Enable fact-checking and questions | False |
| `--llm-model` | Ollama model to use | llama3.2 |
| `--output` | Save to file | None |
| `--force-cpu` | Disable GPU | False |
| `--gpu-index` | GPU device index | 0 |
| `--sentence-mode` | Extract complete sentences from chunks | False |
```bash
# Dutch transcription with LLM
./run_transcribe.sh --model medium --language nl --enable-llm
# High-quality meeting transcription
./run_transcribe.sh --model large --interval 8 --output meeting.txt
# Fast real-time transcription
./run_transcribe.sh --model tiny --interval 3 --min-duration 2
# Specific devices
./run_transcribe.sh --mic "USB Mic" --monitor "Monitor of Speakers"
```
## Model Performance
| Model | Size | Speed | Quality | Best For |
|-------|------|-------|---------|----------|
| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
| base | ~145 MB | Fast | Good | General real-time use |
| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
| medium | ~1.5 GB | Slow | Great | High accuracy needs |
| large | ~3 GB | Slowest | Best | Maximum accuracy |
## Optimization Presets
**Low Latency (Real-Time):**
```bash
python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
```
**Balanced:**
```bash
python transcribe_speakers.py --model base --interval 5
```
**High Accuracy:**
```bash
python transcribe_speakers.py --model large --interval 10 --enable-llm
```
| Model | Size | Speed | Quality | Use Case |
|--------|--------|----------|---------|------------------------|
| tiny | 75 MB | Fastest | Basic | Real-time, low latency |
| base | 145 MB | Fast | Good | General use |
| small | 485 MB | Moderate | Better | Balanced |
| medium | 1.5 GB | Slow | Great | High accuracy |
| large | 3 GB | Slowest | Best | Maximum quality |
## Troubleshooting
**No loopback device:**
- Windows: Enable Stereo Mix or install VB-Cable
- Linux: Configure PulseAudio loopback
**No audio devices found:**
```bash
# List all devices
./run_transcribe.sh --list-devices
# Specify devices explicitly
./run_transcribe.sh --mic "device_name" --monitor "monitor_name"
```
**CUDA errors:**
```bash
python transcribe_speakers.py --force-cpu
# Force CPU processing
./run_transcribe.sh --force-cpu
```
**No audio captured:**
- Verify audio is playing
- Check device: `--list-devices`
- Increase system volume
**Ollama connection failed:**
```bash
# Start Ollama service
ollama serve
**Poor quality:**
- Use larger model: `--model medium`
# Pull required model
ollama pull qwen2.5:3b
```
**Poor transcription quality:**
- Use larger model: `--model medium` or `--model large`
- Increase interval: `--interval 10`
- Specify language: `--language <code>`
**Ollama errors:**
- Ensure Ollama is running
- Pull model: `ollama pull llama3.2`
- Specify language: `--language nl`
- Ensure good audio quality (reduce background noise)
## Output Format
**Standard:**
### Standard Output
```
[14:23:15] Transcribed audio segment.
[14:23:23] Another segment with timestamp.
🎤 [14:23:15] User speaking into microphone
🔊 [14:23:18] Audio from speakers or system
```
**With LLM (--enable-llm):**
### With LLM Analysis
```
🎤 [14:23:15] The Earth orbits the Sun in 365 days.
✅ FACTUAL (0.98): Scientifically accurate orbital period.
❓ Questions:
1. Why do we need leap years?
2. How does the elliptical orbit affect seasons?
3. What factors influence Earth's orbital velocity?
```
### File Output
```
[14:23:15] MIC: User speaking into microphone
[14:23:18] SPEAKER: Audio from speakers
======================================================================
[14:23:15] The Earth revolves around the Sun in 365 days.
[14:23:25] MIC: The Earth orbits the Sun in 365 days.
📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate. Earth's orbital period is 365.25 days.
💡 Scientifically accurate orbital period.
❓ Questions:
1. Why do we need leap years?
2. How does Earth's orbit affect seasons?
======================================================================
2. How does the elliptical orbit affect seasons?
3. What factors influence Earth's orbital velocity?
```
## Technical Stack
## Architecture
- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM)
- **Transcription**: faster-whisper (optimized Whisper)
- **LLM**: Ollama (local inference)
- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux)
- **Audio Capture**: sounddevice with dual-stream support
- **Transcription**: faster-whisper (optimized Whisper implementation)
- **LLM**: Ollama for local inference
- **Format**: 16kHz mono, 16-bit PCM
- **Processing**: Independent mic/speaker buffers with beam_size=3
## Future Work
## Contributing
- Real-time streaming transcription with reduced buffering
- Speaker diarization improvements
- Web interface for remote monitoring
- Multi-device simultaneous transcription
- Cloud LLM integration options
- Custom vocabulary and domain adaptation
- Noise reduction preprocessing
Contributions welcome! Please open issues or submit pull requests.
## License