chore: update 6 file(s)

2025-12-17 22:30:41 +01:00
parent a53c0e2902
commit 4343b7a5a2
6 changed files with 1122 additions and 220 deletions
--- a/README.md
+++ b/README.md
@@ -1,16 +1,15 @@
 # Verbatim Dicta

-Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
+Real-time audio transcription using Whisper AI with optional LLM analysis. Captures microphone and speaker audio simultaneously for comprehensive transcription.

 ## Features

- Real-time transcription of system audio (Windows/Linux)
- Multiple Whisper model sizes (tiny to large)
- Multi-language support
- **Sentence extraction mode** - Stitches audio chunks into complete sentences
- Optional LLM analysis for fact-checking and question generation (via Ollama)
- GPU acceleration support
- Flexible audio device configuration
+- **Dual audio capture** - Record microphone and speaker output simultaneously
+- **Real-time transcription** - Process audio as it's captured with Whisper models
+- **LLM analysis** - Optional fact-checking and question generation via Ollama
+- **Multi-language** - Support for 50+ languages
+- **File output** - Save transcripts with timestamps and analysis
+- **GPU acceleration** - CUDA support for faster processing

 ## Quick Start

@@ -18,17 +17,14 @@ Real-time audio transcription using Whisper AI with optional LLM-powered analysi
 # Install dependencies
 pip install -r requirements.txt

-# Basic transcription (no LLM)
-python transcribe_speakers.py
-
-# With LLM analysis (optional)
-python transcribe_speakers.py --enable-llm
-
-# With sentence extraction
-python transcribe_speakers.py --sentence-mode
-
 # List audio devices
-python transcribe_speakers.py --list-devices
+./run_transcribe.sh --list-devices
+
+# Basic transcription
+./run_transcribe.sh --model medium --language en
+
+# With LLM analysis and file output
+./run_transcribe.sh --model medium --enable-llm --output transcript.txt
 ```

 ## Requirements
@@ -58,172 +54,153 @@ For CUDA 12.1:
 pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
 ```

-### 3. Audio Loopback Setup
+### 3. Audio Setup

-**Windows - Option A (Stereo Mix):**
-1. Right-click speaker icon → Sounds → Recording tab
-2. Right-click → Show Disabled Devices
-3. Enable and set Stereo Mix as default
+**Linux (PulseAudio/PipeWire):**
+```bash
+# List devices to find your monitor device
+./run_transcribe.sh --list-devices

-**Windows - Option B (VB-Cable, recommended):**
-1. Download from [vb-audio.com](https://vb-audio.com/Cable/)
-2. Install and restart
-3. Use `--device "CABLE Output"`
+# Use with monitor device
+./run_transcribe.sh --monitor "alsa_output.monitor"
+```

-**Linux:**
-Configure PulseAudio loopback or use `transcribe_dual_linux.py`
+**Windows:**
+- Enable "Stereo Mix" in Sound settings, or
+- Install VB-Cable from [vb-audio.com](https://vb-audio.com/Cable/)

-### 4. LLM Features (Optional)
+### 4. LLM Support (Optional)

 ```bash
 # Install Ollama from ollama.ai
-ollama pull llama3.2
+ollama pull qwen2.5:3b
 ```

 ## Usage

-### Available Scripts
-
- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`)
- `transcribe_dual_linux.py` - Linux-specific with dual audio support
-
-### Common Commands
+### Command Line Options

 ```bash
-# Quick start with GPU (English)
-./RUN_GPU.sh
+python transcribe.py [OPTIONS]

-# Dutch language
-./RUN_DUTCH.sh
-
-# Dutch with LLM analysis
-./RUN_DUTCH_LLM.sh
-
-# With LLM analysis
-./RUN_GPU.sh --enable-llm
-
-# Save to file
-./RUN_GPU.sh --output transcript.txt
-
-# Other languages (Spanish, French, German, etc.)
-./RUN_GPU.sh --language es  # Spanish
-./RUN_GPU.sh --language fr  # French
-./RUN_GPU.sh --language de  # German
-
-# Maximum accuracy with LLM and sentence extraction
-python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
-
-# Force CPU (if GPU issues)
-python transcribe_speakers.py --force-cpu
+Options:
+  --model {tiny,base,small,medium,large}  Whisper model (default: tiny)
+  --language CODE                         Language code (default: en)
+  --mic DEVICE                            Microphone device name
+  --monitor DEVICE                        Speaker monitor device name
+  --interval SECONDS                      Processing interval (default: 5.0)
+  --min-duration SECONDS                  Minimum audio duration (default: 2.0)
+  --enable-llm                            Enable LLM analysis
+  --llm-model MODEL                       Ollama model (default: qwen2.5:3b)
+  --output FILE                           Save transcript to file
+  --force-cpu                             Force CPU processing
+  --list-devices                          List audio devices
 ```

-### Key Options
+### Examples

-| Option | Description | Default |
-|--------|-------------|---------|
-| `--model` | Model size: tiny/base/small/medium/large | base |
-| `--language` | Language code (en/es/fr/de/ja/etc.) | en |
-| `--device` | Audio device name (partial match) | Auto |
-| `--interval` | Processing interval (seconds) | 8.0 |
-| `--min-duration` | Minimum audio duration | 3.0 |
-| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False |
-| `--enable-llm` | Enable fact-checking and questions | False |
-| `--llm-model` | Ollama model to use | llama3.2 |
-| `--output` | Save to file | None |
-| `--force-cpu` | Disable GPU | False |
-| `--gpu-index` | GPU device index | 0 |
-| `--sentence-mode` | Extract complete sentences from chunks | False |
+```bash
+# Dutch transcription with LLM
+./run_transcribe.sh --model medium --language nl --enable-llm
+
+# High-quality meeting transcription
+./run_transcribe.sh --model large --interval 8 --output meeting.txt
+
+# Fast real-time transcription
+./run_transcribe.sh --model tiny --interval 3 --min-duration 2
+
+# Specific devices
+./run_transcribe.sh --mic "USB Mic" --monitor "Monitor of Speakers"
+```

 ## Model Performance

-| Model | Size | Speed | Quality | Best For |
-|-------|------|-------|---------|----------|
-| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
-| base | ~145 MB | Fast | Good | General real-time use |
-| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
-| medium | ~1.5 GB | Slow | Great | High accuracy needs |
-| large | ~3 GB | Slowest | Best | Maximum accuracy |
-
-## Optimization Presets
-
-**Low Latency (Real-Time):**
-```bash
-python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
-```
-
-**Balanced:**
-```bash
-python transcribe_speakers.py --model base --interval 5
-```
-
-**High Accuracy:**
-```bash
-python transcribe_speakers.py --model large --interval 10 --enable-llm
-```
+| Model  | Size   | Speed    | Quality | Use Case               |
+|--------|--------|----------|---------|------------------------|
+| tiny   | 75 MB  | Fastest  | Basic   | Real-time, low latency |
+| base   | 145 MB | Fast     | Good    | General use            |
+| small  | 485 MB | Moderate | Better  | Balanced               |
+| medium | 1.5 GB | Slow     | Great   | High accuracy          |
+| large  | 3 GB   | Slowest  | Best    | Maximum quality        |

 ## Troubleshooting

-**No loopback device:**
- Windows: Enable Stereo Mix or install VB-Cable
- Linux: Configure PulseAudio loopback
+**No audio devices found:**
+```bash
+# List all devices
+./run_transcribe.sh --list-devices
+
+# Specify devices explicitly
+./run_transcribe.sh --mic "device_name" --monitor "monitor_name"
+```

 **CUDA errors:**
 ```bash
-python transcribe_speakers.py --force-cpu
+# Force CPU processing
+./run_transcribe.sh --force-cpu
 ```

-**No audio captured:**
- Verify audio is playing
- Check device: `--list-devices`
- Increase system volume
+**Ollama connection failed:**
+```bash
+# Start Ollama service
+ollama serve

-**Poor quality:**
- Use larger model: `--model medium`
+# Pull required model
+ollama pull qwen2.5:3b
+```
+
+**Poor transcription quality:**
+- Use larger model: `--model medium` or `--model large`
 - Increase interval: `--interval 10`
- Specify language: `--language <code>`
-
-**Ollama errors:**
- Ensure Ollama is running
- Pull model: `ollama pull llama3.2`
+- Specify language: `--language nl`
+- Ensure good audio quality (reduce background noise)

 ## Output Format

-**Standard:**
+### Standard Output
 ```
-[14:23:15] Transcribed audio segment.
-[14:23:23] Another segment with timestamp.
+🎤 [14:23:15] User speaking into microphone
+🔊 [14:23:18] Audio from speakers or system
 ```

-**With LLM (--enable-llm):**
+### With LLM Analysis
 ```
+🎤 [14:23:15] The Earth orbits the Sun in 365 days.
+   ✅ FACTUAL (0.98): Scientifically accurate orbital period.
+   ❓ Questions:
+      1. Why do we need leap years?
+      2. How does the elliptical orbit affect seasons?
+      3. What factors influence Earth's orbital velocity?
+```
+
+### File Output
+```
+[14:23:15] MIC: User speaking into microphone
+[14:23:18] SPEAKER: Audio from speakers
+
 ======================================================================
-[14:23:15] The Earth revolves around the Sun in 365 days.
+[14:23:25] MIC: The Earth orbits the Sun in 365 days.

 📊 Fact Check: FACTUAL (confidence: 0.98)
-💡 Scientifically accurate. Earth's orbital period is 365.25 days.
+💡 Scientifically accurate orbital period.

 ❓ Questions:
 1. Why do we need leap years?
-2. How does Earth's orbit affect seasons?
-======================================================================
+2. How does the elliptical orbit affect seasons?
+3. What factors influence Earth's orbital velocity?
 ```

-## Technical Stack
+## Architecture

- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM)
- **Transcription**: faster-whisper (optimized Whisper)
- **LLM**: Ollama (local inference)
- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux)
+- **Audio Capture**: sounddevice with dual-stream support
+- **Transcription**: faster-whisper (optimized Whisper implementation)
+- **LLM**: Ollama for local inference
+- **Format**: 16kHz mono, 16-bit PCM
+- **Processing**: Independent mic/speaker buffers with beam_size=3

-## Future Work
+## Contributing

- Real-time streaming transcription with reduced buffering
- Speaker diarization improvements
- Web interface for remote monitoring
- Multi-device simultaneous transcription
- Cloud LLM integration options
- Custom vocabulary and domain adaptation
- Noise reduction preprocessing
+Contributions welcome! Please open issues or submit pull requests.

 ## License