init

2025-12-17 16:33:19 +01:00
commit ae818f0b4b
10 changed files with 2206 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,222 @@
+# Verbatim Dicta
+
+Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
+
+## Features
+
+- Real-time transcription of system audio (Windows/Linux)
+- Multiple Whisper model sizes (tiny to large)
+- Multi-language support
+- **Sentence extraction mode** - Stitches audio chunks into complete sentences
+- Optional LLM analysis for fact-checking and question generation (via Ollama)
+- GPU acceleration support
+- Flexible audio device configuration
+
+## Quick Start
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Basic transcription (no LLM)
+python transcribe_speakers.py
+
+# With LLM analysis (optional)
+python transcribe_speakers.py --enable-llm
+
+# With sentence extraction
+python transcribe_speakers.py --sentence-mode
+
+# List audio devices
+python transcribe_speakers.py --list-devices
+```
+
+## Requirements
+
+- **OS**: Windows 10/11 or Linux
+- **Python**: 3.8+
+- **Audio**: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux)
+- **Optional**: CUDA-capable GPU, Ollama for LLM features
+
+## Installation
+
+### 1. Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. GPU Support (Optional)
+
+For CUDA 11.8:
+```bash
+pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118
+```
+
+For CUDA 12.1:
+```bash
+pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
+```
+
+### 3. Audio Loopback Setup
+
+**Windows - Option A (Stereo Mix):**
+1. Right-click speaker icon → Sounds → Recording tab
+2. Right-click → Show Disabled Devices
+3. Enable and set Stereo Mix as default
+
+**Windows - Option B (VB-Cable, recommended):**
+1. Download from [vb-audio.com](https://vb-audio.com/Cable/)
+2. Install and restart
+3. Use `--device "CABLE Output"`
+
+**Linux:**
+Configure PulseAudio loopback or use `transcribe_dual_linux.py`
+
+### 4. LLM Features (Optional)
+
+```bash
+# Install Ollama from ollama.ai
+ollama pull llama3.2
+```
+
+## Usage
+
+### Available Scripts
+
+- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`)
+- `transcribe_dual_linux.py` - Linux-specific with dual audio support
+
+### Common Commands
+
+```bash
+# Specify device and model
+python transcribe_speakers.py --device "CABLE Output" --model medium
+
+# Save to file with language
+python transcribe_speakers.py --language es --output transcript.txt
+
+# Fast mode (low latency)
+python transcribe_speakers.py --fast-mode --model tiny --interval 3
+
+# Extract complete sentences from chunks
+python transcribe_speakers.py --sentence-mode --output sentences.txt
+
+# Maximum accuracy with LLM and sentence extraction
+python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
+
+# Force CPU (avoid GPU issues)
+python transcribe_speakers.py --force-cpu
+```
+
+### Key Options
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--model` | Model size: tiny/base/small/medium/large | base |
+| `--language` | Language code (en/es/fr/de/ja/etc.) | en |
+| `--device` | Audio device name (partial match) | Auto |
+| `--interval` | Processing interval (seconds) | 8.0 |
+| `--min-duration` | Minimum audio duration | 3.0 |
+| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False |
+| `--enable-llm` | Enable fact-checking and questions | False |
+| `--llm-model` | Ollama model to use | llama3.2 |
+| `--output` | Save to file | None |
+| `--force-cpu` | Disable GPU | False |
+| `--gpu-index` | GPU device index | 0 |
+| `--sentence-mode` | Extract complete sentences from chunks | False |
+
+## Model Performance
+
+| Model | Size | Speed | Quality | Best For |
+|-------|------|-------|---------|----------|
+| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
+| base | ~145 MB | Fast | Good | General real-time use |
+| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
+| medium | ~1.5 GB | Slow | Great | High accuracy needs |
+| large | ~3 GB | Slowest | Best | Maximum accuracy |
+
+## Optimization Presets
+
+**Low Latency (Real-Time):**
+```bash
+python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
+```
+
+**Balanced:**
+```bash
+python transcribe_speakers.py --model base --interval 5
+```
+
+**High Accuracy:**
+```bash
+python transcribe_speakers.py --model large --interval 10 --enable-llm
+```
+
+## Troubleshooting
+
+**No loopback device:**
+- Windows: Enable Stereo Mix or install VB-Cable
+- Linux: Configure PulseAudio loopback
+
+**CUDA errors:**
+```bash
+python transcribe_speakers.py --force-cpu
+```
+
+**No audio captured:**
+- Verify audio is playing
+- Check device: `--list-devices`
+- Increase system volume
+
+**Poor quality:**
+- Use larger model: `--model medium`
+- Increase interval: `--interval 10`
+- Specify language: `--language <code>`
+
+**Ollama errors:**
+- Ensure Ollama is running
+- Pull model: `ollama pull llama3.2`
+
+## Output Format
+
+**Standard:**
+```
+[14:23:15] Transcribed audio segment.
+[14:23:23] Another segment with timestamp.
+```
+
+**With LLM (--enable-llm):**
+```
+======================================================================
+[14:23:15] The Earth revolves around the Sun in 365 days.
+
+📊 Fact Check: FACTUAL (confidence: 0.98)
+💡 Scientifically accurate. Earth's orbital period is 365.25 days.
+
+❓ Questions:
+1. Why do we need leap years?
+2. How does Earth's orbit affect seasons?
+======================================================================
+```
+
+## Technical Stack
+
+- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM)
+- **Transcription**: faster-whisper (optimized Whisper)
+- **LLM**: Ollama (local inference)
+- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux)
+
+## Future Work
+
+- Real-time streaming transcription with reduced buffering
+- Speaker diarization improvements
+- Web interface for remote monitoring
+- Multi-device simultaneous transcription
+- Cloud LLM integration options
+- Custom vocabulary and domain adaptation
+- Noise reduction preprocessing
+
+## License
+
+Uses [Whisper](https://github.com/openai/whisper) (OpenAI), [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (SYSTRAN), and [Ollama](https://ollama.ai).